-
Notifications
You must be signed in to change notification settings - Fork 0
Description
GDAL
A couple of years ago I met with Nyall Dawson in person, we had been chatting on the socials now and then and decided we should have a longer conversation about things.
One of my questions for Nyall was about some smallish contributions I wanted to make to GDAL, at the time it had this VRT connection string syntax for which bands you wanted in a virtual dataset created on the fly.
vrt://{path_to_gdal_dataset}?[bands=num1,...,numN]
What that does is refer to the real dataset and provide you with only the band/s you wanted. So, a 3-band RGB tif could be re-ordered to BGR
vrt://RGB.tif?bands=3,2,1
or, you could choose only the second band, and as well as specifying this virtually, you can treat this description as if it was single band file and convert it a new real file.
gdal_translate vrt://RGB.tif?bands=2 green.tif
Of course, you can choose a new format, and any other options allowed by gdal_translate
. All this feature did was allow choice of band/s, they could be plucked arbitrarily, reordered, or repeated as you like.
This would give a 4-band virtual data set, and even though it might not be useful in the real world it's about the abstraction provided for your use case.
gdalinfo vrt://RGB.tif?bands=3,3,3,1
This is a powerful idea, but there was only one available feature: "bands".
R workarounds
In R we had a lot of code like this, the user could choose a time slice of a variable for a given date.
readvar <- function(date, ...) {
}
Under the hood we used the {raster} package which, via GDAL could read GeoTIFF, NetCDF, GRIB, ASC, PNG, HDF4, SDTS, and so on, so the function only needed to do this:
readvar <- function(date, ...) {
file <- var_file_at_date(date)
raster::raster(file)
}
There was more to it, but essentially that was the key, we used raster to gloss over the details because it could drive the G(eospatial)D(ata)A(bstraction)L(ibrary).
But, as anyone who has seen a lot of raster files, there's still a lot of details that bubble up to the user. A really common one we saw was the Assumption That The Data Uses Longitude/Latitude Coordinates.
That meant we would add workarounds like this:
readvar <- function(date, ...) {
var_crs <- "EPSG:4326" ## it was 2013 but this syntax is better than proj string
file <- var_file_at_date(date)
out <- raster::raster(file)
raster::projection(out) <- var_crs ## because we know these files don't have compliant CRS metadata
out
}
gdal_translate
We knew that the command line interface to translate files had these handy options to fix problems like this. You can do
gdal_translate problemfile.nc problemfile_fixed.tif -a_srs "+proj=longlat +datum=WGS84"
And so we would run this to copy every data file and make it compliant and then re-run that every day to get any new files that turned up.
No we didn't, that's a lie. Harder than writing code is maintaining a new set of files and they're really big too. What about a virtual format?
This works too:
gdal_translate problemfile.nc problemfile_fixed.vrt -a_srs "EPSG:4326"
We get a tiny virtual copy of our original file, but yuck still now we have a new problem, a new file for every old file. This gets hard to manage. And, missing CRS isn't the only problem, sometimes the bbox is wrong, sometimes we only want one of the variables, we need to fix the scale/offset of the values, on and on. And so we end up with a bunch of fixes for each kind of file and a new bespoke set of files that no one else has.
The R workarounds don't look so bad then.
Extending VRT connection string
When I asked Nyall about contributing, I had worried about 'a_ullr' quite a bit, and I thought I need a huge great PR that added several for the effort to be worthwhile:
-a_ullr: should we have separate xmin,ymin,xmax,ymax options, or one comma separate string (what about Euro formatted decimal numbers with comma-sep?)
... what other options should be included.
Nyall said: don't worry about these questions, carve out the smallest new feature that would be useful, write documentation and test/s for it, and contribute that. The repo issues is the place to hash out the details, implications, and problems. Make a PR that is self-contained and doesn't add too many new things.
I was excited because I'd started to realize that if we took the logic out of our R functions and put them into GDAL virtual syntax we could simply modify our file list to apply fixes that we only had in R for known datasets. Our file list could be
originalfile, fixedfile, date
fileday1.nc, "vrt://fileday1.nc?a_srs=EPSG:4326", day1
fileday2.nc, "vrt://fileday2.nc?a_srs=EPSG:4326", day2
...
and that meant that now we could open these filedfile descriptions with any R package that opens data with GDAL, or any Python package, ... or or ..., the command line, and QGIS, and so on.
Looking over the fence
Here's the PR that I first contributed, for the 'a_srs' option:
https://github.com/OSGeo/gdal/pull/6893/files
There are three files modified, one is C++, one is Python (the tests), and one is RST (the documentation).
The GDAL maintainer Even Rouault was extremely prompt and kind and added lots of helpful suggestions about my code.
C++ is really hard, and I knew very little Python, and I hadn't documented in RST before. I read the suggestions and folded in the required changes (sometimes the reviewer simply makes the change for you, and you can accept as-is).
At this point I had learnt enough about GDAL to know:
- where in the code the changes were needed (there's the work detecting the option itself, and small step of preparation before the option-processing starts)
- how to checkout the most recent version and build GDAL itself, to confirm the current behaviour, and then confirm the new behaviour that I wanted after I made the changes
- how to write a Python test to confirm the changed behaviour (at this point I wasn't able to run the tests I don't think)
- how to write documentation in RST for the new feature
- how to create a new branch on my GDAL fork, submit that as a PR
Requirements for contributing
There's quite a lot involved in even a simple PR. Probably the hardest is getting the very latest tip of the GDAL source and building it. And note that the very latest edge keeps moving, so the state of the code when you made your changes is not the state now, and if your changes take time to do, to review, and make decisions on this can happen multiple times (this is where rebase can be important).
The simplest contributions are documentation
You can submit a PR right there in GDAL github itself. This is fine for tiny ad hoc changes, but it's also good to bundle multiple edits of one file into a single commit, and that really takes an proper PR. You don't need to build GDAL or build the documentation to make changes to it.
Ask the question
Use the mailing list, is my understanding correct, are the implications I've not thought of.
getting the source
building GDAL
running the code
running the tests
rebasing your changes
When the core has moved on, when your commit history is messier than one clean commit.
When will my changes appear ?
GDAL releases vs downstream packages ...
What are some good example PRs that are needed now?
What are some hard PRs that are needed?
Example PRs
Let's pick some key ones and discuss the details of what happened to 1) investigate the need 2) determine where the code needs to change 3) what tests and doc were required 4) was it a straightforward review