Creating new catalog entries

Relevant geospatial data is not always formatted for optimum performance in analysis pipelines. Not all geospatial formats are compatible with virtual filesystem access at all, while others (zipped shapefiles, netcdf, gribs) are usable but less efficient when accessed over network-based interfaces. Meanwhile, many recent formats, such as Cloud Optimized Geotiff, Zarr, or geoparquet excel at this. This enables fast, seamless pipelines without assuming that data is first downloaded in full, which can be both less efficient and add friction to production pipelines that need access to constantly evolving data streams, like satellite imagery, to provide up-to-date indicators.

Example: Rasterizing large vectors

Here we take a look at data from the California Fish and Wildlife Areas of Conservation Emphasis (ACE). The original data product offers a set of tiled hexes, which is not particularly efficient for some analysis or visualization pipelines. Here we convert to raster versions:

library(stars)
library(dplyr)
library(sf)

Terrestrial climate change resilience, copy URL from DSE Biodiversity Catalog

# note, in sf this is not actually a lazy read
data <- "/vsicurl/https://minio.carlboettiger.info/public-biodiversity/ACE_summary/ds2738.gdb"
ex <- sf::st_read(data)
Reading layer `ds2738' from data source 
  `/vsicurl/https://minio.carlboettiger.info/public-biodiversity/ACE_summary/ds2738.gdb' 
  using driver `OpenFileGDB'
Simple feature collection with 63982 features and 9 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -373987.9 ymin: -604495.8 xmax: 540034.3 ymax: 450023.2
Projected CRS: NAD83 / California Albers
names(ex)
 [1] "Hex_ID"          "CLIM_RANK"       "VegRefugiaScore" "pct_hex"        
 [5] "Eco_Sect"        "Eco_Name"        "County"          "Shape_Length"   
 [9] "Shape_Area"      "Shape"          

Typically we would specify grid resolution we want for our raster, but in this case we can also allow stars to guess an appropriate scale based on the sizes of the original hexagon tiles. We will rasterize each value column, in this case, the climate resilience rank (on scale 1-5) and the vegetation refugia (continous scale 0-1):

climate_resilience <-
  ex |> 
  select(CLIM_RANK) |>
  stars::st_rasterize()

# optional static plot:
# plot(climate_resilience, col=viridisLite::turbo(n = 5), nbreaks=6)
veg_refugia <- ex |> select(VegRefugiaScore) |> stars::st_rasterize() 

We can now easily add these as layers on an interactive leaflet map: (Trying this with the original hexes is far more resource-intensive in leaflet!) Use the map controls to alter the base map and toggle the layers:

library(tmap)
tmap_mode("view")

tm_shape(veg_refugia) + 
  tm_raster(palette = viridisLite::mako(9), legend.show = FALSE, alpha=0.8) +   
  tm_shape(climate_resilience) + 
  tm_raster(palette = viridisLite::turbo(5), legend.show = FALSE, alpha=0.8)

Writing derived layers

We can also write derived layers back to our cloud storage system where they can be used in other workflows. Note that not all formats that support VSI reads also support VSI writes (see st_drivers("all")). VSI writes to tif files require an additional environmental variable:

Sys.setenv("CPL_VSIL_USE_TEMP_FILE_FOR_RANDOM_WRITE"="YES") 

We can then write directly to bucket storage (assuming we have configured AWS tokens which were granted write privileges)

climate_resilience |> 
  write_stars("/vsis3/public-biodiversity/ace_summary/climate-resilience.tif")

In certain cases it may be necessary to write first to local storage and then copy data over to the bucket system. For instance, gdalcubes can stream a given virtual cube into a collections of many constituent COG assets, including ‘overview’ files, but at present this only works with a local directory. A S3 client like aws-cli or minio R package can be used for bulk copies between local directories and the object store.

library(minio) # install_github("cboettig/minio")
mc("cp -r local_dir minio/bucket-name/path")

Creating new STAC Catalog Entries

In addition to writing out a derived layer, we may want to create a metadata entry which will add the resulting product to the data. We can also write new entries for data products which can already be found on other portals. A stac catalog entry is a simple JSON file which conforms to the STAC spec, and can be generated programmatically or merely written out manually. See the JSON files in the DSE Biodiversity Catalog on GitHub for examples. Simply commit new JSON files to the GitHub repository and link them from the top-level catalog.json or appropriate sub-catalog to add them.

(Details / embedded examples to come)