Preprocessing External Data

Convert geospatial data to Josh’s optimized .jshd format

Introduction

Josh simulations can incorporate external geospatial data like climate projections, soil maps, or elevation models. Before use, this data must be preprocessed into Josh’s optimized .jshd (Josh Data) format.

Preprocessing does several things:

  1. Spatial alignment: Resamples data to match your simulation grid
  2. Coordinate transformation: Handles CRS conversions
  3. Temporal alignment: Maps time steps between data and simulation
  4. Binary optimization: Creates fast-loading binary format

Supported Input Formats

Josh supports three input formats, each with a dedicated config class:

Format Config Class Use Case
NetCDF NetcdfPreprocessConfig Climate projections, gridded data with time dimension
GeoTIFF/COG GeotiffPreprocessConfig Satellite imagery, elevation models, land cover maps
CSV CsvPreprocessConfig Point observations, station data

This tutorial demonstrates:

  1. Creating synthetic test data with clear spatial patterns
  2. Preprocessing NetCDF files using NetcdfPreprocessConfig
  3. Preprocessing GeoTIFF files using GeotiffPreprocessConfig
  4. Preprocessing CSV files using CsvPreprocessConfig
  5. Verifying preprocessed data with load_jshd()

Prerequisites: joshpy installed with pip install joshpy[all]

Step 1: Create Synthetic Test Data

For this tutorial, we create synthetic soil quality data with three distinct spatial patterns. These patterns make it easy to verify that preprocessing works correctly and that simulations respond to external data as expected.

The Patterns

We’ll create three NetCDF files:

Pattern Description Expected Effect
Gradient Soil quality increases west-to-east (0% to 100%) Trees grow taller in the east
Triangle Peak quality in center, decreasing to edges Trees tallest in center
Stripes Alternating bands of 20% and 80% quality Bands of short and tall trees

Grid Specification

Our synthetic data matches the tutorial simulation grid:

  • Extent: 33.7-34.0° latitude, -116.4 to -115.4° longitude
  • Resolution: ~0.01° (~1km cells)
  • Size: Approximately 31 x 102 cells
Show NetCDF creation code
from pathlib import Path
import numpy as np
from scipy.io import netcdf_file

# Output directory
OUTPUT_DIR = Path("../../examples/external_data")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Grid specification (matches external_sweep.josh)
LAT_MIN, LAT_MAX = 33.7, 34.0
LON_MIN, LON_MAX = -116.4, -115.4

# Resolution: ~0.01 degrees (~1km)
RESOLUTION = 0.01
lats = np.arange(LAT_MIN, LAT_MAX + RESOLUTION, RESOLUTION)
lons = np.arange(LON_MIN, LON_MAX + RESOLUTION, RESOLUTION)

# Single timestep - required by Josh preprocessing
times = np.array([2025.0])

print(f"Grid size: {len(lats)} x {len(lons)} = {len(lats) * len(lons)} cells")

def create_netcdf_with_time(filepath, data, lats, lons, times):
    """Create a NetCDF file with soil_quality variable and time dimension."""
    # Add time dimension to data (shape: time x lat x lon)
    data_3d = data[np.newaxis, :, :]
    
    with netcdf_file(str(filepath), 'w') as f:
        # Create dimensions
        f.createDimension('time', len(times))
        f.createDimension('lat', len(lats))
        f.createDimension('lon', len(lons))
        
        # Time variable
        time_var = f.createVariable('time', 'f8', ('time',))
        time_var[:] = times
        time_var.units = 'years'
        
        # Coordinate variables
        lat_var = f.createVariable('lat', 'f8', ('lat',))
        lat_var[:] = lats
        lat_var.units = 'degrees_north'
        
        lon_var = f.createVariable('lon', 'f8', ('lon',))
        lon_var[:] = lons
        lon_var.units = 'degrees_east'
        
        # Data variable
        soil_var = f.createVariable('soil_quality', 'f4', ('time', 'lat', 'lon'))
        soil_var[:] = data_3d
        soil_var.units = 'percent'

# Create meshgrid for calculations
lon_grid, lat_grid = np.meshgrid(lons, lats)

# Pattern 1: Gradient (west-to-east)
gradient_data = ((lon_grid - LON_MIN) / (LON_MAX - LON_MIN)) * 100
create_netcdf_with_time(OUTPUT_DIR / "soil_quality_gradient.nc", gradient_data, lats, lons, times)

# Pattern 2: Triangle (center peak)
center_lon = (LON_MIN + LON_MAX) / 2
center_lat = (LAT_MIN + LAT_MAX) / 2
max_dist = np.sqrt((LON_MAX - center_lon)**2 + (LAT_MAX - center_lat)**2)
dist_from_center = np.sqrt((lon_grid - center_lon)**2 + (lat_grid - center_lat)**2)
triangle_data = np.maximum(0, (1 - dist_from_center / max_dist)) * 100
create_netcdf_with_time(OUTPUT_DIR / "soil_quality_triangle.nc", triangle_data, lats, lons, times)

# Pattern 3: Stripes (alternating bands)
stripe_width = 0.1  # degrees longitude
stripe_index = ((lon_grid - LON_MIN) / stripe_width).astype(int)
stripes_data = np.where(stripe_index % 2 == 0, 80.0, 20.0)
create_netcdf_with_time(OUTPUT_DIR / "soil_quality_stripes.nc", stripes_data, lats, lons, times)
Created 3 NetCDF files in ../../examples/external_data
Grid size: 31 x 102 = 3162 cells

Visualize Input Patterns

import matplotlib.pyplot as plt
from scipy.io import netcdf_file
from pathlib import Path
import numpy as np

OUTPUT_DIR = Path("../../examples/external_data")

fig, axes = plt.subplots(1, 3, figsize=(12, 4))

patterns = [
    ("soil_quality_gradient.nc", "Gradient (West-to-East)"),
    ("soil_quality_triangle.nc", "Triangle (Center Peak)"),
    ("soil_quality_stripes.nc", "Stripes (Alternating)"),
]

for ax, (filename, title) in zip(axes, patterns):
    with netcdf_file(str(OUTPUT_DIR / filename), 'r') as f:
        data = f.variables['soil_quality'][0, :, :].copy()  # First timestep
        lats = f.variables['lat'][:].copy()
        lons = f.variables['lon'][:].copy()
    
    im = ax.imshow(data, extent=[lons.min(), lons.max(), lats.min(), lats.max()],
                   origin='lower', cmap='YlGn', vmin=0, vmax=100, aspect='equal')
    ax.set_xlabel('Longitude')
    ax.set_ylabel('Latitude')
    ax.set_title(title)

plt.colorbar(im, ax=axes, label='Soil Quality (%)', shrink=0.8)
plt.tight_layout()
plt.show()
Figure 1: Synthetic soil quality patterns for preprocessing

Step 2: Preprocess NetCDF with JoshCLI

The cli.preprocess() method converts geospatial data to Josh’s .jshd format. It requires a reference simulation to define the target grid.

Reference Simulation

The preprocessing command needs a Josh simulation file to know the target grid:

from pathlib import Path

SOURCE_PATH = Path("../../examples/external_sweep.josh")
print(SOURCE_PATH.read_text())
# External data sweep simulation - tree growth affected by soil quality
# Demonstrates using external geospatial data in Josh simulations
#
# The `external soil_quality` expression reads preprocessed .jshd data
# that varies spatially across the grid. Tree growth rate scales with
# soil quality, creating visible spatial patterns in the output.

start simulation Main

  # Grid extent for the tutorial area
  # Note: grid.low = northwest corner (higher lat, lower lon)
  #       grid.high = southeast corner (lower lat, higher lon)
  grid.size = 1000 m
  grid.low = 34.0 degrees latitude, -116.4 degrees longitude
  grid.high = 33.7 degrees latitude, -115.4 degrees longitude
  grid.patch = "Default"

  # 10 timesteps to observe growth patterns
  steps.low = 0 count
  steps.high = 10 count

  # Output exports to files (run_hash passed as custom-tag by joshpy)
  exportFiles.patch = "file:///tmp/external_sweep_{run_hash}_{replicate}.csv"

end simulation

start patch Default

  # Create trees in each patch
  ForeverTree.init = create 10 count of ForeverTree

  # Read soil quality from external preprocessed data
  # This value varies spatially based on the .jshd file content
  soil_quality.step = external soil_quality

  # Export patch-level averages
  export.average_height.step = mean(ForeverTree.height)
  export.average_age.step = mean(ForeverTree.age)
  export.soil_quality.step = soil_quality

end patch

start organism ForeverTree

  age.init = 0 year
  age.step = prior.age + 1 year

  height.init = 0 meters

  # Growth rate scales with soil quality from external data
  # 0% soil quality -> 0 meters max growth
  # 100% soil quality -> 10 meters max growth
  max_growth.step = map here.soil_quality from [0 percent, 100 percent] to [0 meters, 10 meters] linear

  # Actual growth is random up to max_growth
  height.step = prior.height + sample uniform from 0 meters to max_growth

end organism

start unit year

  alias years
  alias yr
  alias yrs

end unit

Preprocess Each Pattern

The output filename determines how Josh resolves external references. For example, soil_quality_gradient.jshd will be referenced as external soil_quality when passed with --data soil_quality=soil_quality_gradient.jshd.

from pathlib import Path
from joshpy.cli import JoshCLI, NetcdfPreprocessConfig
from joshpy.jar import JarMode

# Setup CLI (auto-downloads JAR if needed)
cli = JoshCLI(josh_jar=JarMode.DEV)

SOURCE_PATH = Path("../../examples/external_sweep.josh")
DATA_DIR = Path("../../examples/external_data")

# Patterns to preprocess
patterns = ["gradient", "triangle", "stripes"]

for pattern in patterns:
    nc_file = DATA_DIR / f"soil_quality_{pattern}.nc"
    jshd_file = DATA_DIR / f"soil_quality_{pattern}.jshd"

    print(f"Preprocessing {nc_file.name}...")

    result = cli.preprocess(NetcdfPreprocessConfig(
        script=SOURCE_PATH,
        simulation="Main",
        data_file=nc_file,
        variable="soil_quality",
        units="percent",
        output=jshd_file,
        x_coord="lon",
        y_coord="lat",
        time_coord="time",
        parallel=True,  # Enable parallel processing for faster preprocessing
    ))

    if result.success:
        print(f"  -> {jshd_file.name} created successfully")
    else:
        raise RuntimeError(f"Preprocessing failed for {nc_file.name}: {result.stderr}")
Preprocessing soil_quality_gradient.nc...
  -> soil_quality_gradient.jshd created successfully
Preprocessing soil_quality_triangle.nc...
  -> soil_quality_triangle.jshd created successfully
Preprocessing soil_quality_stripes.nc...
  -> soil_quality_stripes.jshd created successfully
TipParallel Preprocessing

The parallel=True option enables parallel processing of grid patches, providing approximately Nx speedup on machines with N CPU cores. This is especially beneficial for large grids or when preprocessing many files. For small grids like this tutorial example, the overhead may outweigh the benefit, but for production-scale data it can significantly reduce preprocessing time.

NetcdfPreprocessConfig Options

Parameter Description
script Josh simulation file (defines target grid)
simulation Name of simulation to use
data_file Input NetCDF file (.nc, .nc4, .netcdf)
variable NetCDF variable name to extract
units Units for the data
output Output .jshd file path
x_coord Name of X/longitude dimension (default: “lon”)
y_coord Name of Y/latitude dimension (default: “lat”)
time_coord Name of time dimension (default: “time”)
timestep Extract specific time slice (optional)
amend Add to existing .jshd file (default: False)
crs Coordinate reference system
parallel Enable parallel processing (~Nx speedup on N cores)

Step 3: Verify with cli.inspect_jshd()

Use cli.inspect_jshd() to verify that preprocessing worked correctly by spot-checking values at known coordinates.

from joshpy.cli import JoshCLI, InspectJshdConfig
from joshpy.jar import JarMode
from pathlib import Path

cli = JoshCLI(josh_jar=JarMode.DEV)
DATA_DIR = Path("../../examples/external_data")

# Test coordinates (grid space, not lat/lon)
# Grid is approximately 31 x 102 cells
test_points = [
    (0, 15, "West edge"),
    (50, 15, "Center"),
    (100, 15, "East edge"),
]

print("Gradient pattern (should increase west-to-east):")
for x, y, label in test_points:
    result = cli.inspect_jshd(InspectJshdConfig(
        jshd_file=DATA_DIR / "soil_quality_gradient.jshd",
        variable="data",  # Preprocessed data stored as 'data'
        timestep=0,
        x=x,
        y=y,
    ))
    if result.success:
        value = result.stdout.strip()
        print(f"  ({x:3d}, {y:2d}) {label}: {value}")
    else:
        print(f"  ({x:3d}, {y:2d}) {label}: ERROR - {result.stderr[:80]}")

print("\nStripes pattern (should alternate 20/80):")
for x in [0, 10, 20, 30, 40]:
    result = cli.inspect_jshd(InspectJshdConfig(
        jshd_file=DATA_DIR / "soil_quality_stripes.jshd",
        variable="data",
        timestep=0,
        x=x,
        y=15,
    ))
    if result.success:
        value = result.stdout.strip()
        print(f"  x={x:3d}: {value}")
Gradient pattern (should increase west-to-east):
  (  0, 15) West edge: Value at (0, 15, 0): 1 percent
  ( 50, 15) Center: Value at (50, 15, 0): 55 percent
  (100, 15) East edge: ERROR - No value found at coordinates (100, 15) for timestep 0 in variable 'data'


Stripes pattern (should alternate 20/80):
  x=  0: Value at (0, 15, 0): 80 percent
  x= 10: Value at (10, 15, 0): 20 percent
  x= 20: Value at (20, 15, 0): 80 percent
  x= 30: Value at (30, 15, 0): 20 percent
  x= 40: Value at (40, 15, 0): 80 percent
NoteVariable Name in JSHD Files

When inspecting .jshd files, the data is stored under the generic name data, not the original variable name from the NetCDF file. However, when Josh resolves external soil_quality, it uses the filename stem passed via --data, not the internal variable name.

Step 4: Visual Verification with plot_jshd()

For a more comprehensive check, use load_jshd() and plot_jshd() to load the entire JSHD contents and visualize them. This is especially useful when debugging preprocessing issues or verifying spatial patterns.

import matplotlib.pyplot as plt
from pathlib import Path
from joshpy import JoshCLI, load_jshd, plot_jshd
from joshpy.jar import JarMode

cli = JoshCLI(josh_jar=JarMode.DEV)
DATA_DIR = Path("../../examples/external_data")

patterns = ["gradient", "triangle", "stripes"]

fig, axes = plt.subplots(1, 3, figsize=(12, 4))

for ax, pattern in zip(axes, patterns):
    jshd_path = DATA_DIR / f"soil_quality_{pattern}.jshd"
    
    # Load JSHD data
    data = load_jshd(cli, jshd_path)
    arr = data.to_array(timestep=0)
    
    # Plot
    im = ax.imshow(arr, origin="lower", cmap="YlGn", aspect="equal")
    ax.set_xlabel("X (grid index)")
    ax.set_ylabel("Y (grid index)")
    ax.set_title(f"{pattern.title()}\n({data.metadata.width}×{data.metadata.height}, "
                 f"range: {arr.min():.0f}{arr.max():.0f})")

plt.colorbar(im, ax=axes, label="Soil Quality (%)", shrink=0.8)
plt.tight_layout()
plt.show()
Figure 2: Visual verification of preprocessed JSHD files showing all three spatial patterns

The visualization confirms that preprocessing preserved the spatial patterns:

  • Gradient: Values increase from left (low) to right (high)
  • Triangle: Values peak in the center and decrease toward edges
  • Stripes: Alternating bands of low (20%) and high (80%) values

Single File Inspection

For detailed inspection of a single file, use plot_jshd() directly:

data = load_jshd(cli, DATA_DIR / "soil_quality_gradient.jshd")

# Print metadata
print(f"Grid: {data.metadata.width} × {data.metadata.height}")
print(f"Timesteps: {data.metadata.num_timesteps}")
print(f"Units: {data.metadata.units}")

# Create debug plot with full metadata
fig = plot_jshd(data, timestep=0, cmap="YlGn", title="Gradient Pattern - JSHD Debug View")
Grid: 94 × 35
Timesteps: 11
Units: percent
Figure 3: Detailed JSHD inspection with metadata

Step 5: Preprocessing GeoTIFF/COG Files

GeoTIFF and Cloud-Optimized GeoTIFF (COG) files are common for satellite imagery, elevation models, and land cover data. Use GeotiffPreprocessConfig for these files.

ImportantRequired: timestep Parameter

GeoTIFF files have no time dimension, so you must specify timestep to indicate which simulation step the data maps to. For initial conditions, use timestep=0.

Create Synthetic GeoTIFF

# This example requires rasterio - shown for reference
import numpy as np
import rasterio
from rasterio.transform import from_bounds

DATA_DIR = Path("../../examples/external_data")

# Grid specification (same as NetCDF examples)
LAT_MIN, LAT_MAX = 33.7, 34.0
LON_MIN, LON_MAX = -116.4, -115.4

# Create elevation data - simple gradient from west to east
width, height = 102, 31
elevation = np.tile(np.linspace(100, 500, width), (height, 1)).astype(np.float32)

# Write GeoTIFF
transform = from_bounds(LON_MIN, LAT_MIN, LON_MAX, LAT_MAX, width, height)
with rasterio.open(
    DATA_DIR / "elevation.tif",
    'w',
    driver='GTiff',
    height=height,
    width=width,
    count=1,
    dtype=elevation.dtype,
    crs='EPSG:4326',
    transform=transform,
) as dst:
    dst.write(elevation, 1)

print(f"Created elevation.tif ({width}x{height})")

Preprocess GeoTIFF

from joshpy.cli import JoshCLI, GeotiffPreprocessConfig
from joshpy.jar import JarMode
from pathlib import Path

cli = JoshCLI(josh_jar=JarMode.DEV)
SOURCE_PATH = Path("../../examples/external_sweep.josh")
DATA_DIR = Path("../../examples/external_data")

result = cli.preprocess(GeotiffPreprocessConfig(
    script=SOURCE_PATH,
    simulation="Main",
    data_file=DATA_DIR / "elevation.tif",
    band=0,                    # Band index (0-based)
    units="meters",
    output=DATA_DIR / "elevation.jshd",
    timestep=0,                # Required: maps to simulation step 0
    crs="EPSG:4326",           # Specify if not embedded in TIF
))

if result.success:
    print("GeoTIFF preprocessed successfully!")
else:
    print(f"Error: {result.stderr}")

GeotiffPreprocessConfig Options

Parameter Description
script Josh simulation file (defines target grid)
simulation Name of simulation to use
data_file Input GeoTIFF file (.tif, .tiff)
band Band index to extract (0-based)
units Units for the data
output Output .jshd file path
timestep Required: Simulation timestep this data maps to
amend Add to existing .jshd file (default: False)
crs Coordinate reference system (if not embedded in file)
parallel Enable parallel processing (~Nx speedup on N cores)
NoteGeoTIFF Coordinates

GeoTIFF spatial coordinates are embedded in the file format itself, so x_coord and y_coord options are not needed (and will be ignored if provided).

Step 6: Preprocessing CSV Point Data

CSV files are useful for point observations like weather station data or field measurements. Use CsvPreprocessConfig for CSV files.

WarningCSV Format Requirements

CSV files must have columns named exactly longitude and latitude. All other columns are available as variables to extract.

Create Synthetic CSV

import pandas as pd
import numpy as np
from pathlib import Path

DATA_DIR = Path("../../examples/external_data")
DATA_DIR.mkdir(parents=True, exist_ok=True)

# Create weather station data
np.random.seed(42)
n_stations = 50

# Random station locations within the grid
lons = np.random.uniform(-116.4, -115.4, n_stations)
lats = np.random.uniform(33.7, 34.0, n_stations)

# Temperature increases with latitude (north is warmer in this example)
temperature = 20 + (lats - 33.7) * 30 + np.random.normal(0, 2, n_stations)

# Precipitation decreases west to east
precipitation = 500 - (lons + 116.4) * 400 + np.random.normal(0, 20, n_stations)
precipitation = np.clip(precipitation, 50, 800)

df = pd.DataFrame({
    'longitude': lons,
    'latitude': lats,
    'temperature': temperature.round(1),
    'precipitation': precipitation.round(0),
})

csv_path = DATA_DIR / "weather_stations.csv"
df.to_csv(csv_path, index=False)

print(f"Created {csv_path.name} with {len(df)} stations")
print(df.head())
Created weather_stations.csv with 50 stations
    longitude   latitude  temperature  precipitation
0 -116.025460  33.990875         28.9          330.0
1 -115.449286  33.932540         26.4          108.0
2 -115.668006  33.981850         28.6          209.0
3 -115.801342  33.968448         24.1          250.0
4 -116.243981  33.879370         24.9          407.0

Preprocess CSV

from joshpy.cli import JoshCLI, CsvPreprocessConfig
from joshpy.jar import JarMode
from pathlib import Path

cli = JoshCLI(josh_jar=JarMode.DEV)
SOURCE_PATH = Path("../../examples/external_sweep.josh")
DATA_DIR = Path("../../examples/external_data")

# Preprocess temperature
result = cli.preprocess(CsvPreprocessConfig(
    script=SOURCE_PATH,
    simulation="Main",
    data_file=DATA_DIR / "weather_stations.csv",
    variable="temperature",    # Column name to extract
    units="celsius",
    output=DATA_DIR / "station_temperature.jshd",
    timestep=0,                # Required: maps to simulation step 0
))

if result.success:
    print("Temperature CSV preprocessed successfully!")
else:
    print(f"Error: {result.stderr}")

# Preprocess precipitation
result = cli.preprocess(CsvPreprocessConfig(
    script=SOURCE_PATH,
    simulation="Main",
    data_file=DATA_DIR / "weather_stations.csv",
    variable="precipitation",
    units="mm",
    output=DATA_DIR / "station_precipitation.jshd",
    timestep=0,
))

if result.success:
    print("Precipitation CSV preprocessed successfully!")
else:
    print(f"Error: {result.stderr}")
Temperature CSV preprocessed successfully!
Precipitation CSV preprocessed successfully!

CsvPreprocessConfig Options

Parameter Description
script Josh simulation file (defines target grid)
simulation Name of simulation to use
data_file Input CSV file (.csv)
variable Column name to extract
units Units for the data
output Output .jshd file path
timestep Required: Simulation timestep this data maps to
amend Add to existing .jshd file (default: False)
crs Coordinate reference system
parallel Enable parallel processing (~Nx speedup on N cores)
WarningCSV Performance Note

CSV preprocessing uses nearest-neighbor interpolation with a brute-force O(n) scan per grid cell. This may be slow for large CSV files (thousands of points) or large simulation grids. For better performance with large datasets, consider:

  1. Converting CSV to GeoTIFF using tools like GDAL
  2. Using NetCDF format which supports spatial indexing

Verify CSV Preprocessing

from joshpy import load_jshd, plot_jshd

# Load and visualize
data = load_jshd(cli, DATA_DIR / "station_temperature.jshd")

print(f"Grid: {data.metadata.width} × {data.metadata.height}")
print(f"Units: {data.metadata.units}")

arr = data.to_array(timestep=0)
print(f"Value range: {arr.min():.1f} to {arr.max():.1f}")

fig = plot_jshd(data, timestep=0, cmap="RdYlBu_r", 
                title="Temperature from Weather Stations (interpolated)")
Grid: 94 × 35
Units: celsius
Value range: 0.0 to 31.5

CSV point data interpolated to simulation grid

Step 7: GridSpec-Based Preprocessing

Steps 2–6 showed the low-level cli.preprocess() API where you manage output paths and file inventories yourself. GridSpec provides a higher-level workflow: you preprocess through the grid, and it automatically registers each file and writes a grid.yaml manifest. This is the recommended approach for structured projects.

Static files (auto-registered)

When you call grid.preprocess_netcdf() without a variant argument, the output goes to output_dir/{josh_name}.jshd and the file is auto-registered in grid.files:

import tempfile
from pathlib import Path
from joshpy.grid import GridSpec
from joshpy.cli import JoshCLI
from joshpy.jar import JarMode

cli = JoshCLI(josh_jar=JarMode.DEV)
DATA_DIR = Path("../../examples/external_data")

static_dir = Path(tempfile.mkdtemp()) / "static_grid"

grid = GridSpec(
    name="tutorial_static",
    output_dir=static_dir,
    size_m=1000,
    low=(34.0, -116.4),
    high=(33.7, -115.4),
    steps=10,
)

# Preprocess — auto-registers in grid.files
result = grid.preprocess_netcdf(
    cli,
    josh_name="soil_quality",
    data_file=DATA_DIR / "soil_quality_gradient.nc",
    variable="soil_quality",
    units="percent",
    x_coord="lon",
    y_coord="lat",
    time_coord="time",
)

print(f"Success: {result.success}")
print(f"Registered files: {list(grid.files.keys())}")
print(f"Entry: {grid.files['soil_quality']}")

# Save to YAML
yaml_path = grid.save()
print(f"\n--- {yaml_path.name} ---")
print(yaml_path.read_text())
Success: True
Registered files: ['soil_quality']
Entry: {'path': 'soil_quality.jshd', 'units': 'percent'}

--- grid.yaml ---
name: tutorial_static
grid:
  size_m: 1000
  low:
  - 34.0
  - -116.4
  high:
  - 33.7
  - -115.4
  steps: 10
files:
  soil_quality:
    path: soil_quality.jshd
    units: percent

The resulting grid.yaml contains both grid geometry and the file inventory. Load it later with GridSpec.from_yaml() to get file_mappings and template_vars for JobConfig.

Variant files

When data files vary by scenario (e.g., different soil patterns or climate projections), declare variant axes in the GridSpec and preprocess each value with the variant parameter. This produces the template_path entries that power variant_sweep().

variant_dir = Path(tempfile.mkdtemp()) / "variant_grid"

grid_v = GridSpec(
    name="tutorial_variant",
    output_dir=variant_dir,
    size_m=1000,
    low=(34.0, -116.4),
    high=(33.7, -115.4),
    steps=10,
    variants={
        "pattern": {
            "values": ["gradient", "triangle", "stripes"],
            "default": "gradient",
        }
    },
    files={
        "soil_quality": {
            "template_path": "soil_quality_{pattern}.jshd",
            "units": "percent",
        }
    },
)

# Preprocess each variant value
for pattern in ["gradient", "triangle", "stripes"]:
    result = grid_v.preprocess_netcdf(
        cli,
        josh_name="soil_quality",
        variant={"pattern": pattern},
        data_file=DATA_DIR / f"soil_quality_{pattern}.nc",
        variable="soil_quality",
        units="percent",
        x_coord="lon",
        y_coord="lat",
        time_coord="time",
    )
    print(f"  {pattern}: {'OK' if result.success else 'FAIL'}")

yaml_path = grid_v.save()
print(f"\n--- {yaml_path.name} ---")
print(yaml_path.read_text())
  gradient: OK
  triangle: OK
  stripes: OK

--- grid.yaml ---
name: tutorial_variant
grid:
  size_m: 1000
  low:
  - 34.0
  - -116.4
  high:
  - 33.7
  - -115.4
  steps: 10
variants:
  pattern:
    values:
    - gradient
    - triangle
    - stripes
    default: gradient
files:
  soil_quality:
    template_path: soil_quality_{pattern}.jshd
    units: percent

Loading back

Once saved, the YAML is the single source of truth for grid geometry and data inventory:

loaded = GridSpec.from_yaml(yaml_path)

print(f"Grid: {loaded.name}")
print(f"Template vars: {loaded.template_vars}")
print(f"\nDefault file_mappings:")
for name, path in loaded.file_mappings.items():
    print(f"  {name}: {path.name}")

print(f"\nfile_mappings_for(pattern='stripes'):")
for name, path in loaded.file_mappings_for(pattern="stripes").items():
    print(f"  {name}: {path.name}")
Grid: tutorial_variant
Template vars: {'size_m': 1000, 'low_lat': 34.0, 'low_lon': -116.4, 'high_lat': 33.7, 'high_lon': -115.4, 'steps': 10}

Default file_mappings:
  soil_quality: soil_quality_gradient.jshd

file_mappings_for(pattern='stripes'):
  soil_quality: soil_quality_stripes.jshd
TipWhen to Use Which
Approach When to use
cli.preprocess() Quick one-off preprocessing, scripts, or non-grid workflows
grid.preprocess_*() Structured projects where you want a grid.yaml manifest

Both produce identical .jshd files – the difference is whether the inventory is tracked in a YAML manifest or managed manually.

Summary

In this tutorial, we:

  1. Created synthetic test data with clear spatial patterns using scipy.io.netcdf_file
  2. Preprocessed NetCDF to .jshd using NetcdfPreprocessConfig
  3. Preprocessed GeoTIFF using GeotiffPreprocessConfig (with required timestep)
  4. Preprocessed CSV point data using CsvPreprocessConfig (with required timestep)
  5. Verified results using cli.inspect_jshd() and plot_jshd()
  6. Built a GridSpec manifest using grid.preprocess_netcdf() and grid.save()

The preprocessed .jshd files are now ready for use in simulations. The next tutorial, Sweeping Over External Data, demonstrates how to run parameter sweeps over multiple external data files. For the GridSpec-based variant sweep workflow, see GridSpec Variant Sweeps.

Format Comparison

Format Config Class timestep Coordinates Best For
NetCDF NetcdfPreprocessConfig Optional Via x_coord, y_coord, time_coord Climate data with time series
GeoTIFF GeotiffPreprocessConfig Required Embedded in file Raster imagery, DEMs
CSV CsvPreprocessConfig Required Must be longitude, latitude columns Point observations

Files Created

File Description
examples/external_data/soil_quality_gradient.nc Synthetic NetCDF - west-to-east gradient
examples/external_data/soil_quality_triangle.nc Synthetic NetCDF - center peak
examples/external_data/soil_quality_stripes.nc Synthetic NetCDF - alternating stripes
examples/external_data/soil_quality_gradient.jshd Preprocessed - gradient pattern
examples/external_data/soil_quality_triangle.jshd Preprocessed - triangle pattern
examples/external_data/soil_quality_stripes.jshd Preprocessed - stripes pattern
examples/external_data/weather_stations.csv Synthetic CSV - station observations
examples/external_data/station_temperature.jshd Preprocessed - temperature from stations
examples/external_data/station_precipitation.jshd Preprocessed - precipitation from stations
TipGrid Alignment & JSHD Reuse

JSHD files are tied to specific grid definitions. If you change grid.size, grid.low, or grid.high in your simulation, you must regenerate the JSHD files. See Best Practices: Preprocessing & Grid Alignment for compatibility rules and directory organization patterns.

Learn More