Sweeping Over External Data Files

Use FileSweepParameter to run simulations with different input data

Introduction

joshpy’s FileSweepParameter enables sweeping over external data files, running one simulation job per input file. This is similar to sweeping over configuration parameters, but the variation comes from .jshd data files rather than .jshc configuration values.

This tutorial demonstrates:

Using the external keyword in Josh to reference preprocessed data
Configuring a file sweep with FileSweepParameter
Running with SweepManager and CartesianStrategy
Visualizing how different input patterns produce different outputs

Prerequisites:

Complete Preprocessing External Data first to create the .jshd files
joshpy installed with pip install joshpy[all]

The Simulation

Our simulation uses external soil_quality to affect tree growth. Trees in patches with high soil quality grow faster than those in low-quality areas.

View Simulation File (external_sweep.josh)

Code

from pathlib import Path

SOURCE_PATH = Path("../../examples/external_sweep.josh")
print(SOURCE_PATH.read_text())

# External data sweep simulation - tree growth affected by soil quality
# Demonstrates using external geospatial data in Josh simulations
#
# The `external soil_quality` expression reads preprocessed .jshd data
# that varies spatially across the grid. Tree growth rate scales with
# soil quality, creating visible spatial patterns in the output.

start simulation Main

  # Grid extent for the tutorial area
  # Note: grid.low = northwest corner (higher lat, lower lon)
  #       grid.high = southeast corner (lower lat, higher lon)
  grid.size = 1000 m
  grid.low = 34.0 degrees latitude, -116.4 degrees longitude
  grid.high = 33.7 degrees latitude, -115.4 degrees longitude
  grid.patch = "Default"

  # 10 timesteps to observe growth patterns
  steps.low = 0 count
  steps.high = 10 count

  # Output exports to files (run_hash passed as custom-tag by joshpy)
  exportFiles.patch = "file:///tmp/external_sweep_{run_hash}_{replicate}.csv"

end simulation

start patch Default

  # Create trees in each patch
  ForeverTree.init = create 10 count of ForeverTree

  # Read soil quality from external preprocessed data
  # This value varies spatially based on the .jshd file content
  soil_quality.step = external soil_quality

  # Export patch-level averages
  export.average_height.step = mean(ForeverTree.height)
  export.average_age.step = mean(ForeverTree.age)
  export.soil_quality.step = soil_quality

end patch

start organism ForeverTree

  age.init = 0 year
  age.step = prior.age + 1 year

  height.init = 0 meters

  # Growth rate scales with soil quality from external data
  # 0% soil quality -> 0 meters max growth
  # 100% soil quality -> 10 meters max growth
  max_growth.step = map here.soil_quality from [0 percent, 100 percent] to [0 meters, 10 meters] linear

  # Actual growth is random up to max_growth
  height.step = prior.height + sample uniform from 0 meters to max_growth

end organism

start unit year

  alias years
  alias yr
  alias yrs

end unit

Key Concepts

The external keyword:

soil_quality.step = external soil_quality

This reads the soil_quality variable from a preprocessed .jshd file. The value varies spatially across the grid based on the file content.

Accessing patch data from organisms:

max_growth.step = map here.soil_quality from [0 percent, 100 percent] to [0 meters, 10 meters] linear

The here keyword lets organisms access variables from their containing patch. Tree growth rate scales linearly with soil quality.

Step 1: Configure the File Sweep

We use FileSweepParameter to sweep over three preprocessed .jshd files:

from pathlib import Path
from joshpy.jobs import JobConfig, SweepConfig, FileSweepParameter
from joshpy.strategies import CartesianStrategy

SOURCE_PATH = Path("../../examples/external_sweep.josh")
TEMPLATE_PATH = Path("../../examples/templates/external_config.jshc.j2")
DATA_DIR = Path("../../examples/external_data")

config = JobConfig(
    template_path=TEMPLATE_PATH,
    source_path=SOURCE_PATH,
    simulation="Main",
    replicates=3,
    sweep=SweepConfig(
        file_parameters=[
            FileSweepParameter(
                name="soil_quality",  # Matches `external soil_quality` in Josh
                paths=[
                    DATA_DIR / "soil_quality_gradient.jshd",
                    DATA_DIR / "soil_quality_triangle.jshd",
                    DATA_DIR / "soil_quality_stripes.jshd",
                ],
            ),
        ],
        strategy=CartesianStrategy(),
    ),
)

print(f"Sweep creates {len(config.sweep.file_parameters[0].paths)} jobs")

Sweep creates 3 jobs

print(f"Each job runs {config.replicates} replicates")

Each job runs 3 replicates

print(f"Total simulation runs: {len(config.sweep.file_parameters[0].paths) * config.replicates}")

Total simulation runs: 9

FileSweepParameter Details

Attribute	Description
`name`	Key used in `file_mappings` and passed to `--data name=path`
`paths`	List of `.jshd` file paths to sweep over
`labels`	Auto-derived from filename stems (e.g., `soil_quality_gradient.jshd` -> `soil_quality_gradient`)

The name must match what Josh expects when resolving external soil_quality. The Josh runtime looks for a file mapping with this name.

Step 2: Build SweepManager

from joshpy.sweep import SweepManager
from joshpy.cli import JoshCLI
from joshpy.jar import JarMode

REGISTRY_PATH = "external_sweep.duckdb"

manager = (
    SweepManager.builder(config)
    .with_registry(REGISTRY_PATH, experiment_name="soil_quality_sweep")
    .with_cli(JoshCLI(josh_jar=JarMode.DEV))
    .build()
)

print(f"Session ID: {manager.session_id}")

Session ID: 505798c2-aab1-48f7-9250-3a3eca445e54

print(f"Total jobs: {manager.job_set.total_jobs}")

Total jobs: 3

print(f"Total replicates: {manager.job_set.total_replicates}")

Total replicates: 9

Job Labels

Each job gets a label derived from the file stem:

for job in manager.job_set.jobs:
    print(f"Job {job.run_hash[:8]}: soil_quality={job.custom_tags.get('soil_quality', 'N/A')}")

Job 09dabd26: soil_quality=soil_quality_gradient
Job 76863b0b: soil_quality=soil_quality_triangle
Job cf6c3df2: soil_quality=soil_quality_stripes

Step 3: Run the Sweep

results = manager.run()

Running 3 jobs (9 total replicates)
[1/3] Running (local): {'soil_quality': 'soil_quality_gradient'}
  [OK] Completed successfully
[2/3] Running (local): {'soil_quality': 'soil_quality_triangle'}
  [OK] Completed successfully
[3/3] Running (local): {'soil_quality': 'soil_quality_stripes'}
  [OK] Completed successfully
Completed: 3 succeeded, 0 failed

print(f"\nSweep complete!")


Sweep complete!

print(f"Succeeded: {results.succeeded}")

Succeeded: 3

print(f"Failed: {results.failed}")

Failed: 0


# Fail the tutorial if any jobs failed - include actual error details
if results.failed > 0:
    errors = []
    for job, result in results:
        if not result.success:
            error_msg = result.stderr.strip() if result.stderr else "No error message"
            errors.append(f"Job {job.run_hash}: {error_msg[:500]}")
    error_detail = "\n".join(errors)
    raise RuntimeError(f"Sweep failed: {results.failed} job(s) failed\n\n{error_detail}")

Step 4: Load Results

manager.load_results()

Loading patch results from: /tmp/external_sweep_{run_hash}_{replicate}.csv
  Loaded 34782 rows from external_sweep_09dabd260a36_0.csv
  Loaded 34782 rows from external_sweep_09dabd260a36_1.csv
  Loaded 34782 rows from external_sweep_09dabd260a36_2.csv
  Loaded 34782 rows from external_sweep_76863b0b149a_0.csv
  Loaded 34782 rows from external_sweep_76863b0b149a_1.csv
  Loaded 34782 rows from external_sweep_76863b0b149a_2.csv
  Loaded 34782 rows from external_sweep_cf6c3df21ea0_0.csv
  Loaded 34782 rows from external_sweep_cf6c3df21ea0_1.csv
  Loaded 34782 rows from external_sweep_cf6c3df21ea0_2.csv

Results:
  Jobs in sweep: 3
  Jobs with results loaded: 3
  Total rows loaded: 313038
313038

# Verify data loaded
summary = manager.registry.get_data_summary()
print(summary)

Registry Data Summary
========================================
Sessions: 1
Configs:  3
Runs:     3
Rows:     313,038

Variables: average_age, average_height, soil_quality
Entity types: patch
Parameters: soil_quality
Steps: 0 - 10
Replicates: 0 - 2
Spatial extent: lon [-116.39, -115.40], lat [33.70, 34.00]

Step 5: Analysis & Visualization

Now we can analyze how different soil quality patterns affected tree growth.

Data Discovery

print("Export variables:", manager.registry.list_export_variables())

Export variables: ['average_age', 'average_height', 'soil_quality']

print("Config parameters:", manager.registry.list_config_parameters())

Config parameters: ['soil_quality']

Compare Patterns Over Time

from joshpy.diagnostics import SimulationDiagnostics

diag = SimulationDiagnostics(manager.registry)

diag.plot_comparison(
    "average_height",
    group_by="soil_quality",
    title="Tree Height by Soil Quality Pattern",
)

Figure 1: Tree height trajectories for different soil quality patterns

Final Height Comparison

diag.plot_comparison(
    "average_height",
    group_by="soil_quality",
    step=10,
    title="Final Tree Height by Soil Pattern",
)

Figure 2: Final tree height at step 10 for each soil pattern

Spatial Patterns

The key test: do the spatial patterns in tree height match the input soil quality patterns?

import matplotlib.pyplot as plt
from joshpy.cell_data import DiagnosticQueries

queries = DiagnosticQueries(manager.registry)

# Get run hashes for each pattern
patterns = ["soil_quality_gradient", "soil_quality_triangle", "soil_quality_stripes"]
pattern_jobs = {job.custom_tags.get("soil_quality"): job for job in manager.job_set.jobs}

fig, axes = plt.subplots(1, 3, figsize=(12, 4))

for ax, pattern in zip(axes, patterns):
    job = pattern_jobs.get(pattern)
    if job is None:
        ax.set_title(f"{pattern}: Not found")
        continue
    
    df = queries.get_spatial_snapshot(
        step=10,
        variable="average_height",
        run_hash=job.run_hash,
        replicate=0,
    )
    
    if len(df) == 0:
        ax.set_title(f"{pattern}: No data")
        continue
    
    scatter = ax.scatter(
        df['longitude'], 
        df['latitude'],
        c=df['value'],  # get_spatial_snapshot returns 'value' column
        cmap='YlGn',
        s=10,
        alpha=0.8,
    )
    ax.set_xlabel('Longitude')
    ax.set_ylabel('Latitude')
    ax.set_title(pattern.replace('soil_quality_', '').title())

plt.colorbar(scatter, ax=axes, label='Avg Height (m)', shrink=0.8)

<matplotlib.colorbar.Colorbar object at 0x7fc3795f4590>

plt.tight_layout()
plt.show()

Figure 3: Spatial distribution of tree height for each soil quality pattern

Verify Soil Quality Export

We also exported soil_quality to verify the external data was read correctly:

fig, axes = plt.subplots(1, 3, figsize=(12, 4))

for ax, pattern in zip(axes, patterns):
    job = pattern_jobs.get(pattern)
    if job is None:
        continue
    
    df = queries.get_spatial_snapshot(
        step=5,  # Mid-simulation
        variable="soil_quality",
        run_hash=job.run_hash,
        replicate=0,
    )
    
    if len(df) == 0:
        continue
    
    scatter = ax.scatter(
        df['longitude'], 
        df['latitude'],
        c=df['value'],  # get_spatial_snapshot returns 'value' column
        cmap='YlGn',
        vmin=0,
        vmax=100,
        s=10,
        alpha=0.8,
    )
    ax.set_xlabel('Longitude')
    ax.set_ylabel('Latitude')
    ax.set_title(f"{pattern.replace('soil_quality_', '').title()} (Soil Quality)")

plt.colorbar(scatter, ax=axes, label='Soil Quality (%)', shrink=0.8)

<matplotlib.colorbar.Colorbar object at 0x7fc350076510>

plt.tight_layout()
plt.show()

Figure 4: Exported soil quality values confirm external data was read correctly

Direct SQL Queries

For custom analysis, query the registry directly. File parameter labels are stored in config_parameters just like config params, making them easy to query:

# Compare mean height by soil pattern
# File parameter labels are stored in config_parameters, just like config params
result = manager.registry.query("""
    SELECT 
        cp.soil_quality as pattern,
        cd.step,
        AVG(cd.average_height) as mean_height,
        STDDEV(cd.average_height) as std_height
    FROM cell_data cd
    JOIN config_parameters cp ON cd.run_hash = cp.run_hash
    GROUP BY cp.soil_quality, cd.step
    ORDER BY cp.soil_quality, cd.step
""")
result.df().head(15)

	pattern	step	mean_height	std_height
0	soil_quality_gradient	0	251.212955	153.985997
1	soil_quality_gradient	1	502.614158	299.300450
2	soil_quality_gradient	2	753.774360	444.307096
3	soil_quality_gradient	3	1006.095663	590.763974
4	soil_quality_gradient	4	1258.231998	736.401820
5	soil_quality_gradient	5	1509.590436	881.379216
6	soil_quality_gradient	6	1762.403899	1027.390708
7	soil_quality_gradient	7	2014.227007	1172.672281
8	soil_quality_gradient	8	2266.419816	1318.130916
9	soil_quality_gradient	9	2518.633282	1464.063508
10	soil_quality_gradient	10	2770.238283	1609.246545
11	soil_quality_stripes	0	253.289744	159.307646
12	soil_quality_stripes	1	505.849162	308.898449
13	soil_quality_stripes	2	759.300500	459.186458
14	soil_quality_stripes	3	1012.700934	608.837888

Summary

This tutorial demonstrated the file sweep workflow:

external keyword - Reference preprocessed .jshd data in Josh simulations
FileSweepParameter - Define which files to sweep over
SweepManager - Orchestrate execution with automatic registry tracking
Spatial visualization - Confirm output patterns match input data

Key Differences: Config vs File Sweeps

Aspect	ConfigSweepParameter	FileSweepParameter
Variation source	Jinja template values	External data files
Storage	`job.parameters` dict	`job.file_mappings` (paths) + `job.parameters` (labels)
Josh reference	`config name.variable`	`external variable`
Hash includes	Rendered config content	File content hash

When to Use File Sweeps

Climate scenarios: Compare RCP 2.6 vs 4.5 vs 8.5 projections
Sensitivity analysis: Test different soil maps or elevation data
Ensemble runs: Use multiple realizations of stochastic inputs
Data source comparison: Compare satellite vs ground-truth inputs

GridSpec Variant Sweeps

When your grid has data files that vary by scenario (e.g., 24 monthly climate files × 3 SSP scenarios), constructing FileSweepParameter objects by hand gets tedious. GridSpec variants automate this.

Declare variant axes in your grid.yaml:

variants:
  scenario:
    values: [ssp245, ssp370, ssp585]
    default: ssp245

files:
  cover:
    path: cover.jshd                              # static — no templating
    units: percent
  futureTempJan:
    template_path: monthly/tas_{scenario}_jan.jshd # resolved per scenario
    units: K
  futurePrecipJan:
    template_path: monthly/pr_{scenario}_jan.jshd
    units: mm/year
  # ... 24 monthly files total

Then generate the entire sweep in one line:

grid = GridSpec.from_yaml("data/grids/dev_fine/grid.yaml")

config = JobConfig(
    file_mappings=grid.file_mappings,              # defaults (ssp245)
    sweep=SweepConfig(
        compound_parameters=[
            grid.variant_sweep("scenario"),        # 24 FileSweepParameters, zipped
        ],
    ),
    ...
)

variant_sweep("scenario") finds every template_path file referencing {scenario}, builds one FileSweepParameter per file, and wraps them in a CompoundSweepParameter so all 24 files switch together per scenario. Static path files pass through unchanged.

See Project Organization for setting up GridSpec with variants.

Cleanup

import os

manager.cleanup()
manager.close()

# Remove temporary registry and any WAL files
for f in [REGISTRY_PATH, f"{REGISTRY_PATH}.wal"]:
    if os.path.exists(f):
        os.remove(f)

Learn More

Preprocessing External Data - Create .jshd files from NetCDF/GeoTIFF
Manual Workflow - Step-by-step control using individual components
SweepManager Workflow - Simplified orchestration with config sweeps
Analysis & Visualization - Query and visualize any registry