Sweeping Over External Data Files

Use FileSweepParameter to run simulations with different input data

Introduction

joshpy’s FileSweepParameter enables sweeping over external data files, running one simulation job per input file. This is similar to sweeping over configuration parameters, but the variation comes from .jshd data files rather than .jshc configuration values.

This tutorial demonstrates:

  1. Using the external keyword in Josh to reference preprocessed data
  2. Configuring a file sweep with FileSweepParameter
  3. Running with SweepManager and CartesianStrategy
  4. Visualizing how different input patterns produce different outputs

Prerequisites:

The Simulation

Our simulation uses external soil_quality to affect tree growth. Trees in patches with high soil quality grow faster than those in low-quality areas.

Code
from pathlib import Path

SOURCE_PATH = Path("../../examples/external_sweep.josh")
print(SOURCE_PATH.read_text())
# External data sweep simulation - tree growth affected by soil quality
# Demonstrates using external geospatial data in Josh simulations
#
# The `external soil_quality` expression reads preprocessed .jshd data
# that varies spatially across the grid. Tree growth rate scales with
# soil quality, creating visible spatial patterns in the output.

start simulation Main

  # Grid extent for the tutorial area
  # Note: grid.low = northwest corner (higher lat, lower lon)
  #       grid.high = southeast corner (lower lat, higher lon)
  grid.size = 1000 m
  grid.low = 34.0 degrees latitude, -116.4 degrees longitude
  grid.high = 33.7 degrees latitude, -115.4 degrees longitude
  grid.patch = "Default"

  # 10 timesteps to observe growth patterns
  steps.low = 0 count
  steps.high = 10 count

  # Output exports to files (run_hash passed as custom-tag by joshpy)
  exportFiles.patch = "file:///tmp/external_sweep_{run_hash}_{replicate}.csv"

end simulation

start patch Default

  # Create trees in each patch
  ForeverTree.init = create 10 count of ForeverTree

  # Read soil quality from external preprocessed data
  # This value varies spatially based on the .jshd file content
  soil_quality.step = external soil_quality

  # Export patch-level averages
  export.average_height.step = mean(ForeverTree.height)
  export.average_age.step = mean(ForeverTree.age)
  export.soil_quality.step = soil_quality

end patch

start organism ForeverTree

  age.init = 0 year
  age.step = prior.age + 1 year

  height.init = 0 meters

  # Growth rate scales with soil quality from external data
  # 0% soil quality -> 0 meters max growth
  # 100% soil quality -> 10 meters max growth
  max_growth.step = map here.soil_quality from [0 percent, 100 percent] to [0 meters, 10 meters] linear

  # Actual growth is random up to max_growth
  height.step = prior.height + sample uniform from 0 meters to max_growth

end organism

start unit year

  alias years
  alias yr
  alias yrs

end unit

Key Concepts

The external keyword:

soil_quality.step = external soil_quality

This reads the soil_quality variable from a preprocessed .jshd file. The value varies spatially across the grid based on the file content.

Accessing patch data from organisms:

max_growth.step = map here.soil_quality from [0 percent, 100 percent] to [0 meters, 10 meters] linear

The here keyword lets organisms access variables from their containing patch. Tree growth rate scales linearly with soil quality.

Step 1: Configure the File Sweep

We use FileSweepParameter to sweep over three preprocessed .jshd files:

from pathlib import Path
from joshpy.jobs import JobConfig, SweepConfig, FileSweepParameter
from joshpy.strategies import CartesianStrategy

SOURCE_PATH = Path("../../examples/external_sweep.josh")
TEMPLATE_PATH = Path("../../examples/templates/external_config.jshc.j2")
DATA_DIR = Path("../../examples/external_data")

config = JobConfig(
    template_path=TEMPLATE_PATH,
    source_path=SOURCE_PATH,
    simulation="Main",
    replicates=3,
    sweep=SweepConfig(
        file_parameters=[
            FileSweepParameter(
                name="soil_quality",  # Matches `external soil_quality` in Josh
                paths=[
                    DATA_DIR / "soil_quality_gradient.jshd",
                    DATA_DIR / "soil_quality_triangle.jshd",
                    DATA_DIR / "soil_quality_stripes.jshd",
                ],
            ),
        ],
        strategy=CartesianStrategy(),
    ),
)

print(f"Sweep creates {len(config.sweep.file_parameters[0].paths)} jobs")
Sweep creates 3 jobs
print(f"Each job runs {config.replicates} replicates")
Each job runs 3 replicates
print(f"Total simulation runs: {len(config.sweep.file_parameters[0].paths) * config.replicates}")
Total simulation runs: 9

FileSweepParameter Details

Attribute Description
name Key used in file_mappings and passed to --data name=path
paths List of .jshd file paths to sweep over
labels Auto-derived from filename stems (e.g., soil_quality_gradient.jshd -> soil_quality_gradient)

The name must match what Josh expects when resolving external soil_quality. The Josh runtime looks for a file mapping with this name.

Step 2: Build SweepManager

from joshpy.sweep import SweepManager
from joshpy.cli import JoshCLI
from joshpy.jar import JarMode

REGISTRY_PATH = "external_sweep.duckdb"

manager = (
    SweepManager.builder(config)
    .with_registry(REGISTRY_PATH, experiment_name="soil_quality_sweep")
    .with_cli(JoshCLI(josh_jar=JarMode.DEV))
    .build()
)

print(f"Session ID: {manager.session_id}")
Session ID: 505798c2-aab1-48f7-9250-3a3eca445e54
print(f"Total jobs: {manager.job_set.total_jobs}")
Total jobs: 3
print(f"Total replicates: {manager.job_set.total_replicates}")
Total replicates: 9

Job Labels

Each job gets a label derived from the file stem:

for job in manager.job_set.jobs:
    print(f"Job {job.run_hash[:8]}: soil_quality={job.custom_tags.get('soil_quality', 'N/A')}")
Job 09dabd26: soil_quality=soil_quality_gradient
Job 76863b0b: soil_quality=soil_quality_triangle
Job cf6c3df2: soil_quality=soil_quality_stripes

Step 3: Run the Sweep

results = manager.run()
Running 3 jobs (9 total replicates)
[1/3] Running (local): {'soil_quality': 'soil_quality_gradient'}
  [OK] Completed successfully
[2/3] Running (local): {'soil_quality': 'soil_quality_triangle'}
  [OK] Completed successfully
[3/3] Running (local): {'soil_quality': 'soil_quality_stripes'}
  [OK] Completed successfully
Completed: 3 succeeded, 0 failed
print(f"\nSweep complete!")

Sweep complete!
print(f"Succeeded: {results.succeeded}")
Succeeded: 3
print(f"Failed: {results.failed}")
Failed: 0

# Fail the tutorial if any jobs failed - include actual error details
if results.failed > 0:
    errors = []
    for job, result in results:
        if not result.success:
            error_msg = result.stderr.strip() if result.stderr else "No error message"
            errors.append(f"Job {job.run_hash}: {error_msg[:500]}")
    error_detail = "\n".join(errors)
    raise RuntimeError(f"Sweep failed: {results.failed} job(s) failed\n\n{error_detail}")

Step 4: Load Results

manager.load_results()
Loading patch results from: /tmp/external_sweep_{run_hash}_{replicate}.csv
  Loaded 34782 rows from external_sweep_09dabd260a36_0.csv
  Loaded 34782 rows from external_sweep_09dabd260a36_1.csv
  Loaded 34782 rows from external_sweep_09dabd260a36_2.csv
  Loaded 34782 rows from external_sweep_76863b0b149a_0.csv
  Loaded 34782 rows from external_sweep_76863b0b149a_1.csv
  Loaded 34782 rows from external_sweep_76863b0b149a_2.csv
  Loaded 34782 rows from external_sweep_cf6c3df21ea0_0.csv
  Loaded 34782 rows from external_sweep_cf6c3df21ea0_1.csv
  Loaded 34782 rows from external_sweep_cf6c3df21ea0_2.csv

Results:
  Jobs in sweep: 3
  Jobs with results loaded: 3
  Total rows loaded: 313038
313038
# Verify data loaded
summary = manager.registry.get_data_summary()
print(summary)
Registry Data Summary
========================================
Sessions: 1
Configs:  3
Runs:     3
Rows:     313,038

Variables: average_age, average_height, soil_quality
Entity types: patch
Parameters: soil_quality
Steps: 0 - 10
Replicates: 0 - 2
Spatial extent: lon [-116.39, -115.40], lat [33.70, 34.00]

Step 5: Analysis & Visualization

Now we can analyze how different soil quality patterns affected tree growth.

Data Discovery

print("Export variables:", manager.registry.list_export_variables())
Export variables: ['average_age', 'average_height', 'soil_quality']
print("Config parameters:", manager.registry.list_config_parameters())
Config parameters: ['soil_quality']

Compare Patterns Over Time

from joshpy.diagnostics import SimulationDiagnostics

diag = SimulationDiagnostics(manager.registry)

diag.plot_comparison(
    "average_height",
    group_by="soil_quality",
    title="Tree Height by Soil Quality Pattern",
)
Figure 1: Tree height trajectories for different soil quality patterns

Final Height Comparison

diag.plot_comparison(
    "average_height",
    group_by="soil_quality",
    step=10,
    title="Final Tree Height by Soil Pattern",
)
Figure 2: Final tree height at step 10 for each soil pattern

Spatial Patterns

The key test: do the spatial patterns in tree height match the input soil quality patterns?

import matplotlib.pyplot as plt
from joshpy.cell_data import DiagnosticQueries

queries = DiagnosticQueries(manager.registry)

# Get run hashes for each pattern
patterns = ["soil_quality_gradient", "soil_quality_triangle", "soil_quality_stripes"]
pattern_jobs = {job.custom_tags.get("soil_quality"): job for job in manager.job_set.jobs}

fig, axes = plt.subplots(1, 3, figsize=(12, 4))

for ax, pattern in zip(axes, patterns):
    job = pattern_jobs.get(pattern)
    if job is None:
        ax.set_title(f"{pattern}: Not found")
        continue
    
    df = queries.get_spatial_snapshot(
        step=10,
        variable="average_height",
        run_hash=job.run_hash,
        replicate=0,
    )
    
    if len(df) == 0:
        ax.set_title(f"{pattern}: No data")
        continue
    
    scatter = ax.scatter(
        df['longitude'], 
        df['latitude'],
        c=df['value'],  # get_spatial_snapshot returns 'value' column
        cmap='YlGn',
        s=10,
        alpha=0.8,
    )
    ax.set_xlabel('Longitude')
    ax.set_ylabel('Latitude')
    ax.set_title(pattern.replace('soil_quality_', '').title())

plt.colorbar(scatter, ax=axes, label='Avg Height (m)', shrink=0.8)
<matplotlib.colorbar.Colorbar object at 0x7fc3795f4590>
plt.tight_layout()
plt.show()
Figure 3: Spatial distribution of tree height for each soil quality pattern

Verify Soil Quality Export

We also exported soil_quality to verify the external data was read correctly:

fig, axes = plt.subplots(1, 3, figsize=(12, 4))

for ax, pattern in zip(axes, patterns):
    job = pattern_jobs.get(pattern)
    if job is None:
        continue
    
    df = queries.get_spatial_snapshot(
        step=5,  # Mid-simulation
        variable="soil_quality",
        run_hash=job.run_hash,
        replicate=0,
    )
    
    if len(df) == 0:
        continue
    
    scatter = ax.scatter(
        df['longitude'], 
        df['latitude'],
        c=df['value'],  # get_spatial_snapshot returns 'value' column
        cmap='YlGn',
        vmin=0,
        vmax=100,
        s=10,
        alpha=0.8,
    )
    ax.set_xlabel('Longitude')
    ax.set_ylabel('Latitude')
    ax.set_title(f"{pattern.replace('soil_quality_', '').title()} (Soil Quality)")

plt.colorbar(scatter, ax=axes, label='Soil Quality (%)', shrink=0.8)
<matplotlib.colorbar.Colorbar object at 0x7fc350076510>
plt.tight_layout()
plt.show()
Figure 4: Exported soil quality values confirm external data was read correctly

Direct SQL Queries

For custom analysis, query the registry directly. File parameter labels are stored in config_parameters just like config params, making them easy to query:

# Compare mean height by soil pattern
# File parameter labels are stored in config_parameters, just like config params
result = manager.registry.query("""
    SELECT 
        cp.soil_quality as pattern,
        cd.step,
        AVG(cd.average_height) as mean_height,
        STDDEV(cd.average_height) as std_height
    FROM cell_data cd
    JOIN config_parameters cp ON cd.run_hash = cp.run_hash
    GROUP BY cp.soil_quality, cd.step
    ORDER BY cp.soil_quality, cd.step
""")
result.df().head(15)
pattern step mean_height std_height
0 soil_quality_gradient 0 251.212955 153.985997
1 soil_quality_gradient 1 502.614158 299.300450
2 soil_quality_gradient 2 753.774360 444.307096
3 soil_quality_gradient 3 1006.095663 590.763974
4 soil_quality_gradient 4 1258.231998 736.401820
5 soil_quality_gradient 5 1509.590436 881.379216
6 soil_quality_gradient 6 1762.403899 1027.390708
7 soil_quality_gradient 7 2014.227007 1172.672281
8 soil_quality_gradient 8 2266.419816 1318.130916
9 soil_quality_gradient 9 2518.633282 1464.063508
10 soil_quality_gradient 10 2770.238283 1609.246545
11 soil_quality_stripes 0 253.289744 159.307646
12 soil_quality_stripes 1 505.849162 308.898449
13 soil_quality_stripes 2 759.300500 459.186458
14 soil_quality_stripes 3 1012.700934 608.837888

Summary

This tutorial demonstrated the file sweep workflow:

  1. external keyword - Reference preprocessed .jshd data in Josh simulations
  2. FileSweepParameter - Define which files to sweep over
  3. SweepManager - Orchestrate execution with automatic registry tracking
  4. Spatial visualization - Confirm output patterns match input data

Key Differences: Config vs File Sweeps

Aspect ConfigSweepParameter FileSweepParameter
Variation source Jinja template values External data files
Storage job.parameters dict job.file_mappings (paths) + job.parameters (labels)
Josh reference config name.variable external variable
Hash includes Rendered config content File content hash

When to Use File Sweeps

  • Climate scenarios: Compare RCP 2.6 vs 4.5 vs 8.5 projections
  • Sensitivity analysis: Test different soil maps or elevation data
  • Ensemble runs: Use multiple realizations of stochastic inputs
  • Data source comparison: Compare satellite vs ground-truth inputs
TipGridSpec Variant Sweeps

When your grid has data files that vary by scenario (e.g., 24 monthly climate files × 3 SSP scenarios), constructing FileSweepParameter objects by hand gets tedious. GridSpec variants automate this.

Declare variant axes in your grid.yaml:

variants:
  scenario:
    values: [ssp245, ssp370, ssp585]
    default: ssp245

files:
  cover:
    path: cover.jshd                              # static — no templating
    units: percent
  futureTempJan:
    template_path: monthly/tas_{scenario}_jan.jshd # resolved per scenario
    units: K
  futurePrecipJan:
    template_path: monthly/pr_{scenario}_jan.jshd
    units: mm/year
  # ... 24 monthly files total

Then generate the entire sweep in one line:

grid = GridSpec.from_yaml("data/grids/dev_fine/grid.yaml")

config = JobConfig(
    file_mappings=grid.file_mappings,              # defaults (ssp245)
    sweep=SweepConfig(
        compound_parameters=[
            grid.variant_sweep("scenario"),        # 24 FileSweepParameters, zipped
        ],
    ),
    ...
)

variant_sweep("scenario") finds every template_path file referencing {scenario}, builds one FileSweepParameter per file, and wraps them in a CompoundSweepParameter so all 24 files switch together per scenario. Static path files pass through unchanged.

See Project Organization for setting up GridSpec with variants.

Cleanup

import os

manager.cleanup()
manager.close()

# Remove temporary registry and any WAL files
for f in [REGISTRY_PATH, f"{REGISTRY_PATH}.wal"]:
    if os.path.exists(f):
        os.remove(f)

Learn More