Project Organization

Directory layout, model templating, and data management for josh projects

Introduction

As josh projects grow beyond a single simulation, you need conventions for organizing model files, configs, data, and results. This guide covers:

  1. Recommended directory layout for multi-grid, multi-model projects
  2. Grid specifications (GridSpec) for managing grid geometry and data manifests
  3. Model templating (.josh.j2) to avoid duplicating model files
  4. Config templating (.jshc.j2) vs raw config files (.jshc)
  5. Data directory helpers to reduce file-mapping boilerplate
  6. Run labels for ad-hoc iteration and comparison
  7. Cross-experiment tracking with ProjectCatalog

These features compose with SweepManager (see SweepManager Workflow) and work identically for local and remote execution.

Directory Layout

my-josh-project/
├── models/
│   ├── canonical.josh.j2              # Model template (one, not N copies)
│   └── experimental_drought.josh.j2   # Model variant for testing
├── configs/
│   ├── baseline.jshc                  # Raw config for dev tinkering
│   └── sweep_config.jshc.j2           # Jinja template for parameter sweeps
├── data/
│   ├── raw/                           # Source GeoTIFF/NetCDF
│   └── grids/
│       ├── dev_fine/                   # One dir per grid definition
│       │   ├── grid.yaml              # GridSpec manifest
│       │   ├── cover.jshd
│       │   ├── fire_rbr.jshd
│       │   └── monthly/
│       │       ├── tas_ssp245_jan.jshd
│       │       └── ...
│       ├── test_fine/
│       └── full_fine/
├── experiments/                        # One registry per experiment
│   ├── baseline_dev_fine.duckdb
│   └── fire_sweep_test_fine.duckdb
└── notebooks/
    ├── run.ipynb
    └── analysis.ipynb

Key conventions:

  • models/ – One .josh.j2 template per model variant, not one per grid.
  • configs/ – Raw .jshc files for single runs, .jshc.j2 templates for sweeps.
  • data/grids/<grid_name>/ – All .jshd files for a grid, plus a grid.yaml manifest.
  • experiments/ – One .duckdb registry per experiment. Named descriptively.

Grid Specifications (GridSpec)

Each grid permutation is defined by a grid.yaml file co-located with its data directory. This file declares grid geometry AND inventories the data files preprocessed for that grid:

# data/grids/dev_fine/grid.yaml
name: dev_fine
grid:
  size_m: 30
  low: [33.902, -116.0465]
  high: [33.908, -116.0395]
  steps: 86

files:
  cover:
    path: cover.jshd
    units: percent
  fireRbr:
    path: fireRbr.jshd
    units: rbr
  futureTempJan:
    path: monthly/futureTempJan.jshd
    units: K

Load it and use directly in JobConfig:

from joshpy.grid import GridSpec

grid = GridSpec.from_yaml("data/grids/dev_fine/grid.yaml")

config = JobConfig(
    source_template_path=Path("models/canonical.josh.j2"),
    template_vars={
        **grid.template_vars,                    # grid geometry
        "simulation_name": "CanonicalDevFine",   # model concern, not grid concern
        "export_prefix": "canonical_dev_fine",
        "debug": True,
    },
    file_mappings=grid.file_mappings,            # all .jshd files for this grid
    config_path=Path("configs/baseline.jshc"),
)

GridSpec also handles preprocessing – see the Preprocessing tutorial for details.

For the ad-hoc iteration workflow (run, tweak, compare with labels), see the Single Run & Iteration tutorial.

Variant Data Files

Grids often have data files that vary by scenario (e.g., SSP245 vs SSP370 climate projections) while sharing the same spatial geometry. GridSpec variants let you declare these axes in YAML:

# data/grids/dev_fine/grid.yaml
name: dev_fine
grid:
  size_m: 30
  low: [33.902, -116.0465]
  high: [33.908, -116.0395]
  steps: 86

variants:
  scenario:
    values: [ssp245, ssp370, ssp585]
    default: ssp245

files:
  # Static files use `path` — no templating
  cover:
    path: cover.jshd
    units: percent
  fireRbr:
    path: fireRbr.jshd
    units: rbr

  # Variant files use `template_path` — {scenario} resolved at runtime
  futureTempJan:
    template_path: monthly/tas_{scenario}_jan.jshd
    units: K
  futurePrecipJan:
    template_path: monthly/pr_{scenario}_jan.jshd
    units: mm/year
  # ... etc for all monthly files

Key rules:

  • path — literal file path, no templating. For static/scenario-independent files.
  • template_path — contains {variant_name} placeholders. Mutually exclusive with path.
  • variants — declared once at the top level. Each axis has values and default.

Use the variant API to work with scenario-varying data:

grid = GridSpec.from_yaml("data/grids/dev_fine/grid.yaml")

# Inspect available axes
grid.variants
# {"scenario": {"values": ["ssp245", "ssp370", "ssp585"], "default": "ssp245"}}

# Default file_mappings — template_path resolved with defaults
grid.file_mappings
# {"cover": Path(".../cover.jshd"),
#  "futureTempJan": Path(".../monthly/tas_ssp245_jan.jshd"), ...}

# Specific scenario
grid.file_mappings_for(scenario="ssp370")
# {"cover": Path(".../cover.jshd"),                          ← unchanged
#  "futureTempJan": Path(".../monthly/tas_ssp370_jan.jshd"), ← resolved}

# Generate a sweep over all scenarios (one line!)
param = grid.variant_sweep("scenario")
# → CompoundSweepParameter with one FileSweepParameter per template_path file

For preprocessing variant files, pass the variant kwarg so the output path is resolved from the template:

for scenario in ["ssp245", "ssp370", "ssp585"]:
    grid.preprocess_netcdf(cli,
        josh_name="futureTempJan",
        variant={"scenario": scenario},
        data_file=Path(f"raw/tas_{scenario}_jan.nc"),
        variable="tas",
        units="K",
    )
grid.save()

For a full walkthrough of sweeping over variant data files, see Sweeping Over External Data.

Model Templating (.josh.j2)

When the same model logic runs across multiple grid configurations, use source_template_path to template the .josh file with Jinja2 rather than maintaining separate copies. The template is rendered once with template_vars to produce a concrete .josh file.

{# models/canonical.josh.j2 #}

start simulation {{ simulation_name }}
  grid.size = {{ size_m }} m
  grid.low = {{ low_lat }} degrees latitude, {{ low_lon }} degrees longitude
  grid.high = {{ high_lat }} degrees latitude, {{ high_lon }} degrees longitude
  steps.low = 0 count
  steps.high = {{ steps }} count

  exportFiles.patch = "file:///outputs/{{ export_prefix }}_{replicate}_{run_hash}.csv"
  {% if debug %}
  debugFiles.patch = "file:///outputs/debug_{{ export_prefix }}_patch.txt"
  debugFiles.organism = "file:///outputs/debug_{{ export_prefix }}_organism.txt"
  {% endif %}
end simulation

{# Everything below is shared across all grids -- written once, not N times #}

start patch Default
  Tree.init = create 10 count of Tree
  export.treeCount.step = count(Tree)
  export.averageHeight.step = mean(Tree.height)
end patch

start organism Tree
  age.init = 0 years
  age.step = prior.age + 1 year

  {% if debug %}
  const growthAmount = sample uniform from 0 meters to config sweep_config.maxGrowth
  const d = debug("growth", growthAmount)
  height.step = prior.height + growthAmount
  {% else %}
  height.step = prior.height + sample uniform from 0 meters to config sweep_config.maxGrowth
  {% endif %}
end organism

This collapses what would otherwise be multiple near-identical .josh files into one template plus a dict of grid-specific values.

Usage

from pathlib import Path
from joshpy.jobs import JobConfig, discover_jshd_files
from joshpy.sweep import SweepManager

config = JobConfig(
    source_template_path=Path("models/canonical.josh.j2"),
    template_vars=GRIDS["dev_fine"],  # From grid_configs.py
    config_path=Path("configs/baseline.jshc"),
    file_mappings=discover_jshd_files("data/grids/dev_fine", recursive=True),
)

with SweepManager.from_config(config, registry="experiments/baseline_dev_fine.duckdb") as mgr:
    mgr.run()
    mgr.load_results()

Switching grids is just changing the dict and data directory:

config = JobConfig(
    source_template_path=Path("models/canonical.josh.j2"),
    template_vars=GRIDS["test_fine"],  # Different grid
    config_path=Path("configs/baseline.jshc"),
    file_mappings=discover_jshd_files("data/grids/test_fine", recursive=True),
)

Live demo: template rendering

Let’s see .josh.j2 rendering in action. We create a small template and render it with two different grid configs:

import tempfile
from pathlib import Path
from joshpy.jobs import JobConfig, JobExpander

# Create a josh template in a temp directory
tmpdir = tempfile.mkdtemp()

josh_template = Path(tmpdir) / "model.josh.j2"
josh_template.write_text("""start simulation {{ simulation_name }}
  grid.size = {{ grid_size }} m
  grid.low = {{ grid_low_lat }} degrees latitude, {{ grid_low_lon }} degrees longitude
  grid.high = {{ grid_high_lat }} degrees latitude, {{ grid_high_lon }} degrees longitude
  steps.low = 0 count
  steps.high = {{ steps_high }} count
  {% if debug %}debugFiles.patch = "file:///tmp/debug.txt"{% endif %}
end simulation

start patch Default
end patch
""")

config_file = Path(tmpdir) / "config.jshc"
config_file.write_text("maxGrowth = 50 meters")

# Render with "dev_fine" grid (debug=True)
config_dev = JobConfig(
    source_template_path=josh_template,
    template_vars={
        "simulation_name": "DevFine", "grid_size": 30,
        "grid_low_lat": 33.902, "grid_low_lon": -116.046,
        "grid_high_lat": 33.908, "grid_high_lon": -116.039,
        "steps_high": 86, "debug": True,
    },
    config_path=config_file,
)

expander = JobExpander()
job_set = expander.expand(config_dev)
print("=== dev_fine (debug=True) ===")
print(job_set.jobs[0].source_path.read_text())
job_set.cleanup()

# Render with "test_fine" grid (debug=False)
config_test = JobConfig(
    source_template_path=josh_template,
    template_vars={
        "simulation_name": "TestFine", "grid_size": 30,
        "grid_low_lat": 33.5, "grid_low_lon": -116.4,
        "grid_high_lat": 34.0, "grid_high_lon": -115.4,
        "steps_high": 86, "debug": False,
    },
    config_path=config_file,
)

job_set = expander.expand(config_test)
print("=== test_fine (debug=False) ===")
print(job_set.jobs[0].source_path.read_text())
job_set.cleanup()
=== dev_fine (debug=True) ===
start simulation DevFine
  grid.size = 30 m
  grid.low = 33.902 degrees latitude, -116.046 degrees longitude
  grid.high = 33.908 degrees latitude, -116.039 degrees longitude
  steps.low = 0 count
  steps.high = 86 count
  debugFiles.patch = "file:///tmp/debug.txt"
end simulation

start patch Default
end patch
=== test_fine (debug=False) ===
start simulation TestFine
  grid.size = 30 m
  grid.low = 33.5 degrees latitude, -116.4 degrees longitude
  grid.high = 34.0 degrees latitude, -115.4 degrees longitude
  steps.low = 0 count
  steps.high = 86 count
  
end simulation

start patch Default
end patch

One template, two different rendered .josh files. The debug line only appears in the dev_fine version.

Config Files: Raw (.jshc) vs Template (.jshc.j2)

Raw config (single runs)

Use config_path when you have a pre-written .jshc file and don’t need parameter substitution:

config = JobConfig(
    config_path=Path("configs/baseline.jshc"),
    source_path=Path("model.josh"),
)

The file content is used as-is – no Jinja rendering. This is the simplest path for dev tinkering: edit the .jshc, re-run, inspect results.

Template config (parameter sweeps)

Use template_path when you need to sweep over config values:

config = JobConfig(
    template_path=Path("configs/sweep_config.jshc.j2"),
    source_path=Path("model.josh"),
    sweep=SweepConfig(
        config_parameters=[
            ConfigSweepParameter(name="maxGrowth", values=[10, 50, 100]),
        ],
    ),
)

template_vars are available in .jshc.j2 rendering alongside sweep parameters. If a sweep parameter has the same name as a template var, the sweep value wins.

Mutual exclusivity

config_path, template_path, and template_string are mutually exclusive – use exactly one. Similarly, source_path and source_template_path are mutually exclusive.

Discovering Data Files

discover_jshd_files() auto-builds a file_mappings dict from a directory, mapping each .jshd file’s stem to its absolute path:

import tempfile
from pathlib import Path
from joshpy.jobs import discover_jshd_files

# Create a mock data directory
data_dir = Path(tempfile.mkdtemp())
(data_dir / "cover.jshd").write_bytes(b"mock cover data")
(data_dir / "fire_rbr.jshd").write_bytes(b"mock fire data")
monthly = data_dir / "monthly"
monthly.mkdir()
(monthly / "tas_ssp245_jan.jshd").write_bytes(b"mock temp data")
(monthly / "pr_ssp245_jan.jshd").write_bytes(b"mock precip data")

# Flat discovery (top-level only)
flat = discover_jshd_files(data_dir)
print(f"Flat: {len(flat)} files")
for name, path in sorted(flat.items()):
    print(f"  {name} -> {path.name}")

# Recursive discovery (includes subdirectories)
recursive = discover_jshd_files(data_dir, recursive=True)
print(f"\nRecursive: {len(recursive)} files")
for name, path in sorted(recursive.items()):
    print(f"  {name} -> {path.name}")
Flat: 2 files
  cover -> cover.jshd
  fire_rbr -> fire_rbr.jshd

Recursive: 4 files
  cover -> cover.jshd
  fire_rbr -> fire_rbr.jshd
  pr_ssp245_jan -> pr_ssp245_jan.jshd
  tas_ssp245_jan -> tas_ssp245_jan.jshd

Raises ValueError if duplicate stems are found in recursive mode. Use unique filenames or separate directories.

Cross-Experiment Comparison with ProjectCatalog

When you have many experiments across grids and model variants, ProjectCatalog provides a lightweight provenance index that tracks what was run with what inputs, and where results live.

Setup

import tempfile
from pathlib import Path
from joshpy.catalog import ProjectCatalog
from joshpy.jobs import JobConfig, SweepConfig, ConfigSweepParameter

# Create catalog and test fixtures
catalog = ProjectCatalog(":memory:")
fixture_dir = Path(tempfile.mkdtemp())

# Create mock project files
model_v1 = fixture_dir / "canonical_v1.josh"
model_v1.write_text("start simulation Main\n  grid.size = 30 m\nend simulation")

model_v2 = fixture_dir / "canonical_v2.josh"
model_v2.write_text("start simulation Main\n  grid.size = 30 m\n  # drought fix\nend simulation")

config_file = fixture_dir / "baseline.jshc"
config_file.write_text("maxGrowth = 50 meters\nfireYear = 75 count")

data_a = fixture_dir / "cover.jshd"
data_a.write_bytes(b"cover data for dev_fine grid")
data_b = fixture_dir / "fire.jshd"
data_b.write_bytes(b"fire data for dev_fine grid")

print("Fixtures created")
Fixtures created

Registering models and data

# Register models
h1 = catalog.register_model(model_v1, name="canonical_v1")
h2 = catalog.register_model(model_v2, name="canonical_v2")
print(f"Model v1 hash: {h1}")
print(f"Model v2 hash: {h2}")
print(f"Different content = different hash: {h1 != h2}")

# Register data manifest
dh = catalog.register_data(
    {"cover": data_a, "fire": data_b},
    name="dev_fine",
)
print(f"\nData manifest hash: {dh}")

# Check what was registered
info = catalog.get_data_manifest(dh)
print(f"Data set '{info.name}': {info.file_count} files, {info.total_size} bytes")
Model v1 hash: 9d17d9ee63eb
Model v2 hash: 14004efde975
Different content = different hash: True

Data manifest hash: 2d8d59ee52aa
Data set 'dev_fine': 2 files, 55 bytes

Registering experiments

# Register two experiments with different models
config_a = JobConfig(
    config_path=config_file,
    source_path=model_v1,
    simulation="Main",
    file_mappings={"cover": data_a, "fire": data_b},
)

config_b = JobConfig(
    config_path=config_file,
    source_path=model_v2,
    simulation="Main",
    file_mappings={"cover": data_a, "fire": data_b},
)

exp1 = catalog.register_experiment(
    config_a,
    registry_path="experiments/baseline_v1.duckdb",
    name="baseline_v1",
)
catalog.update_experiment_status(exp1, "completed", summary={"succeeded": 1, "failed": 0})

exp2 = catalog.register_experiment(
    config_b,
    registry_path="experiments/baseline_v2.duckdb",
    name="baseline_v2",
)
catalog.update_experiment_status(exp2, "completed", summary={"succeeded": 1, "failed": 0})

print(f"Registered experiment: {exp1} (baseline_v1)")
print(f"Registered experiment: {exp2} (baseline_v2)")
Registered experiment: e0f57dd4-623 (baseline_v1)
Registered experiment: 913b1de2-57b (baseline_v2)

Deduplication: “Did I already run this?”

# Check if config_a was already run
existing = catalog.find_experiment(config_a)
if existing and existing.status == "completed":
    print(f"Already run: '{existing.name}' -> {existing.registry_path}")
else:
    print("Not found -- would need to run")

# A different config won't match
different_config_file = fixture_dir / "different.jshc"
different_config_file.write_text("maxGrowth = 100 meters")
config_c = JobConfig(
    config_path=different_config_file,
    source_path=model_v1,
    simulation="Main",
    file_mappings={"cover": data_a, "fire": data_b},
)
print(f"Different config matches: {catalog.find_experiment(config_c) is not None}")
Already run: 'baseline_v1' -> experiments/baseline_v1.duckdb
Different config matches: False

Listing and filtering

# List all experiments
print("All experiments:")
for exp in catalog.list_experiments():
    print(f"  {exp.name}: status={exp.status}, registry={exp.registry_path}")

# Filter by status
completed = catalog.list_experiments(status="completed")
print(f"\nCompleted: {len(completed)}")

# Filter by model name
v1_exps = catalog.list_experiments(model_name="canonical_v1")
print(f"Using canonical_v1: {len(v1_exps)}")
All experiments:
  baseline_v2: status=completed, registry=experiments/baseline_v2.duckdb
  baseline_v1: status=completed, registry=experiments/baseline_v1.duckdb

Completed: 2
Using canonical_v1: 1

Orchestration metadata

Each joshpy-created experiment automatically stores provenance metadata:

exp = catalog.get_experiment(exp1)
orch = exp.orchestration
print(f"Tool: {orch['tool']}")
print(f"JobConfig keys: {list(orch['job_config'].keys())}")
Tool: joshpy
JobConfig keys: ['config_path', 'source_path', 'file_mappings']

This metadata is in an optional orchestration JSON field. The core catalog fields (model_hash, config_hash, data_manifest_hash) are josh-agnostic – any tool can produce and consume them.

SweepManager integration

The catalog can be wired into SweepManager directly via the builder. It auto-registers the experiment on build() and updates status after run():

catalog = ProjectCatalog("catalog.duckdb")

manager = (
    SweepManager.builder(config)
    .with_registry("experiments/sweep.duckdb")
    .with_catalog(catalog, experiment_name="fire_sensitivity_dev_fine")
    .build()  # Registers experiment as "running"
)
results = manager.run()  # Updates to "completed" or "failed"

Cross-experiment queries

The catalog can open multiple experiment registries simultaneously via DuckDB ATTACH, enabling cross-experiment SQL:

experiments = catalog.list_experiments(status="completed")

with catalog.open_registries(experiments) as conn:
    df = conn.execute("""
        SELECT 'baseline' as experiment, step, AVG(averageHeight) as mean_height
        FROM exp_0.cell_data GROUP BY step
        UNION ALL
        SELECT 'drought_fix' as experiment, step, AVG(averageHeight) as mean_height
        FROM exp_1.cell_data GROUP BY step
    """).df()

Cleanup

catalog.close()