Directory layout, model templating, and data management for josh projects
Introduction
As josh projects grow beyond a single simulation, you need conventions for organizing model files, configs, data, and results. This guide covers:
Recommended directory layout for multi-grid, multi-model projects
Grid specifications (GridSpec) for managing grid geometry and data manifests
Model templating (.josh.j2) to avoid duplicating model files
Config templating (.jshc.j2) vs raw config files (.jshc)
Data directory helpers to reduce file-mapping boilerplate
Run labels for ad-hoc iteration and comparison
Cross-experiment tracking with ProjectCatalog
These features compose with SweepManager (see SweepManager Workflow) and work identically for local and remote execution.
Directory Layout
my-josh-project/├── models/│ ├── canonical.josh.j2 # Model template (one, not N copies)│ └── experimental_drought.josh.j2 # Model variant for testing├── configs/│ ├── baseline.jshc # Raw config for dev tinkering│ └── sweep_config.jshc.j2 # Jinja template for parameter sweeps├── data/│ ├── raw/ # Source GeoTIFF/NetCDF│ └── grids/│ ├── dev_fine/ # One dir per grid definition│ │ ├── grid.yaml # GridSpec manifest│ │ ├── cover.jshd│ │ ├── fire_rbr.jshd│ │ └── monthly/│ │ ├── tas_ssp245_jan.jshd│ │ └── ...│ ├── test_fine/│ └── full_fine/├── experiments/ # One registry per experiment│ ├── baseline_dev_fine.duckdb│ └── fire_sweep_test_fine.duckdb└── notebooks/ ├── run.ipynb └── analysis.ipynb
Key conventions:
models/ – One .josh.j2 template per model variant, not one per grid.
configs/ – Raw .jshc files for single runs, .jshc.j2 templates for sweeps.
data/grids/<grid_name>/ – All .jshd files for a grid, plus a grid.yaml manifest.
experiments/ – One .duckdb registry per experiment. Named descriptively.
Grid Specifications (GridSpec)
Each grid permutation is defined by a grid.yaml file co-located with its data directory. This file declares grid geometry AND inventories the data files preprocessed for that grid:
# data/grids/dev_fine/grid.yamlname: dev_finegrid:size_m:30low:[33.902,-116.0465]high:[33.908,-116.0395]steps:86files:cover:path: cover.jshdunits: percentfireRbr:path: fireRbr.jshdunits: rbrfutureTempJan:path: monthly/futureTempJan.jshdunits: K
Load it and use directly in JobConfig:
from joshpy.grid import GridSpecgrid = GridSpec.from_yaml("data/grids/dev_fine/grid.yaml")config = JobConfig( source_template_path=Path("models/canonical.josh.j2"), template_vars={**grid.template_vars, # grid geometry"simulation_name": "CanonicalDevFine", # model concern, not grid concern"export_prefix": "canonical_dev_fine","debug": True, }, file_mappings=grid.file_mappings, # all .jshd files for this grid config_path=Path("configs/baseline.jshc"),)
GridSpec also handles preprocessing – see the Preprocessing tutorial for details.
For the ad-hoc iteration workflow (run, tweak, compare with labels), see the Single Run & Iteration tutorial.
Variant Data Files
Grids often have data files that vary by scenario (e.g., SSP245 vs SSP370 climate projections) while sharing the same spatial geometry. GridSpec variants let you declare these axes in YAML:
# data/grids/dev_fine/grid.yamlname: dev_finegrid:size_m:30low:[33.902,-116.0465]high:[33.908,-116.0395]steps:86variants:scenario:values:[ssp245, ssp370, ssp585]default: ssp245files: # Static files use `path` — no templatingcover:path: cover.jshdunits: percentfireRbr:path: fireRbr.jshdunits: rbr # Variant files use `template_path` — {scenario} resolved at runtimefutureTempJan:template_path: monthly/tas_{scenario}_jan.jshdunits: KfuturePrecipJan:template_path: monthly/pr_{scenario}_jan.jshdunits: mm/year # ... etc for all monthly files
Key rules:
path — literal file path, no templating. For static/scenario-independent files.
template_path — contains {variant_name} placeholders. Mutually exclusive with path.
variants — declared once at the top level. Each axis has values and default.
Use the variant API to work with scenario-varying data:
grid = GridSpec.from_yaml("data/grids/dev_fine/grid.yaml")# Inspect available axesgrid.variants# {"scenario": {"values": ["ssp245", "ssp370", "ssp585"], "default": "ssp245"}}# Default file_mappings — template_path resolved with defaultsgrid.file_mappings# {"cover": Path(".../cover.jshd"),# "futureTempJan": Path(".../monthly/tas_ssp245_jan.jshd"), ...}# Specific scenariogrid.file_mappings_for(scenario="ssp370")# {"cover": Path(".../cover.jshd"), ← unchanged# "futureTempJan": Path(".../monthly/tas_ssp370_jan.jshd"), ← resolved}# Generate a sweep over all scenarios (one line!)param = grid.variant_sweep("scenario")# → CompoundSweepParameter with one FileSweepParameter per template_path file
For preprocessing variant files, pass the variant kwarg so the output path is resolved from the template:
for scenario in ["ssp245", "ssp370", "ssp585"]: grid.preprocess_netcdf(cli, josh_name="futureTempJan", variant={"scenario": scenario}, data_file=Path(f"raw/tas_{scenario}_jan.nc"), variable="tas", units="K", )grid.save()
When the same model logic runs across multiple grid configurations, use source_template_path to template the .josh file with Jinja2 rather than maintaining separate copies. The template is rendered once with template_vars to produce a concrete .josh file.
{# models/canonical.josh.j2 #}
start simulation {{ simulation_name }}
grid.size = {{ size_m }} m
grid.low = {{ low_lat }} degrees latitude, {{ low_lon }} degrees longitude
grid.high = {{ high_lat }} degrees latitude, {{ high_lon }} degrees longitude
steps.low = 0 count
steps.high = {{ steps }} count
exportFiles.patch = "file:///outputs/{{ export_prefix }}_{replicate}_{run_hash}.csv"
{% if debug %}
debugFiles.patch = "file:///outputs/debug_{{ export_prefix }}_patch.txt"
debugFiles.organism = "file:///outputs/debug_{{ export_prefix }}_organism.txt"
{% endif %}
end simulation
{# Everything below is shared across all grids -- written once, not N times #}
start patch Default
Tree.init = create 10 count of Tree
export.treeCount.step = count(Tree)
export.averageHeight.step = mean(Tree.height)
end patch
start organism Tree
age.init = 0 years
age.step = prior.age + 1 year
{% if debug %}
const growthAmount = sample uniform from 0 meters to config sweep_config.maxGrowth
const d = debug("growth", growthAmount)
height.step = prior.height + growthAmount
{% else %}
height.step = prior.height + sample uniform from 0 meters to config sweep_config.maxGrowth
{% endif %}
end organism
This collapses what would otherwise be multiple near-identical .josh files into one template plus a dict of grid-specific values.
Usage
from pathlib import Pathfrom joshpy.jobs import JobConfig, discover_jshd_filesfrom joshpy.sweep import SweepManagerconfig = JobConfig( source_template_path=Path("models/canonical.josh.j2"), template_vars=GRIDS["dev_fine"], # From grid_configs.py config_path=Path("configs/baseline.jshc"), file_mappings=discover_jshd_files("data/grids/dev_fine", recursive=True),)with SweepManager.from_config(config, registry="experiments/baseline_dev_fine.duckdb") as mgr: mgr.run() mgr.load_results()
Switching grids is just changing the dict and data directory:
template_vars are available in .jshc.j2 rendering alongside sweep parameters. If a sweep parameter has the same name as a template var, the sweep value wins.
Mutual exclusivity
config_path, template_path, and template_string are mutually exclusive – use exactly one. Similarly, source_path and source_template_path are mutually exclusive.
Discovering Data Files
discover_jshd_files() auto-builds a file_mappings dict from a directory, mapping each .jshd file’s stem to its absolute path:
Raises ValueError if duplicate stems are found in recursive mode. Use unique filenames or separate directories.
Cross-Experiment Comparison with ProjectCatalog
When you have many experiments across grids and model variants, ProjectCatalog provides a lightweight provenance index that tracks what was run with what inputs, and where results live.
# Register modelsh1 = catalog.register_model(model_v1, name="canonical_v1")h2 = catalog.register_model(model_v2, name="canonical_v2")print(f"Model v1 hash: {h1}")print(f"Model v2 hash: {h2}")print(f"Different content = different hash: {h1 != h2}")# Register data manifestdh = catalog.register_data( {"cover": data_a, "fire": data_b}, name="dev_fine",)print(f"\nData manifest hash: {dh}")# Check what was registeredinfo = catalog.get_data_manifest(dh)print(f"Data set '{info.name}': {info.file_count} files, {info.total_size} bytes")
Model v1 hash: 9d17d9ee63eb
Model v2 hash: 14004efde975
Different content = different hash: True
Data manifest hash: 2d8d59ee52aa
Data set 'dev_fine': 2 files, 55 bytes
# Check if config_a was already runexisting = catalog.find_experiment(config_a)if existing and existing.status =="completed":print(f"Already run: '{existing.name}' -> {existing.registry_path}")else:print("Not found -- would need to run")# A different config won't matchdifferent_config_file = fixture_dir /"different.jshc"different_config_file.write_text("maxGrowth = 100 meters")config_c = JobConfig( config_path=different_config_file, source_path=model_v1, simulation="Main", file_mappings={"cover": data_a, "fire": data_b},)print(f"Different config matches: {catalog.find_experiment(config_c) isnotNone}")
Already run: 'baseline_v1' -> experiments/baseline_v1.duckdb
Different config matches: False
Listing and filtering
# List all experimentsprint("All experiments:")for exp in catalog.list_experiments():print(f" {exp.name}: status={exp.status}, registry={exp.registry_path}")# Filter by statuscompleted = catalog.list_experiments(status="completed")print(f"\nCompleted: {len(completed)}")# Filter by model namev1_exps = catalog.list_experiments(model_name="canonical_v1")print(f"Using canonical_v1: {len(v1_exps)}")
All experiments:
baseline_v2: status=completed, registry=experiments/baseline_v2.duckdb
baseline_v1: status=completed, registry=experiments/baseline_v1.duckdb
Completed: 2
Using canonical_v1: 1
Orchestration metadata
Each joshpy-created experiment automatically stores provenance metadata:
This metadata is in an optional orchestration JSON field. The core catalog fields (model_hash, config_hash, data_manifest_hash) are josh-agnostic – any tool can produce and consume them.
SweepManager integration
The catalog can be wired into SweepManager directly via the builder. It auto-registers the experiment on build() and updates status after run():
catalog = ProjectCatalog("catalog.duckdb")manager = ( SweepManager.builder(config) .with_registry("experiments/sweep.duckdb") .with_catalog(catalog, experiment_name="fire_sensitivity_dev_fine") .build() # Registers experiment as "running")results = manager.run() # Updates to "completed" or "failed"
Cross-experiment queries
The catalog can open multiple experiment registries simultaneously via DuckDB ATTACH, enabling cross-experiment SQL:
experiments = catalog.list_experiments(status="completed")with catalog.open_registries(experiments) as conn: df = conn.execute(""" SELECT 'baseline' as experiment, step, AVG(averageHeight) as mean_height FROM exp_0.cell_data GROUP BY step UNION ALL SELECT 'drought_fix' as experiment, step, AVG(averageHeight) as mean_height FROM exp_1.cell_data GROUP BY step """).df()