registry.RunRegistry

registry.RunRegistry(
    db_path='josh_runs.duckdb',
    enable_spatial=True,
    _conn=None,
    _spatial_filter_bbox=None,
    _spatial_filter_geojson=None,
    _time_filter_range=None,
)

DuckDB-backed registry for tracking parameter sweeps and job runs.

Supports both file-based persistence and in-memory mode (using “:memory:”).

Attributes

Name Type Description
db_path Path | str Path to the DuckDB database file, or “:memory:” for in-memory.
enable_spatial bool If True, load spatial extension and create geometry column.
_conn Any DuckDB connection (created automatically).

Examples

>>> # File-based (persistent)
>>> registry = RunRegistry("experiment.duckdb")
>>> # In-memory (for testing)
>>> registry = RunRegistry(":memory:")
>>> # With spatial support enabled (default)
>>> registry = RunRegistry("experiment.duckdb", enable_spatial=True)
>>> # Context manager
>>> with RunRegistry("experiment.duckdb") as registry:
...     session_id = registry.create_session(...)

Methods

Name Description
bottle Create a self-contained bottle archive from a registered run.
check_sparsity Check for sparse columns in cell_data.
close Close the database connection.
compare_configs Export two run configs and open a side-by-side diff in an IDE.
compare_josh Export two runs’ josh sources and open a side-by-side diff in an IDE.
complete_run Record the completion of a job run.
create_session Create a new sweep session.
export_config Export a run’s config content to a file for IDE diffing.
export_josh Export a run’s josh source content to a file for IDE viewing/diffing.
export_results_df Export session results as a pandas DataFrame.
get_config_by_hash Get config information by run hash.
get_configs_for_session Get all configs for a session.
get_data_summary Get summary of all data in registry.
get_debug_output_files Get debug output file paths for a labeled/hashed run.
get_file_mappings Get the external data file mappings for a run.
get_replicate_count Get the number of distinct replicates for a run hash from cell_data.
get_run Get run information by ID.
get_runs_by_parameters Query runs by parameter values.
get_runs_for_hash Get all runs for a run hash.
get_session Get session information by ID.
get_session_summary Get aggregated statistics for a session.
label_run Assign a human-readable label to a run configuration.
list_config_columns List all parameter column names in config_parameters.
list_config_parameters List all config parameter names from sweep configurations.
list_entity_types List all entity types found in cell_data.
list_export_variables List all export variable names from simulation outputs.
list_labels List all labeled runs.
list_parameters Alias for list_config_parameters(). Deprecated, use list_config_parameters().
list_sessions List sessions, optionally filtered by experiment name.
list_variable_columns List all variable column names in cell_data.
list_variables Alias for list_export_variables(). Deprecated, use list_export_variables().
load_debug Load debug messages for a run from registered debug output files.
query Execute a SQL query with parameters.
register_output Register an output file from a run.
register_run Register a job configuration (run specification).
resolve_config_source Locate the original .jshc file on disk and check if it still matches.
resolve_josh_source Locate the original .josh file on disk and check if it still matches.
resolve_label Get the run_hash for a labeled run.
resolve_latest Get the run_hash of the most recently created run whose label starts with prefix.
spatial_filter Context manager for spatial filtering of queries.
start_run Record the start of a job run.
time_filter Context manager for temporal filtering of queries.
to_csv Export a table to CSV format.
to_parquet Export a table to Parquet format.
update_session_status Update the status of a session.

bottle

registry.RunRegistry.bottle(
    label_or_hash,
    output_dir=Path('bottles'),
    cli=None,
    omit_jshd=False,
)

Create a self-contained bottle archive from a registered run.

By default, copies data files into the archive and raises if any are missing. Use omit_jshd=True for lightweight archives when the recipient has the data locally.

Parameters

Name Type Description Default
label_or_hash str Run label or run_hash to bottle. required
output_dir str | Path Directory for the archive. Default: ./bottles/. Path('bottles')
cli Any | None Optional JoshCLI instance for JAR metadata. None
omit_jshd bool If True, skip copying .jshd data files. False

Returns

Name Type Description
Path Path to the created .tar.gz archive.

Raises

Name Type Description
KeyError If the run is not found.
ValueError If josh content is not stored for the run.
FileNotFoundError If omit_jshd is False and a data file is missing.

check_sparsity

registry.RunRegistry.check_sparsity()

Check for sparse columns in cell_data.

Sparse columns (>50% NULL by default) often indicate that different simulation types are being mixed in the same registry, which hurts query performance.

Returns

Name Type Description
SparsityReport SparsityReport with statistics for each variable column.

Examples

>>> report = registry.check_sparsity()
>>> if report.should_warn:
...     print(report)

close

registry.RunRegistry.close()

Close the database connection.

compare_configs

registry.RunRegistry.compare_configs(
    label_or_hash_1,
    label_or_hash_2,
    ide='vscode',
)

Export two run configs and open a side-by-side diff in an IDE.

Convenience wrapper around :func:joshpy.diff.open_diff.

Parameters

Name Type Description Default
label_or_hash_1 str Label or run_hash of the first run. required
label_or_hash_2 str Label or run_hash of the second run. required
ide str IDE to open diff in (default: "vscode"). Supported: "vscode", "cursor". 'vscode'

Returns

Name Type Description
tuple[Path, Path] Tuple of Paths to the exported config files.

Raises

Name Type Description
KeyError If a label or hash is not found.
RuntimeError If the IDE CLI is not found in PATH.

compare_josh

registry.RunRegistry.compare_josh(
    label_or_hash_1,
    label_or_hash_2,
    ide='vscode',
)

Export two runs’ josh sources and open a side-by-side diff in an IDE.

Convenience wrapper around :func:joshpy.inspect.open_josh_diff.

Parameters

Name Type Description Default
label_or_hash_1 str Label or run_hash of the first run. required
label_or_hash_2 str Label or run_hash of the second run. required
ide str IDE to open diff in (default: "vscode"). Supported: "vscode", "cursor". 'vscode'

Returns

Name Type Description
tuple[Path, Path] Tuple of Paths to the exported josh files.

Raises

Name Type Description
KeyError If a label or hash is not found, or josh content is not stored.
RuntimeError If the IDE CLI is not found in PATH.

complete_run

registry.RunRegistry.complete_run(run_id, exit_code, error_message=None)

Record the completion of a job run.

Parameters

Name Type Description Default
run_id str The run ID to update. required
exit_code int Process exit code (0 = success). required
error_message str | None Error message if run failed. None

create_session

registry.RunRegistry.create_session(
    config,
    experiment_name=None,
    session_id=None,
)

Create a new sweep session.

Parameters

Name Type Description Default
config Any Job configuration containing simulation, template, and sweep info. Must have simulation, template_path, and to_dict() attributes (typically a JobConfig from joshpy.jobs). required
experiment_name str | None Name for the experiment. Defaults to config.simulation. None
session_id str | None Optional externally-provided session ID. If None, generates a UUID. This allows the frontend/API layer to manage session IDs (e.g., using project IDs). None

Returns

Name Type Description
str The session ID (generated or provided).

Examples

>>> from joshpy.jobs import JobConfig, SweepConfig, ConfigSweepParameter
>>> config = JobConfig(
...     template_path=Path("template.jshc.j2"),
...     source_path=Path("simulation.josh"),
...     simulation="Main",
...     sweep=SweepConfig(
...         config_parameters=[ConfigSweepParameter(name="maxGrowth", values=[10, 20, 30])]
...     ),
... )
>>> session_id = registry.create_session(config=config)

Note

total_jobs and total_replicates are computed from the JobSet after job expansion. Use job_set.total_jobs and job_set.total_replicates.

export_config

registry.RunRegistry.export_config(label_or_hash, output_dir)

Export a run’s config content to a file for IDE diffing.

Resolves by label first, falls back to run_hash.

Parameters

Name Type Description Default
label_or_hash str Label or run_hash to look up. required
output_dir str | Path Directory to write the config file to. required

Returns

Name Type Description
Path Path to the written file.

Raises

Name Type Description
KeyError If no matching run is found.

export_josh

registry.RunRegistry.export_josh(label_or_hash, output_dir)

Export a run’s josh source content to a file for IDE viewing/diffing.

Resolves by label first, falls back to run_hash.

Parameters

Name Type Description Default
label_or_hash str Label or run_hash to look up. required
output_dir str | Path Directory to write the josh file to. required

Returns

Name Type Description
Path Path to the written file.

Raises

Name Type Description
KeyError If no matching run is found or josh content is not stored.

export_results_df

registry.RunRegistry.export_results_df(session_id)

Export session results as a pandas DataFrame.

Parameters

Name Type Description Default
session_id str The session to export. required

Returns

Name Type Description
Any pandas DataFrame with run results and parameters.

Raises

Name Type Description
ImportError If pandas is not installed.

get_config_by_hash

registry.RunRegistry.get_config_by_hash(run_hash)

Get config information by run hash.

Parameters

Name Type Description Default
run_hash str The run hash to look up. required

Returns

Name Type Description
ConfigInfo | None ConfigInfo if found, None otherwise.

get_configs_for_session

registry.RunRegistry.get_configs_for_session(session_id)

Get all configs for a session.

Parameters

Name Type Description Default
session_id str The session ID to get configs for. required

Returns

Name Type Description
list[ConfigInfo] List of ConfigInfo objects.

get_data_summary

registry.RunRegistry.get_data_summary(session_id=None)

Get summary of all data in registry.

Provides counts, available variables, parameters, and data ranges for diagnostic purposes.

Parameters

Name Type Description Default
session_id str | None Optional session ID to filter by. None

Returns

Name Type Description
DataSummary DataSummary with counts and metadata.

get_debug_output_files

registry.RunRegistry.get_debug_output_files(
    label_or_hash,
    *,
    run_id=None,
    entity_types=None,
    existing_only=True,
)

Get debug output file paths for a labeled/hashed run.

Parameters

Name Type Description Default
label_or_hash str Run label or run_hash. required
run_id str | None Optional explicit run execution ID. If omitted, uses the latest execution for the run hash. None
entity_types list[str] | None Optional debug entity types to include (e.g., [“organism”, “patch”]). If omitted, includes all. None
existing_only bool If True, return only paths that exist on disk. True

Returns

Name Type Description
list[Path] List of debug file paths for the selected run execution.

Raises

Name Type Description
KeyError If the run or run execution is not found.
ValueError If the explicit run_id does not belong to the run hash.

get_file_mappings

registry.RunRegistry.get_file_mappings(label_or_hash)

Get the external data file mappings for a run.

Parameters

Name Type Description Default
label_or_hash str Run label or run_hash. required

Returns

Name Type Description
dict[str, Path] | None Dict mapping data names to file paths, or None if no
dict[str, Path] | None file mappings were registered for this run.

Raises

Name Type Description
KeyError If the label or hash is not found.

get_replicate_count

registry.RunRegistry.get_replicate_count(run_hash)

Get the number of distinct replicates for a run hash from cell_data.

This is the source-of-truth count, derived from actual loaded data rather than from job_runs metadata. Returns 0 if no data has been loaded yet.

Counts distinct (run_id, replicate) pairs rather than just distinct replicate values, because pooled runs may reuse replicate numbers across different CLI invocations.

Parameters

Name Type Description Default
run_hash str The run hash to count replicates for. required

Returns

Name Type Description
int Number of distinct replicates in cell_data.

get_run

registry.RunRegistry.get_run(run_id)

Get run information by ID.

Parameters

Name Type Description Default
run_id str The run ID to look up. required

Returns

Name Type Description
RunInfo | None RunInfo if found, None otherwise.

get_runs_by_parameters

registry.RunRegistry.get_runs_by_parameters(**params)

Query runs by parameter values.

Parameters

Name Type Description Default
**params Any Parameter name-value pairs to filter by. {}

Returns

Name Type Description
list[dict[str, Any]] List of dicts containing run info and parameters.

get_runs_for_hash

registry.RunRegistry.get_runs_for_hash(run_hash)

Get all runs for a run hash.

Parameters

Name Type Description Default
run_hash str The run hash to get runs for. required

Returns

Name Type Description
list[RunInfo] List of RunInfo objects.

get_session

registry.RunRegistry.get_session(session_id)

Get session information by ID.

Parameters

Name Type Description Default
session_id str The session ID to look up. required

Returns

Name Type Description
SessionInfo | None SessionInfo if found, None otherwise.

get_session_summary

registry.RunRegistry.get_session_summary(session_id)

Get aggregated statistics for a session.

Parameters

Name Type Description Default
session_id str The session ID to summarize. required

Returns

Name Type Description
SessionSummary | None SessionSummary with counts, or None if session not found.

label_run

registry.RunRegistry.label_run(run_hash, label, force=False, on_collision=None)

Assign a human-readable label to a run configuration.

Labels are unique within a registry. When a collision occurs, the behavior depends on force and on_collision:

  • Default: raise ValueError
  • force=True: silently drop the old label and reassign
  • on_collision="timestamp": rename the old label with a timestamp suffix (e.g., baselinebaseline_20260402_153000) and assign the bare label to the new run

Parameters

Name Type Description Default
run_hash str The run hash to label. required
label str Human-readable label (e.g., “baseline”, “high_mortality”). required
force bool If True, reassign the label even if already taken. False
on_collision str | None Collision strategy. "timestamp" archives the old label with a timestamp suffix. Mutually exclusive with force. None

Raises

Name Type Description
KeyError If run_hash does not exist.
ValueError If label is already assigned to a different run and neither force nor on_collision is set, or if both force and on_collision are set, or if on_collision has an invalid value.

list_config_columns

registry.RunRegistry.list_config_columns()

List all parameter column names in config_parameters.

Returns the dynamically-added parameter columns. Column names preserve original names with special characters (e.g., ‘soil.moisture’).

Returns

Name Type Description
list[str] Sorted list of parameter column names.

Examples

>>> registry.list_config_columns()
['maxGrowth', 'scenario', 'soil.moisture']

list_config_parameters

registry.RunRegistry.list_config_parameters(session_id=None)

List all config parameter names from sweep configurations.

These are the parameters you defined in your JobConfig sweep, stored as typed columns in the config_parameters table.

Parameters

Name Type Description Default
session_id str | None Optional session ID to filter by. None

Returns

Name Type Description
list[str] Sorted list of parameter names.

Examples

>>> registry.list_config_parameters()
['maxGrowth', 'scenario', 'survivalProb']

list_entity_types

registry.RunRegistry.list_entity_types(session_id=None)

List all entity types found in cell_data.

Parameters

Name Type Description Default
session_id str | None Optional session ID to filter by. None

Returns

Name Type Description
list[str] Sorted list of entity type names.

list_export_variables

registry.RunRegistry.list_export_variables(session_id=None)

List all export variable names from simulation outputs.

These are the variables exported by Josh simulations, stored as typed columns in the cell_data table. Variable names preserve original .josh names (e.g., ‘avg.height’).

When session_id is provided, only returns variables that have at least one non-NULL value for runs in that session.

Parameters

Name Type Description Default
session_id str | None Optional session ID to filter by. If provided, only returns variables with data in that session. None

Returns

Name Type Description
list[str] Sorted list of variable column names.

Examples

>>> registry.list_export_variables()
['averageAge', 'avg.height', 'treeCount']
>>> registry.list_export_variables(session_id="abc123")
['treeCount']  # Only variables with data in this session

list_labels

registry.RunRegistry.list_labels()

List all labeled runs.

Returns

Name Type Description
list[tuple[str, str]] List of (label, run_hash) tuples, sorted by label.

list_parameters

registry.RunRegistry.list_parameters(session_id=None)

Alias for list_config_parameters(). Deprecated, use list_config_parameters().

list_sessions

registry.RunRegistry.list_sessions(experiment_name=None, limit=100)

List sessions, optionally filtered by experiment name.

Parameters

Name Type Description Default
experiment_name str | None Filter by experiment name (optional). None
limit int Maximum number of sessions to return. 100

Returns

Name Type Description
list[SessionInfo] List of SessionInfo objects, ordered by creation time (newest first).

list_variable_columns

registry.RunRegistry.list_variable_columns()

List all variable column names in cell_data.

Returns the dynamically-added variable columns. Column names preserve original names with special characters (e.g., ‘avg.height’).

Returns

Name Type Description
list[str] Sorted list of variable column names.

Examples

>>> registry.list_variable_columns()
['averageAge', 'avg.height', 'treeCount']

list_variables

registry.RunRegistry.list_variables(session_id=None)

Alias for list_export_variables(). Deprecated, use list_export_variables().

load_debug

registry.RunRegistry.load_debug(
    label_or_hash,
    *,
    run_id=None,
    entity_types=None,
    existing_only=True,
)

Load debug messages for a run from registered debug output files.

Parameters

Name Type Description Default
label_or_hash str Run label or run_hash. required
run_id str | None Optional explicit run execution ID. If omitted, uses latest. None
entity_types list[str] | None Optional debug entity types to include. None
existing_only bool If True, only load files that currently exist. True

Returns

Name Type Description
Any DebugMessageStore with messages merged across all selected files.

Raises

Name Type Description
KeyError If run/run execution is not found.
ValueError If no matching debug files are available.
FileNotFoundError If existing_only=False and any file is missing.

query

registry.RunRegistry.query(sql, params=None)

Execute a SQL query with parameters.

This provides direct access to DuckDB for custom queries beyond the pre-built methods. Use this when you need to run complex queries or explore the data in ways not covered by the API.

Parameters

Name Type Description Default
sql str SQL query with ? placeholders for parameters. required
params list | None List of parameter values. None

Returns

Name Type Description
Any DuckDB relation (call .df() for DataFrame, .fetchall() for tuples).

Examples

>>> # Get DataFrame
>>> df = registry.query(
...     "SELECT * FROM cell_data WHERE step BETWEEN ? AND ?",
...     [0, 10]
... ).df()
>>> # Get raw results
>>> rows = registry.query(
...     "SELECT COUNT(*) FROM cell_data WHERE run_hash = ?",
...     ["abc123"]
... ).fetchone()

register_output

registry.RunRegistry.register_output(
    run_id,
    output_type,
    file_path,
    file_size=None,
    row_count=None,
)

Register an output file from a run.

Parameters

Name Type Description Default
run_id str The run this output belongs to. required
output_type str Type of output (e.g., ‘csv’, ‘log’, ‘error’). required
file_path str Path to the output file. required
file_size int | None Size of the file in bytes. None
row_count int | None Number of rows (for tabular data). None

Returns

Name Type Description
str The generated output ID.

register_run

registry.RunRegistry.register_run(
    session_id,
    run_hash,
    josh_path,
    config_content,
    file_mappings,
    parameters,
    josh_content=None,
)

Register a job configuration (run specification).

Parameters

Name Type Description Default
session_id str Session this config belongs to. required
run_hash str MD5 hash of josh + config + file_mappings (12 chars). required
josh_path str Path to the .josh script file. required
config_content str Full text of the rendered configuration. required
file_mappings dict[str, dict[str, str]] | None Dict mapping names to {“path”: “…”, “hash”: “…”}. required
parameters dict[str, Any] Parameter values used to generate this config. required
josh_content str | None Rendered .josh source content (optional). None

resolve_config_source

registry.RunRegistry.resolve_config_source(run_hash)

Locate the original .jshc file on disk and check if it still matches.

Looks up the session metadata to find the original config_path, then checks whether the file exists and whether its content has changed since it was registered.

Parameters

Name Type Description Default
run_hash str The run hash to look up. required

Returns

Name Type Description
A ConfigSourceInfo class:ConfigSourceInfo describing the file’s status.

resolve_josh_source

registry.RunRegistry.resolve_josh_source(run_hash)

Locate the original .josh file on disk and check if it still matches.

Compares the file at josh_path against the stored josh_content.

Parameters

Name Type Description Default
run_hash str The run hash to look up. required

Returns

Name Type Description
A ConfigSourceInfo class:ConfigSourceInfo describing the file’s status.

resolve_label

registry.RunRegistry.resolve_label(label)

Get the run_hash for a labeled run.

Parameters

Name Type Description Default
label str The label to look up. required

Returns

Name Type Description
str The run_hash associated with the label.

Raises

Name Type Description
KeyError If no run has this label.

resolve_latest

registry.RunRegistry.resolve_latest(prefix)

Get the run_hash of the most recently created run whose label starts with prefix.

Useful after on_collision="timestamp" to find the latest run among "baseline", "baseline_20260402_153000", etc.

Parameters

Name Type Description Default
prefix str Label prefix to match (e.g., “baseline”). required

Returns

Name Type Description
str The run_hash of the most recently created matching run.

Raises

Name Type Description
KeyError If no runs have labels starting with prefix.

spatial_filter

registry.RunRegistry.spatial_filter(bbox=None, geojson=None)

Context manager for spatial filtering of queries.

All DiagnosticQueries within this context will be spatially filtered. Can be nested with time_filter().

Parameters

Name Type Description Default
bbox tuple[float, float, float, float] | None Bounding box as (min_lon, max_lon, min_lat, max_lat). None
geojson str | dict | None GeoJSON polygon string or dict. None

Raises

Name Type Description
ValueError If both bbox and geojson are provided.

Examples

>>> with registry.spatial_filter(bbox=(-116, -115, 33.5, 34.0)):
...     df = queries.get_timeseries("height", run_hash="abc123")
>>> # Nested with time filter
>>> with registry.spatial_filter(geojson=park_boundary):
...     with registry.time_filter(step_range=(0, 50)):
...         df = queries.get_timeseries("height", run_hash="abc123")

start_run

registry.RunRegistry.start_run(
    run_hash,
    *,
    session_id,
    replicate=0,
    output_path=None,
    metadata=None,
)

Record the start of a job run.

Parameters

Name Type Description Default
run_hash str Run hash for this run. required
session_id str Session that initiated this run. required
replicate int Replicate number (0-indexed). 0
output_path str | None Path where output will be written. None
metadata dict[str, Any] | None Additional metadata. None

Returns

Name Type Description
str The generated run ID.

time_filter

registry.RunRegistry.time_filter(step_range)

Context manager for temporal filtering of queries.

All DiagnosticQueries within this context will be filtered to the specified step range. Can be nested with spatial_filter().

Parameters

Name Type Description Default
step_range tuple[int, int] Tuple of (min_step, max_step) inclusive. required

Examples

>>> with registry.time_filter(step_range=(0, 50)):
...     df = queries.get_timeseries("height", run_hash="abc123")
>>> # Nested with spatial filter
>>> with registry.time_filter(step_range=(10, 20)):
...     with registry.spatial_filter(bbox=(-116, -115, 33.5, 34.0)):
...         df = queries.get_comparison("height", group_by="maxGrowth")

to_csv

registry.RunRegistry.to_csv(path, table='cell_data')

Export a table to CSV format.

Parameters

Name Type Description Default
path str | Path Output file path. required
table str Table name to export (default: cell_data). 'cell_data'

Examples

>>> registry.to_csv("results.csv")

In R:

>>> # df <- readr::read_csv("results.csv")

to_parquet

registry.RunRegistry.to_parquet(path, table='cell_data')

Export a table to Parquet format.

Parquet is recommended for R/Python analysis due to compression and type preservation.

Parameters

Name Type Description Default
path str | Path Output file path. required
table str Table name to export (default: cell_data). 'cell_data'

Examples

>>> registry.to_parquet("results.parquet")

In R:

>>> # df <- arrow::read_parquet("results.parquet")

In Python:

>>> import pandas as pd
>>> df = pd.read_parquet("results.parquet")

update_session_status

registry.RunRegistry.update_session_status(session_id, status)

Update the status of a session.

Parameters

Name Type Description Default
session_id str The session to update. required
status str New status (e.g., ‘running’, ‘completed’, ‘failed’). required