registry.RunRegistry

registry.RunRegistry(
    db_path='josh_runs.duckdb',
    enable_spatial=True,
    _conn=None,
    _spatial_filter_bbox=None,
    _spatial_filter_geojson=None,
    _time_filter_range=None,
)

DuckDB-backed registry for tracking parameter sweeps and job runs.

Supports both file-based persistence and in-memory mode (using “:memory:”).

Attributes

Name	Type	Description
db_path	Path \| str	Path to the DuckDB database file, or “:memory:” for in-memory.
enable_spatial	bool	If True, load spatial extension and create geometry column.
_conn	Any	DuckDB connection (created automatically).

Examples

>>> # File-based (persistent)
>>> registry = RunRegistry("experiment.duckdb")

>>> # In-memory (for testing)
>>> registry = RunRegistry(":memory:")

>>> # With spatial support enabled (default)
>>> registry = RunRegistry("experiment.duckdb", enable_spatial=True)

>>> # Context manager
>>> with RunRegistry("experiment.duckdb") as registry:
...     session_id = registry.create_session(...)

Methods

Name	Description
check_sparsity	Check for sparse columns in cell_data.
close	Close the database connection.
complete_run	Record the completion of a job run.
create_session	Create a new sweep session.
export_results_df	Export session results as a pandas DataFrame.
get_config_by_hash	Get config information by run hash.
get_configs_for_session	Get all configs for a session.
get_data_summary	Get summary of all data in registry.
get_run	Get run information by ID.
get_runs_by_parameters	Query runs by parameter values.
get_runs_for_hash	Get all runs for a run hash.
get_session	Get session information by ID.
get_session_summary	Get aggregated statistics for a session.
list_config_columns	List all parameter column names in config_parameters.
list_config_parameters	List all config parameter names from sweep configurations.
list_entity_types	List all entity types found in cell_data.
list_export_variables	List all export variable names from simulation outputs.
list_parameters	Alias for list_config_parameters(). Deprecated, use list_config_parameters().
list_sessions	List sessions, optionally filtered by experiment name.
list_variable_columns	List all variable column names in cell_data.
list_variables	Alias for list_export_variables(). Deprecated, use list_export_variables().
query	Execute a SQL query with parameters.
register_output	Register an output file from a run.
register_run	Register a job configuration (run specification).
spatial_filter	Context manager for spatial filtering of queries.
start_run	Record the start of a job run.
time_filter	Context manager for temporal filtering of queries.
to_csv	Export a table to CSV format.
to_parquet	Export a table to Parquet format.
update_session_status	Update the status of a session.

check_sparsity

registry.RunRegistry.check_sparsity()

Check for sparse columns in cell_data.

Sparse columns (>50% NULL by default) often indicate that different simulation types are being mixed in the same registry, which hurts query performance.

Returns

Name	Type	Description
	SparsityReport	SparsityReport with statistics for each variable column.

Examples

>>> report = registry.check_sparsity()
>>> if report.should_warn:
...     print(report)

close

registry.RunRegistry.close()

Close the database connection.

complete_run

registry.RunRegistry.complete_run(run_id, exit_code, error_message=None)

Record the completion of a job run.

Parameters

Name	Type	Description	Default
run_id	str	The run ID to update.	required
exit_code	int	Process exit code (0 = success).	required
error_message	str \| None	Error message if run failed.	`None`

create_session

registry.RunRegistry.create_session(
    config,
    experiment_name=None,
    session_id=None,
)

Create a new sweep session.

Parameters

Name	Type	Description	Default
config	Any	Job configuration containing simulation, template, and sweep info. Must have `simulation`, `template_path`, and `to_dict()` attributes (typically a JobConfig from joshpy.jobs).	required
experiment_name	str \| None	Name for the experiment. Defaults to config.simulation.	`None`
session_id	str \| None	Optional externally-provided session ID. If None, generates a UUID. This allows the frontend/API layer to manage session IDs (e.g., using project IDs).	`None`

Returns

Name	Type	Description
	str	The session ID (generated or provided).

Examples

>>> from joshpy.jobs import JobConfig, SweepConfig, SweepParameter
>>> config = JobConfig(
...     template_path=Path("template.jshc.j2"),
...     source_path=Path("simulation.josh"),
...     simulation="Main",
...     sweep=SweepConfig(
...         parameters=[SweepParameter(name="maxGrowth", values=[10, 20, 30])]
...     ),
... )
>>> session_id = registry.create_session(config=config)

Note

total_jobs and total_replicates are computed from the JobSet after job expansion. Use job_set.total_jobs and job_set.total_replicates.

export_results_df

registry.RunRegistry.export_results_df(session_id)

Export session results as a pandas DataFrame.

Parameters

Name	Type	Description	Default
session_id	str	The session to export.	required

Returns

Name	Type	Description
	Any	pandas DataFrame with run results and parameters.

Raises

Name	Type	Description
	ImportError	If pandas is not installed.

get_config_by_hash

registry.RunRegistry.get_config_by_hash(run_hash)

Get config information by run hash.

Parameters

Name	Type	Description	Default
run_hash	str	The run hash to look up.	required

Returns

Name	Type	Description
	ConfigInfo \| None	ConfigInfo if found, None otherwise.

get_configs_for_session

registry.RunRegistry.get_configs_for_session(session_id)

Get all configs for a session.

Parameters

Name	Type	Description	Default
session_id	str	The session ID to get configs for.	required

Returns

Name	Type	Description
	list[ConfigInfo]	List of ConfigInfo objects.

get_data_summary

registry.RunRegistry.get_data_summary(session_id=None)

Get summary of all data in registry.

Provides counts, available variables, parameters, and data ranges for diagnostic purposes.

Parameters

Name	Type	Description	Default
session_id	str \| None	Optional session ID to filter by.	`None`

Returns

Name	Type	Description
	DataSummary	DataSummary with counts and metadata.

get_run

registry.RunRegistry.get_run(run_id)

Get run information by ID.

Parameters

Name	Type	Description	Default
run_id	str	The run ID to look up.	required

Returns

Name	Type	Description
	RunInfo \| None	RunInfo if found, None otherwise.

get_runs_by_parameters

registry.RunRegistry.get_runs_by_parameters(**params)

Query runs by parameter values.

Parameters

Name	Type	Description	Default
**params	Any	Parameter name-value pairs to filter by.	`{}`

Returns

Name	Type	Description
	list[dict[str, Any]]	List of dicts containing run info and parameters.

get_runs_for_hash

registry.RunRegistry.get_runs_for_hash(run_hash)

Get all runs for a run hash.

Parameters

Name	Type	Description	Default
run_hash	str	The run hash to get runs for.	required

Returns

Name	Type	Description
	list[RunInfo]	List of RunInfo objects.

get_session

registry.RunRegistry.get_session(session_id)

Get session information by ID.

Parameters

Name	Type	Description	Default
session_id	str	The session ID to look up.	required

Returns

Name	Type	Description
	SessionInfo \| None	SessionInfo if found, None otherwise.

get_session_summary

registry.RunRegistry.get_session_summary(session_id)

Get aggregated statistics for a session.

Parameters

Name	Type	Description	Default
session_id	str	The session ID to summarize.	required

Returns

Name	Type	Description
	SessionSummary \| None	SessionSummary with counts, or None if session not found.

list_config_columns

registry.RunRegistry.list_config_columns()

List all parameter column names in config_parameters.

Returns the dynamically-added parameter columns. Column names preserve original names with special characters (e.g., ‘soil.moisture’).

Returns

Name	Type	Description
	list[str]	Sorted list of parameter column names.

Examples

>>> registry.list_config_columns()
['maxGrowth', 'scenario', 'soil.moisture']

list_config_parameters

registry.RunRegistry.list_config_parameters(session_id=None)

List all config parameter names from sweep configurations.

These are the parameters you defined in your JobConfig sweep, stored as typed columns in the config_parameters table.

Parameters

Name	Type	Description	Default
session_id	str \| None	Optional session ID to filter by.	`None`

Returns

Name	Type	Description
	list[str]	Sorted list of parameter names.

Examples

>>> registry.list_config_parameters()
['maxGrowth', 'scenario', 'survivalProb']

list_entity_types

registry.RunRegistry.list_entity_types(session_id=None)

List all entity types found in cell_data.

Parameters

Name	Type	Description	Default
session_id	str \| None	Optional session ID to filter by.	`None`

Returns

Name	Type	Description
	list[str]	Sorted list of entity type names.

list_export_variables

registry.RunRegistry.list_export_variables(session_id=None)

List all export variable names from simulation outputs.

These are the variables exported by Josh simulations, stored as typed columns in the cell_data table. Variable names preserve original .josh names (e.g., ‘avg.height’).

When session_id is provided, only returns variables that have at least one non-NULL value for runs in that session.

Parameters

Name	Type	Description	Default
session_id	str \| None	Optional session ID to filter by. If provided, only returns variables with data in that session.	`None`

Returns

Name	Type	Description
	list[str]	Sorted list of variable column names.

Examples

>>> registry.list_export_variables()
['averageAge', 'avg.height', 'treeCount']

>>> registry.list_export_variables(session_id="abc123")
['treeCount']  # Only variables with data in this session

list_parameters

registry.RunRegistry.list_parameters(session_id=None)

Alias for list_config_parameters(). Deprecated, use list_config_parameters().

list_sessions

registry.RunRegistry.list_sessions(experiment_name=None, limit=100)

List sessions, optionally filtered by experiment name.

Parameters

Name	Type	Description	Default
experiment_name	str \| None	Filter by experiment name (optional).	`None`
limit	int	Maximum number of sessions to return.	`100`

Returns

Name	Type	Description
	list[SessionInfo]	List of SessionInfo objects, ordered by creation time (newest first).

list_variable_columns

registry.RunRegistry.list_variable_columns()

List all variable column names in cell_data.

Returns the dynamically-added variable columns. Column names preserve original names with special characters (e.g., ‘avg.height’).

Returns

Name	Type	Description
	list[str]	Sorted list of variable column names.

Examples

>>> registry.list_variable_columns()
['averageAge', 'avg.height', 'treeCount']

list_variables

registry.RunRegistry.list_variables(session_id=None)

Alias for list_export_variables(). Deprecated, use list_export_variables().

query

registry.RunRegistry.query(sql, params=None)

Execute a SQL query with parameters.

This provides direct access to DuckDB for custom queries beyond the pre-built methods. Use this when you need to run complex queries or explore the data in ways not covered by the API.

Parameters

Name	Type	Description	Default
sql	str	SQL query with ? placeholders for parameters.	required
params	list \| None	List of parameter values.	`None`

Returns

Name	Type	Description
	Any	DuckDB relation (call .df() for DataFrame, .fetchall() for tuples).

Examples

>>> # Get DataFrame
>>> df = registry.query(
...     "SELECT * FROM cell_data WHERE step BETWEEN ? AND ?",
...     [0, 10]
... ).df()

>>> # Get raw results
>>> rows = registry.query(
...     "SELECT COUNT(*) FROM cell_data WHERE run_hash = ?",
...     ["abc123"]
... ).fetchone()

register_output

registry.RunRegistry.register_output(
    run_id,
    output_type,
    file_path,
    file_size=None,
    row_count=None,
)

Parameters

Name	Type	Description	Default
run_id	str	The run this output belongs to.	required
output_type	str	Type of output (e.g., ‘csv’, ‘log’, ‘error’).	required
file_path	str	Path to the output file.	required
file_size	int \| None	Size of the file in bytes.	`None`
row_count	int \| None	Number of rows (for tabular data).	`None`

Returns

Name	Type	Description
	str	The generated output ID.

register_run

registry.RunRegistry.register_run(
    session_id,
    run_hash,
    josh_path,
    config_content,
    file_mappings,
    parameters,
)

Parameters

Name	Type	Description	Default
session_id	str	Session this config belongs to.	required
run_hash	str	MD5 hash of josh + config + file_mappings (12 chars).	required
josh_path	str	Path to the .josh script file.	required
config_content	str	Full text of the rendered configuration.	required
file_mappings	dict[str, dict[str, str]] \| None	Dict mapping names to {“path”: “…”, “hash”: “…”}.	required
parameters	dict[str, Any]	Parameter values used to generate this config.	required

spatial_filter

registry.RunRegistry.spatial_filter(bbox=None, geojson=None)

Context manager for spatial filtering of queries.

All DiagnosticQueries within this context will be spatially filtered. Can be nested with time_filter().

Parameters

Name	Type	Description	Default
bbox	tuple[float, float, float, float] \| None	Bounding box as (min_lon, max_lon, min_lat, max_lat).	`None`
geojson	str \| dict \| None	GeoJSON polygon string or dict.	`None`

Raises

Name	Type	Description
	ValueError	If both bbox and geojson are provided.

Examples

>>> with registry.spatial_filter(bbox=(-116, -115, 33.5, 34.0)):
...     df = queries.get_timeseries("height", run_hash="abc123")

>>> # Nested with time filter
>>> with registry.spatial_filter(geojson=park_boundary):
...     with registry.time_filter(step_range=(0, 50)):
...         df = queries.get_timeseries("height", run_hash="abc123")

start_run

registry.RunRegistry.start_run(
    run_hash,
    replicate=0,
    output_path=None,
    metadata=None,
)

Record the start of a job run.

Parameters

Name	Type	Description	Default
run_hash	str	Run hash for this run.	required
replicate	int	Replicate number (0-indexed).	`0`
output_path	str \| None	Path where output will be written.	`None`
metadata	dict[str, Any] \| None	Additional metadata.	`None`

Returns

Name	Type	Description
	str	The generated run ID.

time_filter

registry.RunRegistry.time_filter(step_range)

Context manager for temporal filtering of queries.

All DiagnosticQueries within this context will be filtered to the specified step range. Can be nested with spatial_filter().

Parameters

Name	Type	Description	Default
step_range	tuple[int, int]	Tuple of (min_step, max_step) inclusive.	required

Examples

>>> with registry.time_filter(step_range=(0, 50)):
...     df = queries.get_timeseries("height", run_hash="abc123")

>>> # Nested with spatial filter
>>> with registry.time_filter(step_range=(10, 20)):
...     with registry.spatial_filter(bbox=(-116, -115, 33.5, 34.0)):
...         df = queries.get_comparison("height", group_by="maxGrowth")

to_csv

registry.RunRegistry.to_csv(path, table='cell_data')

Export a table to CSV format.

Parameters

Name	Type	Description	Default
path	str \| Path	Output file path.	required
table	str	Table name to export (default: cell_data).	`'cell_data'`

Examples

>>> registry.to_csv("results.csv")

In R:

>>> # df <- readr::read_csv("results.csv")

to_parquet

registry.RunRegistry.to_parquet(path, table='cell_data')

Export a table to Parquet format.

Parquet is recommended for R/Python analysis due to compression and type preservation.

Parameters

Name	Type	Description	Default
path	str \| Path	Output file path.	required
table	str	Table name to export (default: cell_data).	`'cell_data'`

Examples

>>> registry.to_parquet("results.parquet")

In R:

>>> # df <- arrow::read_parquet("results.parquet")

In Python:

>>> import pandas as pd
>>> df = pd.read_parquet("results.parquet")

update_session_status

registry.RunRegistry.update_session_status(session_id, status)

Update the status of a session.

Parameters

Name	Type	Description	Default
session_id	str	The session to update.	required
status	str	New status (e.g., ‘running’, ‘completed’, ‘failed’).	required