DuckDB-backed registry for tracking parameter sweeps and job runs.
Supports both file-based persistence and in-memory mode (using “:memory:”).
Methods
check_sparsity
registry.RunRegistry.check_sparsity()
Check for sparse columns in cell_data.
Sparse columns (>50% NULL by default) often indicate that different simulation types are being mixed in the same registry, which hurts query performance.
Returns
SparsityReport
SparsityReport with statistics for each variable column.
Examples
>>> report = registry.check_sparsity()
>>> if report.should_warn:
... print (report)
close
registry.RunRegistry.close()
Close the database connection.
complete_run
registry.RunRegistry.complete_run(run_id, exit_code, error_message= None )
Record the completion of a job run.
Parameters
run_id
str
The run ID to update.
required
exit_code
int
Process exit code (0 = success).
required
error_message
str | None
Error message if run failed.
None
create_session
registry.RunRegistry.create_session(
config,
experiment_name= None ,
session_id= None ,
)
Create a new sweep session.
Parameters
config
Any
Job configuration containing simulation, template, and sweep info. Must have simulation, template_path, and to_dict() attributes (typically a JobConfig from joshpy.jobs).
required
experiment_name
str | None
Name for the experiment. Defaults to config.simulation.
None
session_id
str | None
Optional externally-provided session ID. If None, generates a UUID. This allows the frontend/API layer to manage session IDs (e.g., using project IDs).
None
Returns
str
The session ID (generated or provided).
Examples
>>> from joshpy.jobs import JobConfig, SweepConfig, SweepParameter
>>> config = JobConfig(
... template_path= Path("template.jshc.j2" ),
... source_path= Path("simulation.josh" ),
... simulation= "Main" ,
... sweep= SweepConfig(
... parameters= [SweepParameter(name= "maxGrowth" , values= [10 , 20 , 30 ])]
... ),
... )
>>> session_id = registry.create_session(config= config)
Note
total_jobs and total_replicates are computed from the JobSet after job expansion. Use job_set.total_jobs and job_set.total_replicates.
export_results_df
registry.RunRegistry.export_results_df(session_id)
Export session results as a pandas DataFrame.
Parameters
session_id
str
The session to export.
required
Returns
Any
pandas DataFrame with run results and parameters.
Raises
ImportError
If pandas is not installed.
get_config_by_hash
registry.RunRegistry.get_config_by_hash(run_hash)
Get config information by run hash.
Parameters
run_hash
str
The run hash to look up.
required
Returns
ConfigInfo | None
ConfigInfo if found, None otherwise.
get_configs_for_session
registry.RunRegistry.get_configs_for_session(session_id)
Get all configs for a session.
Parameters
session_id
str
The session ID to get configs for.
required
Returns
list[ConfigInfo]
List of ConfigInfo objects.
get_data_summary
registry.RunRegistry.get_data_summary(session_id= None )
Get summary of all data in registry.
Provides counts, available variables, parameters, and data ranges for diagnostic purposes.
Parameters
session_id
str | None
Optional session ID to filter by.
None
Returns
DataSummary
DataSummary with counts and metadata.
get_run
registry.RunRegistry.get_run(run_id)
Get run information by ID.
Parameters
run_id
str
The run ID to look up.
required
Returns
RunInfo | None
RunInfo if found, None otherwise.
get_runs_by_parameters
registry.RunRegistry.get_runs_by_parameters(** params)
Query runs by parameter values.
Parameters
**params
Any
Parameter name-value pairs to filter by.
{}
Returns
list[dict[str, Any]]
List of dicts containing run info and parameters.
get_runs_for_hash
registry.RunRegistry.get_runs_for_hash(run_hash)
Get all runs for a run hash.
Parameters
run_hash
str
The run hash to get runs for.
required
Returns
list[RunInfo]
List of RunInfo objects.
get_session
registry.RunRegistry.get_session(session_id)
Get session information by ID.
Parameters
session_id
str
The session ID to look up.
required
Returns
SessionInfo | None
SessionInfo if found, None otherwise.
get_session_summary
registry.RunRegistry.get_session_summary(session_id)
Get aggregated statistics for a session.
Parameters
session_id
str
The session ID to summarize.
required
Returns
SessionSummary | None
SessionSummary with counts, or None if session not found.
list_config_columns
registry.RunRegistry.list_config_columns()
List all parameter column names in config_parameters.
Returns the dynamically-added parameter columns. Column names preserve original names with special characters (e.g., ‘soil.moisture’).
Returns
list[str]
Sorted list of parameter column names.
Examples
>>> registry.list_config_columns()
['maxGrowth' , 'scenario' , 'soil.moisture' ]
list_config_parameters
registry.RunRegistry.list_config_parameters(session_id= None )
List all config parameter names from sweep configurations.
These are the parameters you defined in your JobConfig sweep, stored as typed columns in the config_parameters table.
Parameters
session_id
str | None
Optional session ID to filter by.
None
Returns
list[str]
Sorted list of parameter names.
Examples
>>> registry.list_config_parameters()
['maxGrowth' , 'scenario' , 'survivalProb' ]
list_entity_types
registry.RunRegistry.list_entity_types(session_id= None )
List all entity types found in cell_data.
Parameters
session_id
str | None
Optional session ID to filter by.
None
Returns
list[str]
Sorted list of entity type names.
list_export_variables
registry.RunRegistry.list_export_variables(session_id= None )
List all export variable names from simulation outputs.
These are the variables exported by Josh simulations, stored as typed columns in the cell_data table. Variable names preserve original .josh names (e.g., ‘avg.height’).
When session_id is provided, only returns variables that have at least one non-NULL value for runs in that session.
Parameters
session_id
str | None
Optional session ID to filter by. If provided, only returns variables with data in that session.
None
Returns
list[str]
Sorted list of variable column names.
Examples
>>> registry.list_export_variables()
['averageAge' , 'avg.height' , 'treeCount' ]
>>> registry.list_export_variables(session_id= "abc123" )
['treeCount' ] # Only variables with data in this session
list_parameters
registry.RunRegistry.list_parameters(session_id= None )
Alias for list_config_parameters(). Deprecated, use list_config_parameters().
list_sessions
registry.RunRegistry.list_sessions(experiment_name= None , limit= 100 )
List sessions, optionally filtered by experiment name.
Parameters
experiment_name
str | None
Filter by experiment name (optional).
None
limit
int
Maximum number of sessions to return.
100
Returns
list[SessionInfo]
List of SessionInfo objects, ordered by creation time (newest first).
list_variable_columns
registry.RunRegistry.list_variable_columns()
List all variable column names in cell_data.
Returns the dynamically-added variable columns. Column names preserve original names with special characters (e.g., ‘avg.height’).
Returns
list[str]
Sorted list of variable column names.
Examples
>>> registry.list_variable_columns()
['averageAge' , 'avg.height' , 'treeCount' ]
list_variables
registry.RunRegistry.list_variables(session_id= None )
Alias for list_export_variables(). Deprecated, use list_export_variables().
query
registry.RunRegistry.query(sql, params= None )
Execute a SQL query with parameters.
This provides direct access to DuckDB for custom queries beyond the pre-built methods. Use this when you need to run complex queries or explore the data in ways not covered by the API.
Parameters
sql
str
SQL query with ? placeholders for parameters.
required
params
list | None
List of parameter values.
None
Returns
Any
DuckDB relation (call .df() for DataFrame, .fetchall() for tuples).
Examples
>>> # Get DataFrame
>>> df = registry.query(
... "SELECT * FROM cell_data WHERE step BETWEEN ? AND ?" ,
... [0 , 10 ]
... ).df()
>>> # Get raw results
>>> rows = registry.query(
... "SELECT COUNT(*) FROM cell_data WHERE run_hash = ?" ,
... ["abc123" ]
... ).fetchone()
register_output
registry.RunRegistry.register_output(
run_id,
output_type,
file_path,
file_size= None ,
row_count= None ,
)
Register an output file from a run.
Parameters
run_id
str
The run this output belongs to.
required
output_type
str
Type of output (e.g., ‘csv’, ‘log’, ‘error’).
required
file_path
str
Path to the output file.
required
file_size
int | None
Size of the file in bytes.
None
row_count
int | None
Number of rows (for tabular data).
None
Returns
str
The generated output ID.
register_run
registry.RunRegistry.register_run(
session_id,
run_hash,
josh_path,
config_content,
file_mappings,
parameters,
)
Register a job configuration (run specification).
Parameters
session_id
str
Session this config belongs to.
required
run_hash
str
MD5 hash of josh + config + file_mappings (12 chars).
required
josh_path
str
Path to the .josh script file.
required
config_content
str
Full text of the rendered configuration.
required
file_mappings
dict[str, dict[str, str]] | None
Dict mapping names to {“path”: “…”, “hash”: “…”}.
required
parameters
dict[str, Any]
Parameter values used to generate this config.
required
spatial_filter
registry.RunRegistry.spatial_filter(bbox= None , geojson= None )
Context manager for spatial filtering of queries.
All DiagnosticQueries within this context will be spatially filtered. Can be nested with time_filter().
Parameters
bbox
tuple[float, float, float, float] | None
Bounding box as (min_lon, max_lon, min_lat, max_lat).
None
geojson
str | dict | None
GeoJSON polygon string or dict.
None
Raises
ValueError
If both bbox and geojson are provided.
Examples
>>> with registry.spatial_filter(bbox= (- 116 , - 115 , 33.5 , 34.0 )):
... df = queries.get_timeseries("height" , run_hash= "abc123" )
>>> # Nested with time filter
>>> with registry.spatial_filter(geojson= park_boundary):
... with registry.time_filter(step_range= (0 , 50 )):
... df = queries.get_timeseries("height" , run_hash= "abc123" )
start_run
registry.RunRegistry.start_run(
run_hash,
replicate= 0 ,
output_path= None ,
metadata= None ,
)
Record the start of a job run.
Parameters
run_hash
str
Run hash for this run.
required
replicate
int
Replicate number (0-indexed).
0
output_path
str | None
Path where output will be written.
None
metadata
dict[str, Any] | None
Additional metadata.
None
Returns
str
The generated run ID.
time_filter
registry.RunRegistry.time_filter(step_range)
Context manager for temporal filtering of queries.
All DiagnosticQueries within this context will be filtered to the specified step range. Can be nested with spatial_filter().
Parameters
step_range
tuple[int, int]
Tuple of (min_step, max_step) inclusive.
required
Examples
>>> with registry.time_filter(step_range= (0 , 50 )):
... df = queries.get_timeseries("height" , run_hash= "abc123" )
>>> # Nested with spatial filter
>>> with registry.time_filter(step_range= (10 , 20 )):
... with registry.spatial_filter(bbox= (- 116 , - 115 , 33.5 , 34.0 )):
... df = queries.get_comparison("height" , group_by= "maxGrowth" )
to_csv
registry.RunRegistry.to_csv(path, table= 'cell_data' )
Export a table to CSV format.
Parameters
path
str | Path
Output file path.
required
table
str
Table name to export (default: cell_data).
'cell_data'
Examples
>>> registry.to_csv("results.csv" )
In R:
>>> # df <- readr::read_csv("results.csv")
to_parquet
registry.RunRegistry.to_parquet(path, table= 'cell_data' )
Export a table to Parquet format.
Parquet is recommended for R/Python analysis due to compression and type preservation.
Parameters
path
str | Path
Output file path.
required
table
str
Table name to export (default: cell_data).
'cell_data'
Examples
>>> registry.to_parquet("results.parquet" )
In R:
>>> # df <- arrow::read_parquet("results.parquet")
In Python:
>>> import pandas as pd
>>> df = pd.read_parquet("results.parquet" )
update_session_status
registry.RunRegistry.update_session_status(session_id, status)
Update the status of a session.
Parameters
session_id
str
The session to update.
required
status
str
New status (e.g., ‘running’, ‘completed’, ‘failed’).
required