from joshpy.registry import RunRegistry
# Load registry from disk (created by manual-workflow.qmd or sweep-manager.qmd)
REGISTRY_PATH = "demo_registry.duckdb"
registry = RunRegistry(REGISTRY_PATH)Analysis & Visualization
Introduction
This demo focuses on exploring and visualizing simulation results after they’ve been collected. The analysis layer in joshpy is decoupled from orchestration - it only needs a RunRegistry containing data. Whether the data came from SweepManager, manual workflow, or someone else’s experiment doesn’t matter.
Prerequisites: Run either Manual Workflow or SweepManager Workflow first to generate demo_registry.duckdb.
For orchestration workflows, see:
- Manual Workflow - Step-by-step control using individual components
- SweepManager Workflow - Simplified workflow with
SweepManager
This demo covers:
- Data Discovery - What’s in the registry?
- Diagnostic Plots - Quick matplotlib visualizations
- Custom Queries -
DiagnosticQueriesfor common patterns - Direct SQL - Full access to DuckDB for advanced analysis
- R/ggplot2 - Publication-quality figures
Setup: Load Registry
Load the registry saved by a previous orchestration demo:
Part 1: Data Discovery
Before plotting, discover what data is available in the registry.
Full Summary
summary = registry.get_data_summary()
print(summary)Registry Data Summary
========================================
Sessions: 2
Configs: 10
Runs: 20
Rows: 87,780
Variables: averageAge, averageHeight
Entity types: patch
Parameters: maxGrowth
Steps: 0 - 10
Replicates: 0 - 2
Spatial extent: lon [-115.37, -114.40], lat [33.41, 33.68]
Specific Queries
# What variables were exported from the simulation?
registry.list_export_variables()['averageAge', 'averageHeight']
# What parameters were swept?
registry.list_config_parameters()['maxGrowth']
# What entity types are in the data?
registry.list_entity_types()['patch']
Variable Columns
# List all variable columns in the cell_data table
registry.list_variable_columns()['averageAge', 'averageHeight']
Part 2: Diagnostic Plots (SimulationDiagnostics)
The SimulationDiagnostics class provides quick matplotlib-based visualizations for simulation sanity checks. These are designed for rapid exploration, not publication.
from joshpy.diagnostics import SimulationDiagnostics
diag = SimulationDiagnostics(registry)Time Series
Plot how a variable evolves over simulation steps. By default, spatially aggregates across patches and shows uncertainty bands across replicates.
# Filter by parameter value
diag.plot_timeseries(
"averageHeight",
maxGrowth=50,
title="Average Tree Height Over Time (maxGrowth=50)",
)
Aggregation Options
# Sum across patches
diag.plot_timeseries("averageHeight", maxGrowth=50, aggregate="sum", title="Sum")
# Min across patches
diag.plot_timeseries("averageHeight", maxGrowth=50, aggregate="min", title="Min")
Parameter Comparison
Compare a variable across different parameter values - ideal for sweep results.
diag.plot_comparison(
"averageHeight",
group_by="maxGrowth",
title="Tree Height by Growth Rate Parameter",
)
Bar Chart at Specific Step
diag.plot_comparison(
"averageHeight",
group_by="maxGrowth",
step=10,
title="Final Tree Height vs Growth Rate",
)
Spatial Snapshot
Scatter plot colored by value at a specific timestep:
# Get a run hash for maxGrowth=50
run_hash_50 = registry.get_runs_by_parameters(maxGrowth=50)[0]["run_hash"]
diag.plot_spatial(
"averageHeight",
step=10,
run_hash=run_hash_50,
title="Spatial Distribution (maxGrowth=50, step=10)",
)
Saving Figures
All plot methods return a matplotlib Figure that can be saved:
fig = diag.plot_comparison(
"averageHeight",
group_by="maxGrowth",
show=False, # Don't display inline
)
fig.savefig("/tmp/comparison_plot.png", dpi=150, bbox_inches="tight")
Part 3: Custom Queries (DiagnosticQueries)
For more control than SimulationDiagnostics, use DiagnosticQueries directly. These return pandas DataFrames for further processing.
from joshpy.cell_data import DiagnosticQueries
queries = DiagnosticQueries(registry)Parameter Comparison
df = queries.get_parameter_comparison(
variable="averageHeight",
param_name="maxGrowth",
)
df.head(15)| param_value | step | mean_value | std_value | n_cells | |
|---|---|---|---|---|---|
| 0 | 10.0 | 0 | 5.002052 | 0.949872 | 798 |
| 1 | 10.0 | 1 | 10.017796 | 1.316734 | 798 |
| 2 | 10.0 | 2 | 15.035342 | 1.586715 | 798 |
| 3 | 10.0 | 3 | 19.980981 | 1.825273 | 798 |
| 4 | 10.0 | 4 | 24.937082 | 2.022142 | 798 |
| 5 | 10.0 | 5 | 29.970522 | 2.196011 | 798 |
| 6 | 10.0 | 6 | 34.976634 | 2.380003 | 798 |
| 7 | 10.0 | 7 | 39.937856 | 2.495450 | 798 |
| 8 | 10.0 | 8 | 44.960226 | 2.691332 | 798 |
| 9 | 10.0 | 9 | 49.950621 | 2.836932 | 798 |
| 10 | 10.0 | 10 | 54.976967 | 2.961161 | 798 |
| 11 | 20.0 | 0 | 10.024647 | 1.749574 | 798 |
| 12 | 20.0 | 1 | 20.038832 | 2.439569 | 798 |
| 13 | 20.0 | 2 | 30.057417 | 3.191414 | 798 |
| 14 | 20.0 | 3 | 40.080597 | 3.764329 | 798 |
Replicate Uncertainty
Get statistics across replicates for a specific run:
df = queries.get_replicate_uncertainty(
variable="averageHeight",
run_hash=run_hash_50,
)
df.head(10)| step | mean | std | ci_low | ci_high | n_replicates | |
|---|---|---|---|---|---|---|
| 0 | 0 | 24.959723 | 0.334604 | 24.581090 | 25.338356 | 3 |
| 1 | 1 | 49.847802 | 0.242053 | 49.573898 | 50.121705 | 3 |
| 2 | 2 | 74.751186 | 0.249176 | 74.469222 | 75.033150 | 3 |
| 3 | 3 | 100.123437 | 0.235847 | 99.856556 | 100.390318 | 3 |
| 4 | 4 | 125.486203 | 0.498842 | 124.921720 | 126.050685 | 3 |
| 5 | 5 | 150.693025 | 0.424666 | 150.212479 | 151.173571 | 3 |
| 6 | 6 | 175.916505 | 0.670069 | 175.158265 | 176.674745 | 3 |
| 7 | 7 | 201.128909 | 0.812048 | 200.210006 | 202.047811 | 3 |
| 8 | 8 | 226.277511 | 0.833181 | 225.334696 | 227.220327 | 3 |
| 9 | 9 | 251.387767 | 0.923702 | 250.342518 | 252.433015 | 3 |
Spatial Snapshot
Get all spatial data for a given timestep:
df = queries.get_spatial_snapshot(
step=10,
variable="averageHeight",
run_hash=run_hash_50,
)
df.head(10)| longitude | latitude | value | |
|---|---|---|---|
| 0 | -115.264878 | 33.587585 | 280.058142 |
| 1 | -114.508196 | 33.452687 | 263.347895 |
| 2 | -115.156781 | 33.542619 | 257.482334 |
| 3 | -114.724391 | 33.407720 | 281.867308 |
| 4 | -114.832488 | 33.452687 | 271.387664 |
| 5 | -114.832488 | 33.677517 | 273.771064 |
| 6 | -115.264878 | 33.542619 | 289.841778 |
| 7 | -115.210829 | 33.497653 | 271.214587 |
| 8 | -115.102732 | 33.632551 | 270.783724 |
| 9 | -115.102732 | 33.452687 | 259.565090 |
Get All Variables at Once
Retrieve all exported variables for a specific step:
df = queries.get_all_variables_at_step(
step=10,
run_hash=run_hash_50,
)
df.head(5)| longitude | latitude | position_x | position_y | averageAge | averageHeight | |
|---|---|---|---|---|---|---|
| 0 | -115.264878 | 33.587585 | 2.5 | 2.5 | 11.0 | 280.058142 |
| 1 | -114.508196 | 33.452687 | 16.5 | 5.5 | 11.0 | 263.347895 |
| 2 | -115.156781 | 33.542619 | 4.5 | 3.5 | 11.0 | 257.482334 |
| 3 | -114.724391 | 33.407720 | 12.5 | 6.5 | 11.0 | 281.867308 |
| 4 | -114.832488 | 33.452687 | 10.5 | 5.5 | 11.0 | 271.387664 |
Part 4: Direct SQL Queries
For full flexibility, query DuckDB directly.
Simple Aggregations
# Mean, min, max across all cells per step
result = registry.query("""
SELECT
step,
AVG(averageHeight) as mean_height,
MIN(averageHeight) as min_height,
MAX(averageHeight) as max_height,
STDDEV(averageHeight) as std_height,
COUNT(*) as n_cells
FROM cell_data
GROUP BY step
ORDER BY step
""")
result.df()| step | mean_height | min_height | max_height | std_height | n_cells | |
|---|---|---|---|---|---|---|
| 0 | 0 | 27.449281 | 2.571519 | 81.261263 | 15.424800 | 7980 |
| 1 | 1 | 54.946575 | 5.344454 | 141.446671 | 29.806147 | 7980 |
| 2 | 2 | 82.415020 | 9.365887 | 200.431595 | 44.181379 | 7980 |
| 3 | 3 | 110.022066 | 13.070597 | 258.265273 | 58.589413 | 7980 |
| 4 | 4 | 137.611730 | 17.214230 | 309.244176 | 73.050728 | 7980 |
| 5 | 5 | 165.177986 | 21.954278 | 364.878327 | 87.426829 | 7980 |
| 6 | 6 | 192.695592 | 26.523429 | 412.071149 | 101.860115 | 7980 |
| 7 | 7 | 220.171013 | 31.515315 | 469.674767 | 116.095081 | 7980 |
| 8 | 8 | 247.716202 | 36.695347 | 526.101379 | 130.517679 | 7980 |
| 9 | 9 | 275.254514 | 40.568335 | 578.888969 | 144.922821 | 7980 |
| 10 | 10 | 302.741507 | 46.718865 | 633.356189 | 159.222472 | 7980 |
Filtering on Variable Values
# Find cells where trees grew above a threshold
result = registry.query("""
SELECT
step,
COUNT(*) as n_cells_above_50,
AVG(averageHeight) as mean_height_of_those
FROM cell_data
WHERE averageHeight > 50
GROUP BY step
ORDER BY step
""")
result.df()| step | n_cells_above_50 | mean_height_of_those | |
|---|---|---|---|
| 0 | 0 | 676 | 56.237291 |
| 1 | 1 | 4311 | 78.262376 |
| 2 | 2 | 5666 | 104.167978 |
| 3 | 3 | 6352 | 130.435111 |
| 4 | 4 | 6770 | 156.420390 |
| 5 | 5 | 7178 | 180.274691 |
| 6 | 6 | 7182 | 210.219921 |
| 7 | 7 | 7182 | 240.196919 |
| 8 | 8 | 7203 | 269.605582 |
| 9 | 9 | 7576 | 287.388078 |
| 10 | 10 | 7939 | 304.052147 |
Percentile Analysis
# Get distribution statistics per step
result = registry.query("""
SELECT
step,
APPROX_QUANTILE(averageHeight, 0.25) as p25,
APPROX_QUANTILE(averageHeight, 0.50) as median,
APPROX_QUANTILE(averageHeight, 0.75) as p75,
APPROX_QUANTILE(averageHeight, 0.95) as p95
FROM cell_data
GROUP BY step
ORDER BY step
""")
result.df()| step | p25 | median | p75 | p95 | |
|---|---|---|---|---|---|
| 0 | 0 | 14.333138 | 26.583360 | 38.896391 | 54.049355 |
| 1 | 1 | 29.717330 | 53.901597 | 78.920256 | 103.584791 |
| 2 | 2 | 45.065265 | 81.175304 | 119.418901 | 153.739298 |
| 3 | 3 | 59.867460 | 108.434569 | 159.078913 | 203.425428 |
| 4 | 4 | 74.906988 | 136.388891 | 199.157198 | 254.036799 |
| 5 | 5 | 89.727411 | 163.824428 | 239.437937 | 302.780912 |
| 6 | 6 | 104.488440 | 191.694419 | 279.790472 | 352.457984 |
| 7 | 7 | 119.703968 | 218.730664 | 319.603379 | 401.400480 |
| 8 | 8 | 134.630376 | 246.288961 | 359.761569 | 451.124598 |
| 9 | 9 | 149.649509 | 273.035080 | 399.361658 | 502.023270 |
| 10 | 10 | 164.947037 | 300.665859 | 438.918273 | 550.910851 |
Joining with Config Parameters
Config parameters are stored as typed columns in the config_parameters table:
# Compare results across different parameter values
result = registry.query("""
SELECT
cp.maxGrowth,
cd.step,
AVG(cd.averageHeight) as mean_height,
STDDEV(cd.averageHeight) as std_height
FROM cell_data cd
JOIN config_parameters cp ON cd.run_hash = cp.run_hash
GROUP BY cp.maxGrowth, cd.step
ORDER BY cp.maxGrowth, cd.step
""")
result.df().head(15)| maxGrowth | step | mean_height | std_height | |
|---|---|---|---|---|
| 0 | 10.0 | 0 | 5.002052 | 0.949872 |
| 1 | 10.0 | 1 | 10.017796 | 1.316734 |
| 2 | 10.0 | 2 | 15.035342 | 1.586715 |
| 3 | 10.0 | 3 | 19.980981 | 1.825273 |
| 4 | 10.0 | 4 | 24.937082 | 2.022142 |
| 5 | 10.0 | 5 | 29.970522 | 2.196011 |
| 6 | 10.0 | 6 | 34.976634 | 2.380003 |
| 7 | 10.0 | 7 | 39.937856 | 2.495450 |
| 8 | 10.0 | 8 | 44.960226 | 2.691332 |
| 9 | 10.0 | 9 | 49.950621 | 2.836932 |
| 10 | 10.0 | 10 | 54.976967 | 2.961161 |
| 11 | 20.0 | 0 | 10.024647 | 1.749574 |
| 12 | 20.0 | 1 | 20.038832 | 2.439569 |
| 13 | 20.0 | 2 | 30.057417 | 3.191414 |
| 14 | 20.0 | 3 | 40.080597 | 3.764329 |
# Filter by parameter value (maxGrowth > 50)
result = registry.query("""
SELECT
cd.step,
AVG(cd.averageHeight) as mean_height
FROM cell_data cd
JOIN config_parameters cp ON cd.run_hash = cp.run_hash
WHERE cp.maxGrowth > 50
GROUP BY cd.step
ORDER BY cd.step
""")
result.df()| step | mean_height | |
|---|---|---|
| 0 | 0 | 39.960817 |
| 1 | 1 | 79.949623 |
| 2 | 2 | 119.928205 |
| 3 | 3 | 160.040159 |
| 4 | 4 | 200.189526 |
| 5 | 5 | 240.277122 |
| 6 | 6 | 280.317346 |
| 7 | 7 | 320.171237 |
| 8 | 8 | 360.241799 |
| 9 | 9 | 400.280759 |
| 10 | 10 | 440.219348 |
Export to File
# Export query results to CSV
registry.query("""
SELECT step, AVG(averageHeight) as mean_height
FROM cell_data
GROUP BY step
ORDER BY step
""").df().to_csv("/tmp/height_by_step.csv", index=False)Part 5: R/ggplot2 Visualization
For publication-quality figures, there are three routes to get data from joshpy into R:
- CSV Export - Simple, always works
- Direct DuckDB - Custom SQL queries from R, no Python needed
py$dfvia reticulate - Zero-copy Python/R interop
Route 1: Export to CSV (Simple, Always Works)
# Python exports to CSV
csv_df = queries.get_parameter_comparison(
variable="averageHeight",
param_name="maxGrowth",
)
csv_df.to_csv("/tmp/sweep_results.csv", index=False)# R reads CSV
df_csv <- read.csv("/tmp/sweep_results.csv")
df_csv$param_value <- as.numeric(df_csv$param_value)
head(df_csv, 5) param_value step mean_value std_value n_cells
1 10 0 5.002052 0.9498717 798
2 10 1 10.017796 1.3167339 798
3 10 2 15.035342 1.5867147 798
4 10 3 19.980981 1.8252732 798
5 10 4 24.937082 2.0221417 798
Pros: Always works, simple, portable
Cons: File I/O overhead, temp file management
Route 2: Direct DuckDB from R (Custom Queries)
R can query DuckDB directly with full SQL flexibility:
library(DBI)
library(duckdb)
con <- dbConnect(duckdb(), "demo_registry.duckdb", read_only = TRUE)
df_duckdb <- dbGetQuery(con, "
SELECT
cp.maxGrowth as param_value,
cd.step,
AVG(cd.averageHeight) as mean_value,
STDDEV(cd.averageHeight) as std_value
FROM cell_data cd
JOIN config_parameters cp ON cd.run_hash = cp.run_hash
GROUP BY cp.maxGrowth, cd.step
ORDER BY cp.maxGrowth, cd.step
")
dbDisconnect(con, shutdown = TRUE)
head(df_duckdb, 10) param_value step mean_value std_value
1 10 0 5.002052 0.9498717
2 10 1 10.017796 1.3167339
3 10 2 15.035342 1.5867147
4 10 3 19.980981 1.8252732
5 10 4 24.937082 2.0221417
6 10 5 29.970522 2.1960110
7 10 6 34.976634 2.3800031
8 10 7 39.937856 2.4954505
9 10 8 44.960226 2.6913317
10 10 9 49.950621 2.8369320
Pros: Full SQL flexibility, no Python needed for custom queries
Cons: Requires r-duckdb package (linux-64 only in pixi)
Route 3: py$df via reticulate (Zero-Copy)
In theory, you can access Python DataFrames directly via py$df:
# This should work:
# df <- reticulate::py_to_r(py$sweep_df)We’re working on this - for now, Routes 1 and 2 are recommended.
Plotting with ggplot2
The following examples use df_duckdb from Route 2 (direct DuckDB query). We assign it to df for cleaner ggplot code:
# Use the DuckDB query result for plotting
df <- df_duckdbTime Series with Uncertainty Ribbon
ggplot(df, aes(x = step, y = mean_value,
color = factor(param_value),
fill = factor(param_value))) +
geom_ribbon(aes(ymin = mean_value - std_value,
ymax = mean_value + std_value),
alpha = 0.2, color = NA) +
geom_line(linewidth = 0.8) +
geom_point(size = 2) +
labs(
x = "Simulation Step",
y = "Average Tree Height (meters)",
title = "Tree Growth Over Time by maxGrowth Parameter",
color = "maxGrowth",
fill = "maxGrowth"
) +
theme_minimal() +
theme(
legend.position = "right",
panel.grid.minor = element_blank()
)
Bar Chart with Error Bars
final_df <- df |>
dplyr::filter(step == max(step))
ggplot(final_df, aes(x = param_value, y = mean_value)) +
geom_col(fill = "steelblue", alpha = 0.7) +
geom_errorbar(
aes(ymin = mean_value - std_value, ymax = mean_value + std_value),
width = 3,
linewidth = 0.6
) +
labs(
x = "maxGrowth (meters/step)",
y = "Final Average Height (meters)",
title = "Final Tree Height vs Growth Rate"
) +
theme_minimal() +
theme(panel.grid.minor = element_blank())
Faceted Plot
ggplot(df, aes(x = step, y = mean_value)) +
geom_ribbon(aes(ymin = mean_value - std_value,
ymax = mean_value + std_value),
alpha = 0.3, fill = "steelblue") +
geom_line(color = "steelblue", linewidth = 0.8) +
geom_point(color = "steelblue", size = 1.5) +
facet_wrap(~param_value, ncol = 5, labeller = label_both) +
labs(
x = "Simulation Step",
y = "Average Tree Height (meters)",
title = "Tree Growth Trajectories by maxGrowth Parameter"
) +
theme_minimal() +
theme(
panel.grid.minor = element_blank(),
strip.background = element_rect(fill = "gray90", color = NA)
)
Summary
This demo covered the joshpy analysis workflow:
| Task | Tool | Notes |
|---|---|---|
| Discovery | registry.get_data_summary() |
What’s in the registry? |
| Quick plots | SimulationDiagnostics |
matplotlib-based, fast iteration |
| DataFrames | DiagnosticQueries |
Returns pandas for further processing |
| Full SQL | registry.query() |
Direct DuckDB access |
| R (CSV) | read.csv() |
Simple, always works |
| R (DuckDB) | dbConnect(duckdb(), ...) |
Custom SQL directly from R |
| R (reticulate) | py$df |
Zero-copy Python/R interop |
Key Points:
- Analysis is decoupled from orchestration - same workflow for any data source
- Export variables are typed columns in
cell_data - Config parameters are typed columns in
config_parameters - R can query DuckDB directly
- Three routes for R: CSV export, direct DuckDB, or
py$dfvia reticulate
Cleanup
registry.close()