Analysis & Visualization

Explore and visualize simulation results with joshpy

Introduction

This demo focuses on exploring and visualizing simulation results after they’ve been collected. The analysis layer in joshpy is decoupled from orchestration - it only needs a RunRegistry containing data. Whether the data came from SweepManager, manual workflow, or someone else’s experiment doesn’t matter.

Prerequisites: Run either Manual Workflow or SweepManager Workflow first to generate demo_registry.duckdb.

For orchestration workflows, see:

Manual Workflow - Step-by-step control using individual components
SweepManager Workflow - Simplified workflow with SweepManager

This demo covers:

Data Discovery - What’s in the registry?
Diagnostic Plots - Quick matplotlib visualizations
Custom Queries - DiagnosticQueries for common patterns
Direct SQL - Full access to DuckDB for advanced analysis
R/ggplot2 - Publication-quality figures

Setup: Load Registry

Load the registry saved by a previous orchestration demo:

from joshpy.registry import RunRegistry

# Load registry from disk (created by manual-workflow.qmd or sweep-manager.qmd)
REGISTRY_PATH = "demo_registry.duckdb"
registry = RunRegistry(REGISTRY_PATH)

Part 1: Data Discovery

Before plotting, discover what data is available in the registry.

Full Summary

summary = registry.get_data_summary()
print(summary)

Registry Data Summary
========================================
Sessions: 2
Configs:  10
Runs:     20
Rows:     87,780

Variables: averageAge, averageHeight
Entity types: patch
Parameters: maxGrowth
Steps: 0 - 10
Replicates: 0 - 2
Spatial extent: lon [-115.37, -114.40], lat [33.41, 33.68]

Specific Queries

# What variables were exported from the simulation?
registry.list_export_variables()

['averageAge', 'averageHeight']

# What parameters were swept?
registry.list_config_parameters()

['maxGrowth']

# What entity types are in the data?
registry.list_entity_types()

['patch']

Variable Columns

# List all variable columns in the cell_data table
registry.list_variable_columns()

['averageAge', 'averageHeight']

Part 2: Diagnostic Plots (SimulationDiagnostics)

The SimulationDiagnostics class provides quick matplotlib-based visualizations for simulation sanity checks. These are designed for rapid exploration, not publication.

from joshpy.diagnostics import SimulationDiagnostics

diag = SimulationDiagnostics(registry)

Time Series

Plot how a variable evolves over simulation steps. By default, spatially aggregates across patches and shows uncertainty bands across replicates.

# Filter by parameter value
diag.plot_timeseries(
    "averageHeight",
    maxGrowth=50,
    title="Average Tree Height Over Time (maxGrowth=50)",
)

Figure 1: Tree height over time for maxGrowth=50, with replicate uncertainty.

Aggregation Options

# Sum across patches
diag.plot_timeseries("averageHeight", maxGrowth=50, aggregate="sum", title="Sum")
# Min across patches
diag.plot_timeseries("averageHeight", maxGrowth=50, aggregate="min", title="Min")

Figure 2: Different spatial aggregation methods.

Figure 3: Different spatial aggregation methods.

Parameter Comparison

Compare a variable across different parameter values - ideal for sweep results.

diag.plot_comparison(
    "averageHeight",
    group_by="maxGrowth",
    title="Tree Height by Growth Rate Parameter",
)

Figure 4: Tree height trajectories across all maxGrowth values.

Bar Chart at Specific Step

diag.plot_comparison(
    "averageHeight",
    group_by="maxGrowth",
    step=10,
    title="Final Tree Height vs Growth Rate",
)

Figure 5: Final tree height (step 10) for each maxGrowth value.

Spatial Snapshot

Scatter plot colored by value at a specific timestep:

# Get a run hash for maxGrowth=50
run_hash_50 = registry.get_runs_by_parameters(maxGrowth=50)[0]["run_hash"]

diag.plot_spatial(
    "averageHeight",
    step=10,
    run_hash=run_hash_50,
    title="Spatial Distribution (maxGrowth=50, step=10)",
)

Figure 6: Spatial distribution of tree height at step 10 for maxGrowth=50.

Saving Figures

All plot methods return a matplotlib Figure that can be saved:

fig = diag.plot_comparison(
    "averageHeight",
    group_by="maxGrowth",
    show=False,  # Don't display inline
)
fig.savefig("/tmp/comparison_plot.png", dpi=150, bbox_inches="tight")

Part 3: Custom Queries (DiagnosticQueries)

For more control than SimulationDiagnostics, use DiagnosticQueries directly. These return pandas DataFrames for further processing.

from joshpy.cell_data import DiagnosticQueries

queries = DiagnosticQueries(registry)

Parameter Comparison

df = queries.get_parameter_comparison(
    variable="averageHeight",
    param_name="maxGrowth",
)
df.head(15)

	param_value	step	mean_value	std_value	n_cells
0	10.0	0	5.002052	0.949872	798
1	10.0	1	10.017796	1.316734	798
2	10.0	2	15.035342	1.586715	798
3	10.0	3	19.980981	1.825273	798
4	10.0	4	24.937082	2.022142	798
5	10.0	5	29.970522	2.196011	798
6	10.0	6	34.976634	2.380003	798
7	10.0	7	39.937856	2.495450	798
8	10.0	8	44.960226	2.691332	798
9	10.0	9	49.950621	2.836932	798
10	10.0	10	54.976967	2.961161	798
11	20.0	0	10.024647	1.749574	798
12	20.0	1	20.038832	2.439569	798
13	20.0	2	30.057417	3.191414	798
14	20.0	3	40.080597	3.764329	798

Replicate Uncertainty

Get statistics across replicates for a specific run:

df = queries.get_replicate_uncertainty(
    variable="averageHeight",
    run_hash=run_hash_50,
)
df.head(10)

	step	mean	std	ci_low	ci_high	n_replicates
0	0	24.959723	0.334604	24.581090	25.338356	3
1	1	49.847802	0.242053	49.573898	50.121705	3
2	2	74.751186	0.249176	74.469222	75.033150	3
3	3	100.123437	0.235847	99.856556	100.390318	3
4	4	125.486203	0.498842	124.921720	126.050685	3
5	5	150.693025	0.424666	150.212479	151.173571	3
6	6	175.916505	0.670069	175.158265	176.674745	3
7	7	201.128909	0.812048	200.210006	202.047811	3
8	8	226.277511	0.833181	225.334696	227.220327	3
9	9	251.387767	0.923702	250.342518	252.433015	3

Spatial Snapshot

Get all spatial data for a given timestep:

df = queries.get_spatial_snapshot(
    step=10,
    variable="averageHeight",
    run_hash=run_hash_50,
)
df.head(10)

	longitude	latitude	value
0	-115.264878	33.587585	280.058142
1	-114.508196	33.452687	263.347895
2	-115.156781	33.542619	257.482334
3	-114.724391	33.407720	281.867308
4	-114.832488	33.452687	271.387664
5	-114.832488	33.677517	273.771064
6	-115.264878	33.542619	289.841778
7	-115.210829	33.497653	271.214587
8	-115.102732	33.632551	270.783724
9	-115.102732	33.452687	259.565090

Get All Variables at Once

Retrieve all exported variables for a specific step:

df = queries.get_all_variables_at_step(
    step=10,
    run_hash=run_hash_50,
)
df.head(5)

	longitude	latitude	position_x	position_y	averageAge	averageHeight
0	-115.264878	33.587585	2.5	2.5	11.0	280.058142
1	-114.508196	33.452687	16.5	5.5	11.0	263.347895
2	-115.156781	33.542619	4.5	3.5	11.0	257.482334
3	-114.724391	33.407720	12.5	6.5	11.0	281.867308
4	-114.832488	33.452687	10.5	5.5	11.0	271.387664

Part 4: Direct SQL Queries

For full flexibility, query DuckDB directly.

Simple Aggregations

# Mean, min, max across all cells per step
result = registry.query("""
    SELECT 
        step,
        AVG(averageHeight) as mean_height,
        MIN(averageHeight) as min_height,
        MAX(averageHeight) as max_height,
        STDDEV(averageHeight) as std_height,
        COUNT(*) as n_cells
    FROM cell_data
    GROUP BY step
    ORDER BY step
""")
result.df()

	step	mean_height	min_height	max_height	std_height	n_cells
0	0	27.449281	2.571519	81.261263	15.424800	7980
1	1	54.946575	5.344454	141.446671	29.806147	7980
2	2	82.415020	9.365887	200.431595	44.181379	7980
3	3	110.022066	13.070597	258.265273	58.589413	7980
4	4	137.611730	17.214230	309.244176	73.050728	7980
5	5	165.177986	21.954278	364.878327	87.426829	7980
6	6	192.695592	26.523429	412.071149	101.860115	7980
7	7	220.171013	31.515315	469.674767	116.095081	7980
8	8	247.716202	36.695347	526.101379	130.517679	7980
9	9	275.254514	40.568335	578.888969	144.922821	7980
10	10	302.741507	46.718865	633.356189	159.222472	7980

Filtering on Variable Values

# Find cells where trees grew above a threshold
result = registry.query("""
    SELECT 
        step,
        COUNT(*) as n_cells_above_50,
        AVG(averageHeight) as mean_height_of_those
    FROM cell_data
    WHERE averageHeight > 50
    GROUP BY step
    ORDER BY step
""")
result.df()

	step	n_cells_above_50	mean_height_of_those
0	0	676	56.237291
1	1	4311	78.262376
2	2	5666	104.167978
3	3	6352	130.435111
4	4	6770	156.420390
5	5	7178	180.274691
6	6	7182	210.219921
7	7	7182	240.196919
8	8	7203	269.605582
9	9	7576	287.388078
10	10	7939	304.052147

Percentile Analysis

# Get distribution statistics per step
result = registry.query("""
    SELECT 
        step,
        APPROX_QUANTILE(averageHeight, 0.25) as p25,
        APPROX_QUANTILE(averageHeight, 0.50) as median,
        APPROX_QUANTILE(averageHeight, 0.75) as p75,
        APPROX_QUANTILE(averageHeight, 0.95) as p95
    FROM cell_data
    GROUP BY step
    ORDER BY step
""")
result.df()

	step	p25	median	p75	p95
0	0	14.333138	26.583360	38.896391	54.049355
1	1	29.717330	53.901597	78.920256	103.584791
2	2	45.065265	81.175304	119.418901	153.739298
3	3	59.867460	108.434569	159.078913	203.425428
4	4	74.906988	136.388891	199.157198	254.036799
5	5	89.727411	163.824428	239.437937	302.780912
6	6	104.488440	191.694419	279.790472	352.457984
7	7	119.703968	218.730664	319.603379	401.400480
8	8	134.630376	246.288961	359.761569	451.124598
9	9	149.649509	273.035080	399.361658	502.023270
10	10	164.947037	300.665859	438.918273	550.910851

Joining with Config Parameters

Config parameters are stored as typed columns in the config_parameters table:

# Compare results across different parameter values
result = registry.query("""
    SELECT 
        cp.maxGrowth,
        cd.step,
        AVG(cd.averageHeight) as mean_height,
        STDDEV(cd.averageHeight) as std_height
    FROM cell_data cd
    JOIN config_parameters cp ON cd.run_hash = cp.run_hash
    GROUP BY cp.maxGrowth, cd.step
    ORDER BY cp.maxGrowth, cd.step
""")
result.df().head(15)

	maxGrowth	step	mean_height	std_height
0	10.0	0	5.002052	0.949872
1	10.0	1	10.017796	1.316734
2	10.0	2	15.035342	1.586715
3	10.0	3	19.980981	1.825273
4	10.0	4	24.937082	2.022142
5	10.0	5	29.970522	2.196011
6	10.0	6	34.976634	2.380003
7	10.0	7	39.937856	2.495450
8	10.0	8	44.960226	2.691332
9	10.0	9	49.950621	2.836932
10	10.0	10	54.976967	2.961161
11	20.0	0	10.024647	1.749574
12	20.0	1	20.038832	2.439569
13	20.0	2	30.057417	3.191414
14	20.0	3	40.080597	3.764329

# Filter by parameter value (maxGrowth > 50)
result = registry.query("""
    SELECT 
        cd.step,
        AVG(cd.averageHeight) as mean_height
    FROM cell_data cd
    JOIN config_parameters cp ON cd.run_hash = cp.run_hash
    WHERE cp.maxGrowth > 50
    GROUP BY cd.step
    ORDER BY cd.step
""")
result.df()

	step	mean_height
0	0	39.960817
1	1	79.949623
2	2	119.928205
3	3	160.040159
4	4	200.189526
5	5	240.277122
6	6	280.317346
7	7	320.171237
8	8	360.241799
9	9	400.280759
10	10	440.219348

Export to File

# Export query results to CSV
registry.query("""
    SELECT step, AVG(averageHeight) as mean_height
    FROM cell_data
    GROUP BY step
    ORDER BY step
""").df().to_csv("/tmp/height_by_step.csv", index=False)

Part 5: R/ggplot2 Visualization

For publication-quality figures, there are three routes to get data from joshpy into R:

CSV Export - Simple, always works
Direct DuckDB - Custom SQL queries from R, no Python needed
py$df via reticulate - Zero-copy Python/R interop

Route 1: Export to CSV (Simple, Always Works)

# Python exports to CSV
csv_df = queries.get_parameter_comparison(
    variable="averageHeight",
    param_name="maxGrowth",
)
csv_df.to_csv("/tmp/sweep_results.csv", index=False)

# R reads CSV
df_csv <- read.csv("/tmp/sweep_results.csv")
df_csv$param_value <- as.numeric(df_csv$param_value)
head(df_csv, 5)

  param_value step mean_value std_value n_cells
1          10    0   5.002052 0.9498717     798
2          10    1  10.017796 1.3167339     798
3          10    2  15.035342 1.5867147     798
4          10    3  19.980981 1.8252732     798
5          10    4  24.937082 2.0221417     798

Pros: Always works, simple, portable
Cons: File I/O overhead, temp file management

Route 2: Direct DuckDB from R (Custom Queries)

R can query DuckDB directly with full SQL flexibility:

library(DBI)
library(duckdb)

con <- dbConnect(duckdb(), "demo_registry.duckdb", read_only = TRUE)

df_duckdb <- dbGetQuery(con, "
    SELECT 
        cp.maxGrowth as param_value,
        cd.step,
        AVG(cd.averageHeight) as mean_value,
        STDDEV(cd.averageHeight) as std_value
    FROM cell_data cd
    JOIN config_parameters cp ON cd.run_hash = cp.run_hash
    GROUP BY cp.maxGrowth, cd.step
    ORDER BY cp.maxGrowth, cd.step
")

dbDisconnect(con, shutdown = TRUE)
head(df_duckdb, 10)

   param_value step mean_value std_value
1           10    0   5.002052 0.9498717
2           10    1  10.017796 1.3167339
3           10    2  15.035342 1.5867147
4           10    3  19.980981 1.8252732
5           10    4  24.937082 2.0221417
6           10    5  29.970522 2.1960110
7           10    6  34.976634 2.3800031
8           10    7  39.937856 2.4954505
9           10    8  44.960226 2.6913317
10          10    9  49.950621 2.8369320

Pros: Full SQL flexibility, no Python needed for custom queries
Cons: Requires r-duckdb package (linux-64 only in pixi)

Route 3: `py$df` via reticulate (Zero-Copy)

In theory, you can access Python DataFrames directly via py$df:

# This should work:
# df <- reticulate::py_to_r(py$sweep_df)

We’re working on this - for now, Routes 1 and 2 are recommended.

Plotting with ggplot2

The following examples use df_duckdb from Route 2 (direct DuckDB query). We assign it to df for cleaner ggplot code:

# Use the DuckDB query result for plotting
df <- df_duckdb

Time Series with Uncertainty Ribbon

ggplot(df, aes(x = step, y = mean_value, 
                     color = factor(param_value), 
                     fill = factor(param_value))) +
  geom_ribbon(aes(ymin = mean_value - std_value, 
                  ymax = mean_value + std_value), 
              alpha = 0.2, color = NA) +
  geom_line(linewidth = 0.8) +
  geom_point(size = 2) +
  labs(
    x = "Simulation Step",
    y = "Average Tree Height (meters)",
    title = "Tree Growth Over Time by maxGrowth Parameter",
    color = "maxGrowth",
    fill = "maxGrowth"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    panel.grid.minor = element_blank()
  )

Figure 7: Tree height trajectories with uncertainty ribbons (ggplot2).

Bar Chart with Error Bars

final_df <- df |>
  dplyr::filter(step == max(step))

ggplot(final_df, aes(x = param_value, y = mean_value)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_errorbar(
    aes(ymin = mean_value - std_value, ymax = mean_value + std_value),
    width = 3,
    linewidth = 0.6
  ) +
  labs(
    x = "maxGrowth (meters/step)",
    y = "Final Average Height (meters)",
    title = "Final Tree Height vs Growth Rate"
  ) +
  theme_minimal() +
  theme(panel.grid.minor = element_blank())

Figure 8: Final tree height vs growth rate parameter with error bars.

Faceted Plot

ggplot(df, aes(x = step, y = mean_value)) +
  geom_ribbon(aes(ymin = mean_value - std_value, 
                  ymax = mean_value + std_value), 
              alpha = 0.3, fill = "steelblue") +
  geom_line(color = "steelblue", linewidth = 0.8) +
  geom_point(color = "steelblue", size = 1.5) +
  facet_wrap(~param_value, ncol = 5, labeller = label_both) +
  labs(
    x = "Simulation Step",
    y = "Average Tree Height (meters)",
    title = "Tree Growth Trajectories by maxGrowth Parameter"
  ) +
  theme_minimal() +
  theme(
    panel.grid.minor = element_blank(),
    strip.background = element_rect(fill = "gray90", color = NA)
  )

Figure 9: Faceted time series - one panel per maxGrowth value.

Summary

This demo covered the joshpy analysis workflow:

Task	Tool	Notes
Discovery	`registry.get_data_summary()`	What’s in the registry?
Quick plots	`SimulationDiagnostics`	matplotlib-based, fast iteration
DataFrames	`DiagnosticQueries`	Returns pandas for further processing
Full SQL	`registry.query()`	Direct DuckDB access
R (CSV)	`read.csv()`	Simple, always works
R (DuckDB)	`dbConnect(duckdb(), ...)`	Custom SQL directly from R
R (reticulate)	`py$df`	Zero-copy Python/R interop

Key Points:

Analysis is decoupled from orchestration - same workflow for any data source
Export variables are typed columns in cell_data
Config parameters are typed columns in config_parameters
R can query DuckDB directly
Three routes for R: CSV export, direct DuckDB, or py$df via reticulate

Cleanup

registry.close()

Introduction

Setup: Load Registry

Part 1: Data Discovery

Full Summary

Specific Queries

Variable Columns

Part 2: Diagnostic Plots (SimulationDiagnostics)

Time Series

Aggregation Options

Parameter Comparison

Bar Chart at Specific Step

Spatial Snapshot

Saving Figures

Part 3: Custom Queries (DiagnosticQueries)

Parameter Comparison

Replicate Uncertainty

Spatial Snapshot

Get All Variables at Once

Part 4: Direct SQL Queries

Simple Aggregations

Filtering on Variable Values

Percentile Analysis

Joining with Config Parameters

Export to File

Part 5: R/ggplot2 Visualization

Route 1: Export to CSV (Simple, Always Works)

Route 2: Direct DuckDB from R (Custom Queries)

Route 3: py$df via reticulate (Zero-Copy)

Plotting with ggplot2

Time Series with Uncertainty Ribbon

Bar Chart with Error Bars

Faceted Plot

Summary

Cleanup

Route 3: `py$df` via reticulate (Zero-Copy)