Analysis & Visualization

Explore and visualize simulation results with joshpy

Introduction

This demo focuses on exploring and visualizing simulation results after they’ve been collected. The analysis layer in joshpy is decoupled from orchestration - it only needs a RunRegistry containing data. Whether the data came from SweepManager, manual workflow, or someone else’s experiment doesn’t matter.

Prerequisites: Run either Manual Workflow or SweepManager Workflow first to generate demo_registry.duckdb.

For orchestration workflows, see:

This demo covers:

  1. Data Discovery - What’s in the registry?
  2. Diagnostic Plots - Quick matplotlib visualizations
  3. Custom Queries - DiagnosticQueries for common patterns
  4. Direct SQL - Full access to DuckDB for advanced analysis
  5. R/ggplot2 - Publication-quality figures

Setup: Load Registry

Load the registry saved by a previous orchestration demo:

from joshpy.registry import RunRegistry

# Load registry from disk (created by manual-workflow.qmd or sweep-manager.qmd)
REGISTRY_PATH = "demo_registry.duckdb"
registry = RunRegistry(REGISTRY_PATH)

Part 1: Data Discovery

Before plotting, discover what data is available in the registry.

Full Summary

summary = registry.get_data_summary()
print(summary)
Registry Data Summary
========================================
Sessions: 2
Configs:  10
Runs:     20
Rows:     87,780

Variables: averageAge, averageHeight
Entity types: patch
Parameters: maxGrowth
Steps: 0 - 10
Replicates: 0 - 2
Spatial extent: lon [-115.37, -114.40], lat [33.41, 33.68]

Specific Queries

# What variables were exported from the simulation?
registry.list_export_variables()
['averageAge', 'averageHeight']
# What parameters were swept?
registry.list_config_parameters()
['maxGrowth']
# What entity types are in the data?
registry.list_entity_types()
['patch']

Variable Columns

# List all variable columns in the cell_data table
registry.list_variable_columns()
['averageAge', 'averageHeight']

Part 2: Diagnostic Plots (SimulationDiagnostics)

The SimulationDiagnostics class provides quick matplotlib-based visualizations for simulation sanity checks. These are designed for rapid exploration, not publication.

from joshpy.diagnostics import SimulationDiagnostics

diag = SimulationDiagnostics(registry)

Time Series

Plot how a variable evolves over simulation steps. By default, spatially aggregates across patches and shows uncertainty bands across replicates.

# Filter by parameter value
diag.plot_timeseries(
    "averageHeight",
    maxGrowth=50,
    title="Average Tree Height Over Time (maxGrowth=50)",
)
Figure 1: Tree height over time for maxGrowth=50, with replicate uncertainty.

Aggregation Options

# Sum across patches
diag.plot_timeseries("averageHeight", maxGrowth=50, aggregate="sum", title="Sum")
# Min across patches
diag.plot_timeseries("averageHeight", maxGrowth=50, aggregate="min", title="Min")
Figure 2: Different spatial aggregation methods.
Figure 3: Different spatial aggregation methods.

Parameter Comparison

Compare a variable across different parameter values - ideal for sweep results.

diag.plot_comparison(
    "averageHeight",
    group_by="maxGrowth",
    title="Tree Height by Growth Rate Parameter",
)
Figure 4: Tree height trajectories across all maxGrowth values.

Bar Chart at Specific Step

diag.plot_comparison(
    "averageHeight",
    group_by="maxGrowth",
    step=10,
    title="Final Tree Height vs Growth Rate",
)
Figure 5: Final tree height (step 10) for each maxGrowth value.

Spatial Snapshot

Scatter plot colored by value at a specific timestep:

# Get a run hash for maxGrowth=50
run_hash_50 = registry.get_runs_by_parameters(maxGrowth=50)[0]["run_hash"]

diag.plot_spatial(
    "averageHeight",
    step=10,
    run_hash=run_hash_50,
    title="Spatial Distribution (maxGrowth=50, step=10)",
)
Figure 6: Spatial distribution of tree height at step 10 for maxGrowth=50.

Saving Figures

All plot methods return a matplotlib Figure that can be saved:

fig = diag.plot_comparison(
    "averageHeight",
    group_by="maxGrowth",
    show=False,  # Don't display inline
)
fig.savefig("/tmp/comparison_plot.png", dpi=150, bbox_inches="tight")

Part 3: Custom Queries (DiagnosticQueries)

For more control than SimulationDiagnostics, use DiagnosticQueries directly. These return pandas DataFrames for further processing.

from joshpy.cell_data import DiagnosticQueries

queries = DiagnosticQueries(registry)

Parameter Comparison

df = queries.get_parameter_comparison(
    variable="averageHeight",
    param_name="maxGrowth",
)
df.head(15)
param_value step mean_value std_value n_cells
0 10.0 0 5.002052 0.949872 798
1 10.0 1 10.017796 1.316734 798
2 10.0 2 15.035342 1.586715 798
3 10.0 3 19.980981 1.825273 798
4 10.0 4 24.937082 2.022142 798
5 10.0 5 29.970522 2.196011 798
6 10.0 6 34.976634 2.380003 798
7 10.0 7 39.937856 2.495450 798
8 10.0 8 44.960226 2.691332 798
9 10.0 9 49.950621 2.836932 798
10 10.0 10 54.976967 2.961161 798
11 20.0 0 10.024647 1.749574 798
12 20.0 1 20.038832 2.439569 798
13 20.0 2 30.057417 3.191414 798
14 20.0 3 40.080597 3.764329 798

Replicate Uncertainty

Get statistics across replicates for a specific run:

df = queries.get_replicate_uncertainty(
    variable="averageHeight",
    run_hash=run_hash_50,
)
df.head(10)
step mean std ci_low ci_high n_replicates
0 0 24.959723 0.334604 24.581090 25.338356 3
1 1 49.847802 0.242053 49.573898 50.121705 3
2 2 74.751186 0.249176 74.469222 75.033150 3
3 3 100.123437 0.235847 99.856556 100.390318 3
4 4 125.486203 0.498842 124.921720 126.050685 3
5 5 150.693025 0.424666 150.212479 151.173571 3
6 6 175.916505 0.670069 175.158265 176.674745 3
7 7 201.128909 0.812048 200.210006 202.047811 3
8 8 226.277511 0.833181 225.334696 227.220327 3
9 9 251.387767 0.923702 250.342518 252.433015 3

Spatial Snapshot

Get all spatial data for a given timestep:

df = queries.get_spatial_snapshot(
    step=10,
    variable="averageHeight",
    run_hash=run_hash_50,
)
df.head(10)
longitude latitude value
0 -115.264878 33.587585 280.058142
1 -114.508196 33.452687 263.347895
2 -115.156781 33.542619 257.482334
3 -114.724391 33.407720 281.867308
4 -114.832488 33.452687 271.387664
5 -114.832488 33.677517 273.771064
6 -115.264878 33.542619 289.841778
7 -115.210829 33.497653 271.214587
8 -115.102732 33.632551 270.783724
9 -115.102732 33.452687 259.565090

Get All Variables at Once

Retrieve all exported variables for a specific step:

df = queries.get_all_variables_at_step(
    step=10,
    run_hash=run_hash_50,
)
df.head(5)
longitude latitude position_x position_y averageAge averageHeight
0 -115.264878 33.587585 2.5 2.5 11.0 280.058142
1 -114.508196 33.452687 16.5 5.5 11.0 263.347895
2 -115.156781 33.542619 4.5 3.5 11.0 257.482334
3 -114.724391 33.407720 12.5 6.5 11.0 281.867308
4 -114.832488 33.452687 10.5 5.5 11.0 271.387664

Part 4: Direct SQL Queries

For full flexibility, query DuckDB directly.

Simple Aggregations

# Mean, min, max across all cells per step
result = registry.query("""
    SELECT 
        step,
        AVG(averageHeight) as mean_height,
        MIN(averageHeight) as min_height,
        MAX(averageHeight) as max_height,
        STDDEV(averageHeight) as std_height,
        COUNT(*) as n_cells
    FROM cell_data
    GROUP BY step
    ORDER BY step
""")
result.df()
step mean_height min_height max_height std_height n_cells
0 0 27.449281 2.571519 81.261263 15.424800 7980
1 1 54.946575 5.344454 141.446671 29.806147 7980
2 2 82.415020 9.365887 200.431595 44.181379 7980
3 3 110.022066 13.070597 258.265273 58.589413 7980
4 4 137.611730 17.214230 309.244176 73.050728 7980
5 5 165.177986 21.954278 364.878327 87.426829 7980
6 6 192.695592 26.523429 412.071149 101.860115 7980
7 7 220.171013 31.515315 469.674767 116.095081 7980
8 8 247.716202 36.695347 526.101379 130.517679 7980
9 9 275.254514 40.568335 578.888969 144.922821 7980
10 10 302.741507 46.718865 633.356189 159.222472 7980

Filtering on Variable Values

# Find cells where trees grew above a threshold
result = registry.query("""
    SELECT 
        step,
        COUNT(*) as n_cells_above_50,
        AVG(averageHeight) as mean_height_of_those
    FROM cell_data
    WHERE averageHeight > 50
    GROUP BY step
    ORDER BY step
""")
result.df()
step n_cells_above_50 mean_height_of_those
0 0 676 56.237291
1 1 4311 78.262376
2 2 5666 104.167978
3 3 6352 130.435111
4 4 6770 156.420390
5 5 7178 180.274691
6 6 7182 210.219921
7 7 7182 240.196919
8 8 7203 269.605582
9 9 7576 287.388078
10 10 7939 304.052147

Percentile Analysis

# Get distribution statistics per step
result = registry.query("""
    SELECT 
        step,
        APPROX_QUANTILE(averageHeight, 0.25) as p25,
        APPROX_QUANTILE(averageHeight, 0.50) as median,
        APPROX_QUANTILE(averageHeight, 0.75) as p75,
        APPROX_QUANTILE(averageHeight, 0.95) as p95
    FROM cell_data
    GROUP BY step
    ORDER BY step
""")
result.df()
step p25 median p75 p95
0 0 14.333138 26.583360 38.896391 54.049355
1 1 29.717330 53.901597 78.920256 103.584791
2 2 45.065265 81.175304 119.418901 153.739298
3 3 59.867460 108.434569 159.078913 203.425428
4 4 74.906988 136.388891 199.157198 254.036799
5 5 89.727411 163.824428 239.437937 302.780912
6 6 104.488440 191.694419 279.790472 352.457984
7 7 119.703968 218.730664 319.603379 401.400480
8 8 134.630376 246.288961 359.761569 451.124598
9 9 149.649509 273.035080 399.361658 502.023270
10 10 164.947037 300.665859 438.918273 550.910851

Joining with Config Parameters

Config parameters are stored as typed columns in the config_parameters table:

# Compare results across different parameter values
result = registry.query("""
    SELECT 
        cp.maxGrowth,
        cd.step,
        AVG(cd.averageHeight) as mean_height,
        STDDEV(cd.averageHeight) as std_height
    FROM cell_data cd
    JOIN config_parameters cp ON cd.run_hash = cp.run_hash
    GROUP BY cp.maxGrowth, cd.step
    ORDER BY cp.maxGrowth, cd.step
""")
result.df().head(15)
maxGrowth step mean_height std_height
0 10.0 0 5.002052 0.949872
1 10.0 1 10.017796 1.316734
2 10.0 2 15.035342 1.586715
3 10.0 3 19.980981 1.825273
4 10.0 4 24.937082 2.022142
5 10.0 5 29.970522 2.196011
6 10.0 6 34.976634 2.380003
7 10.0 7 39.937856 2.495450
8 10.0 8 44.960226 2.691332
9 10.0 9 49.950621 2.836932
10 10.0 10 54.976967 2.961161
11 20.0 0 10.024647 1.749574
12 20.0 1 20.038832 2.439569
13 20.0 2 30.057417 3.191414
14 20.0 3 40.080597 3.764329
# Filter by parameter value (maxGrowth > 50)
result = registry.query("""
    SELECT 
        cd.step,
        AVG(cd.averageHeight) as mean_height
    FROM cell_data cd
    JOIN config_parameters cp ON cd.run_hash = cp.run_hash
    WHERE cp.maxGrowth > 50
    GROUP BY cd.step
    ORDER BY cd.step
""")
result.df()
step mean_height
0 0 39.960817
1 1 79.949623
2 2 119.928205
3 3 160.040159
4 4 200.189526
5 5 240.277122
6 6 280.317346
7 7 320.171237
8 8 360.241799
9 9 400.280759
10 10 440.219348

Export to File

# Export query results to CSV
registry.query("""
    SELECT step, AVG(averageHeight) as mean_height
    FROM cell_data
    GROUP BY step
    ORDER BY step
""").df().to_csv("/tmp/height_by_step.csv", index=False)

Part 5: R/ggplot2 Visualization

For publication-quality figures, there are three routes to get data from joshpy into R:

  1. CSV Export - Simple, always works
  2. Direct DuckDB - Custom SQL queries from R, no Python needed
  3. py$df via reticulate - Zero-copy Python/R interop

Route 1: Export to CSV (Simple, Always Works)

# Python exports to CSV
csv_df = queries.get_parameter_comparison(
    variable="averageHeight",
    param_name="maxGrowth",
)
csv_df.to_csv("/tmp/sweep_results.csv", index=False)
# R reads CSV
df_csv <- read.csv("/tmp/sweep_results.csv")
df_csv$param_value <- as.numeric(df_csv$param_value)
head(df_csv, 5)
  param_value step mean_value std_value n_cells
1          10    0   5.002052 0.9498717     798
2          10    1  10.017796 1.3167339     798
3          10    2  15.035342 1.5867147     798
4          10    3  19.980981 1.8252732     798
5          10    4  24.937082 2.0221417     798

Pros: Always works, simple, portable
Cons: File I/O overhead, temp file management

Route 2: Direct DuckDB from R (Custom Queries)

R can query DuckDB directly with full SQL flexibility:

library(DBI)
library(duckdb)

con <- dbConnect(duckdb(), "demo_registry.duckdb", read_only = TRUE)

df_duckdb <- dbGetQuery(con, "
    SELECT 
        cp.maxGrowth as param_value,
        cd.step,
        AVG(cd.averageHeight) as mean_value,
        STDDEV(cd.averageHeight) as std_value
    FROM cell_data cd
    JOIN config_parameters cp ON cd.run_hash = cp.run_hash
    GROUP BY cp.maxGrowth, cd.step
    ORDER BY cp.maxGrowth, cd.step
")

dbDisconnect(con, shutdown = TRUE)
head(df_duckdb, 10)
   param_value step mean_value std_value
1           10    0   5.002052 0.9498717
2           10    1  10.017796 1.3167339
3           10    2  15.035342 1.5867147
4           10    3  19.980981 1.8252732
5           10    4  24.937082 2.0221417
6           10    5  29.970522 2.1960110
7           10    6  34.976634 2.3800031
8           10    7  39.937856 2.4954505
9           10    8  44.960226 2.6913317
10          10    9  49.950621 2.8369320

Pros: Full SQL flexibility, no Python needed for custom queries
Cons: Requires r-duckdb package (linux-64 only in pixi)

Route 3: py$df via reticulate (Zero-Copy)

In theory, you can access Python DataFrames directly via py$df:

# This should work:
# df <- reticulate::py_to_r(py$sweep_df)

We’re working on this - for now, Routes 1 and 2 are recommended.

Plotting with ggplot2

The following examples use df_duckdb from Route 2 (direct DuckDB query). We assign it to df for cleaner ggplot code:

# Use the DuckDB query result for plotting
df <- df_duckdb

Time Series with Uncertainty Ribbon

ggplot(df, aes(x = step, y = mean_value, 
                     color = factor(param_value), 
                     fill = factor(param_value))) +
  geom_ribbon(aes(ymin = mean_value - std_value, 
                  ymax = mean_value + std_value), 
              alpha = 0.2, color = NA) +
  geom_line(linewidth = 0.8) +
  geom_point(size = 2) +
  labs(
    x = "Simulation Step",
    y = "Average Tree Height (meters)",
    title = "Tree Growth Over Time by maxGrowth Parameter",
    color = "maxGrowth",
    fill = "maxGrowth"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    panel.grid.minor = element_blank()
  )
Figure 7: Tree height trajectories with uncertainty ribbons (ggplot2).

Bar Chart with Error Bars

final_df <- df |>
  dplyr::filter(step == max(step))

ggplot(final_df, aes(x = param_value, y = mean_value)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  geom_errorbar(
    aes(ymin = mean_value - std_value, ymax = mean_value + std_value),
    width = 3,
    linewidth = 0.6
  ) +
  labs(
    x = "maxGrowth (meters/step)",
    y = "Final Average Height (meters)",
    title = "Final Tree Height vs Growth Rate"
  ) +
  theme_minimal() +
  theme(panel.grid.minor = element_blank())
Figure 8: Final tree height vs growth rate parameter with error bars.

Faceted Plot

ggplot(df, aes(x = step, y = mean_value)) +
  geom_ribbon(aes(ymin = mean_value - std_value, 
                  ymax = mean_value + std_value), 
              alpha = 0.3, fill = "steelblue") +
  geom_line(color = "steelblue", linewidth = 0.8) +
  geom_point(color = "steelblue", size = 1.5) +
  facet_wrap(~param_value, ncol = 5, labeller = label_both) +
  labs(
    x = "Simulation Step",
    y = "Average Tree Height (meters)",
    title = "Tree Growth Trajectories by maxGrowth Parameter"
  ) +
  theme_minimal() +
  theme(
    panel.grid.minor = element_blank(),
    strip.background = element_rect(fill = "gray90", color = NA)
  )
Figure 9: Faceted time series - one panel per maxGrowth value.

Summary

This demo covered the joshpy analysis workflow:

Task Tool Notes
Discovery registry.get_data_summary() What’s in the registry?
Quick plots SimulationDiagnostics matplotlib-based, fast iteration
DataFrames DiagnosticQueries Returns pandas for further processing
Full SQL registry.query() Direct DuckDB access
R (CSV) read.csv() Simple, always works
R (DuckDB) dbConnect(duckdb(), ...) Custom SQL directly from R
R (reticulate) py$df Zero-copy Python/R interop

Key Points:

  • Analysis is decoupled from orchestration - same workflow for any data source
  • Export variables are typed columns in cell_data
  • Config parameters are typed columns in config_parameters
  • R can query DuckDB directly
  • Three routes for R: CSV export, direct DuckDB, or py$df via reticulate

Cleanup

registry.close()