TowerGuard — The Cost of Doing Nothing

T1 · Scope

Scope

A strategic, national-aggregate decision-support simulator for the US air-traffic-controller (CPC) staffing crisis, FY2025–FY2036.
Projects the workforce across 5 policy scenarios and quantifies the cost and safety risk of delaying intervention.

Key Assumptions

Calibrated to GAO-26-107320 and the FAA Controller Workforce Plan; every parameter is listed with its source and a confidence rating in T6.
The burnout-loop coefficients are illustrative (low confidence), so the collapse depth and dollar figures are order-of-magnitude, not point predictions. (See T5 sensitivity, T6 ledger.)

Non-goals

NOT a facility-level scheduler — use FAA CRWG/AFN tools for a single tower.
NOT an accident-prediction tool — safety outputs are RELATIVE risk, not probabilities.
NOT for operational use — decision-support only.
Does NOT decide policy or set “acceptable” risk — humans decide (see T9).
Endorses NEITHER the FAA nor the NATCA staffing target.

T2 · Scenario Dashboard

Certified controllers (CPC), FY2025–FY2036

Five policy scenarios shown together. Confidence band (P10–P90) is drawn for the highlighted scenario so projections are never read as point estimates.

Highlight band

Actual CPC (history, FY2020–FY2025)Projection start (FY2025)FAA target (12,563)NATCA target (14,633)P10–P90 band (Current Plan (FAA CWP))

Peak safety risk index

1.8×

for Current Plan (FAA CWP)

Note. Relative multiplier vs baseline, not a probability. Corroborating context (not a fitted relationship): FY2023 saw 19 serious near-misses, a 7-year high, during the staffing crisis. Risk_index is a RELATIVE multiplier (1.0 = rested baseline), NOT an accident probability (§9.2).

Months below 85% staffing

132

over the 12-year horizon

Cumulative cost of delay

$154.8B

FY2025–FY2036, Current Plan (FAA CWP)

T3 · Architecture & Design

Two halves, one loop

TowerGuard is two decoupled halves that close a loop: a simulator that projects the cost of doing nothing, and a live system that checks whether reality is unfolding the way the model assumed.

Simulator— "What will doing nothing cost?"

models/ (system dynamics + Monte Carlo)

→

scenario_results.json

frozen JSON contract

→

T2–T12 panels

closes the loop

Live Validation— "Is reality unfolding as the model assumed?"

OpenSky-style traffic

→

modules/

3 deterministic risk signals

→

Redis

frozen Redis contract

→

agents/Claude

phrases only, never decides

→

T13

Two frozen contracts (the JSON contract and the Redis contract) keep the halves independent and testable — 304 tests guard the boundaries.

Design tradeoffs

System dynamics, not machine learning

Policy needs an interpretable causal mechanism and the data is sparse (~10 annual points). ML would be a data-hungry black box that can't explain why the workforce collapses.

Monte Carlo confidence bands, not point estimates

A policy model must represent uncertainty, not fake precision — we show P10–P90, not a single confident line.

The LLM augments, it never decides (Option B)

The deterministic engine owns every escalation decision (auditable, testable); Claude only phrases the human-facing text, with a template fallback. Safety-critical AI must not 'drive'.

Template-first policy brief, LLM optional

Every figure in the brief is pulled live from the model and traceable; the LLM only rephrases — it invents no numbers.

Frozen contracts between the halves

A JSON contract (simulator → frontend) and a Redis contract (modules ↔ agents) let the two halves be built independently and stay decoupled.

Replay for the deployed demo, live engine for proof

Free tunnels buffer SSE, so the always-on dashboard replays a real captured session (reliable); the live engine runs in the demo video as proof the agent is real.

T4 · Intervention Timing Comparator

Every year of delay is locked in

Drag the slider to choose the year the FAA Current Plan hiring ramp begins. The CPC trajectory and the cost penalty update together — cost already locked in by waiting cannot be recovered by starting later.

Intervention start yearFY2027

Locked-in cost of starting in FY2027

$69.9B

Net cost of delay vs. acting in FY2026.

Cumulative cost gap through FY2036

$295.5B

Remaining cost still recoverable if you start in this year (locked-in cost + this = $365.4B total cost of doing nothing).

CPC trajectory if intervention starts FY2027

Locked-in cost of delay by start year — every year you wait, more is permanently lost

Takeaway: starting one year late costs roughly $69.9B in locked-in delay cost — money that no later acceleration can recover.

T5 · Sensitivity (Tornado)

How far each input parameter swings between its low and high plausibility, anchored at baseline. Sorted by swing magnitude — the top row is the parameter the answer is most fragile to.

cpc_attrition_rate

1112

6210

3273

swing 156%

effectiveness_gap_sensitivity

1022

5334

3273

swing 132%

attrition_amplify

1757

4568

3273

swing 86%

ojt_pass_rate

2793

3507

3273

swing 22%

certification_rate

3124

3211

3273

swing 3%

Bars span low ↔ high parameter values; the vertical tick marks the baseline. Swing % is (high − low) ÷ |baseline|.

T6 · Assumption Ledger

Every numeric input the model rests on, its source, and how confident we are in it. Click a column header to sort.

Parameter	Value	Source	Confidence
total_controllers_fy2025	13,164	GAO-26-107320	high
ojt_pass_rate	0.865	GAO FY2017-2022 funnel	high
academy_washout_rate	0.3	GAO (>30% FY2024)	high
fatigue_threshold	0.77	SAFTE-FAST (Hursh 2004)	high
annual_delay_cost_usd	33,000,000,000	FAA/Nextor 2019	high
cpc_fy2025	11,000	FAA CWP 2026-2028	medium
certification_median_years	3	FAA/National Academies	medium
staffing_floor	0.85	CRWG operational floor (§10.3)	medium
controller_delay_share_shutdown	0.61	A4A (Nov 2025 shutdown)	medium
cpc_attrition_rate_forward	0.1	calibrated to FAA plan (D14)	low
r1_effectiveness_gap_sensitivity	1.2	illustrative (D7)	low
r1_attrition_amplify	15	illustrative, Brazil study order-of-magnitude (D7)	low

cpc_attrition_rate_forward, r1_effectiveness_gap_sensitivity, r1_attrition_amplify are the illustrative / weakly-calibrated inputs — exactly the ones the T5 sensitivity analysis stress-tests and the T12 limitations flag. This is why the do-nothing collapse DEPTH and cost are order-of-magnitude, not point estimates.

T7 · Causal Loops

Why staffing shortfalls don't stay linear. Two reinforcing loops (R) compound the gap; one balancing loop (B) absorbs pressure by throttling throughput.

R1 · Burnout → Attrition

Reinforcing

1Staffing gap

2Overtime / 6-day weeks

3Burnout

4Attrition

12341

Understaffing forces overtime; overtime fuels burnout; burnout drives attrition; attrition deepens the gap.

Loop strength is set by r1_attrition_amplify (=15) and r1_effectiveness_gap_sensitivity (=1.2) — both LOW confidence (see T6). This is why the collapse depth is order-of-magnitude (T12).

R2 · Knowledge Drain

Reinforcing

1Senior CPC departures

2OJT instructor capacity ↓

3Time-to-CPC ↑

4Net CPC growth ↓

12341

Senior CPCs leave; OJT instructor capacity shrinks; trainees take longer to certify; the senior pool keeps thinning.

B1 · Load Shedding

Balancing

1Workload pressure

2Ground stops / metering

3Throughput ↓

4Pressure relieved

12341

When facilities run too hot, traffic is metered, flow programs activate, and ops are deferred — pressure drops, but at a cost.

The “cost” of shedding load is the delay quantified in T4 and the community exposure ranked in T11 — these loops generate the dollar figures; they are not decorative.

T8 · Model Validation

What this model gets wrong — and how we know

Validation is rendered straight from the model's own backtest. The monitor flags its own failure modes; it does not smooth them away.

Backtest · FY2020-FY2025

Predicted vs. actual CPC, year by year

✕ BREACH · MAPE 7.91% vs. threshold 5.0%

Y-axis starts at 8,000 to show the divergence — not zero-based.

Gap |Δ| = absolute headcount difference (Actual − Predicted) per year. The 7.91% in the breach badge is the MEAN error across years (MAPE).

Annual error %

Caption. The drift monitor catches the COVID structural break (bypass condition #5). A model that flags its own failure mode is more credible than a fake-perfect one.

Extreme-condition checks

Does the model behave sensibly under stress?

zero_hiring_decays
Expected: With hiring = 0 the workforce shrinks well below the start.
Observed: total 13,164 -> 0 over 15 yr
Pass
hiring_cannot_shortcut_certification
Expected: A 1-yr hiring flood does not raise next-year CPC (2-3 yr lag).
Observed: next-yr CPC: normal 10,524.1 vs flood 10,524.1
Pass
zero_attrition_never_shrinks_cpc
Expected: With attrition = 0 the CPC stock is non-decreasing.
Observed: CPC 11,000 -> 20,907, monotonic=True
Pass

Historical reproduction

Re-running the model against known history.

INDEPENDENT = not fitted, true out-of-sample test · SEMI-INDEPENDENT = partially fitted · IN-SAMPLE = fitted, internal-consistency check only.

Case	Predicted	Actual	\|err\|%	Note
backcast_fy2015_to_2025_totalIN-SAMPLE Tight fit	13,166	13,164	0.01%	IN-SAMPLE consistency check: hiring (~1,326/yr) was SOLVED to reproduce this total (D11), so a tight fit confirms internal consistency, not predictive skill. The independent legs are the composition (below) and the FY2020-2025 backtest.
backcast_fy2025_cpc_compositionSEMI-INDEPENDENT	10,982	11,000	0.16%	SEMI-INDEPENDENT: hiring was solved for the TOTAL, not the CPC/Dev split — so reproducing the ~11,000 CPC composition is a meaningful check (OJT washout, D13, is what pinned it down).
forward_developmental_checkpoint_fy2026INDEPENDENT	2,661	3,000	11.29%	Independent (not fitted), order-of-magnitude only: FAA/Reuters Apr-2026 reports ~4,000 'in training' = Dev + ~1,000 CPC-IT, so the Dev-only comparator is ~3,000 (derived). The model lands in the right ballpark for the in-training surge.

Method note. Validated as a strategic system-dynamics model (behaviour reproduction, extreme-condition tests, out-of-sample backtest, sensitivity — see the `sensitivity` block), NOT as a point-prediction accuracy score. Policy models must represent uncertainty, not certainty. Face validity. The do-nothing collapse and the 'hiring grows headcount but certified controllers barely move' dynamic match the qualitative warnings of GAO-26-107320 and the National Academies/TRB (Jun 2025) — directionally consistent with independent expert assessment.

T9 · Lifecycle & Governance

How the model is monitored, overruled, and retired

TowerGuard's responsible-AI framework — monitored, overrulable, and honest about its own limits.

Responsible AI — at a glance

AI informs · humans decide

Self-monitoring drift: 7.91% 🟡

5 explicit MODEL-VOID conditions

Every output versioned + uncertainty-banded

No cherry-picking — full comparison context required

Bias & framing: shows BOTH targets, no cherry-picking

Freshness

Monitor

7.91% / 5% threshold

Basis: Mean absolute CPC error of the FY2020-2025 out-of-sample backtest (N-eval). Green <=5%, yellow <=10%, red >10%.

Drift detection

Triggered

Method: Compare the model's CPC projection against each new FAA CWP actual; |predicted - actual| / actual is the drift.

current 7.91%↗ / threshold 5%

Same figure as the T8 backtest MAPE — the model monitors its own out-of-sample validation error; the COVID structural break is what drives it.

On trigger → Recalibrate against the latest CWP and flag bypass condition #5 (structural break) — the FY2020-2025 divergence is the COVID era.

Human-in-the-Loop

AI informs. Humans decide. The boundary is not negotiable.

AI informs

Projects the workforce pipeline across scenarios (with confidence bands)
Quantifies the cost of delaying an intervention
Generates the policy brief and explains the feedback-loop dynamics

Human decides

How many controllers to hire and at what pace
Where to allocate staff across facilities
What level of safety risk is acceptable
Whether and when to act on the projection

Two-review cycle required

××Changing any calibration parameter (requires a source citation + a confidence rating before it is accepted)
××Publishing a brief that feeds an appropriations decision (model owner + domain reviewer sign-off)

Bypass conditions (model void)

Governance

01Parameter changes require a source citation and a confidence rating.
02Every output is tagged with the model version and calibration date.
03A single scenario cannot be exported without its full comparison context (prevents cherry-picking).
04Safety outputs always carry uncertainty bands and the relative-risk disclaimer; both the FAA and NATCA targets are always shown.
05Bias & framing: both the FAA (12,563) and NATCA (14,633) targets are always shown (the model endorses neither); a scenario cannot be exported without its full comparison context (no cherry-picking); known data undercounts are flagged (e.g. BTS excludes FedEx, understating the Memphis cargo hub). The model quantifies RELATIVE risk — it never sets what level of risk is acceptable.

Changelog

1.02026-06-17
Initial release. Calibrated to GAO-26-107320 (FY2015-2025) and FAA CWP 2025-2028 / 2026-2028; validated with the FY2020-2025 backtest.

T10 · Stakeholders & who bears the risk

Exposure and cost are not evenly shared

Stakeholder	What's at stake	Where in this tool
FAA / agency leadership	Owns the hiring plan: how many to hire, at what pace.	T2, T4
Congress / appropriators	Funds the training pipeline; the cost of waiting lands on the budget.	T4, T12
Air traffic controllers	Bear the burnout, overtime, and fatigue risk of understaffing.	T13, safety view
Travelers & airlines	Pay the delay cost when sectors are metered.	T4, T11
Affected metro communities	Exposure is concentrated, not national-average.	T11
The flying public (safety)	The ultimate stake: relative fatigue-error risk rises to ~3.6x on the do-nothing path.	T2 safety

Exposure and cost are not evenly shared — the tool names who bears each, and which decision each informs.

T11 · Community Exposure

Who gets hurt first

National averages hide that the staffing gap is concentrated. The ranking below reflects where understaffing meets traffic — exposure tracks the gap, not the volume.

Most exposed

rank #1

New York · N90

JFK · LGA · EWR

Staffed

72%

Exposure

1.00

Least exposed

rank #6

Chicago · C90

ORD

Staffed

107%

Exposure

0.00

Facility ranking

FAA Core-30 volume share · 25.2%

Rank	Metro · Facility	Airports	Staffed	Exposure	NAS delay ↗
#1	New York N90	JFK · LGA · EWR	72%	1.00	$219M 2,174,255 min
#2	San Francisco Bay NCT	SFO	80%	0.59	$111M 1,102,917 min
#3	Atlanta A80	ATL	78%	0.34	$77M 763,292 min
#4	Seattle S46	SEA	67%	0.32	$41M 404,312 min
#5	Memphis M03	MEM	71%	0.16	$5M 49,014 min
#6	Chicago C90	ORD	107%	0.00	$110M 1,088,047 min

About the NAS delay column. NAS-category delay already on the table — an upper bound, not a staffing-attributable cost. Many factors (weather, equipment, volume) contribute. We report it so the headline ranking is not confused with a dollar attribution.

Methodology

Exposure = max(0, 1 - staffing%) x operations share — two real facts (National Academies Table 2-6 staffing, FAA OPSNET FY2024 ops) ranked relatively. The dollar is the airport's CY2024 NAS-category delay-minutes (BTS) x $100.76/min (A4A block-time). Unit = the governing ATC facility/metro; N90 sums JFK+LGA+EWR.

Caveats

The NAS-category delay cost is an UPPER BOUND on the staffing-relevant cost, NOT a 'staffing cost': BTS 'NAS' bundles non-extreme weather, volume and equipment (the staffing-controllable slice is the volume portion — FAA Core-30 FY2024 ~25%). BTS covers reporting carriers, arriving flights only (~70-80%), so totals undercount; MEM is passenger-only (FedEx does not report), understating the cargo hub. NCT delay = SFO only. Facility ops are FY2024 (all traffic); delay-minutes are CY2024 (arrivals) — not divided against each other. This is exposure attribution, not an independent regional economic model.

T12 · Policy Brief

Executive summary for decision-makers

One-page brief drawn from the model output. Intended for budget and policy staff, not operators.

TowerGuard Policy Brief

FY2025–FY2036 workforce projection · decision-support only

−78%↗

CPC by FY2036 (do-nothing)

$365B↗

vs the current plan, 2026-2036

3.6×↗

relative fatigue-error risk vs rested

+$271B↗

if the plan starts in 2030, not 2026

Executive summary

On the do-nothing path, the certified controller (CPC) workforce is projected to fall from 11,000 to about 2,412↗ by FY2036 (~78%↗), while staffing stays below the 85% safety floor↗ for the entire projection. Executing the current FAA plan instead of doing nothing avoids on the order of $365B↗ in controller-attributable delay and overtime costs over the decade. The cost of waiting is front-loaded and largely irreversible: the certification pipeline takes years, so a hire today reaches the line in 2-3 years.

Cost of delay

Net cost of delaying the current plan, relative to starting in 2026: 2027 → +$70B↗, 2028 → +$139B↗, 2029 → +$206B↗, 2030 → +$271B↗.

Key findings

1Do-nothing collapse: CPCs fall ~78%↗ by FY2036, driven by a reinforcing burnout-attrition loop.
2Cost of doing nothing↗: ~$365B↗ versus the current plan, up to ~$500B↗ versus accelerated hiring (controller-attributable delay + overtime, FY2026-2036).
3Safety (the cost money can't buy back): doing nothing pushes the relative fatigue-error risk to ~3.6x↗ the rested baseline, and CPCs stay below the 85% floor↗ for all 11 projected years.
4Front-loaded delay: starting the plan one year late locks in ~$70B↗; four years late ~$271B↗.

Recommendations

1Begin or sustain the hiring ramp now — the certification lag means delay compounds for years before it can be reversed.
2Plan against the confidence ranges, not the point estimates (the bands widen with the least-calibrated assumptions).
3Evaluate outcomes against BOTH the FAA (12,563) and NATCA (14,633) staffing targets; this model endorses neither.

Limitations

Strategic, aggregate model — not a facility-level or accident-prediction tool. Safety outputs are RELATIVE risk indicators with wide uncertainty, not accident forecasts. The burnout-loop coefficients are illustrative (literature-anchored, not yet calibrated), so collapse depth and the do-nothing cost are order-of-magnitude. The 'retirement cliff' framing in some source material is outdated for the 2026-2036 horizon.

T13 · Live Validation

Real-time model check against live traffic

Recorded from the live pipeline: real OpenSky-style traffic → deterministic risk modules → Claude phrases the advisory & briefing (it never decides escalation) → controller confirms.

Live pipeline

Replay · recorded real session

Modules score

deterministic risk

→

Claude phrases

claude-opus-4-8

→

Advisory ready

→

Controller confirms

Stage 2 timing reflects the measured ~2s median phrasing latency (Claude Opus 4.8) — this is a pipeline visualization of a recorded session, not a live API call.

Module signals

Traffic Density

UNKNOWN

awaiting…—

Conflict Geometry

UNKNOWN

awaiting…—

Workload Index

UNKNOWN

awaiting…—

Live advisory feed

No advisories yet — awaiting events.

Relief briefing

Briefing renders here when Claude drafts one for the most recent advisory.

A companion system streams live OpenSky position data into deterministic risk modules that score congestion, weather-impacted sectors, and staffing pressure in real time.

When the modules detect a significant deviation from the projected trajectory, a small language model drafts an advisory phrasing. A controller reviews and confirms it before it is logged.

This closes the loop: the annual model informs the budget cycle; the real-time model checks whether the world is unfolding the way the annual model assumed.

The AI phrases; it never decides. Escalation is determined by the deterministic modules and confirmed by a human controller.

T14 · Tools & Data

Tools

Python (system-dynamics + Monte Carlo) · Anthropic Claude (claude-opus-4-8) via the Anthropic SDK · FastAPI + SSE · Redis · OpenSky API · pypdf · React/Vite/Tailwind/shadcn (Lovable) · Leaflet · pytest (304 tests) · cloudflared.

Data

GAO-26-107320 · FAA Controller Workforce Plan 2025-2028/2026-2028 · BTS NAS delay (CY2024) · A4A (Nov 2025 shutdown) · SAFTE-FAST/Hursh 2004 · National Academies/TRB (Jun 2025) · FAA/Nextor 2019 · OpenSky Network.

Every model parameter's source and confidence is in T6.

Scope

Key Assumptions

Non-goals

Certified controllers (CPC), FY2025–FY2036

Two halves, one loop

Design tradeoffs

System dynamics, not machine learning

Monte Carlo confidence bands, not point estimates

The LLM augments, it never decides (Option B)

Template-first policy brief, LLM optional

Frozen contracts between the halves

Replay for the deployed demo, live engine for proof

Every year of delay is locked in

R1 · Burnout → Attrition

R2 · Knowledge Drain

B1 · Load Shedding

What this model gets wrong — and how we know

How the model is monitored, overruled, and retired

During a government shutdown

Individual facility staffing decisions

Extrapolation beyond the calibration range

Setting an 'acceptable' risk level

Immediately after a structural break

Exposure and cost are not evenly shared

Who gets hurt first

Executive summary for decision-makers

Real-time model check against live traffic

Tools

Data