TowerGuard — The Cost of Doing Nothing

Workforce projection FY2025–FY2036 · decision-support, not for operational use

model 1.0·calibrated 2026-06-17
Monitordrift 7.91% / 5%

T1 · Scope

Scope

  • A strategic, national-aggregate decision-support simulator for the US air-traffic-controller (CPC) staffing crisis, FY2025–FY2036.
  • Projects the workforce across 5 policy scenarios and quantifies the cost and safety risk of delaying intervention.

Key Assumptions

  • Calibrated to GAO-26-107320 and the FAA Controller Workforce Plan; every parameter is listed with its source and a confidence rating in T6.
  • The burnout-loop coefficients are illustrative (low confidence), so the collapse depth and dollar figures are order-of-magnitude, not point predictions. (See T5 sensitivity, T6 ledger.)

Non-goals

  • NOT a facility-level scheduler — use FAA CRWG/AFN tools for a single tower.
  • NOT an accident-prediction tool — safety outputs are RELATIVE risk, not probabilities.
  • NOT for operational use — decision-support only.
  • Does NOT decide policy or set “acceptable” risk — humans decide (see T9).
  • Endorses NEITHER the FAA nor the NATCA staffing target.

T2 · Scenario Dashboard

Certified controllers (CPC), FY2025–FY2036

Five policy scenarios shown together. Confidence band (P10–P90) is drawn for the highlighted scenario so projections are never read as point estimates.

Highlight band
Actual CPC (history, FY2020–FY2025)Projection start (FY2025)FAA target (12,563)NATCA target (14,633)P10–P90 band (Current Plan (FAA CWP))

Peak safety risk index

1.8×

for Current Plan (FAA CWP)

Note. Relative multiplier vs baseline, not a probability. Corroborating context (not a fitted relationship): FY2023 saw 19 serious near-misses, a 7-year high, during the staffing crisis. Risk_index is a RELATIVE multiplier (1.0 = rested baseline), NOT an accident probability (§9.2).

Months below 85% staffing

132

over the 12-year horizon

Cumulative cost of delay

$154.8B

FY2025–FY2036, Current Plan (FAA CWP)

T3 · Architecture & Design

Two halves, one loop

TowerGuard is two decoupled halves that close a loop: a simulator that projects the cost of doing nothing, and a live system that checks whether reality is unfolding the way the model assumed.

Simulator— "What will doing nothing cost?"
models/ (system dynamics + Monte Carlo)
scenario_results.json
frozen JSON contract
T2–T12 panels
closes the loop
Live Validation— "Is reality unfolding as the model assumed?"
OpenSky-style traffic
modules/
3 deterministic risk signals
Redis
frozen Redis contract
agents/Claude
phrases only, never decides
T13

Two frozen contracts (the JSON contract and the Redis contract) keep the halves independent and testable — 304 tests guard the boundaries.

Design tradeoffs

System dynamics, not machine learning

Policy needs an interpretable causal mechanism and the data is sparse (~10 annual points). ML would be a data-hungry black box that can't explain why the workforce collapses.

Monte Carlo confidence bands, not point estimates

A policy model must represent uncertainty, not fake precision — we show P10–P90, not a single confident line.

The LLM augments, it never decides (Option B)

The deterministic engine owns every escalation decision (auditable, testable); Claude only phrases the human-facing text, with a template fallback. Safety-critical AI must not 'drive'.

Template-first policy brief, LLM optional

Every figure in the brief is pulled live from the model and traceable; the LLM only rephrases — it invents no numbers.

Frozen contracts between the halves

A JSON contract (simulator → frontend) and a Redis contract (modules ↔ agents) let the two halves be built independently and stay decoupled.

Replay for the deployed demo, live engine for proof

Free tunnels buffer SSE, so the always-on dashboard replays a real captured session (reliable); the live engine runs in the demo video as proof the agent is real.

T4 · Intervention Timing Comparator

Every year of delay is locked in

Drag the slider to choose the year the FAA Current Plan hiring ramp begins. The CPC trajectory and the cost penalty update together — cost already locked in by waiting cannot be recovered by starting later.

FY2027

Locked-in cost of starting in FY2027

$69.9B

Net cost of delay vs. acting in FY2026.

Cumulative cost gap through FY2036

$295.5B

Remaining cost still recoverable if you start in this year (locked-in cost + this = $365.4B total cost of doing nothing).

CPC trajectory if intervention starts FY2027

Locked-in cost of delay by start year — every year you wait, more is permanently lost

Takeaway: starting one year late costs roughly $69.9B in locked-in delay cost — money that no later acceleration can recover.

T5 · Sensitivity (Tornado)

How far each input parameter swings between its low and high plausibility, anchored at baseline. Sorted by swing magnitude — the top row is the parameter the answer is most fragile to.

cpc_attrition_rate
1112
6210
3273
swing 156%
effectiveness_gap_sensitivity
1022
5334
3273
swing 132%
attrition_amplify
1757
4568
3273
swing 86%
ojt_pass_rate
2793
3507
3273
swing 22%
certification_rate
3124
3211
3273
swing 3%

Bars span low ↔ high parameter values; the vertical tick marks the baseline. Swing % is (high − low) ÷ |baseline|.

T6 · Assumption Ledger

Every numeric input the model rests on, its source, and how confident we are in it. Click a column header to sort.

ParameterValueSourceConfidence
total_controllers_fy202513,164GAO-26-107320high
ojt_pass_rate0.865GAO FY2017-2022 funnelhigh
academy_washout_rate0.3GAO (>30% FY2024)high
fatigue_threshold0.77SAFTE-FAST (Hursh 2004)high
annual_delay_cost_usd33,000,000,000FAA/Nextor 2019high
cpc_fy202511,000FAA CWP 2026-2028medium
certification_median_years3FAA/National Academiesmedium
staffing_floor0.85CRWG operational floor (§10.3)medium
controller_delay_share_shutdown0.61A4A (Nov 2025 shutdown)medium
cpc_attrition_rate_forward0.1calibrated to FAA plan (D14)low
r1_effectiveness_gap_sensitivity1.2illustrative (D7)low
r1_attrition_amplify15illustrative, Brazil study order-of-magnitude (D7)low

cpc_attrition_rate_forward, r1_effectiveness_gap_sensitivity, r1_attrition_amplify are the illustrative / weakly-calibrated inputs — exactly the ones the T5 sensitivity analysis stress-tests and the T12 limitations flag. This is why the do-nothing collapse DEPTH and cost are order-of-magnitude, not point estimates.

T7 · Causal Loops

Why staffing shortfalls don't stay linear. Two reinforcing loops (R) compound the gap; one balancing loop (B) absorbs pressure by throttling throughput.

R1 · Burnout → Attrition

Reinforcing
1Staffing gap
2Overtime / 6-day weeks
3Burnout
4Attrition
12341

Understaffing forces overtime; overtime fuels burnout; burnout drives attrition; attrition deepens the gap.

Loop strength is set by r1_attrition_amplify (=15) and r1_effectiveness_gap_sensitivity (=1.2) — both LOW confidence (see T6). This is why the collapse depth is order-of-magnitude (T12).

R2 · Knowledge Drain

Reinforcing
1Senior CPC departures
2OJT instructor capacity ↓
3Time-to-CPC ↑
4Net CPC growth ↓
12341

Senior CPCs leave; OJT instructor capacity shrinks; trainees take longer to certify; the senior pool keeps thinning.

B1 · Load Shedding

Balancing
1Workload pressure
2Ground stops / metering
3Throughput ↓
4Pressure relieved
12341

When facilities run too hot, traffic is metered, flow programs activate, and ops are deferred — pressure drops, but at a cost.

The “cost” of shedding load is the delay quantified in T4 and the community exposure ranked in T11 — these loops generate the dollar figures; they are not decorative.

T8 · Model Validation

What this model gets wrong — and how we know

Validation is rendered straight from the model's own backtest. The monitor flags its own failure modes; it does not smooth them away.

Backtest · FY2020-FY2025

Predicted vs. actual CPC, year by year

✕ BREACH · MAPE 7.91% vs. threshold 5.0%

Y-axis starts at 8,000 to show the divergence — not zero-based.

Gap |Δ| = absolute headcount difference (Actual − Predicted) per year. The 7.91% in the breach badge is the MEAN error across years (MAPE).

Annual error %

Caption. The drift monitor catches the COVID structural break (bypass condition #5). A model that flags its own failure mode is more credible than a fake-perfect one.

Extreme-condition checks

Does the model behave sensibly under stress?

  • zero_hiring_decays

    Expected: With hiring = 0 the workforce shrinks well below the start.

    Observed: total 13,164 -> 0 over 15 yr

    Pass

  • hiring_cannot_shortcut_certification

    Expected: A 1-yr hiring flood does not raise next-year CPC (2-3 yr lag).

    Observed: next-yr CPC: normal 10,524.1 vs flood 10,524.1

    Pass

  • zero_attrition_never_shrinks_cpc

    Expected: With attrition = 0 the CPC stock is non-decreasing.

    Observed: CPC 11,000 -> 20,907, monotonic=True

    Pass

Historical reproduction

Re-running the model against known history.

INDEPENDENT = not fitted, true out-of-sample test · SEMI-INDEPENDENT = partially fitted · IN-SAMPLE = fitted, internal-consistency check only.

CasePredictedActual|err|%Note
backcast_fy2015_to_2025_totalIN-SAMPLE Tight fit13,16613,1640.01%IN-SAMPLE consistency check: hiring (~1,326/yr) was SOLVED to reproduce this total (D11), so a tight fit confirms internal consistency, not predictive skill. The independent legs are the composition (below) and the FY2020-2025 backtest.
backcast_fy2025_cpc_compositionSEMI-INDEPENDENT10,98211,0000.16%SEMI-INDEPENDENT: hiring was solved for the TOTAL, not the CPC/Dev split — so reproducing the ~11,000 CPC composition is a meaningful check (OJT washout, D13, is what pinned it down).
forward_developmental_checkpoint_fy2026INDEPENDENT2,6613,00011.29%Independent (not fitted), order-of-magnitude only: FAA/Reuters Apr-2026 reports ~4,000 'in training' = Dev + ~1,000 CPC-IT, so the Dev-only comparator is ~3,000 (derived). The model lands in the right ballpark for the in-training surge.

Method note. Validated as a strategic system-dynamics model (behaviour reproduction, extreme-condition tests, out-of-sample backtest, sensitivity — see the `sensitivity` block), NOT as a point-prediction accuracy score. Policy models must represent uncertainty, not certainty. Face validity. The do-nothing collapse and the 'hiring grows headcount but certified controllers barely move' dynamic match the qualitative warnings of GAO-26-107320 and the National Academies/TRB (Jun 2025) — directionally consistent with independent expert assessment.

T9 · Lifecycle & Governance

How the model is monitored, overruled, and retired

TowerGuard's responsible-AI framework — monitored, overrulable, and honest about its own limits.

Responsible AI — at a glance

AI informs · humans decide
Self-monitoring drift: 7.91% 🟡
5 explicit MODEL-VOID conditions
Every output versioned + uncertainty-banded
No cherry-picking — full comparison context required
Bias & framing: shows BOTH targets, no cherry-picking

Freshness

Monitor

7.91% / 5% threshold

Basis: Mean absolute CPC error of the FY2020-2025 out-of-sample backtest (N-eval). Green <=5%, yellow <=10%, red >10%.

Drift detection

Triggered

Method: Compare the model's CPC projection against each new FAA CWP actual; |predicted - actual| / actual is the drift.

current 7.91% / threshold 5%

Same figure as the T8 backtest MAPE — the model monitors its own out-of-sample validation error; the COVID structural break is what drives it.

On trigger → Recalibrate against the latest CWP and flag bypass condition #5 (structural break) — the FY2020-2025 divergence is the COVID era.

Human-in-the-Loop

AI informs. Humans decide. The boundary is not negotiable.

AI informs

  • Projects the workforce pipeline across scenarios (with confidence bands)
  • Quantifies the cost of delaying an intervention
  • Generates the policy brief and explains the feedback-loop dynamics

Human decides

  • How many controllers to hire and at what pace
  • Where to allocate staff across facilities
  • What level of safety risk is acceptable
  • Whether and when to act on the projection

Two-review cycle required

  • ××Changing any calibration parameter (requires a source citation + a confidence rating before it is accepted)
  • ××Publishing a brief that feeds an appropriations decision (model owner + domain reviewer sign-off)

Bypass conditions (model void)

Governance

  1. 01Parameter changes require a source citation and a confidence rating.
  2. 02Every output is tagged with the model version and calibration date.
  3. 03A single scenario cannot be exported without its full comparison context (prevents cherry-picking).
  4. 04Safety outputs always carry uncertainty bands and the relative-risk disclaimer; both the FAA and NATCA targets are always shown.
  5. 05Bias & framing: both the FAA (12,563) and NATCA (14,633) targets are always shown (the model endorses neither); a scenario cannot be exported without its full comparison context (no cherry-picking); known data undercounts are flagged (e.g. BTS excludes FedEx, understating the Memphis cargo hub). The model quantifies RELATIVE risk — it never sets what level of risk is acceptable.

Changelog

  • 1.02026-06-17

    Initial release. Calibrated to GAO-26-107320 (FY2015-2025) and FAA CWP 2025-2028 / 2026-2028; validated with the FY2020-2025 backtest.

T10 · Stakeholders & who bears the risk

Exposure and cost are not evenly shared

StakeholderWhat's at stakeWhere in this tool
FAA / agency leadershipOwns the hiring plan: how many to hire, at what pace.T2, T4
Congress / appropriatorsFunds the training pipeline; the cost of waiting lands on the budget.T4, T12
Air traffic controllersBear the burnout, overtime, and fatigue risk of understaffing.T13, safety view
Travelers & airlinesPay the delay cost when sectors are metered.T4, T11
Affected metro communitiesExposure is concentrated, not national-average.T11
The flying public (safety)The ultimate stake: relative fatigue-error risk rises to ~3.6x on the do-nothing path.T2 safety

Exposure and cost are not evenly shared — the tool names who bears each, and which decision each informs.

T11 · Community Exposure

Who gets hurt first

National averages hide that the staffing gap is concentrated. The ranking below reflects where understaffing meets traffic — exposure tracks the gap, not the volume.

Most exposed

rank #1

New York · N90

JFK · LGA · EWR

Staffed

72%

Exposure

1.00

Least exposed

rank #6

Chicago · C90

ORD

Staffed

107%

Exposure

0.00

Facility ranking

FAA Core-30 volume share · 25.2%

RankMetro · FacilityAirportsStaffedExposureNAS delay ↗
#1
New York
N90
JFK · LGA · EWR72%
1.00
$219M
2,174,255 min
#2
San Francisco Bay
NCT
SFO80%
0.59
$111M
1,102,917 min
#3
Atlanta
A80
ATL78%
0.34
$77M
763,292 min
#4
Seattle
S46
SEA67%
0.32
$41M
404,312 min
#5
Memphis
M03
MEM71%
0.16
$5M
49,014 min
#6
Chicago
C90
ORD107%
0.00
$110M
1,088,047 min

About the NAS delay column. NAS-category delay already on the table — an upper bound, not a staffing-attributable cost. Many factors (weather, equipment, volume) contribute. We report it so the headline ranking is not confused with a dollar attribution.

Methodology

Exposure = max(0, 1 - staffing%) x operations share — two real facts (National Academies Table 2-6 staffing, FAA OPSNET FY2024 ops) ranked relatively. The dollar is the airport's CY2024 NAS-category delay-minutes (BTS) x $100.76/min (A4A block-time). Unit = the governing ATC facility/metro; N90 sums JFK+LGA+EWR.

Caveats

The NAS-category delay cost is an UPPER BOUND on the staffing-relevant cost, NOT a 'staffing cost': BTS 'NAS' bundles non-extreme weather, volume and equipment (the staffing-controllable slice is the volume portion — FAA Core-30 FY2024 ~25%). BTS covers reporting carriers, arriving flights only (~70-80%), so totals undercount; MEM is passenger-only (FedEx does not report), understating the cargo hub. NCT delay = SFO only. Facility ops are FY2024 (all traffic); delay-minutes are CY2024 (arrivals) — not divided against each other. This is exposure attribution, not an independent regional economic model.

T12 · Policy Brief

Executive summary for decision-makers

One-page brief drawn from the model output. Intended for budget and policy staff, not operators.

TowerGuard Policy Brief

FY2025–FY2036 workforce projection · decision-support only

−78%

CPC by FY2036 (do-nothing)

$365B

vs the current plan, 2026-2036

3.6×

relative fatigue-error risk vs rested

+$271B

if the plan starts in 2030, not 2026

Executive summary

On the do-nothing path, the certified controller (CPC) workforce is projected to fall from 11,000 to about 2,412 by FY2036 (~78%), while staffing stays below the 85% safety floor for the entire projection. Executing the current FAA plan instead of doing nothing avoids on the order of $365B in controller-attributable delay and overtime costs over the decade. The cost of waiting is front-loaded and largely irreversible: the certification pipeline takes years, so a hire today reaches the line in 2-3 years.

Cost of delay

Net cost of delaying the current plan, relative to starting in 2026: 2027 → +$70B, 2028 → +$139B, 2029 → +$206B, 2030 → +$271B.

Key findings

  • 1Do-nothing collapse: CPCs fall ~78% by FY2036, driven by a reinforcing burnout-attrition loop.
  • 2Cost of doing nothing: ~$365B versus the current plan, up to ~$500B versus accelerated hiring (controller-attributable delay + overtime, FY2026-2036).
  • 3Safety (the cost money can't buy back): doing nothing pushes the relative fatigue-error risk to ~3.6x the rested baseline, and CPCs stay below the 85% floor for all 11 projected years.
  • 4Front-loaded delay: starting the plan one year late locks in ~$70B; four years late ~$271B.

Recommendations

  • 1Begin or sustain the hiring ramp now — the certification lag means delay compounds for years before it can be reversed.
  • 2Plan against the confidence ranges, not the point estimates (the bands widen with the least-calibrated assumptions).
  • 3Evaluate outcomes against BOTH the FAA (12,563) and NATCA (14,633) staffing targets; this model endorses neither.

Limitations

Strategic, aggregate model — not a facility-level or accident-prediction tool. Safety outputs are RELATIVE risk indicators with wide uncertainty, not accident forecasts. The burnout-loop coefficients are illustrative (literature-anchored, not yet calibrated), so collapse depth and the do-nothing cost are order-of-magnitude. The 'retirement cliff' framing in some source material is outdated for the 2026-2036 horizon.

T13 · Live Validation

Real-time model check against live traffic

Recorded from the live pipeline: real OpenSky-style traffic → deterministic risk modules → Claude phrases the advisory & briefing (it never decides escalation) → controller confirms.

Live pipeline

Replay · recorded real session
1

Modules score

deterministic risk

2

Claude phrases

claude-opus-4-8

3

Advisory ready

4

Controller confirms

Stage 2 timing reflects the measured ~2s median phrasing latency (Claude Opus 4.8) — this is a pipeline visualization of a recorded session, not a live API call.

Module signals

Traffic Density

UNKNOWN
awaiting…

Conflict Geometry

UNKNOWN
awaiting…

Workload Index

UNKNOWN
awaiting…

Live advisory feed

No advisories yet — awaiting events.

Relief briefing

Briefing renders here when Claude drafts one for the most recent advisory.

A companion system streams live OpenSky position data into deterministic risk modules that score congestion, weather-impacted sectors, and staffing pressure in real time.

When the modules detect a significant deviation from the projected trajectory, a small language model drafts an advisory phrasing. A controller reviews and confirms it before it is logged.

This closes the loop: the annual model informs the budget cycle; the real-time model checks whether the world is unfolding the way the annual model assumed.

The AI phrases; it never decides. Escalation is determined by the deterministic modules and confirmed by a human controller.

T14 · Tools & Data

Tools

Python (system-dynamics + Monte Carlo) · Anthropic Claude (claude-opus-4-8) via the Anthropic SDK · FastAPI + SSE · Redis · OpenSky API · pypdf · React/Vite/Tailwind/shadcn (Lovable) · Leaflet · pytest (304 tests) · cloudflared.

Data

GAO-26-107320 · FAA Controller Workforce Plan 2025-2028/2026-2028 · BTS NAS delay (CY2024) · A4A (Nov 2025 shutdown) · SAFTE-FAST/Hursh 2004 · National Academies/TRB (Jun 2025) · FAA/Nextor 2019 · OpenSky Network.

Every model parameter's source and confidence is in T6.