Benchmarks

All numbers are backed by reproducible experiment scripts in the experiments/ directory.

Scope Notice

All recovery claims are restricted to programmatically induced failures. Real-world organic failures are untested. All timing measurements were conducted on CPU.

100%
Recovery (25/25)
97.5%
Prediction Accuracy
0
False Positives
<10%
Overhead (250K+)

Baseline Comparison

Protocol: 4 methods x 5 failure types x 5 seeds = 25 scenarios per method.
Script: experiments/baseline_comparison.py

MethodDetectionRecoveryFalse Pos.Avg. Time
No Protection52.0%0.0%0926ms
Gradient Clipping20.0%0.0%01297ms
Loss-Only Monitor80.0%80.0%01359ms
Full ARC100%100%01722ms

Detection Rate

No Protection
52%
Gradient Clipping
20%
Loss-Only Monitor
80%
Full ARC
100%
Key Finding

ARC's optimizer state monitoring catches silent failures (momentum buffer zeroing) that loss-only monitoring misses entirely.

Failure Prediction

Protocol: 4 architectures x 5 failure types x 5 seeds x 2 labels = 200 scenarios, 5-fold CV.
Script: experiments/prediction_200_v2.py

ClassifierAccuracyPrecisionRecallF1
Logistic Regression (12f)95.5%100%91.0%0.953
MLP (12 features)97.5%100%95%0.974

v1 to v2 Improvement

Metricv1 (8 feat, LogReg)v2 (12 feat, MLP)
Accuracy86.5%97.5%
Recall73.0%95.0%
F10.8440.974

Ablation Study

Protocol: 7 failure types x 5 seeds = 35 scenarios per config.
Script: experiments/ablation_experiment.py

ConfigurationDetectionDelta
Full ARC (all components)85.7%-
Without Weight Health85.7%0.0%
Without Gradient Monitoring85.7%0.0%
Without Loss Monitoring85.7%0.0%
Without Optimizer State71.4%-14.3%
Loss Only (baseline)71.4%-14.3%
Interpretation

Weight/gradient/loss monitoring provide redundant coverage (defense-in-depth). Optimizer state monitoring is uniquely valuable for silent failures like momentum buffer corruption.

Overhead Analysis

Protocol: Median of 100 iterations, time.perf_counter, CPU.
Script: experiments/overhead_measurement.py

Per-Component Timing (288K CNN)

ComponentTime (ms)% of Total
Gradient Norm0.129.0%
Weight Statistics1.0676.9%
Loss Analysis0.010.6%
Checkpoint (amortized)0.139.6%
Forecasting0.064.1%
Total ARC1.38100%

Overhead by Model Scale

ModelParametersARC (ms)Baseline (ms)Relative
Small MLP50K0.861.45~60%
Medium CNN288K1.3814.17~10%
Large CNN2.5M7.0474.24~9.5%
Scaling Behavior

Overhead decreases with model size because forward/backward cost grows superlinearly while monitoring is O(n). GPU deployments would show even lower relative overhead.

Large Model Stress Tests

Protocol: 20 stable steps, failure injected, recovery verified within 10 steps.
Script: experiments/validate_claims_phase2.py

ModelParamsFailure TypeRecoveryRollbacks
NanoGPT10MLR Spike (50x)Yes2
ResNet-5025.6MLoss SingularityYes1
YOLOv1130MCatastrophic LRYes3
GPT-2 Small50MNaN BombYes4
SD-UNet60MGradient ApocalypseYes4
Wide ResNet68MLoss SupernovaYes3
Llama-style70MCatastrophic LRYes5
ViT-Base86MInf NukeYes1
GPT-2 Medium117MNaN BombYes3

Result: 9/9 models recovered successfully (100% recovery rate).

Known Limitations

ClaimStatus
GPU overheadUntested - all measurements on CPU
Scale beyond 117MUntested - behavior at 1B+ unknown
Organic failuresUntested - all failures injected programmatically
Distributed trainingUntested - single-process only
Non-CIFAR datasetsLimited - prediction experiments primarily use CIFAR-10
Data corruptionNot supported - no input data validation