Benchmarks
All numbers are backed by reproducible experiment scripts in the experiments/ directory.
All recovery claims are restricted to programmatically induced failures. Real-world organic failures are untested. All timing measurements were conducted on CPU.
Baseline Comparison
Protocol: 4 methods x 5 failure types x 5 seeds = 25 scenarios per method.
Script: experiments/baseline_comparison.py
| Method | Detection | Recovery | False Pos. | Avg. Time |
|---|---|---|---|---|
| No Protection | 52.0% | 0.0% | 0 | 926ms |
| Gradient Clipping | 20.0% | 0.0% | 0 | 1297ms |
| Loss-Only Monitor | 80.0% | 80.0% | 0 | 1359ms |
| Full ARC | 100% | 100% | 0 | 1722ms |
Detection Rate
ARC's optimizer state monitoring catches silent failures (momentum buffer zeroing) that loss-only monitoring misses entirely.
Failure Prediction
Protocol: 4 architectures x 5 failure types x 5 seeds x 2 labels = 200 scenarios, 5-fold CV.
Script: experiments/prediction_200_v2.py
| Classifier | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|
| Logistic Regression (12f) | 95.5% | 100% | 91.0% | 0.953 |
| MLP (12 features) | 97.5% | 100% | 95% | 0.974 |
v1 to v2 Improvement
| Metric | v1 (8 feat, LogReg) | v2 (12 feat, MLP) |
|---|---|---|
| Accuracy | 86.5% | 97.5% |
| Recall | 73.0% | 95.0% |
| F1 | 0.844 | 0.974 |
Ablation Study
Protocol: 7 failure types x 5 seeds = 35 scenarios per config.
Script: experiments/ablation_experiment.py
| Configuration | Detection | Delta |
|---|---|---|
| Full ARC (all components) | 85.7% | - |
| Without Weight Health | 85.7% | 0.0% |
| Without Gradient Monitoring | 85.7% | 0.0% |
| Without Loss Monitoring | 85.7% | 0.0% |
| Without Optimizer State | 71.4% | -14.3% |
| Loss Only (baseline) | 71.4% | -14.3% |
Weight/gradient/loss monitoring provide redundant coverage (defense-in-depth). Optimizer state monitoring is uniquely valuable for silent failures like momentum buffer corruption.
Overhead Analysis
Protocol: Median of 100 iterations, time.perf_counter, CPU.
Script: experiments/overhead_measurement.py
Per-Component Timing (288K CNN)
| Component | Time (ms) | % of Total |
|---|---|---|
| Gradient Norm | 0.12 | 9.0% |
| Weight Statistics | 1.06 | 76.9% |
| Loss Analysis | 0.01 | 0.6% |
| Checkpoint (amortized) | 0.13 | 9.6% |
| Forecasting | 0.06 | 4.1% |
| Total ARC | 1.38 | 100% |
Overhead by Model Scale
| Model | Parameters | ARC (ms) | Baseline (ms) | Relative |
|---|---|---|---|---|
| Small MLP | 50K | 0.86 | 1.45 | ~60% |
| Medium CNN | 288K | 1.38 | 14.17 | ~10% |
| Large CNN | 2.5M | 7.04 | 74.24 | ~9.5% |
Overhead decreases with model size because forward/backward cost grows superlinearly while monitoring is O(n). GPU deployments would show even lower relative overhead.
Large Model Stress Tests
Protocol: 20 stable steps, failure injected, recovery verified within 10 steps.
Script: experiments/validate_claims_phase2.py
| Model | Params | Failure Type | Recovery | Rollbacks |
|---|---|---|---|---|
| NanoGPT | 10M | LR Spike (50x) | Yes | 2 |
| ResNet-50 | 25.6M | Loss Singularity | Yes | 1 |
| YOLOv11 | 30M | Catastrophic LR | Yes | 3 |
| GPT-2 Small | 50M | NaN Bomb | Yes | 4 |
| SD-UNet | 60M | Gradient Apocalypse | Yes | 4 |
| Wide ResNet | 68M | Loss Supernova | Yes | 3 |
| Llama-style | 70M | Catastrophic LR | Yes | 5 |
| ViT-Base | 86M | Inf Nuke | Yes | 1 |
| GPT-2 Medium | 117M | NaN Bomb | Yes | 3 |
Result: 9/9 models recovered successfully (100% recovery rate).
Known Limitations
| Claim | Status |
|---|---|
| GPU overhead | Untested - all measurements on CPU |
| Scale beyond 117M | Untested - behavior at 1B+ unknown |
| Organic failures | Untested - all failures injected programmatically |
| Distributed training | Untested - single-process only |
| Non-CIFAR datasets | Limited - prediction experiments primarily use CIFAR-10 |
| Data corruption | Not supported - no input data validation |