Benchmarks

All numbers are backed by reproducible experiment scripts in the experiments/ directory.

Scope Notice

All recovery claims are restricted to programmatically induced failures. Real-world organic failures are untested. All timing measurements were conducted on CPU.

100%

Recovery (25/25)

97.5%

Prediction Accuracy

False Positives

<10%

Overhead (250K+)

Baseline Comparison

Protocol: 4 methods x 5 failure types x 5 seeds = 25 scenarios per method.
Script: experiments/baseline_comparison.py

Method	Detection	Recovery	Avg. Time
No Protection	52.0%	0.0%	926ms
Gradient Clipping	20.0%	0.0%	1297ms
Loss-Only Monitor	80.0%	80.0%	1359ms
Full ARC	100%	100%	1722ms

Detection Rate

No Protection

52%

Gradient Clipping

20%

Loss-Only Monitor

80%

Full ARC

100%

Key Finding

ARC's optimizer state monitoring catches silent failures (momentum buffer zeroing) that loss-only monitoring misses entirely.

Failure Prediction

Protocol: 4 architectures x 5 failure types x 5 seeds x 2 labels = 200 scenarios, 5-fold CV.
Script: experiments/prediction_200_v2.py

Classifier	Accuracy	Precision	Recall	F1
Logistic Regression (12f)	95.5%	100%	91.0%	0.953
MLP (12 features)	97.5%	100%	95%	0.974

v1 to v2 Improvement

Metric	v1 (8 feat, LogReg)	v2 (12 feat, MLP)
Accuracy	86.5%	97.5%
Recall	73.0%	95.0%
F1	0.844	0.974

Ablation Study

Protocol: 7 failure types x 5 seeds = 35 scenarios per config.
Script: experiments/ablation_experiment.py

Configuration	Detection	Delta
Full ARC (all components)	85.7%	-
Without Weight Health	85.7%	0.0%
Without Gradient Monitoring	85.7%	0.0%
Without Loss Monitoring	85.7%	0.0%
Without Optimizer State	71.4%	-14.3%
Loss Only (baseline)	71.4%	-14.3%

Interpretation

Weight/gradient/loss monitoring provide redundant coverage (defense-in-depth). Optimizer state monitoring is uniquely valuable for silent failures like momentum buffer corruption.

Overhead Analysis

Protocol: Median of 100 iterations, time.perf_counter, CPU.
Script: experiments/overhead_measurement.py

Per-Component Timing (288K CNN)

Component	Time (ms)	% of Total
Gradient Norm	0.12	9.0%
Weight Statistics	1.06	76.9%
Loss Analysis	0.01	0.6%
Checkpoint (amortized)	0.13	9.6%
Forecasting	0.06	4.1%
Total ARC	1.38	100%

Overhead by Model Scale

Model	Parameters	ARC (ms)	Baseline (ms)	Relative
Small MLP	50K	0.86	1.45	~60%
Medium CNN	288K	1.38	14.17	~10%
Large CNN	2.5M	7.04	74.24	~9.5%

Scaling Behavior

Overhead decreases with model size because forward/backward cost grows superlinearly while monitoring is O(n). GPU deployments would show even lower relative overhead.

Large Model Stress Tests

Protocol: 20 stable steps, failure injected, recovery verified within 10 steps.
Script: experiments/validate_claims_phase2.py

Model	Params	Failure Type	Recovery	Rollbacks
NanoGPT	10M	LR Spike (50x)	Yes	2
ResNet-50	25.6M	Loss Singularity	Yes	1
YOLOv11	30M	Catastrophic LR	Yes	3
GPT-2 Small	50M	NaN Bomb	Yes	4
SD-UNet	60M	Gradient Apocalypse	Yes	4
Wide ResNet	68M	Loss Supernova	Yes	3
Llama-style	70M	Catastrophic LR	Yes	5
ViT-Base	86M	Inf Nuke	Yes	1
GPT-2 Medium	117M	NaN Bomb	Yes	3

Result: 9/9 models recovered successfully (100% recovery rate).

Known Limitations

Claim	Status
GPU overhead	Untested - all measurements on CPU
Scale beyond 117M	Untested - behavior at 1B+ unknown
Organic failures	Untested - all failures injected programmatically
Distributed training	Untested - single-process only
Non-CIFAR datasets	Limited - prediction experiments primarily use CIFAR-10
Data corruption	Not supported - no input data validation