ARC monitors, predicts, and recovers from training failures automatically. Three lines of code. Less than 10% overhead. 100% recovery on tested scenarios.
pip install arc-training
from arc import ArcV2
arc = ArcV2.auto(model, optimizer)
for epoch in range(100):
for batch in dataloader:
optimizer.zero_grad()
loss = model(batch)
loss.backward()
optimizer.step()
arc.step(loss)
arc.end_epoch(epoch)
A single NaN gradient at hour 47 of a 48-hour run can destroy days of compute. Engineers waste enormous time adding manual checkpointing and babysitting long runs.
Loss becomes NaN or Inf with no warning, corrupting the entire training run irreversibly.
Vanishing or exploding gradients silently halt learning, wasting compute without any alert.
Cloud GPU spending exceeds $30B annually. Every unrecoverable failure burns money directly.
Engineers monitor training overnight, writing custom recovery scripts and checkpointing logic.
Tracks 12+ real-time signals including loss trajectory, gradient norms, weight health, optimizer state integrity, and activation statistics.
An MLP classifier with 97.5% accuracy detects failures before they become irreversible, using signal-based features with zero false positives.
Automatically rolls back to the last healthy checkpoint and applies corrective measures including learning rate reduction and gradient clipping.
Automatic rollback with learning rate reduction, gradient clipping, and full state restoration on failure detection.
12+ signals tracked simultaneously: gradients, activations, weights, optimizer state, loss curvature, spectral features.
MLP classifier trained on 200 scenarios achieves 97.5% accuracy with 100% precision and zero false positives.
Specialized loss balancing for physics-informed neural networks with adaptive curriculum scheduling.
Detect adversarial inputs, apply randomized smoothing, and train models with certified robustness guarantees.
Conformal prediction for distribution-free coverage guarantees with Venn-Abers calibration support.
Elastic Weight Consolidation prevents catastrophic forgetting when training across sequential tasks.
Less than 10% overhead for models above 250K parameters. Adaptive sampling keeps monitoring efficient.
Quantized FP16 checkpoints with 50% memory reduction and automatic lifecycle management.
arc/
core/ Self-healing engine
signals/ Multi-signal collectors
features/ Feature extraction and buffering
prediction/ Failure prediction models
intervention/ Recovery strategies
checkpointing/ Checkpoint management
introspection/ Hessian + Fisher analysis
physics/ PINN stabilization
uncertainty/ Conformal prediction
security/ Adversarial defense
learning/ Continual learning (EWC)
evaluation/ Benchmarking harness
Add or remove signal collectors independently without touching the core engine.
Use MLP, Logistic Regression, or custom classifiers. Hot-swap models at runtime.
Dataclass-based configuration with presets: Config.low_overhead() and Config.high_accuracy().
Requires only PyTorch and NumPy. Optional integrations for Lightning, SciPy, and tqdm.
Validated on NanoGPT, ResNet-50, YOLOv11, GPT-2, ViT-Base, Stable Diffusion UNet, and more.