ARC - Autonomous Recovery Controller

The Problem

Training neural networks is fragile

A single NaN gradient at hour 47 of a 48-hour run can destroy days of compute. Engineers waste enormous time adding manual checkpointing and babysitting long runs.

01

NaN Explosions

Loss becomes NaN or Inf with no warning, corrupting the entire training run irreversibly.

02

Gradient Collapse

Vanishing or exploding gradients silently halt learning, wasting compute without any alert.

03

Wasted Compute

Cloud GPU spending exceeds $30B annually. Every unrecoverable failure burns money directly.

04

Manual Monitoring

Engineers monitor training overnight, writing custom recovery scripts and checkpointing logic.

How It Works

Three autonomous stages

01

Monitor

Tracks 12+ real-time signals including loss trajectory, gradient norms, weight health, optimizer state integrity, and activation statistics.

→

02

Predict

An MLP classifier with 97.5% accuracy detects failures before they become irreversible, using signal-based features with zero false positives.

→

03

Recover

Automatically rolls back to the last healthy checkpoint and applies corrective measures including learning rate reduction and gradient clipping.

Capabilities

Everything needed for resilient training

01

Self-Healing Engine

Automatic rollback with learning rate reduction, gradient clipping, and full state restoration on failure detection.

02

Multi-Signal Monitoring

12+ signals tracked simultaneously: gradients, activations, weights, optimizer state, loss curvature, spectral features.

03

Failure Prediction

MLP classifier trained on 200 scenarios achieves 97.5% accuracy with 100% precision and zero false positives.

04

PINN Stabilizer

Specialized loss balancing for physics-informed neural networks with adaptive curriculum scheduling.

05

Adversarial Defense

Detect adversarial inputs, apply randomized smoothing, and train models with certified robustness guarantees.

06

Uncertainty Quantification

Conformal prediction for distribution-free coverage guarantees with Venn-Abers calibration support.

07

Continual Learning

Elastic Weight Consolidation prevents catastrophic forgetting when training across sequential tasks.

08

Low Overhead

Less than 10% overhead for models above 250K parameters. Adaptive sampling keeps monitoring efficient.

09

Smart Checkpointing

Quantized FP16 checkpoints with 50% memory reduction and automatic lifecycle management.

Architecture

Modular by design

project structure

arc/
  core/            Self-healing engine
  signals/         Multi-signal collectors
  features/        Feature extraction and buffering
  prediction/      Failure prediction models
  intervention/    Recovery strategies
  checkpointing/   Checkpoint management
  introspection/   Hessian + Fisher analysis
  physics/         PINN stabilization
  uncertainty/     Conformal prediction
  security/        Adversarial defense
  learning/        Continual learning (EWC)
  evaluation/      Benchmarking harness

Pluggable Collectors

Add or remove signal collectors independently without touching the core engine.

Swappable Prediction

Use MLP, Logistic Regression, or custom classifiers. Hot-swap models at runtime.

Configurable

Dataclass-based configuration with presets: Config.low_overhead() and Config.high_accuracy().

Minimal Dependencies

Requires only PyTorch and NumPy. Optional integrations for Lightning, SciPy, and tqdm.

Compatibility

Works with any architecture

CNN / ResNet

Transformers

YOLO / ViT

PINNs

GPT / LLMs

Diffusion UNet

RNN / LSTM

GANs

Validated on NanoGPT, ResNet-50, YOLOv11, GPT-2, ViT-Base, Stable Diffusion UNet, and more.

Start using ARC

Install in seconds. Never lose a training run again.

Read the Docs View on GitHub

Autonomous Recovery Controllerfor Neural Network Training