Frequently Asked Questions
Quick answers to the most common questions. Can't find what you need? Ask on Discord or open an issue.
General
ARC (Autonomous Recovery Controller) is an open-source Python library that makes neural network training self-healing. It wraps your existing PyTorch training loop and autonomously monitors training signals, predicts failures before they happen, and automatically recovers when things go wrong by rolling back to the last healthy checkpoint and applying corrective measures.
ARC detects 5 core failure modes: Divergence (NaN/Inf loss, loss explosion), Vanishing Gradients, Exploding Gradients, Representation Collapse (all activations become identical), and Severe Overfitting. It can also detect silent failures like optimizer momentum corruption and dead neurons.
Yes. ARC is released under the AGPL-3.0 license and is completely free for both academic and commercial use. The core library will always remain free and open source.
ARC supports PyTorch natively. It also provides a one-line integration for PyTorch Lightning via ArcCallback. TensorFlow and JAX are not supported at this time.
Installation
Python 3.8+, PyTorch 1.9+, and NumPy 1.19+. ARC has minimal dependencies by design. Optional dependencies like SciPy and tqdm can be installed with pip install arc-training[full].
Clone the repository and install in development mode: git clone https://github.com/a-kaushik2209/ARC.git && cd ARC && pip install -e .
Usage
For vanilla PyTorch, 3 lines: import, create the controller, call arc.step(loss). For PyTorch Lightning, 1 line: add ArcCallback() to your trainer's callbacks.
No. ARC is a wrapper, not a modification. It attaches monitoring hooks to your existing model and optimizer. Your architecture, loss function, optimizer, and training loop all stay exactly the same.
ARC works with any PyTorch model: CNNs (ResNet, VGG, EfficientNet), Transformers (GPT, BERT, ViT), object detection (YOLO), diffusion models (UNet), RNNs/LSTMs, GANs, PINNs, and more. Stress-tested up to 117M parameters.
Distributed training (multi-GPU/multi-node) is not yet supported. ARC currently operates in single-process mode. Distributed support is on the roadmap.
When auto_intervene=True, ARC automatically applies corrective actions when a critical-risk failure is predicted. When False (default), ARC only monitors and reports.
Performance
For models above 250K parameters, ARC adds less than 10% overhead (CPU). Overhead decreases with model size because forward/backward pass time grows superlinearly while ARC monitoring is O(n). For very small models (under 50K params), overhead can be higher (~60%).
97.5% accuracy with the MLP classifier using 12 features, evaluated on 200 scenarios with 5-fold cross-validation. The classifier achieves 100% precision (zero false positives) and 95% recall.
ARC has been stress-tested up to 117M parameters (GPT-2 Medium) with successful recovery. Behavior at 1B+ parameters is untested.
Advanced Features
PINNStabilizer provides adaptive loss weighting for Physics-Informed Neural Networks with multiple competing loss terms. It automatically balances losses and applies gradient clipping to prevent chaotic training dynamics.
EWC prevents catastrophic forgetting during continual learning. After finishing a task, call arc.consolidate_task() to compute Fisher Information. When training on a new task, add arc.get_ewc_loss() to penalize changes to important parameters.
ConformalPredictor provides distribution-free coverage guarantees. Instead of a single prediction, it returns a prediction set guaranteed to contain the true label with a specified probability (e.g., 90%).
Use the low-overhead preset: config = Config.low_overhead(). This reduces activation sampling to 5%, disables curvature computation, limits MC dropout to 5 samples, and sets a 2% overhead budget.
Troubleshooting
Call arc.attach(model, optimizer) before calling on_epoch_end(). If using ArcV2.auto(), attach is done automatically - make sure you pass your model and optimizer as arguments.
Set verbose=False when creating the Arc instance. You can also adjust thresholds: config.prediction.high_risk_threshold = 0.85 to reduce sensitivity.
Yes. Use arc.save_state("path") and arc.load_state("path") to persist and restore signal buffer, normalizer, and recommender state across sessions.
Fork the repo, open an issue describing your proposed change, create a feature branch, implement your changes, and open a PR. Join the Discord community for discussions.