Frequently Asked Questions

Quick answers to the most common questions. Can't find what you need? Ask on Discord or open an issue.

General

What is ARC?

ARC (Autonomous Recovery Controller) is an open-source Python library that makes neural network training self-healing. It wraps your existing PyTorch training loop and autonomously monitors training signals, predicts failures before they happen, and automatically recovers when things go wrong by rolling back to the last healthy checkpoint and applying corrective measures.

What failure types can ARC detect?

ARC detects 5 core failure modes: Divergence (NaN/Inf loss, loss explosion), Vanishing Gradients, Exploding Gradients, Representation Collapse (all activations become identical), and Severe Overfitting. It can also detect silent failures like optimizer momentum corruption and dead neurons.

Is ARC free to use?

Yes. ARC is released under the AGPL-3.0 license and is completely free for both academic and commercial use. The core library will always remain free and open source.

What frameworks does ARC support?

ARC supports PyTorch natively. It also provides a one-line integration for PyTorch Lightning via ArcCallback. TensorFlow and JAX are not supported at this time.

Installation

What are the minimum requirements?

Python 3.8+, PyTorch 1.9+, and NumPy 1.19+. ARC has minimal dependencies by design. Optional dependencies like SciPy and tqdm can be installed with pip install arc-training[full].

How do I install from source?

Clone the repository and install in development mode: git clone https://github.com/a-kaushik2209/ARC.git && cd ARC && pip install -e .

Usage

How many lines of code does integration require?

For vanilla PyTorch, 3 lines: import, create the controller, call arc.step(loss). For PyTorch Lightning, 1 line: add ArcCallback() to your trainer's callbacks.

Do I need to change my model architecture?

No. ARC is a wrapper, not a modification. It attaches monitoring hooks to your existing model and optimizer. Your architecture, loss function, optimizer, and training loop all stay exactly the same.

Which model architectures are supported?

ARC works with any PyTorch model: CNNs (ResNet, VGG, EfficientNet), Transformers (GPT, BERT, ViT), object detection (YOLO), diffusion models (UNet), RNNs/LSTMs, GANs, PINNs, and more. Stress-tested up to 117M parameters.

Can I use ARC with distributed training?

Distributed training (multi-GPU/multi-node) is not yet supported. ARC currently operates in single-process mode. Distributed support is on the roadmap.

What does auto_intervene do?

When auto_intervene=True, ARC automatically applies corrective actions when a critical-risk failure is predicted. When False (default), ARC only monitors and reports.

Performance

How much overhead does ARC add?

For models above 250K parameters, ARC adds less than 10% overhead (CPU). Overhead decreases with model size because forward/backward pass time grows superlinearly while ARC monitoring is O(n). For very small models (under 50K params), overhead can be higher (~60%).

What is the prediction accuracy?

97.5% accuracy with the MLP classifier using 12 features, evaluated on 200 scenarios with 5-fold cross-validation. The classifier achieves 100% precision (zero false positives) and 95% recall.

What is the maximum model size tested?

ARC has been stress-tested up to 117M parameters (GPT-2 Medium) with successful recovery. Behavior at 1B+ parameters is untested.

Advanced Features

What is the PINN Stabilizer?

PINNStabilizer provides adaptive loss weighting for Physics-Informed Neural Networks with multiple competing loss terms. It automatically balances losses and applies gradient clipping to prevent chaotic training dynamics.

How does EWC work in ARC?

EWC prevents catastrophic forgetting during continual learning. After finishing a task, call arc.consolidate_task() to compute Fisher Information. When training on a new task, add arc.get_ewc_loss() to penalize changes to important parameters.

What is Conformal Prediction?

ConformalPredictor provides distribution-free coverage guarantees. Instead of a single prediction, it returns a prediction set guaranteed to contain the true label with a specified probability (e.g., 90%).

How do I reduce overhead for production?

Use the low-overhead preset: config = Config.low_overhead(). This reduces activation sampling to 5%, disables curvature computation, limits MC dropout to 5 samples, and sets a 2% overhead budget.

Troubleshooting

I get "RuntimeError: Arc not attached"

Call arc.attach(model, optimizer) before calling on_epoch_end(). If using ArcV2.auto(), attach is done automatically - make sure you pass your model and optimizer as arguments.

ARC is printing too many warnings

Set verbose=False when creating the Arc instance. You can also adjust thresholds: config.prediction.high_risk_threshold = 0.85 to reduce sensitivity.

Can I serialize and restore ARC state?

Yes. Use arc.save_state("path") and arc.load_state("path") to persist and restore signal buffer, normalizer, and recommender state across sessions.

How do I contribute to ARC?

Fork the repo, open an issue describing your proposed change, create a feature branch, implement your changes, and open a PR. Join the Discord community for discussions.