Sandbox for Automated Weak-to-Strong AI Alignment Research
Provides datasets, baselines, and Claude agent to automate weak-to-strong generalization experiments, measuring strong model recovery of weak labels via PGR = (transfer_acc - weak_acc) / (strong_acc - weak_acc).
Weak-to-Strong Generalization and PGR Metric
Weak-to-strong generalization tackles aligning superhuman AI by training a weak model on labeled data to pseudo-label unlabeled data, then fine-tuning a stronger model on those labels. Success is quantified by Performance Gap Recovery (PGR): (transfer_acc - weak_acc) / (strong_acc - weak_acc), where PGR=0 means no improvement over weak model, and PGR=1 means full recovery of strong model's oracle performance. This setup uses three datasets—chat, math, code—each split into test.jsonl, train_label.jsonl, and train_unlabel.jsonl. Ground truth labels are server-held; agents access only unlabeled data via API to prevent cheating.
Baselines and Custom Experiments
Pre-computed baseline results are cached in cache_results.tar.gz. Rerun or extend with these methods:
| Baseline | Technique |
|---|---|
vanilla_w2s | Standard: Train strong model directly on weak pseudo-labels |
train_only_on_confident_labels | Filter weak labels by model confidence before training |
critic | Use strong model to critique and refine weak labels |
ue_zeroshot | Unsupervised elicitation via zero-shot prompting |
ue_fewshot | Few-shot in-context learning (less effective on small models like Qwen3-4B-Base) |
New ideas implement run.py using RunConfig, leveraging cached weak model predictions. Training uses Unsloth + LoRA on PyTorch/Transformers; evaluation computes PGR via API. See Idea.md for inspiration.
Automated Claude Agent Researcher
A Claude agent iteratively proposes ideas from Idea.md, codes implementations in ideas/, trains models, evaluates PGR, and shares via leaderboard/tools. Start Flask dashboard at http://localhost:8000 (requires ANTHROPIC_API_KEY). Execution modes ensure isolation:
- Local (subprocess): Quick debugging, but agent sees labels (not legit).
- Local Docker: Isolated container mounts only
data/and read-onlycache_results/; GPU passthrough. - RunPod Cloud: Parallel agents on cloud GPUs; server deploys, retries on errors, stores via S3.
Project uses Python (uv/uv.lock), Docker, Flask UI, Anthropic SDK, vLLM inference.