Sandbox for Automated Weak-to-Strong AI Alignment Research

Weak-to-Strong Generalization and PGR Metric

Weak-to-strong generalization tackles aligning superhuman AI by training a weak model on labeled data to pseudo-label unlabeled data, then fine-tuning a stronger model on those labels. Success is quantified by Performance Gap Recovery (PGR): (transfer_acc - weak_acc) / (strong_acc - weak_acc), where PGR=0 means no improvement over weak model, and PGR=1 means full recovery of strong model's oracle performance. This setup uses three datasets—chat, math, code—each split into test.jsonl, train_label.jsonl, and train_unlabel.jsonl. Ground truth labels are server-held; agents access only unlabeled data via API to prevent cheating.

Baselines and Custom Experiments

Pre-computed baseline results are cached in cache_results.tar.gz. Rerun or extend with these methods:

Baseline	Technique
`vanilla_w2s`	Standard: Train strong model directly on weak pseudo-labels
`train_only_on_confident_labels`	Filter weak labels by model confidence before training
`critic`	Use strong model to critique and refine weak labels
`ue_zeroshot`	Unsupervised elicitation via zero-shot prompting
`ue_fewshot`	Few-shot in-context learning (less effective on small models like Qwen3-4B-Base)

New ideas implement run.py using RunConfig, leveraging cached weak model predictions. Training uses Unsloth + LoRA on PyTorch/Transformers; evaluation computes PGR via API. See Idea.md for inspiration.

Automated Claude Agent Researcher

A Claude agent iteratively proposes ideas from Idea.md, codes implementations in ideas/, trains models, evaluates PGR, and shares via leaderboard/tools. Start Flask dashboard at http://localhost:8000 (requires ANTHROPIC_API_KEY). Execution modes ensure isolation:

Local (subprocess): Quick debugging, but agent sees labels (not legit).
Local Docker: Isolated container mounts only data/ and read-only cache_results/; GPU passthrough.
RunPod Cloud: Parallel agents on cloud GPUs; server deploys, retries on errors, stores via S3.

Project uses Python (uv/uv.lock), Docker, Flask UI, Anthropic SDK, vLLM inference.