Sandbox for Automated Weak-to-Strong AI Alignment Research

Provides datasets, baselines, and Claude agent to automate weak-to-strong generalization experiments, measuring strong model recovery of weak labels via PGR = (transfer_acc - weak_acc) / (strong_acc - weak_acc).

Weak-to-Strong Generalization and PGR Metric

Weak-to-strong generalization tackles aligning superhuman AI by training a weak model on labeled data to pseudo-label unlabeled data, then fine-tuning a stronger model on those labels. Success is quantified by Performance Gap Recovery (PGR): (transfer_acc - weak_acc) / (strong_acc - weak_acc), where PGR=0 means no improvement over weak model, and PGR=1 means full recovery of strong model's oracle performance. This setup uses three datasets—chat, math, code—each split into test.jsonl, train_label.jsonl, and train_unlabel.jsonl. Ground truth labels are server-held; agents access only unlabeled data via API to prevent cheating.

Baselines and Custom Experiments

Pre-computed baseline results are cached in cache_results.tar.gz. Rerun or extend with these methods:

BaselineTechnique
vanilla_w2sStandard: Train strong model directly on weak pseudo-labels
train_only_on_confident_labelsFilter weak labels by model confidence before training
criticUse strong model to critique and refine weak labels
ue_zeroshotUnsupervised elicitation via zero-shot prompting
ue_fewshotFew-shot in-context learning (less effective on small models like Qwen3-4B-Base)

New ideas implement run.py using RunConfig, leveraging cached weak model predictions. Training uses Unsloth + LoRA on PyTorch/Transformers; evaluation computes PGR via API. See Idea.md for inspiration.

Automated Claude Agent Researcher

A Claude agent iteratively proposes ideas from Idea.md, codes implementations in ideas/, trains models, evaluates PGR, and shares via leaderboard/tools. Start Flask dashboard at http://localhost:8000 (requires ANTHROPIC_API_KEY). Execution modes ensure isolation:

  • Local (subprocess): Quick debugging, but agent sees labels (not legit).
  • Local Docker: Isolated container mounts only data/ and read-only cache_results/; GPU passthrough.
  • RunPod Cloud: Parallel agents on cloud GPUs; server deploys, retries on errors, stores via S3.

Project uses Python (uv/uv.lock), Docker, Flask UI, Anthropic SDK, vLLM inference.

Summarized by x-ai/grok-4.1-fast via openrouter

6031 input / 1242 output tokens in 7727ms

© 2026 Edge