OpenAI's gpt-oss-120b/20b: Open-weight LLMs for agents
OpenAI's gpt-oss-120b and gpt-oss-20b open-weight models excel at reasoning and agentic tasks but require harmony response format; run via Transformers, vLLM, Ollama with BF16 and temp=1.0/top_p=1.0 sampling.
Core Model Specs and Requirements
OpenAI released gpt-oss-120b (120B params) and gpt-oss-20b as open-weight models optimized for reasoning, agentic workflows, and developer tasks. Download weights from Hugging Face: openai/gpt-oss-120b and openai/gpt-oss-20b. Both mandate the harmony response format (via openai-harmony package or Transformers chat template) for correct output—direct model.generate() needs manual harmony application. Use BF16 for activations; MoE layers employ MXFP4 quantization (tensor.blocks in uint8 + tensor.scales) split for linear projections, enabling gpt-oss-120b on single 80GB GPU with Triton. Recommended sampling: temperature=1.0, top_p=1.0. Models integrate browsing/python tools natively.
Inference Options for Production and Local Use
For high-throughput serving, use vLLM (OpenAI-compatible server): uv pip install vllm, auto-downloads model. Transformers handles harmony automatically in chat templates. Consumer hardware: Ollama (ollama run gpt-oss:20b), LM Studio (direct download). Reference impls (non-prod): PyTorch (tensor-parallel MoE, 4xH100/2xH200, upcasts to BF16), Triton (nightly, optimized MoE/attention kernels, expandable allocator for OOM), Metal (Apple Silicon, convert SafeTensors first). Install via PyPI (pip install gpt-oss) or local pip install -e .[metal]. Terminal chat/Responses API servers support torch/triton/vllm/metal/ollama/transformers backends; Codex client works with Ollama on port 11434.
Agentic Tools and Harmony Integration
Embed tools in system prompt via harmony (with_browser_tool(), with_python(), with_tools()). Browser tool (ExaBackend/YouComBackend) offers search/open/find on scrollable 50+20-line windows with caching/citations—new instance per request, educational only. Python tool (stateless override) runs in permissive Docker for CoT calculations (add restrictions in prod). Apply_patch tool creates/updates/deletes local files. Harmony lib (github.com/openai/harmony) standardizes chat; see cookbook.openai.com for guides (Transformers/vLLM/Ollama). Awesome-gpt-oss.md lists community resources.