Run GPT-OSS-20B in Colab with Quantized Inference & Tools

Load OpenAI's 20B open-weight GPT-OSS model in Colab using MXFP4 quantization and torch.bfloat16 (needs 16GB+ VRAM), then implement reasoning controls, JSON schemas, multi-turn chat, streaming, tool calling, and batch processing for production-like workflows.

Precise Model Loading for Local Open-Weight Execution

To run GPT-OSS-20B (~40GB download), install transformers>=4.51.0, accelerate, sentencepiece, protobuf, huggingface_hub, gradio, ipywidgets, and openai-harmony. Verify T4/A100 GPU with 16GB+ VRAM via torch.cuda.get_device_properties(0).total_memory / 1e9; free Colab T4s often fall short—upgrade to Pro. Load with AutoModelForCausalLM.from_pretrained('openai/gpt-oss-20b', torch_dtype=torch.bfloat16, device_map='auto', trust_remote_code=True) and AutoTokenizer for native MXFP4 quantization, allocating ~16GB VRAM. Use pipeline('text-generation') with pad_token_id=tokenizer.eos_token_id. OpenAI recommends temperature=1.0, top_p=1.0; tune lower (0.7-0.8) for consistency. This setup exposes full controllability absent in closed APIs, trading latency for transparency.

Basic generation: Format as chat messages [{'role': 'user', 'content': '...'}], call pipe(messages, max_new_tokens=256, do_sample=True, temperature=0.8, top_p=1.0); extract output[0]['generated_text'][-1]['content']. Handles Q&A, code gen, creative tasks reliably.

Adjustable Reasoning and Structured Outputs

Control depth via ReasoningEffortController with three configs:

  • Low: 'Be concise', max_tokens=200, temp=0.7 → fast facts.
  • Medium: 'Think step-by-step', max_tokens=400, temp=0.8 → balanced.
  • High: 'Analyze thoroughly, chain-of-thought', max_tokens=800, temp=1.0 → deep logic (e.g., puzzles).

Prepend system prompts to messages; higher effort boosts accuracy on complex reasoning but increases tokens/latency. For JSON, use StructuredOutputGenerator: Feed schema (e.g., {'name': 'string', 'prep_time_minutes': 'integer', ...}) into strict system prompt ('Output ONLY valid JSON, no markdown'). Clean via regex (re.sub(r'^```(?:json)?\s*', '', text)), parse with json.loads, retry up to 2x on failure with error feedback. Temp=0.3 ensures conformity; succeeds on entity extraction, recipes. Trade-off: Retries add latency but hit 90%+ validity vs. raw prompting.

Stateful Interactions, Streaming, Tools, and Batch Efficiency

ConversationManager persists history: Append user/assistant pairs to self.history, prepend system + history to each pipe call (max_tokens=300, temp=0.8). Tracks turns (len(history)//2), summarizes previews. Maintains context (e.g., recalls name/field across 4 turns) without token explosion.

Streaming: TextIteratorStreamer(tokenizer, skip_prompt=True) + threaded model.generate(inputs, streamer=streamer, max_new_tokens=200) yields tokens live (for token in streamer: print(token)), revealing decoding pace—ideal for UX or debugging.

Tools via ToolExecutor: Decorator-register funcs (e.g., safe-eval calculator with math whitelist, datetime.now(), simulated weather/search). Prompt lists tools; model outputs 'TOOL: name\nARGS: {...}'—parse, execute, feed result back ('Tool result: ... Now final answer.'), regenerate. Handles math (15*23+7), time, queries; simulates prod agent loops.

Batch: batch_generate(prompts, batch_size=2) processes lists (e.g., 5 Q&A) in chunks via parallel pipe([messages1, messages2]), max_tokens=100, temp=0.7. Cuts overhead 2x+ vs. serial for throughput testing.

These patterns turn GPT-OSS into a flexible local stack: Memory use stays under 16GB post-load; scale via batching, control via params/prompts. Differs from APIs—no rate limits, full inspectability, but manage VRAM/hosting yourself.

Summarized by x-ai/grok-4.1-fast via openrouter

8775 input / 1962 output tokens in 11273ms

© 2026 Edge