Build Multimodal Qwen 3.6 Agents with Thinking & Tools

GPU-Adaptive Loading for Efficient Multimodal Inference

Start by probing your GPU's VRAM to pick the optimal quantization: bf16 for 75+ GB, int8 for 40+ GB, int4 otherwise. This ensures the 35B MoE model (3B active params) fits without OOM errors. Install transformers>=4.48, accelerate, bitsandbytes, qwen-vl-utils, sentence-transformers. Prefer flash_attention_2 if available, fallback to sdpa.

if VRAM_GB >= 75: LOAD_MODE = "bf16"
elif VRAM_GB >= 40: LOAD_MODE = "int8"
else: LOAD_MODE = "int4"
kwargs = dict(device_map="auto", trust_remote_code=True, attn_implementation=ATTN_IMPL, torch_dtype=torch.bfloat16)
if LOAD_MODE == "int4":
    kwargs["quantization_config"] = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True)
model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, **kwargs)

Principle: Dynamic loading maximizes hardware utilization—int4 on L4 GPUs still delivers coherent multimodal output. Common mistake: Fixed quantization ignores VRAM variance, crashing on smaller GPUs. Test on A100: ~70GB download, loads in ~60s, uses 20-40GB VRAM depending on mode.

Sampling presets tune behavior: thinking_general (temp=1.0, top_p=0.95) for open reasoning, thinking_coding (temp=0.6) for precise code. Thinking tags <think>...</think> enable inspectable chain-of-thought.

"GPU: A100 | VRAM: 80.0 GB | CUDA 12.1 | torch 2.4.0" — auto-detects setup for reproducibility.

QwenChat: Session-Persistent Framework with Thinking Split

Core class manages history, tools, chat templates. Methods: user(content), assistant(content, reasoning), tool_result(name, result). _inputs() applies template with enable_thinking=True for reasoning blocks.

generate() produces think/ans split: split_thinking(raw) extracts <think> content. Appends to history automatically. stream() threads generation, callbacks on_thinking/on_answer for real-time UI: buffers until </think>, then switches.

def split_thinking(text: str):
    if THINK_OPEN in text and THINK_CLOSE in text:
        a = text.index(THINK_OPEN) + len(THINK_OPEN)
        b = text.index(THINK_CLOSE)
        return text[a:b].strip(), text[b + len(THINK_CLOSE):].strip()
    return "", text.strip()

Save/load JSON for persistence. Principle: Separate reasoning from answers prevents dilution in multi-turn chats; preserve_thinking=True carries prior thinks forward for agents. Mistake: Ignoring split leads to garbled streams—always parse tags.

Quality criteria: Thinking ~100-200 tokens for complex tasks; answers concise post-</think>.

"Loaded in 62s | VRAM used: 28.4 GB" — efficient even quantized.

Thinking Budget, Tool Agents, and Structured Outputs

ThinkingBudget(StoppingCriteria) caps reasoning tokens post-<think>: stops if exceeds budget before </think>. Example: Frog well puzzle, budget=150 tokens—model simulates days without endless loops.

Agent loop: run_agent(user_msg, max_steps=5)—generate with tools, parse <tool_call>{json}</tool_call>, execute (calc, search_docs, get_time), feed results back. Schema-defined tools enable function calling.

TOOL_CALL_RE = re.compile(r"<tool_call>\s*(\{.*?\})\s*</tool_call>", re.S)
TOOLS_SCHEMA = [{"type":"function", "function":{"name":"calculate", ...}}]

JSON extraction: json_with_retry(prompt, schema) strips fences, parses balanced {}, validates jsonschema, retries up to 3x with feedback. Handles malformed output reliably.

"You reply with ONLY a single JSON object matching the user's schema. No markdown fences. No blocks." — strict system prompt ensures purity.

Principle: Budgets prevent verbose reasoning explosion; retries fix hallucinated JSON. For Inception movie: 1-2 tries yields valid {"title":"Inception","year":2010,...}.

MoE Inspection, Benchmarks, RAG, and Vision Handling

Hook routers (256 experts, top-8 + shared): Forward hooks count activations per token. Short prompt activates ~50 distinct experts unevenly—reveals routing dynamics.

Benchmark: Batch 1-4, measures tok/s (e.g., 150+ tok/s batch=1), VRAM peak. Resets cache between runs.

Mini-RAG: SentenceTransformer('all-MiniLM-L6-v2') embeds KB facts, cosine top-k retrieve. rag_answer() injects context, instructs grounded response.

Vision: Multimodal content={"type":"image","image":"url"}, {"type":"text","text":"prompt"}. Handles math figures, object grounding (bbox JSON).

YaRN override for 1M context: Config rope_type:"yarn", factor=4.0—reload model with it.

c.history.append({"role":"user","content":[
    {"type":"image","image":IMG},
    {"type":"text","text":"Locate every distinct object..."}
]})

Principle: Inspect MoE for debugging (e.g., expert skew); RAG uses lightweight embeds over heavy rerankers for prototyping. Mistake: No padding_side='left' in batching slows inference.

"distinct experts activated: 48" — hands-on routing visibility.

Key Takeaways

Probe VRAM and auto-quantize: int4 for <40GB GPUs keeps multimodal coherent.
Use <think>...</think> + split for inspectable reasoning; stream separately for UIs.
Implement ThinkingBudget to cap tokens—avoids infinite loops in puzzles/agents.
Agent loop: Parse tool_calls regex, execute schema tools, max_steps=5 prevents drift.
JSON retry + jsonschema: Reliable structured output even from creative models.
Hook MoE gates to count expert fires—debug routing imbalances.
Mini-RAG with MiniLM: Embed KB, cosine retrieve top-3 for grounded answers.
Vision prompts: List content with image URLs—native without extra VL utils.
Benchmark batches with left-padding: Quantify tok/s before scaling.
Persist sessions via JSON: Enables long-running prototypes across restarts.