Uncensored SuperGemma-4 Powers Local Agent Workflows

Uncensored Fine-Tune Enhances Gemma 4 for Practical Agents

SuperGemma-4 refines Google's Gemma 4 26B A4B (instruction-tuned, 256K context, native system prompts, function calling, 3.8B active MoE params) into an uncensored variant optimized for text, coding, planning, tool-use, browser tasks, and logic—avoiding chaotic role-play while staying useful. Creator Jun Song's MLX 4-bit v2 claims QuickBench score of 95.8 (vs. 91.4 baseline) and 46.2 tokens/second (vs. 42.5), with gains in code, logic, Korean, and browser workflows. Use neutral embedded templates to prevent prompt drift into unwanted coding or tool modes, ensuring clean chat and agent behavior without manual chat template overrides, which can corrupt responses.

This balance makes it ideal for local power users needing permissive models that retain agent-ready architecture, outperforming stock Gemma 4 in unfiltered workflows without sacrificing reasoning.

Apple Silicon Setup Requires 24GB+ RAM for Smooth Inference

On Macs, install MLX-LM via pip install -U mlx-lm, then launch OpenAI-compatible server: mlx_lm.server jun-song/super-gemma-4-26b-mlx-4bit-v2 --port 8080, letting it auto-detect the template. Test with mlx_lm.generate and a prompt at --max-tokens 512. Minimum 24GB unified memory for comfort, 32GB+ preferred; tune wired memory via sysctl if needed. At Q4_K_M quantization, GGUF variant is 16.8GB for broader compatibility.

These steps yield fast, local inference leveraging Gemma's MoE efficiency, enabling seamless tool integration without cloud dependency.

Pair with Hermes or OpenClaw for Terminal and Assistant Agents

Connect MLX/GGUF servers (OpenAI-compatible) to Hermes Agent for terminal-first workflows with tools, memory, MCP, and messaging—select custom OpenAI endpoint, input local URL/model. Hermes leverages Gemma's native function calling for natural agent behavior, not forced adaptations.

For multi-channel assistants, route OpenClaw to the local endpoint as reasoning model, supporting automation and task-running. GGUF works identically via llama.cpp, LM Studio, Jan, or Open WebUI servers.

This stack delivers uncensored, production-like local agents: Gemma base + permissive fine-tune + agent shells, practical for coding/planning without refusals.