Super Gemma 4: Uncensored Local Agent Booster
Community fine-tune of Gemma 4 26B delivers uncensored performance gains (95.8 QuickBench vs 91.4 baseline, 46.2 t/s) for agent tasks like coding and tools, optimized for MLX on Apple Silicon or GGUF elsewhere.
Uncensored Fine-Tune Enhances Gemma 4 for Practical Agent Work
Super Gemma 4 26B builds on Google's Gemma 4 26B A4B base, which activates only 3.8B of 25B parameters during inference, supports native system prompts, function calling, and 256K context. This community version by Jun Song removes restrictions without sacrificing utility, targeting text-only tasks like coding, logic, tool use, browser workflows, and planning. Benchmarks show QuickBench overall at 95.8 (vs 91.4 baseline) and 46.2 tokens/second generation (vs 42.5), with gains in code, logic, Korean, and browser tasks. Unlike chaotic uncensored models, it stays practical for agent shells, avoiding refusals while maintaining reasoning.
MLX Setup Unlocks Fast Apple Silicon Inference
On Macs, install MLX-LM via pip install -U mlx-lm, then launch server: mlx_lm.server --model jun-song/super-gemma-4-26b-it-mlx-4bit-v2 --port 8080. Let MLX auto-detect the bundled template—manually forcing one corrupts responses. Test with mlx_lm.generate --model jun-song/super-gemma-4-26b-it-mlx-4bit-v2 --prompt "test" --max-tokens 512. This exposes an OpenAI-compatible endpoint at localhost:8080, enabling seamless integration without custom hacks.
Agent Integrations Leverage Native Capabilities
Pair with Hermes agent by selecting custom OpenAI provider, pointing to the MLX endpoint, and choosing the super Gemma model—its native function calling aligns perfectly for terminal-based tools, memory, MCP, and messaging. For Open Claw personal assistants, configure the custom OpenAI provider similarly; raise wired memory via sysctl if needed. Both benefit from the model's agent-ready design, turning local uncensored inference into production-like workflows without cloud dependency.
GGUF Variant Extends to Non-Mac Ecosystems
For Windows/Linux, use the Q4_K_M GGUF (16.8 GB) via llama.cpp, LM Studio, Jan, or Open Web UI. It embeds a neutral template to prevent prompt drift into coding or erratic tool calls, ensuring clean chat. Serve via OpenAI-compatible interface for Hermes/Open Claw compatibility, broadening access beyond Apple Silicon while preserving speed and uncensored utility.