Architecture and Efficiency

North Mini Code is a decoder-only Transformer utilizing a sparse Mixture-of-Experts (MoE) architecture. With 30 billion total parameters, it achieves high efficiency by activating only 3 billion parameters per forward pass. The model features 128 experts, with 8 experts selected per token via a router. Its attention mechanism interleaves sliding-window attention (using RoPE) and global attention in a 3:1 ratio. This design is specifically engineered to run on a single H100 GPU at FP8 precision, facilitating sovereign AI deployments without requiring massive compute clusters.

Performance and Capabilities

Optimized specifically for code generation, agentic software engineering, and terminal operations, the model supports a 256K token context window and a 64K maximum generation length. In internal benchmarks, North Mini Code demonstrated up to 2.8x higher output throughput compared to Devstral Small 2 on identical hardware, along with a 30% improvement in inter-token latency. The model was trained using a two-stage cascaded supervised fine-tuning (SFT) process followed by reinforcement learning with verifiable rewards (RLVR) to enhance its agentic reasoning and native tool-use capabilities.

Implementation and Deployment

The model is available under an Apache 2.0 license (with additional usage addenda) via Hugging Face, the Cohere API, and OpenRouter. Developers can integrate it using Hugging Face Transformers or serve it via vLLM using the cohere_melody library for accurate tool-call parsing. Recommended sampling parameters are a temperature of 1.0 and top_p of 0.95. Quantized versions are also compatible with standard local inference tools like Ollama, LM Studio, and llama.cpp.