Architecture and Efficiency

Command A+ is a decoder-only Sparse Mixture-of-Experts (MoE) model with 218B total parameters, utilizing 25B active parameters per token. It employs 128 experts, routing 8 experts per token alongside a shared expert. The model architecture features a 3:1 ratio of sliding-window attention to global attention layers, supporting a 128K input context and 64K generation length.

To optimize deployment, Cohere utilizes NVFP4 W4A4 quantization (4-bit weights and activations) for the MoE experts, while maintaining full precision for the attention path and KV cache. To mitigate quality loss from this aggressive quantization, the model undergoes Quantization-Aware Distillation (QAD), where the student model is trained to mirror the full-precision teacher's output distribution.

Hardware and Performance

Command A+ is designed for high-performance enterprise agentic workflows, including RAG, multilingual processing, and multimodal document analysis. Hardware requirements scale by quantization level:

  • BF16: 8x H100 or 4x B200
  • FP8: 4x H100 or 2x B200
  • W4A4: 2x H100 or 1x B200

Performance benchmarks show significant improvements over previous iterations, including a 20% increase in agentic QA accuracy and a 32% improvement in spreadsheet analysis. The model also demonstrates superior speed, with W4A4 quantization providing a 47% speed increase and 13% latency reduction. When combined with architecture-specific speculative decoding, inference speed increases by 1.5–1.6x.