AntAngelMed: 103B MoE Medical LLM Matches 40B Dense at 7x Speed

Sparse MoE Delivers Massive Capacity at Low Compute

AntAngelMed packs 103B total parameters into a 1/32 activation-ratio Mixture-of-Experts (MoE) architecture, activating just 6.1B params per inference to match performance of ~40B dense models while achieving up to 7x efficiency over equivalently sized dense setups—speed advantages grow further with longer outputs. MoE works by routing inputs to a subset of 'expert' sub-networks instead of using all params per token, scaling knowledge without proportional compute hikes. Builds on Ling-flash-2.0 base via Ling Scaling Laws, with refinements like finer expert granularity, optimized shared expert ratio, attention balancing, auxiliary-loss-free sigmoid routing, Multi-Token Prediction (MTP) layer, QK-Norm, and Partial-RoPE (subset of attention heads). On H20 GPUs, hits >200 tokens/second (3x a 36B dense model), extends to 128K context via YaRN for full clinical docs or multi-turn dialogues. FP8 quantization + EAGLE3 speculative decoding yields 71% HumanEval uplift, 45% GSM8K, 94% Math-500 at 32 concurrency, stabilizing throughput for coding/math proxies.

Three-Stage Training Infuses Medical Depth

Layer general reasoning atop medical specialization through: (1) Continual pre-training on vast medical corpora—encyclopedias, web text, papers—from Ling-flash-2.0 checkpoint; (2) Supervised Fine-Tuning (SFT) on mixed instructions preserving chain-of-thought via math/coding/logic tasks alongside doctor-patient Q&A, diagnostics, ethics/safety; (3) GRPO Reinforcement Learning (lighter PPO variant estimating baselines from group scores, per DeepSeekMath paper) with rewards targeting empathy, structured clinical outputs, safety, evidence-based reasoning to slash hallucinations. This progression embeds domain expertise without eroding broad capabilities.

Leads Benchmarks, Deploys Easily Open-Source

Tops HealthBench (OpenAI's multi-turn clinical dialogues): #1 open-source, beats proprietary models, widest margin on HealthBench-Hard. Dominates MedAIBench (China Nat’l AI Medical Facility): elite in knowledge Q&A/ethics-safety. #1 overall MedBench (36 datasets, ~700K samples across knowledge QA, understanding, generation, complex reasoning, safety/ethics). Apache 2.0 weights (HuggingFace: MedAIBase/AntAngelMed), MIT code (GitHub: MedAIBase/AntAngelMed). Transformers load: AutoModelForCausalLM.from_pretrained("MedAIBase/AntAngelMed", device_map="auto", trust_remote_code=True). Runs on vLLM v0.11.0 (4-GPU tensor parallel), SGLang+FlashAttention-3, vLLM-Ascend (Huawei 910B NPUs). From Health Information Center of Zhejiang Province, Ant Healthcare, Zhejiang Anzhen’er Medical AI Technology Co., Ltd.

Sparse MoE Delivers Massive Capacity at Low Compute

Three-Stage Training Infuses Medical Depth

Leads Benchmarks, Deploys Easily Open-Source

More on Edge

Gemma 2: Open LLMs Trained on 13T Tokens, Top Benchmarks

TwELL Delivers 20% LLM Speedups via GPU-Optimized Sparsity

Qwen-Scope SAEs Unlock Actionable LLM Internals

OpenMythos: 770M RDT Matches 1.3B Transformer Power