MLX-VLM: Run VLMs on Mac with MLX Inference & Fine-Tuning

MLX-VLM package runs vision-language models (VLMs) and omni models on Apple Silicon via MLX, supporting text/image/audio/video inference, multi-modal inputs, CLI/UI/server APIs, and LoRA fine-tuning.

Core Setup and Inference Workflows

Install via pip install -U mlx-vlm (add [torch] for models like Qwen2-VL). Use CLI for quick generation: mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit "prompt" image.jpg handles text, images, audio (audio.wav), video, or multi-modal. Launch Gradio chat UI with mlx_vlm.chat_ui --model <model>. In Python, load via from mlx_vlm import load, generate; model, processor = load('model_path'); apply chat template with apply_chat_template; pass lists for multi-images (num_images=len(images)) or audio (num_audios=len(audios)). For thinking models like Qwen3.5, set --thinking-budget <tokens> to cap internal reasoning (forces \n transition on exceedance); flags include --enable-thinking, custom start/end tokens.

FastAPI server (mlx_vlm.server --model <path>) offers OpenAI-compatible /v1/chat/completions endpoint supporting streamed text/image/audio inputs (e.g., {"messages": [{"role": "user", "content": [{"type": "text", "text": "Describe"}, {"type": "input_image", "image_url": "/path.jpg"}]}]}). Cache one model at a time; unload via /unload; list via /models. Parameters: max_tokens, temperature, top_p/k, min_p, repetition_penalty, stream.

Optimization and Multi-Modal Capabilities

On NVIDIA CUDA with MLX, enable activation quantization for mxfp8/nvfp4 models via --quantize-activations (CLI) or quantize_activations=True (Python), converting QuantizedLinear to QQLinear for weights+activations (unneeded on Apple Metal). Multi-image chat works by passing image lists (e.g., mlx_vlm.generate ... "Compare these" image1.jpg image2.jpg), enabling cross-image reasoning. Video support (captioning/summarization) for Qwen2-VL/2.5-VL, Idefics3, LLaVA via CLI/Python with video paths.

Model Ecosystem and Customization

Detailed docs for models like DeepSeek-OCR, Phi-4 Reasoning Vision/Multimodal, MiniCPM-o, Moondream3 cover prompts/best practices. Repo has 481 commits, 2.3k stars, 302 forks. Fine-tune with LoRA/QLoRA (see LoRA.md); supports adapters (--adapter-path). Topics: mlx, vision-language-model, llava, local-ai. Python 100%.

Summarized by x-ai/grok-4.1-fast via openrouter

8187 input / 1288 output tokens in 10584ms

© 2026 Edge