#multimodal
Every summary, chronological. Filter by category, tag, or source from the rail.
NaviGen: Bridging User History and Personalized Multimodal Generation
NaviGen translates implicit user interaction history into explicit, high-fidelity generation instructions using a dual-identifier representation and a two-stage SFT+RL alignment pipeline.
ReMMD: Agentic Verification for Multimodal Misinformation
ReMMD is a new framework for detecting complex, multilingual, multi-image misinformation by decomposing posts into atomic points and using persistent-memory agents to verify claims, significantly reducing costs compared to previous methods.
Building Custom Vision Agents with Gemini, MCP, and Veo 3
Learn how to build a cloud-native vision agent that orchestrates real-time camera input, image style transfer via Nano Banana, and cinematic video generation using Veo 3, all controlled via natural language.
Google Cloud TechVisual-Seeker: Active Visual Reasoning for Multimodal Agents
Visual-Seeker introduces a visual-native agentic search framework that moves beyond text-based retrieval by employing active visual reasoning to navigate and interpret complex multimodal environments.
Building Multimodal Audio Applications with Gemini 3
Google DeepMind's Gemini 3 models enable unified audio understanding, steerable speech generation, and real-time multimodal interaction, allowing developers to build complex audio-to-audio applications with structured outputs.
AI EngineerStep 3.7 Flash: A 198B MoE Model for Agentic Workflows
StepFun’s new 198B parameter MoE model features native vision capabilities, improved tool-use reliability, and an 'Advisor Mode' that delivers near-Opus performance at a fraction of the cost.
Next-Gen Agentic Architecture: Gemini 3.5 & ADK
Google's Gemini 3.5 Flash and Gemini Omni introduce higher intelligence, lower costs, and advanced multimodal capabilities, while the Agent Development Kit (ADK) streamlines the lifecycle of building, scaling, and governing AI agents.
Google Cloud TechCohere's Command A+: A 218B Sparse MoE Model for Agentic Workflows
Command A+ is a 218B parameter sparse MoE model designed for enterprise agentic tasks, featuring multimodal capabilities, a 128K context window, and efficient W4A4 quantization that allows it to run on as few as two H100 GPUs.
Building Native Multimodal Agents with Gemini
Learn to build agentic, multimodal applications using Gemini's native understanding and generation capabilities, moving beyond hardcoded pipelines to reasoning-based agent loops.
AI EngineerReal-Time Multimodal Interpretation with Qwen3.5-LiveTranslate-Flash
Alibaba's Qwen3.5-LiveTranslate-Flash achieves 2.8-second latency for real-time interpretation across 60 languages by integrating visual context and real-time voice cloning.
Google Overhauls Gemini App into Multimodal AI Hub
Google is transforming the Gemini app from a chatbot into an agentic, multimodal hub featuring a redesigned interface, 24/7 background agents, and native video generation.
Google Transforms Gemini App into an Agentic AI Hub
Google is pivoting the Gemini app from a static chatbot to a proactive, agentic hub featuring personalized daily briefings, background task automation, and native video generation.
Building Multi-Modal AI Media Pipelines with Google DeepMind
Guillaume Vernade demonstrates how to orchestrate a multi-modal media pipeline using Gemini, Imagen, Veo, and Lyria, highlighting the role of LLMs as prompt engineers and the efficiency of stateful interaction APIs.
AI EngineerVisual Primitives Solve LMM Reference Gap
DeepSeek's withdrawn paper introduces 'Thinking with Visual Primitives'—embedding bounding boxes and points into every reasoning step—to fix ambiguous referencing in multimodal models, achieving 77.2% on spatial benchmarks with 10x fewer tokens than rivals.
Showing 15 of 15