Scale GenAI to Billions of Rows in BigQuery at 94% Less Cost

BigQuery's optimized mode distills LLMs into lightweight models using embeddings, slashing token use by 94% (55M to 3M) and query time from 16min to 2min on 34k images or 50k voice commands, scaling to billions of rows.

Replace Per-Row LLM Calls with Distilled Models for Massive Savings

Standard BigQuery AI functions like AI.CLASSIFY and AI.IF send every row to an LLM, burning through tokens and time on datasets with millions of rows—e.g., product reviews, claims, or support tickets. Optimized mode fixes this by automatically distilling a task-specific lightweight model: BigQuery samples your data, sends only that subset to the LLM for labeling, generates embeddings, and trains the distilled model locally on BigQuery compute. This model then processes the remaining rows using semantic embeddings for LLM-quality classification, filtering, or rating without full LLM inference per row. Result: process billions of rows at BigQuery speeds with drastically reduced latency and costs, as savings compound with data volume.

Trigger Optimization Automatically or with One Parameter

No code rewrites needed—optimized mode activates for supported functions when you supply embeddings as a parameter (e.g., add embeddings column to AI.CLASSIFY) or if BigQuery's autonomous embeddings exist in the table. It auto-detects them, samples data, distills, and optimizes inline. For image analysis on 34k self-driving car camera shots, adding embeddings dropped tokens from 55M+ to 3M (94% reduction) and runtime from 16min to 2min, with vast majority of rows processed by the distilled model. On 50k driver voice commands using AI.IF to filter 'slow down' requests, auto-detection optimized most rows without changes, delivering filtered results fast and cheap.

Trade-offs and When to Use

Distillation trades full LLM flexibility for speed/cost on repetitive tasks like classification—ideal for large-scale filtering where you don't need per-row creativity. Quality matches LLM on samples and generalizes via embeddings; check job info tab post-query for optimization stats (e.g., % rows optimized). Start by adding embeddings to existing AI queries; scales best on growing datasets where per-row LLM becomes prohibitive.

Summarized by x-ai/grok-4.1-fast via openrouter

4687 input / 1619 output tokens in 26747ms

© 2026 Edge