Google's Gemini-SQL2 Sets New BIRD Benchmark Record

Performance and Benchmarking

Google Research has introduced Gemini-SQL2, a specialized text-to-SQL capability built on the Gemini 3.1 Pro foundation model. It achieved an 80.04% execution accuracy on the BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) benchmark, specifically within the Single Trained Model track. This track is significant because it prohibits the use of complex ensembling or external agentic frameworks, isolating the model's core ability to translate natural language into valid, executable SQL.

For context, the BIRD benchmark is considered a rigorous industry standard, containing 12,751 question-SQL pairs across 37 domains. Unlike older benchmarks, it requires models to handle 'dirty' database values and incorporate external knowledge. While Gemini-SQL2 leads the current leaderboard, it remains approximately 12.92 percentage points behind the estimated human performance of 92.96%.

Technical Implementation and Integration

Gemini-SQL2 is designed to address the inherent difficulty of mapping natural language to complex business contexts. While Google has not yet released a standalone model string or API endpoint for Gemini-SQL2, the current implementation pattern for Gemini-based SQL generation involves schema-grounded prompting. Developers are encouraged to use the google-genai SDK to pass schema definitions and natural language questions to the model.

To achieve production-grade reliability, the article recommends implementing an execution verification loop:

Generate the SQL query.
Attempt to execute the query against the database.
If an error occurs, capture the error message and feed it back into the model for a corrective retry.

This iterative approach mirrors the logic used by the BIRD benchmark to validate execution accuracy, ensuring that the generated SQL is not just syntactically correct, but functionally valid against the specific database schema.

Performance and Benchmarking

Technical Implementation and Integration

More from AI & LLMs

Schema-Aware Localisation (SAL) for NL2SQL Reliability

Vector Search Explained: From Brute Force to ANN

AI Coders Default to Hardcoded Keyword Rules

Caveman Plugin Barely Cuts Tokens in Claude Code Tasks