DuckDB: Fast In-Process OLAP SQL Everywhere

Columnar Engine Powers Fast, Memory-Efficient Analytics

DuckDB's state-of-the-art columnar storage enables larger-than-memory workloads, preventing out-of-memory failures during analytics. Query Parquet/CSV/JSON/S3 data directly without loading into tables—e.g., SELECT station_name, count(*) AS num_services FROM 'https://blobs.duckdb.org/train_services.parquet' GROUP BY ALL ORDER BY num_services DESC LIMIT 10;. Auto-detects CSV formats, names, and types: CREATE TABLE stations AS FROM 'https://blobs.duckdb.org/stations.csv';. Supports spatial functions like ST_Distance(ST_Point(lng1, lat1), ST_Point(lng2, lat2)) * 111139 for crow-flies distances between stations. GROUP BY ALL simplifies grouping by all non-aggregate columns. MIT-licensed core, extensions, and DuckLake format ensure free extensibility.

Install in Seconds, Run Anywhere

Distribute across OSes/CPUs with one-liners: pip install duckdb, npm install @duckdb/node-api, curl https://install.duckdb.org | sh, cargo add duckdb --features bundled, go get github.com/duckdb/duckdb-go/v2. Portable to browsers/laptops/servers. Extension system adds features modularly—many core ones are extensions. Idiomatic APIs per language minimize setup; no servers needed as it's in-process.

Embed SQL in Python/R/JS/Java Workflows

Python: Query DataFrames via duckdb.sql('SELECT ... FROM df_in').to_df(); register UDFs like con.create_function('plus_one', lambda x: x+1, ['BIGINT'], 'BIGINT'). R: duckdb_register(con, 'iris', iris) then dplyr/duckplyr pipelines: iris |> filter(Sepal.Length > 5) |> group_by(Species) |> summarize(n(), max(Sepal.Width)) |> collect(). Java: JDBC DriverManager.getConnection('jdbc:duckdb:'); bulk appenders for inserts. Node.js: Async connection.runAndReadAll('SELECT ...'); integrate in Express endpoints for API responses. All preserve SQL dialect power (e.g., monthname(date) = 'May') while accelerating Pandas/dplyr.