DuckDB-Python: Fast Analytics Pipelines with Zero-Copy DataFrames

Zero-Copy Queries and Seamless DataFrame Integration

Query Pandas, Polars, or PyArrow tables directly without loading: con.sql('SELECT * FROM pdf') accesses DataFrames in-place via replacement scans, even for dicts like my_dict_data. Convert results flexibly: .df() for Pandas, .pl() for Polars, .arrow() for Arrow, .fetchnumpy() for arrays, or .fetchall() for lists. Generate synthetic data fast with generate_series(1, 100000) for sales tables including dates, categories, amounts, regions, and returns. Use relational API chaining: con.table('sales').filter('NOT returned').aggregate('category, region, SUM(amount)').order('revenue DESC') for filtered aggregations outperforming manual Python steps.

Advanced SQL for Complex Analytics

Apply window functions like SUM(daily_rev) OVER (PARTITION BY region ORDER BY order_date) for cumulative revenue and AVG(daily_rev) OVER (PARTITION BY region ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) for 7-day rolling averages, filtered by QUALIFY row_number() <= 3. Pivot with PIVOT sales ON region USING SUM(amount) GROUP BY category. Handle nested types: access struct fields (name.first), list indices (scores[1]), maps (metadata['tier']), and unnest lists (unnest(scores)). Create Python UDFs: scalar c2f(celsius) or vectorized Arrow discount(prices) via pc.multiply(prices, 0.85). Define macros like revenue_tier(amt) for CASE logic or table macros top_by_category(cat, n) for reusable subqueries. Traverse hierarchies with recursive CTEs: WITH RECURSIVE org ... UNION ALL builds org charts with depth and paths. Match time series via ASOF JOINs: trades ASOF JOIN stock_prices ON ticker AND trade_ts >= ts links trades to latest prices.

High-Performance Execution and Profiling

Bulk insert 50,000 rows from Pandas in <0.1s using con.append('fast_load', bulk_df), far faster than row-by-row. Benchmark on 1M rows shows DuckDB groupby aggregations (sum/mean/std/min/max) at ~0.05s vs Pandas ~0.5s, yielding 10x speedup. Profile with EXPLAIN for plans, PRAGMA enable_profiling='json' for timings in profile.json. Run multi-threaded: each thread gets its own connection (duckdb.connect()) for parallel table creation and sums on 10k rows without conflicts. Configure threads: 2, memory_limit: '512MB'. Use lambdas in SQL: list_transform([1,2,3], x -> x*x) squares lists, list_filter(x -> x%2=0) extracts evens.

Production I/O and Storage Patterns

Export to CSV/Parquet/JSON: COPY (SELECT ...) TO 'file.parquet' (FORMAT PARQUET, COMPRESSION ZSTD), with Parquet smallest (e.g., summary files: CSV 1kB, Parquet 500B, JSON 2kB). Write Hive-partitioned Parquet COPY sales TO 'partitioned_data' (PARTITION_BY (region, category)) and read selectively: read_parquet('partitioned_data/**/*.parquet', hive_partitioning=true) WHERE region='US'. Query remote HTTPS Parquet directly after install_extension/load_extension('httpfs'): read_parquet('https://blobs.duckdb.org/data/yellow_tripdata_2010-01.parquet') counts 1.5M+ rows. Parameterize with $1 in prepared statements or SET VARIABLE target_region='EU'. Manage transactions: BEGIN(); UPDATE ...; COMMIT() or ROLLBACK(). Add FTS indexes PRAGMA create_fts_index for BM25 searches. Persist with duckdb.connect('tutorial.duckdb'); enums like CREATE TYPE mood AS ENUM ('happy', 'neutral', 'sad').