FinanceBench: LLM Eval Dataset for SEC Filing QA

FinanceBench benchmarks LLMs on 10K+ financial QA tasks from real 10K/10Q filings, covering metric extraction, numerical ratios like ROA (-0.02 for AES), and domain reasoning like liquidity via quick ratio (0.96 for 3M).

Core Structure Enables LLM Financial Reasoning Benchmarks

FinanceBench structures QA pairs from public company SEC filings (10K, 10Q, 8K) across sectors like Industrials (3M), IT (Adobe), Utilities (AES). Key columns include financebench_id, company, doc_name (e.g., 3M_2018_10K), question_type (metrics-generated, domain-relevant, novel-generated), question_reasoning (information extraction, numerical/logical reasoning), question, answer, justification, evidence (text snippets/pages), gics_sector, doc_type, doc_period (e.g., 2018-2023), doc_link. All subsets labeled OPEN_SOURCE. Enables testing LLMs on production-grade tasks: direct extraction (e.g., 3M FY2018 CAPEX $1577M from 'Purchases of PP&E'), calculated metrics (e.g., Adobe FY2015 operating cash flow ratio 0.66 = cash from ops / current liabilities), multi-year averages (Activision Blizzard FY2017-19 capex/revenue 1.9%).

Numerical Reasoning Tasks Build Real-World Ratios

Dataset stresses formula-based computations from balance sheets, income/cash flow statements. Examples: fixed asset turnover (Activision Blizzard FY2019: 24.26 = revenue / avg PP&E); DPO (Amazon FY2017: 93.86 = 365 * avg payables / (COGS + Δinventory)); inventory turnover (AES FY2022: 9.5 = cost of sales / inventory); ROA (AES FY2022: -0.02 = net income / avg total assets); FCF conversion (Adobe FY2022: improved 143% to 156% = (ops cash - CAPEX) / net income); YoY changes (Amazon revenue FY16-17: 30.8%; Adobe op income FY15-16: 65.4%). Justifications detail line items (e.g., 'Net cash provided by operating activities') and math steps, with evidence texts/pages for verifiability.

Domain-Relevant and Novel Questions Test Analyst Insights

Beyond extraction, probes qualitative/quantitative judgment: capital intensity (3M FY2022: no, via 5.1% CAPEX/revenue, 20% fixed assets/total assets, 12.4% ROA); liquidity (3M Q2 FY2023 quick ratio 0.96 = (current assets - inventory) / current liabilities, needs improvement); operating margin drivers (3M FY2022 decline 1.7% from litigation/PFAS exit); segment growth (3M consumer -0.9% organic excluding M&A); dividend stability (3M 65 consecutive years increases); debt securities (3M Q2 2023: MMM26/30/31 on NYSE); restructuring costs (AES FY2022: 0, not outlined). Novel tasks like 'segment dragging growth' or 8K agendas (Amcor 2022: debt substitution) mimic analyst workflows, grounding LLMs in evidence-based reasoning over filings.

Summarized by x-ai/grok-4.1-fast via openrouter

10599 input / 1737 output tokens in 10323ms

© 2026 Edge