Spark's 50k Small Files Kill Downstream Query Speed

Avoid Small-File Outputs for Production Spark Jobs

Spark jobs tuned only for write completion produce 50,000 files of ~200MB each for 10TB datasets. This creates production issues: downstream systems like Spark, Presto, or Trino face high latency because the driver's first step—listing and scheduling across 50k files—takes minutes before any data processing starts. Result: dashboards go red hours after successful writes, frustrating consuming teams.

Fix the root cause upfront: Target output files in the 128MB–1GB range to enable locality (data on fewer nodes) and efficient batching, matching big-data engines' core assumptions. A 10TB job should aim for hundreds, not tens of thousands, of files—reducing metadata load and speeding reads by orders of magnitude.

Metadata Overhead and Engine Assumptions

Each small file adds listing overhead: for 50k files, Spark's driver catalogs paths, sizes, and partitions before task assignment, burning time on coordination rather than compute. Individually, 200MB files read fine in isolation, but collectively they fragment HDFS/S3 directories, preventing optimizations like:

Locality: Data spread across too many objects, forcing cross-node shuffles.
Batching: Engines expect larger files for vectorized I/O and predicate pushdown.

Trade-off: Larger files improve reads but may increase write time slightly—prioritize downstream velocity over upstream completion speed. In interviews, demonstrate by repartitioning writes (e.g., df.repartition(1000).write...) to hit optimal sizing based on cluster size and data volume.