Data and Beyond
Spark's 50k Small Files Kill Downstream Query Speed
Spark jobs writing 10TB as 50,000 200MB files cause minutes of metadata overhead on reads and break big-data engines' 128MB-1GB file assumptions, slowing queries.
Short, curated notes. One email when something good lands.
Spark jobs writing 10TB as 50,000 200MB files cause minutes of metadata overhead on reads and break big-data engines' 128MB-1GB file assumptions, slowing queries.