
If your Iceberg queries are getting slower and your S3 bill is climbing, the most likely culprit is not your data volume — it is your file count. The small files problem is the single most common structural issue in production Iceberg lakehouses, and it compounds silently until query latency doubles and object storage costs spike.
A well-tuned Iceberg table stores data in 256–512 MB Parquet files. A streaming table with 10-minute Flink checkpoints across 100 partitions produces 14,400 new files per day — most of them a few megabytes each. Within a week, that table carries over 100,000 tiny files that every query must plan against, every manifest must track, and every LIST call must enumerate. The performance impact is measurable; the cost impact is material.
This guide walks through the full lifecycle of the small files problem: why it happens, how to measure it, how to fix it manually, how to prevent it, and how to automate the solution at production scale.
What the small files problem is and why it matters
Apache Iceberg stores table data as immutable Parquet files in object storage. Each file is tracked in manifest files, which are themselves tracked in manifest lists referenced by snapshots. When a table accumulates thousands of undersized files instead of hundreds of properly sized ones, three cost vectors compound simultaneously.
Query planning overhead. Before executing a single row scan, the query engine must read every manifest file to build the list of data files that match the query predicates. A table with 500 manifest files tracking 200,000 data files requires the planner to open, decompress, and evaluate each manifest — hundreds of S3 GET requests before any data is read. Planning time that should take milliseconds stretches to seconds or tens of seconds.
S3 request costs. S3 charges per API request regardless of object size. A GET on a 2 MB file costs the same as a GET on a 512 MB file — $0.0004 per 1,000 requests. A table with 500,000 small files queried 50 times per day generates roughly 25 million GET requests per month. The same data compacted to 2,000 files drops that to 100,000 GETs. PUT costs on write, LIST costs during planning, and HEAD costs during maintenance all multiply proportionally with file count (see Iceberg cost optimization strategies for a full breakdown of these cost vectors).
Metadata bloat. Every data file gets an entry in a manifest file recording its partition values, column-level statistics (min, max, null count), file size, and record count. More files mean more manifest entries, larger manifest files, and more manifest files overall. The metadata layer that enables Iceberg's fast scan planning becomes the bottleneck — manifest reads dominate query start time, and manifest rewrites during maintenance take longer.
The net effect: a table with the same logical data stored in 500,000 × 5 MB files performs 10–50× worse on planning, costs 10–100× more in API requests, and takes dramatically longer to maintain than the same data stored in 5,000 × 500 MB files.

Solutions like LakeOps automate this entire lifecycle — detecting small file accumulation, triggering compaction, and optimizing sort order based on actual query patterns. Rather than building custom cron jobs or monitoring pipelines, a dedicated lakehouse management platform handles detection, execution, and prevention continuously across every table. We'll explore this approach in detail later in this guide.
Root causes: where small files come from
Small files are rarely a single misconfiguration. In production lakehouses, five patterns create them — often simultaneously on the same table.
Streaming micro-batches. Flink and Spark Structured Streaming commit data at checkpoint intervals. A Flink job checkpointing every 60 seconds against a table with 50 active partitions creates 50 new files per minute — 72,000 files per day. Each file contains only the rows that arrived in that 60-second window for that partition, typically 1–10 MB. This is by far the most prolific source of small files in production (see Flink Iceberg optimization for streaming-specific tuning).
High-cardinality partitioning. Partitioning by a column with too many distinct values — user_id, device_id, or a timestamp at hour granularity on a low-volume table — spreads each write across many partitions. If a batch job writes 10 GB of data partitioned by hour across 30 days, each of the 720 partitions receives roughly 14 MB. Partition transforms like bucket(N) or overly fine truncate() create similar fragmentation (see partitioning best practices for guidance on choosing the right scheme).
Frequent upserts and deletes. MERGE INTO operations and row-level deletes in Iceberg v2 create delete files (position or equality) alongside data files. Each merge cycle adds new small data files for inserted rows and delete files for updated or removed rows. Without compaction, delete files accumulate and every read must reconcile them against data files — a per-query cost that grows linearly with delete file count.
Concurrent writers. Multiple Spark jobs or Flink pipelines writing to the same table — even different partitions — each produce their own file sets per commit. Optimistic concurrency control handles conflicts, but the file proliferation is additive: three concurrent writers each producing 1,000 files per hour means 3,000 files per hour, each writer unaware of the others' output.
Under-tuned Spark and Flink jobs. Default Spark shuffle partition counts (200), small advisory partition sizes, and untuned write distribution modes all produce undersized output files. A Spark job with 200 shuffle partitions writing 5 GB of data produces 200 files averaging 25 MB — well below the 256 MB target. The defaults optimize for parallelism, not file size.
How Iceberg's metadata tree amplifies the problem
Iceberg's metadata structure is a tree: the current metadata file points to a manifest list, which points to manifests, which track data files. Each layer amplifies the cost of small files.
Every commit — whether a streaming checkpoint, a batch append, or a compaction job — creates a new snapshot with its own manifest list. Each manifest list references one or more manifest files. Frequent commits from streaming jobs produce many manifests, each tracking a small number of new files. A table with 10-minute streaming commits over 30 days accumulates roughly 4,300 snapshots, each with at least one manifest.
Query planning must traverse this tree. The engine reads the current manifest list, then reads every referenced manifest to collect the full set of data files. If manifests are fragmented — many manifests each tracking a handful of files — the planner issues hundreds of small S3 GETs instead of a few large reads. Even with manifest caching, the initial cold-start planning cost is proportional to manifest count.
The compounding effect is clear: more data files → more manifest entries → more manifests → slower planning → more API calls → higher cost. Compaction addresses the data file layer; manifest rewriting (rewrite_manifests) addresses the manifest layer. Both are necessary for tables with severe small-file accumulation. For a comprehensive view of how metadata health intersects with table maintenance best practices, see our dedicated guide.
Measuring the problem
Before compacting, establish where the problem is worst. Three metrics identify tables with actionable small-file issues.
Average file size. Target: 256–512 MB for analytic workloads, 64–128 MB for point-lookup tables. Anything consistently below 32 MB is a problem. Query the Iceberg metadata to compute this:
1-- Spark SQL: average file size per table2SELECT3 COUNT(*) AS total_files,4 ROUND(AVG(file_size_in_bytes) / 1048576, 1) AS avg_file_mb,5 ROUND(MIN(file_size_in_bytes) / 1048576, 1) AS min_file_mb,6 ROUND(MAX(file_size_in_bytes) / 1048576, 1) AS max_file_mb,7 ROUND(SUM(file_size_in_bytes) / 1073741824, 2) AS total_gb8FROM catalog.db.my_table.files;Files per partition. A partition with 10,000 files of 2 MB each should be 40 files of 500 MB each. Partitions with extreme file counts are compaction priorities:
1-- Files per partition — find the worst partitions2SELECT3 partition,4 COUNT(*) AS file_count,5 ROUND(AVG(file_size_in_bytes) / 1048576, 1) AS avg_mb6FROM catalog.db.my_table.files7GROUP BY partition8ORDER BY file_count DESC9LIMIT 20;Manifest count and size. Too many manifests slow down every query plan. Check the manifest list:
1-- Manifest count and average size2SELECT3 COUNT(*) AS manifest_count,4 ROUND(AVG(length) / 1048576, 2) AS avg_manifest_mb,5 SUM(added_data_files_count) AS total_tracked_files6FROM catalog.db.my_table.manifests;If you find tables with average file sizes under 32 MB, partitions with thousands of files, and hundreds of manifests, those are the tables to compact first. Prioritize by query frequency — a heavily queried table with 100,000 small files has a far higher cost impact than a rarely accessed table with the same structure.
Manual compaction: rewrite_data_files
Iceberg provides the rewrite_data_files stored procedure for manual compaction. It reads small files, merges them, and writes new properly sized files — all as an atomic Iceberg commit with snapshot isolation.
Binpack strategy
Binpack is the fastest compaction strategy. It merges small files into target-size files without changing data order — no shuffle, no sort, just concatenation into larger files:
1CALL catalog.system.rewrite_data_files(2 table => 'db.events',3 strategy => 'binpack',4 options => map(5 'target-file-size-bytes', '536870912',6 'min-file-size-bytes', '402653184',7 'max-file-size-bytes', '671088640',8 'min-input-files', '5',9 'max-concurrent-file-group-rewrites', '10'10 )11);Key parameters:
target-file-size-bytes: The ideal output file size. 512 MB (536870912 bytes) is a good default for analytics workloads; 256 MB for BI/low-latency tables.min-file-size-bytes: Files below this size are always candidates for compaction. Default is 75% of target. Leave at 75% to avoid rewriting files already close to optimal.max-file-size-bytes: Files above this size are always candidates for compaction. Default is 180% of target; the example above uses a tighter 125% for more aggressive normalization.min-input-files: Minimum number of candidate files needed before a rewrite group runs. Default is 5. Lower to 2 for aggressive compaction on streaming tables.max-concurrent-file-group-rewrites: Parallelism. Increase on larger clusters; decrease if you see memory pressure.partial-progress.enabled: Set totrueon large tables so compaction commits incrementally per file group rather than failing atomically across the entire table.
Sort strategy
Sort compaction rewrites data in a specified column order. It is more expensive — the engine must shuffle all data — but the payoff is dramatic for query performance. Sorted data enables Parquet min/max statistics to skip entire row groups, often reducing scan volume by 80–95%:
1CALL catalog.system.rewrite_data_files(2 table => 'db.events',3 strategy => 'sort',4 sort_order => 'event_date ASC NULLS LAST, user_id ASC NULLS LAST',5 options => map(6 'target-file-size-bytes', '536870912',7 'min-file-size-bytes', '402653184',8 'max-file-size-bytes', '671088640',9 'rewrite-all', 'true'10 )11);Set rewrite-all to true for sort compaction — you want every file rewritten in the new order, not just the small ones. Choose sort columns based on your actual query predicates: the columns that appear most frequently in WHERE and JOIN clauses.
Delete file compaction
Tables using merge-on-read (MERGE INTO, UPDATE, DELETE) accumulate position delete files alongside data files. rewrite_data_files does not remove delete files — it only merges data files. You need a separate call to rewrite_position_delete_files to compact and resolve delete files (see the full delete files guide for deep coverage):
1CALL catalog.system.rewrite_position_delete_files(2 table => 'db.events',3 options => map(4 'rewrite-all', 'true'5 )6);For high-update tables, run both procedures in sequence: rewrite_data_files first to merge small data files, then rewrite_position_delete_files to resolve accumulated deletes. Without both, query performance degrades because readers must reconcile every delete file against every data file in the partition — a cost that grows linearly with delete file count.
Filtering compaction scope
On large tables, compact specific partitions rather than the entire table. This limits compute cost and reduces conflict probability with concurrent writers:
1CALL catalog.system.rewrite_data_files(2 table => 'db.events',3 strategy => 'binpack',4 where => 'event_date >= current_date - INTERVAL 7 DAYS',5 options => map(6 'target-file-size-bytes', '536870912'7 )8);After compaction, clean up the metadata layer. Compaction creates a new snapshot; the old snapshot still references the original small files. Expire snapshots and rewrite manifests to complete the cleanup:
1-- Expire snapshots older than 5 days, keep at least 32CALL catalog.system.expire_snapshots(3 table => 'db.events',4 older_than => TIMESTAMP '2026-05-21 00:00:00',5 retain_last => 36);7 8-- Consolidate manifests after compaction9CALL catalog.system.rewrite_manifests('db.events');Spark tuning: preventing small files at write time
The best compaction run is one you never need. Tuning Spark write behavior reduces small file creation at the source.
Write distribution mode. Controls how Spark distributes data across tasks before writing. The default (none) creates one file per task, regardless of partition. Set to hash for partitioned tables to colocate partition data and produce fewer, larger files:
1spark.sql("""2 ALTER TABLE catalog.db.events3 SET TBLPROPERTIES (4 'write.distribution-mode' = 'hash',5 'write.target-file-size-bytes' = '536870912'6 )7""")Distribution mode options:
none— No redistribution. Fastest writes, most small files. Use only when upstream data is already well-distributed.hash— Hash-partitions data by partition key before writing. Produces fewer files per partition. The default choice for most partitioned tables.range— Range-distributes data, producing globally sorted output. Most expensive writes, best query performance. Use for sort-optimized tables.
Advisory partition size. Controls the target size of Spark output partitions. Increase it to produce larger files:
1spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "512m")2spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")Fanout writers. For tables with many partitions, enable fanout writers to keep a writer open per partition rather than sorting and splitting. This avoids the shuffle overhead of hash distribution at the cost of higher memory usage per executor:
1spark.sql("""2 ALTER TABLE catalog.db.events3 SET TBLPROPERTIES (4 'write.spark.fanout.enabled' = 'true'5 )6""")Shuffle partitions. The default spark.sql.shuffle.partitions = 200 is too high for small-to-medium tables and too low for very large ones. Set it proportional to data volume — roughly one partition per 256–512 MB of output data.
Flink tuning: checkpoint and commit intervals
Flink is the primary producer of small files in streaming Iceberg lakehouses. Two settings control file size: checkpoint interval and commit interval.
Checkpoint interval. Each Flink checkpoint triggers a file flush. Shorter intervals mean smaller files. The default of 60 seconds is aggressive for Iceberg — 5–10 minute checkpoints produce 5–10× larger files per partition:
1StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();2env.enableCheckpointing(600_000); // 10 minutes3env.getCheckpointConfig().setMinPauseBetweenCheckpoints(300_000);Commit interval. Iceberg's Flink connector can buffer commits across multiple checkpoints. The upsert mode in particular benefits from longer commit intervals that batch more changes per file:
1CREATE TABLE iceberg_sink (2 event_id STRING,3 event_date DATE,4 payload STRING5) PARTITIONED BY (event_date)6WITH (7 'connector' = 'iceberg',8 'catalog-name' = 'my_catalog',9 'write.target-file-size-bytes' = '536870912',10 'write.upsert.enabled' = 'true'11);The trade-off is latency versus file size. Longer checkpoint intervals reduce small file production but increase the data-freshness lag. For most analytics use cases, 5–10 minute checkpoints provide an acceptable balance. For near-real-time dashboards where sub-minute freshness is required, accept the small files and rely on downstream compaction.
Automated compaction approaches
Manual compaction works for a handful of tables. At production scale — tens to hundreds of tables with different write patterns, query loads, and SLAs — automation is essential. See our solutions overview for how LakeOps addresses these challenges across the full table maintenance lifecycle.
Cron-based compaction
The simplest approach: schedule a compaction job to run on a fixed interval. A nightly Spark job iterates over all tables and runs rewrite_data_files on each:
1from pyspark.sql import SparkSession2 3spark = SparkSession.builder.getOrCreate()4 5tables = ["db.events", "db.orders", "db.clicks", "db.sessions"]6 7for table in tables:8 spark.sql(f"""9 CALL catalog.system.rewrite_data_files(10 table => '{table}',11 strategy => 'binpack',12 options => map(13 'target-file-size-bytes', '536870912',14 'min-file-size-bytes', '402653184'15 )16 )17 """)18 spark.sql(f"CALL catalog.system.rewrite_manifests('{table}')")Cron-based compaction has obvious limitations:
- Waste on idle tables. A table with no new data still gets a compaction run — the Spark job starts, scans metadata, finds nothing to do, and shuts down. The cluster time is wasted.
- Missed degradation between runs. A streaming table can accumulate 100,000 small files between nightly runs. Queries during those 23 hours pay the full planning and API cost penalty.
- No priority ordering. All tables are treated equally. A critical customer-facing table with 200,000 small files waits in the same queue as a low-priority staging table with 50.
- Cluster management. Someone must provision, right-size, and maintain the Spark cluster. Over-provision wastes money; under-provision causes jobs to fail with OOM errors on large tables.
Event-driven compaction
A step up from cron: trigger compaction based on table state rather than wall-clock time. Monitor file counts, average file sizes, or commit frequency, and compact only when thresholds are crossed. This requires infrastructure to watch table metadata and dispatch compaction jobs — typically a combination of a metadata polling service, a job scheduler, and a compute backend.
Event-driven compaction fixes the waste problem — idle tables are not compacted — but building and maintaining the monitoring, triggering, and execution pipeline is significant operational work. Most teams that attempt this end up with a custom system that needs its own monitoring, alerting, and on-call rotation. For a deeper look at what fully autonomous table maintenance looks like in practice, see how a control-plane approach eliminates this overhead entirely.
How LakeOps solves small files at scale
The approaches above work at low scale but break down as tables multiply, engines diversify, and maintenance operations interact. LakeOps is a dedicated lakehouse control plane built in Rust on Apache DataFusion that automates the entire small file lifecycle — detection, compaction, cleanup, and prevention — across every table in the lake. Instead of building custom cron jobs, monitoring pipelines, and Spark clusters, a single control plane watches every table's structural health, triggers compaction only when thresholds are crossed, executes at a fraction of Spark's cost, and sequences the full maintenance stack so each step operates on the clean output of the previous one.

Key capabilities for small file resolution:
- Event-driven triggers — Continuously monitors structural signals per table and partition (file count, average file size, delete-file-to-data-file ratio, manifest depth) and fires compaction only when thresholds are crossed; a streaming table may compact multiple times per hour while a weekly batch table compacts once
- Rust execution engine — Purpose-built on Apache DataFusion with Arrow columnar buffers, lock-free parallelism, and no JVM overhead; 86% faster than Spark in production benchmarks (221s vs 1,612s across 5.5 TB, peak 2,522 MB/s) at ~$5/TB vs $50/TB
- Per-table target sizes and strategies — Point-lookup tables compact to 64–128 MB, analytics tables to 256–512 MB; each table gets its own policy (strategy, target size, trigger thresholds, schedule constraints) with a specificity hierarchy from table → namespace → catalog
- Coordinated maintenance stack — Sequences the full pipeline (snapshot expiration → orphan cleanup → compaction → manifest optimization) so each step operates on clean output, eliminating wasted rewrites of soon-to-be-dereferenced files
- Query-aware sort optimization — Collects query telemetry across every connected engine (Trino, Spark, Snowflake, Athena, DuckDB, Flink) and chooses sort orders that maximize data skipping based on actual WHERE, JOIN, and GROUP BY patterns
- Lake-wide observability — Every table continuously classified as Critical, Warning, or Healthy; Insights engine surfaces proactive alerts at four severity levels before query latency spikes, giving platform teams full visibility into which tables need attention





Key takeaways
Compaction is the cure; prevention reduces how often you need it. Apply these practices regardless of which compaction approach you use.
- 1.Automate compaction with a dedicated control plane. Small files are an ongoing structural problem, not a one-time fix. A control plane like LakeOps continuously monitors file counts across every table, triggers compaction based on actual thresholds, applies query-aware sort orders, and sequences the full maintenance pipeline (expire → cleanup → compact → rewrite manifests) automatically. Whether you run it in autopilot mode or use its observability to make manual decisions, treating small file management as infrastructure — not a cron job — is the most impactful best practice.
- 2.Right-size checkpoint intervals. If freshness requirements allow it, set Flink checkpoints to 5–10 minutes instead of 60 seconds. This alone can reduce small file production by 5–10×.
- 3.Use hash or range distribution mode. Set
write.distribution-modetohashon every partitioned Iceberg table written by Spark. The defaultnonevirtually guarantees small files. - 4.Avoid high-cardinality partitions. Partition on columns with bounded cardinality — date, region, category. Never partition on user IDs, device IDs, or other high-cardinality columns. Use Iceberg's hidden partitioning transforms (
days(),months(),bucket(),truncate()) to control cardinality. - 5.Set target file size at the table level. Declare
write.target-file-size-bytesin table properties so every engine writing to the table respects the same target. - 6.Tune shuffle partitions. Set
spark.sql.shuffle.partitionsproportional to output data volume — one partition per 256–512 MB of data — rather than using the default 200. - 7.Enable adaptive query execution. Spark's AQE coalesces small partitions automatically when
spark.sql.adaptive.enabledistrueandspark.sql.adaptive.coalescePartitions.enabledistrue. - 8.Monitor continuously. Track average file size, files per partition, and manifest count per table. Set alerts when metrics cross thresholds — 10,000+ files or average size below 32 MB — and investigate before degradation becomes critical.
- 9.Sequence maintenance correctly. Always expire snapshots before orphan cleanup, and run both before compaction. Compacting first wastes compute rewriting files that are about to be garbage-collected.
The small files problem is structural, not accidental. Every streaming pipeline, every concurrent writer, every under-tuned Spark job adds files that degrade the table over time. The question is not whether your tables will accumulate small files — they will. The question is whether you detect and fix the problem before it costs you in query latency, S3 spend, and engineering time. Measure your tables today, compact the worst offenders, tune your writers to prevent recurrence, and automate the process so it runs continuously without human intervention.



