
Streaming from Kafka into Apache Iceberg is the most common ingestion pattern in production lakehouses. It is also the most prolific source of small files, snapshot bloat, and metadata growth — the three things that make Iceberg tables slow and expensive over time.
A Flink job checkpointing every 60 seconds against a table with 100 active partitions produces 144,000 new files per day. Kafka Connect with 5-minute commit intervals against 50 partitions produces 14,400. Even at moderate volumes, a streaming table accumulates over 100,000 undersized files in its first week — each one tracked in manifests, each one triggering an S3 GET on every query, each one making your lakehouse a little slower.
Compaction is the fix: read many small files, write fewer large ones. But standard compaction — a nightly Spark job that rewrites everything — does not work for streaming tables. It conflicts with active writers, leaves 23 hours of degradation between runs, wastes compute on healthy partitions, and never catches up on high-throughput tables.
This guide covers how to compact streaming Iceberg tables correctly: how to measure when compaction is needed, the right strategy, the right timing, the right partition boundaries, the right execution engine, and how to coordinate compaction with the rest of the maintenance lifecycle. It also covers how a purpose-built control plane like LakeOps replaces manual compaction scripts with autonomous, event-driven maintenance that adapts to each table's write pattern.
Why streaming tables are different
Batch tables accumulate small files predictably — after each load, the table is stable until the next one. Compaction can run once after the load completes, targeting the full output. Streaming tables never stop writing. Every checkpoint creates new files in the active partition while compaction tries to consolidate older ones.
This creates four problems that batch compaction approaches cannot solve:
Continuous file creation. A streaming table's file count grows linearly with time and partition count. There is no "after the load" quiet period. A table with 50 partitions and 5-minute checkpoints adds 14,400 files per day, 100,800 per week, 432,000 per month. Without continuous compaction, the table degrades indefinitely.
Active partition conflicts. Iceberg uses optimistic concurrency control. Each writer assumes exclusive access and validates at commit time. When a compaction job rewrites files in a partition that a streaming writer is still appending to, the commit fails with a ValidationException. One of them must retry — and on a hot partition with sub-minute writes, retries cascade. This is the single most common cause of compaction failures on streaming tables.
Snapshot accumulation. Every checkpoint commit creates a new Iceberg snapshot. A table with 5-minute commits generates 8,640 snapshots per month. Each snapshot pins references to data files, preventing storage reclamation. Without snapshot expiration running in coordination with compaction, the metadata layer bloats even as data files are consolidated.
Delete file overhead (CDC tables). Tables receiving CDC events via Debezium accumulate delete files on every commit. At read time, the engine reconciles each delete file against matching data files — a per-query cost that grows linearly. Without compaction to merge deletes into base data, a CDC table with 10,000 updates per hour becomes progressively slower to query.

For a deeper look at the full small files lifecycle — root causes, measurement, and prevention — see Fixing Small Files in Apache Iceberg.
Measuring compaction need
Before running compaction, quantify the problem. Iceberg exposes structural metadata through system tables that every query engine can read. Use these to determine which tables and partitions need compaction — and how urgently.
File count and size per partition
1SELECT2 partition,3 COUNT(*) AS file_count,4 AVG(file_size_in_bytes) / 1048576 AS avg_size_mb,5 SUM(file_size_in_bytes) / 1073741824 AS total_size_gb,6 SUM(CASE WHEN file_size_in_bytes < 67108864 THEN 1 ELSE 0 END) AS small_files7FROM analytics.clickstream.files8GROUP BY partition9ORDER BY file_count DESC;A healthy partition has fewer than 100 files averaging 256–512 MB each. A streaming partition with 5,000 files averaging 3 MB is in critical condition. The small_files column (files under 64 MB) tells you exactly how many files compaction needs to consolidate.
Manifest health
1SELECT2 COUNT(*) AS manifest_count,3 AVG(added_data_files_count + existing_data_files_count) AS avg_files_per_manifest,4 SUM(length) / 1048576 AS total_manifest_size_mb5FROM analytics.clickstream.manifests;Target: fewer than 100 manifests per snapshot. Streaming tables with 5-minute commits can accumulate 8,600+ manifests in 30 days. When manifest count exceeds 100, query planning time increases measurably — the engine must open, decompress, and evaluate each manifest before reading any data. Manifest rewrite (rewrite_manifests) consolidates them after compaction.
Snapshot velocity
1SELECT2 COUNT(*) AS snapshot_count,3 MIN(committed_at) AS oldest_snapshot,4 MAX(committed_at) AS latest_snapshot5FROM analytics.clickstream.snapshots;If snapshot count exceeds 1,000, expiration is not running or is not aggressive enough. Each snapshot pins references to data files, preventing deletion of superseded data. On streaming tables with 5-minute commits, 3–7 day retention is typical.
Table health score
Combine file count, average file size, and manifest count into a single assessment:
1WITH file_stats AS (2 SELECT3 COUNT(*) AS total_files,4 AVG(file_size_in_bytes) AS avg_file_size,5 SUM(CASE WHEN file_size_in_bytes < 67108864 THEN 1 ELSE 0 END) AS small_files6 FROM analytics.clickstream.files7),8manifest_stats AS (9 SELECT COUNT(*) AS total_manifests10 FROM analytics.clickstream.manifests11)12SELECT13 total_files,14 ROUND(avg_file_size / 1048576, 1) AS avg_file_size_mb,15 small_files,16 total_manifests,17 CASE18 WHEN avg_file_size < 67108864 AND total_files > 1000 THEN 'Critical — compact immediately'19 WHEN avg_file_size < 134217728 AND total_files > 500 THEN 'Warning — schedule compaction'20 ELSE 'Healthy'21 END AS status22FROM file_stats, manifest_stats;Run this query periodically (or wire it into Airflow/Dagster) to trigger compaction based on actual table state rather than fixed schedules.
The three compaction strategies
Iceberg's `rewriteDataFiles` action supports three strategies. Each makes a different tradeoff between speed and query performance improvement.
Binpack
Groups small files by partition and concatenates them into target-sized files without reordering data. No sort, no shuffle — pure I/O. This is the fastest strategy and the lowest-cost per byte compacted.
1CALL catalog.system.rewrite_data_files(2 table => 'analytics.clickstream',3 strategy => 'binpack',4 options => map(5 'target-file-size-bytes', '268435456',6 'min-file-size-bytes', '67108864',7 'max-file-size-bytes', '483183820',8 'min-input-files', '5',9 'partial-progress.enabled', 'true',10 'partial-progress.max-commits', '10',11 'max-concurrent-file-group-rewrites', '10'12 )13);When to use for streaming tables: As your baseline. Every streaming table should get binpack compaction at minimum. It eliminates the file-count penalty — more files per query plan, more S3 GET requests, more manifest entries — without the compute overhead of sorting. For append-only event streams where queries do full-partition scans (e.g., daily aggregations), binpack is often sufficient.
Limitation: Binpack does not improve data layout. Queries that filter on specific columns (WHERE event_type = 'purchase') still scan entire files because related rows are scattered randomly across all files in the partition.
Sort
Reads files, sorts data by specified columns, and writes sorted output. This physically clusters related rows so that Parquet min/max statistics enable file-level pruning — queries skip entire files that cannot contain matching data.
1CALL catalog.system.rewrite_data_files(2 table => 'analytics.clickstream',3 strategy => 'sort',4 sort_order => 'event_type ASC NULLS LAST, event_timestamp ASC',5 options => map(6 'target-file-size-bytes', '268435456',7 'min-file-size-bytes', '67108864',8 'min-input-files', '5',9 'partial-progress.enabled', 'true',10 'partial-progress.max-commits', '10'11 )12);When to use for streaming tables: When your queries consistently filter on the same 1–2 columns. A clickstream table where every dashboard query includes WHERE event_type = ... benefits from sorting by event_type. The sort adds CPU cost during compaction but pays for itself on every subsequent query through data skipping.
Tradeoff: Sort compaction is 3–5x slower than binpack on Spark. For streaming tables requiring multiple compaction passes per day, this compute cost is material. A Rust-based compaction engine narrows this gap — sort compaction at 780 seconds versus Spark's 1,612 seconds on identical 200 GB datasets — making sort viable even for high-frequency streaming maintenance.
Z-order
A multidimensional clustering technique that interleaves bits from multiple columns to create a space-filling curve. This enables file pruning across any combination of the Z-ordered columns — unlike linear sort, which only prunes efficiently on the leading column.
1CALL catalog.system.rewrite_data_files(2 table => 'analytics.clickstream',3 strategy => 'sort',4 sort_order => 'zorder(user_id, event_type, page_url)',5 options => map(6 'target-file-size-bytes', '536870912',7 'partial-progress.enabled', 'true',8 'max-concurrent-file-group-rewrites', '5'9 )10);When to use for streaming tables: When ad-hoc analytics queries filter on unpredictable combinations of 2–4 columns. Z-order is expensive — the most compute-intensive strategy — but it is the only way to optimize for multi-dimensional range predicates. Limit to 2–4 columns; more columns dilute the clustering benefit.
Tradeoff: The highest compaction cost for the broadest query benefit. On streaming tables, Z-order is typically run less frequently than binpack — perhaps daily or weekly — while binpack handles the continuous small-file consolidation.
Strategy comparison for streaming tables
| Strategy | Speed | Query improvement | Best streaming use case |
|---|---|---|---|
| Binpack | Fastest (I/O only) | File count reduction only | Baseline for all streaming tables |
| Sort | 3–5x slower (Spark) | 2–10x scan reduction on sorted columns | Predictable filter patterns (dashboards, reports) |
| Z-order | Slowest | Multi-column pruning | Ad-hoc analytics across 2–4 filter columns |
In practice, most streaming tables benefit from a two-tier approach: frequent binpack to keep file counts manageable, periodic sort compaction to optimize layout for the dominant query pattern. For a detailed comparison of compaction engines and strategies, see 9 Iceberg Compaction Tools Compared.
The impact of compaction: before and after
Concrete performance differences on a streaming clickstream table (50 partitions, 7 days of 5-minute Flink checkpoints):
| Metric | Before compaction (100,800 files) | After compaction (700 files) |
|---|---|---|
| Average file size | 3 MB | 430 MB |
| Query planning time | 8.2 seconds | 0.3 seconds |
| S3 GET requests per full scan | 100,800 | 700 |
| Column statistics effectiveness | Poor (wide min/max ranges) | Good (tight ranges per file) |
| Query execution time (daily aggregate) | 45 seconds | 4 seconds |
| Manifest files | 2,016 | 28 |
The 10x+ query speedup comes from three sources: fewer files to plan against, fewer S3 requests to execute, and tighter column statistics that enable predicate pushdown to skip entire files.
Tuning rewriteDataFiles parameters
The rewrite_data_files procedure has over 15 parameters. Most teams run with defaults and then wonder why compaction is slow, fails on large tables, or causes conflicts. Here are the parameters that matter for streaming tables:
File selection
- `target-file-size-bytes` (default: 512 MB) — the target output file size. 256 MB is better for streaming tables with frequent reads; 512 MB for heavy scan workloads. Set this as a table property (
write.target-file-size-bytes) so both writers and compaction use the same target - `min-file-size-bytes` (default: 75% of target) — files smaller than this are candidates for compaction. Increase to be more aggressive; decrease to skip borderline files
- `max-file-size-bytes` (default: 180% of target) — files larger than this are always rewritten, regardless of other criteria. Prevents oversized files from accumulating
- `min-input-files` (default: 5) — minimum number of candidate files in a partition before compaction runs. Set to 5 for streaming to skip healthy partitions. Set to 2 if you want aggressive consolidation
Parallelism and grouping
- `max-concurrent-file-group-rewrites` (default: 5) — how many file groups are rewritten simultaneously. Increase to 10–20 if your cluster has capacity. Each file group runs as an independent task, so higher concurrency finishes faster but uses more memory
- `max-file-group-size-bytes` (default: 100 GB) — maximum data in a single file group. Reduce to 5–10 GB for memory-constrained environments. Larger groups produce fewer commits but risk OOM on Spark
Partial progress (critical for streaming)
- `partial-progress.enabled` (default: false) — set to `true` for streaming tables. Without it, a conflict on one partition invalidates the entire compaction job. With it, each file group commits independently
- `partial-progress.max-commits` (default: 10) — maximum number of separate commits the job produces. If you have 100 file groups and max-commits is 10, Iceberg batches 10 groups per commit. Higher values increase fault tolerance but add metadata overhead (each commit is a new snapshot)
- `partial-progress.max-failed-commits` (default: same as max-commits) — how many commit failures are allowed before the entire job fails. On streaming tables with active writers, set this to at least 3–5 to tolerate occasional hot-partition conflicts
Advanced options
- `use-starting-sequence-number` (default: true) — uses the sequence number of the snapshot at compaction start rather than the new snapshot. This prevents V2 delete files from being applied to compacted data files that should not have them — important for CDC tables
- `rewrite-all` (default: false) — force rewriting all files regardless of size thresholds. Use only for one-time sort optimization when changing the sort order of an existing table
- `rewrite-job-order` — controls the order in which file groups are processed. Options include
bytes-asc,bytes-desc,files-asc,files-desc, andnone. Usefiles-descto prioritize the most fragmented partitions first
Partition-aware compaction: the critical pattern
The single most important technique for compacting streaming tables is: never compact the partition currently being written to. This eliminates writer conflicts entirely.
Iceberg's optimistic concurrency model detects conflicts at commit time. When a compaction job rewrites files in a partition that received new writes since the compaction started, the commit is rejected. On streaming tables with minute-level writes, the hot partition conflicts on almost every compaction attempt.
The solution is partition-aware compaction: restrict compaction to historical (cold) partitions where no new data is arriving. The streaming job writes to today's partition; compaction targets yesterday's and older.
Implementation in Spark
1-- Compact only cold partitions (older than 1 day)2CALL catalog.system.rewrite_data_files(3 table => 'analytics.clickstream',4 strategy => 'sort',5 sort_order => 'event_type ASC',6 where => 'event_date < current_date()',7 options => map(8 'target-file-size-bytes', '268435456',9 'min-input-files', '5',10 'partial-progress.enabled', 'true',11 'partial-progress.max-commits', '10',12 'max-concurrent-file-group-rewrites', '10',13 'rewrite-job-order', 'files-desc'14 )15);The where clause limits the scope to partitions older than today. The streaming writer appends to today's partition without interference. When today becomes yesterday, the next compaction run picks it up.
`partial-progress.enabled` is critical for streaming tables. It allows the compaction job to commit in chunks — if a conflict occurs on one chunk, only that chunk retries rather than the entire table rewrite. On a table with 100+ partitions, this prevents a single conflict from invalidating hours of compaction work.
`min-input-files` prevents wasting compute on partitions that do not need compaction. A partition with only 3 files, each 200 MB, is already well-sized. Setting min-input-files to 5 skips these partitions and focuses compute on the ones with hundreds of tiny files.
`rewrite-job-order` set to files-desc ensures the most fragmented partitions — the ones with the most files — are compacted first. If the job is interrupted or times out, the partitions with the worst degradation have already been handled.
Implementation in Flink
Flink provides the RewriteDataFilesAction through the iceberg-flink-runtime JAR. For streaming workloads, run compaction as a separate Flink job — not in the same pipeline as ingestion — to isolate resource consumption:
1Actions.forTable(table)2 .rewriteDataFiles()3 .binPack()4 .filter(Expressions.lessThan("event_date", today))5 .option("target-file-size-bytes", "268435456")6 .option("min-input-files", "5")7 .option("partial-progress.enabled", "true")8 .option("partial-progress.max-commits", "10")9 .option("max-concurrent-file-group-rewrites", "10")10 .execute();Run this job on a dedicated Flink cluster or task manager to avoid contention with the streaming ingestion pipeline. Scheduling through Airflow or a similar orchestrator at 1–4 hour intervals keeps file counts manageable without over-compacting.
Orchestrating with Airflow
For teams using Apache Airflow, a practical DAG pattern for streaming compaction:
1from airflow import DAG2from airflow.providers.apache.spark.operators.spark_sql import SparkSqlOperator3from datetime import datetime4 5with DAG(6 'streaming_table_maintenance',7 schedule_interval='0 */2 * * *',8 start_date=datetime(2026, 1, 1),9 catchup=False,10) as dag:11 12 expire_snapshots = SparkSqlOperator(13 task_id='expire_snapshots',14 sql="""15 CALL catalog.system.expire_snapshots(16 table => 'analytics.clickstream',17 older_than => current_timestamp() - INTERVAL 5 DAYS,18 retain_last => 10019 )20 """,21 )22 23 remove_orphans = SparkSqlOperator(24 task_id='remove_orphans',25 sql="""26 CALL catalog.system.remove_orphan_files(27 table => 'analytics.clickstream',28 older_than => current_timestamp() - INTERVAL 7 DAYS29 )30 """,31 )32 33 compact_files = SparkSqlOperator(34 task_id='compact_files',35 sql="""36 CALL catalog.system.rewrite_data_files(37 table => 'analytics.clickstream',38 strategy => 'binpack',39 where => 'event_date < current_date()',40 options => map(41 'target-file-size-bytes', '268435456',42 'min-input-files', '5',43 'partial-progress.enabled', 'true',44 'partial-progress.max-commits', '10'45 )46 )47 """,48 )49 50 rewrite_manifests = SparkSqlOperator(51 task_id='rewrite_manifests',52 sql="""53 CALL catalog.system.rewrite_manifests(54 table => 'analytics.clickstream'55 )56 """,57 )58 59 expire_snapshots >> remove_orphans >> compact_files >> rewrite_manifestsThis DAG enforces the correct maintenance order and runs every 2 hours. Each step is a separate task so failures are isolated — if orphan cleanup fails, it does not block compaction from running on the data that is already clean.
The full maintenance sequence
Compaction alone is not enough. Streaming tables require four coordinated maintenance operations, in a specific order:
1. Expire snapshots — remove snapshots beyond the retention window. A streaming table with 5-minute commits accumulates 8,640 snapshots per month. Each pins references to data files, preventing deletion of superseded bytes. Typical streaming retention is 3–7 days.
1CALL catalog.system.expire_snapshots(2 table => 'analytics.clickstream',3 older_than => TIMESTAMP '2026-06-04 00:00:00',4 retain_last => 1005);2. Remove orphan files — failed checkpoints, aborted commits, and concurrent writer conflicts leave data files in S3 that no snapshot references. These orphan files are invisible to Iceberg but billable by S3. On mature streaming lakes, orphans routinely account for 25–40% of billable storage.
1CALL catalog.system.remove_orphan_files(2 table => 'analytics.clickstream',3 older_than => TIMESTAMP '2026-05-31 00:00:00'4);Always use a safety window (7+ days) for orphan cleanup to avoid deleting files from in-progress writes.
3. Compact data files — merge small files into 256–512 MB targets. This is where partition-aware compaction from the previous section applies.
4. Rewrite manifests — consolidate the fragmented manifest files that streaming commits created. Each commit adds at least one manifest. After 30 days of 5-minute commits, the table carries roughly 8,600 manifests that query planning must traverse.
1CALL catalog.system.rewrite_manifests(2 table => 'analytics.clickstream'3);Manifest rewrite is the most underappreciated maintenance operation. After compaction merges 100,000 files into 700, the 2,016 manifests still reference the old file layout. Planning still reads all of them. rewrite_manifests consolidates them into a handful of properly sized manifests, cutting planning time from seconds to milliseconds.
The order matters. Expiring snapshots first frees data file references for orphan cleanup. Orphan cleanup removes files before compaction runs — so compaction does not waste compute rewriting files that are about to be deleted. Manifest rewrite runs last because it should reflect the final compacted file set. Running these operations independently — separate cron jobs on separate schedules — produces conflicts and wasted work. For a deep dive on maintenance sequencing, see Iceberg Table Health & Maintenance.

Why cron-based compaction fails for streaming
A nightly Spark compaction job is the default approach for most teams. Here is why it falls apart on streaming tables:
23 hours of degradation. Between nightly runs, every query pays the full small-file penalty. A streaming table accumulating 14,400 files per day means queries run against increasingly fragmented data all day. By the time compaction runs at midnight, the table has 14,400 uncompacted files. After compaction, the cycle restarts — by noon the next day, 7,200 new files have already accumulated.
Hot partition conflicts. A nightly job that targets the entire table will try to compact today's partition alongside historical ones. The streaming writer is still appending to today's partition. The compaction commit fails. Without partition-aware filtering, the entire job may abort — including the work on historical partitions that would have succeeded.
No priority ordering. A cron schedule treats all tables equally. Your critical customer-facing clickstream table with 200,000 small files waits in the same queue as a low-priority staging table with 50 files. Resources are spread uniformly instead of concentrated where they are needed.
Wasted compute on healthy partitions. Without min-input-files filtering, the job scans every partition — including ones that are already optimally sized. On a table with 365 daily partitions, only the most recent 7 typically need compaction. The other 358 are rewritten unnecessarily.
No coordination with other maintenance. Snapshot expiration runs on a separate schedule. Orphan cleanup runs on another. Compaction may rewrite files that snapshot expiration is about to dereference — wasted I/O. Orphan cleanup may run before snapshot expiration has freed the references — missing files. The operations interfere with each other instead of feeding into each other.
The cost of the compaction engine
For streaming tables, the compaction engine runs frequently — multiple times per day on high-throughput tables. The per-run cost compounds quickly.
JVM overhead
Spark-based compaction carries structural overhead that is irrelevant to the compaction workload itself:
- Cluster provisioning — starting a Spark job requires provisioning an EMR/Dataproc cluster or allocating capacity on a shared cluster. Cold starts take 2–5 minutes before any compaction work begins
- JVM startup and GC — garbage collection pauses during compaction introduce latency variance. On large tables with high delete file counts, full GC events can stall compaction for seconds. Spark routinely OOMs on tables above 1 TB with complex delete file patterns — requiring cluster resizing or job splitting
- Idle cluster costs — if you keep a Spark cluster running for compaction, you pay for idle time between runs. If you provision on demand, you pay for startup latency
- Executor overhead — Spark's executor model requires over-provisioning because under-provisioning causes failures. The compute cost of keeping tables healthy often rivals the compute cost of querying them
Cost comparison
| Spark (EMR) | Rust/DataFusion (LakeOps) | |
|---|---|---|
| 200 GB binpack | 1,612 seconds / ~$1.54 | 221 seconds / ~$0.21 |
| 200 GB sort | Not benchmarked (estimated ~$3.50) | 780 seconds / ~$0.75 |
| Cost per TB | ~$50/TB | ~$5/TB |
| 1.2 TB table | OOM failure | 11 minutes |
| Peak throughput | ~400 MB/s | 2,522 MB/s |
| JVM startup | 2–5 minutes | 0 seconds |
| GC pauses | Yes (unpredictable) | None |
Data from production benchmarks across 10 tables totaling 5.5 TB. The gap is structural, not marginal — Rust eliminates GC pauses, Apache DataFusion provides vectorized columnar execution with Arrow, and lock-free parallelism means worker threads never stall.

When you compact streaming tables multiple times per day across dozens of tables, the cost difference between $5/TB and $50/TB is the difference between compaction as an affordable utility and compaction as a significant line item on your cloud bill. See Iceberg Cost Optimization for a broader cost analysis.
Compaction for CDC streaming tables
Tables receiving CDC events from Kafka via Debezium face an additional compaction challenge: delete file accumulation.
Every update and delete in the source database produces delete files in the Iceberg table — position deletes (V2) or deletion vectors (V3). Without compaction, these delete files accumulate and every query must reconcile them against data files at read time. The cost is linear: 1,000 delete files means 1,000 additional reconciliation passes per query.
CDC compaction strategy
- Use merge-on-read (MoR) for the write path — lightweight streaming writes that create delete files alongside data files
- Run compaction frequently (1–4 hours) to merge delete files into base data — restoring read performance to baseline
- Set `delete-file-threshold` to trigger compaction when delete file count exceeds a threshold (e.g., 50 per partition) rather than on a fixed schedule
- On Iceberg V3, enable [deletion vectors](/blog/apache-iceberg-1-11-whats-new) — compressed Roaring bitmaps that replace position delete files and reduce both storage overhead and read-time reconciliation cost
- Monitor delete-to-data file ratios — a rising ratio signals compaction is not keeping pace with updates. See Iceberg Delete Files Guide for monitoring techniques
Measuring delete file overhead
1SELECT2 content,3 COUNT(*) AS file_count,4 SUM(file_size_in_bytes) / 1048576 AS total_size_mb5FROM analytics.customers.all_data_files6GROUP BY content;7-- content = 0: data files, content = 1: position deletes, content = 2: equality deletesWhen delete file count approaches or exceeds data file count, read performance is severely degraded. Compaction merges the deletes into clean data files, resetting the ratio to zero.
V3 deletion vectors change the game
Iceberg V3's deletion vectors store deleted row positions as Roaring bitmaps inside Puffin files, attached to each data file. This replaces the V2 pattern of creating separate position delete files — which fragment metadata and scale poorly on high-update tables.
For CDC tables, deletion vectors reduce compaction urgency. Where a V2 table with 10,000 updates per hour might need compaction every 1–2 hours to keep delete file counts manageable, a V3 table with deletion vectors can tolerate longer intervals because the bitmap overhead is minimal. Compaction still matters — it merges deleted rows into clean data files — but the read-time penalty between compaction runs is dramatically lower.
Autonomous compaction with LakeOps
LakeOps replaces cron-based compaction with an autonomous, event-driven maintenance system built specifically for the problems streaming tables create.

Event-driven triggers
LakeOps continuously monitors structural signals per table and partition — file count, average file size relative to target, delete-file-to-data-file ratio, manifest depth, and snapshot velocity. Compaction fires only when thresholds are crossed:
- A streaming table accumulating 1,000 files per hour may compact every 30 minutes
- A low-volume batch table may compact once a week
- A CDC table with rising delete ratios triggers compaction when the delete-file threshold is crossed
- A partition that already has optimal file sizes is skipped entirely
No wasted runs on healthy tables. No missed runs on degraded ones. No fixed schedule that treats every table identically.
Query-aware sort optimization
The compaction engine does not just merge files — it reorganizes data based on how tables are actually queried. LakeOps collects query telemetry across every connected engine (Trino, Spark, Snowflake, Athena, DuckDB, Flink) and identifies which columns appear in WHERE, JOIN, and GROUP BY clauses per table.
Compaction then applies the sort order that maximizes data skipping for the observed query mix. The planner evaluates single-dimension sort, multi-column sort, and Z-order independently per table — not a global default applied to every table. Sort orders adapt as query patterns evolve, so a clickstream table whose dominant filter shifts from event_type to user_id gets re-sorted automatically.
This turns compaction from a pure file-consolidation operation into a continuous query optimization.
Coordinated maintenance pipeline
Compaction runs as step 3 in a sequenced pipeline: snapshot expiration → orphan cleanup → compaction → manifest rewrite. Each step's output feeds the next. The full pipeline is coordinated per table, with conflict-aware execution that targets cold partitions while active streaming partitions continue receiving writes.
Policies at scale
When you operate dozens of streaming tables, per-table cron configuration does not scale. LakeOps policies apply maintenance rules at the catalog or namespace level — every table in scope inherits the policy automatically, including new tables created after the policy is defined. A single Adaptive Maintenance policy bundles compaction, snapshot expiration, orphan cleanup, and manifest rewrite into a data-driven workflow that reacts to table activity without separate schedules.


Configuration checklist for streaming compaction
Writer-side prevention
- Set checkpoint/commit intervals to 5 minutes for analytical workloads. Sub-minute freshness is rarely needed and creates 5x more files
- Use `hash` write distribution mode in Flink to align records with partitions before writing — reduces cross-partition file scatter
- Set `write.target-file-size-bytes` to
268435456(256 MB) for general analytics,536870912(512 MB) for heavy scan workloads - Prefer `day(timestamp)` partitioning over
hour(timestamp)unless volume exceeds 5 GB/hour. Coarser partitions concentrate files where compaction can consolidate them. See Partitioning Best Practices
Compaction-side configuration
- Always use partition-aware compaction — filter with
whereto exclude the active (hot) partition - Enable `partial-progress.enabled` — allows compaction to commit in chunks, so a single conflict does not invalidate hours of work
- Set `partial-progress.max-commits` to 10–20 for large tables with many partitions
- Set `max-concurrent-file-group-rewrites` to 10 if cluster capacity allows — increases parallelism
- Set `min-input-files` to 5 — skip partitions with only a few large files that are already well-sized
- Set `rewrite-job-order` to
files-desc— compact the most fragmented partitions first - Target 256–512 MB output files — optimal for S3 GET costs and Parquet column-chunk read efficiency
- Run binpack every 1–4 hours as a baseline; run sort compaction daily or weekly for query optimization
- For CDC tables, set `delete-file-threshold` to trigger compaction when delete files exceed 50 per partition
Maintenance coordination
- Run maintenance in order: expire snapshots → remove orphans → compact → rewrite manifests
- Set snapshot retention to 3–7 days for streaming tables
- Use a 7+ day safety window for orphan cleanup to avoid deleting in-progress writes
- Never schedule these operations as independent cron jobs — coordinate them as a pipeline (Airflow DAG) or use LakeOps Adaptive Maintenance
Monitoring
- Track file count per partition over time — the leading indicator of small-file accumulation. Alert when any partition exceeds 500 files
- Monitor average file size — should stay above 128 MB post-compaction. Alert when average drops below 64 MB
- Query `table.files` and `table.manifests` metadata tables periodically to compute a table health score
- Watch S3 GET request costs in CloudWatch/Billing — a sudden increase signals small-file proliferation before query latency degrades visibly
- Track delete file count on CDC tables — rising delete-to-data ratios signal compaction is falling behind
- Monitor compaction duration — if it exceeds the compaction interval, you are falling behind. Increase frequency, switch to a faster engine, or scale out
- Track query planning time — if planning exceeds 5 seconds on a table that previously planned in milliseconds, manifest bloat or small files are the cause
Summary
Compaction on streaming Iceberg tables is not a nightly job. It is a continuous system that must run alongside active writers without conflicts, target the right partitions at the right frequency, and coordinate with snapshot expiration, orphan cleanup, and manifest maintenance.
The minimum viable approach is partition-aware binpack compaction on a 1–4 hour schedule, with partial progress enabled, the active partition excluded, and metadata-driven thresholds determining when each table needs attention. The production-grade approach adds query-aware sort optimization, event-driven triggers, maintenance sequencing, and an execution engine that makes frequent compaction economically viable.
LakeOps provides the full stack — from catalog connection to autonomous, sequenced maintenance that keeps streaming tables compacted, lean, and queryable without scripts, cron jobs, or manual intervention.
Further reading
- Kafka to Iceberg: Ingestion Guide — ingestion paths, configuration, and the operational reality of streaming into Iceberg
- 9 Iceberg Compaction Tools Compared — deep comparison of LakeOps, Spark, AWS Glue, S3 Tables, Snowflake, and more
- Apache Iceberg with Flink: Streaming Optimization — checkpoint tuning, distribution modes, and Flink-specific compaction
- Fixing Small Files in Apache Iceberg — root causes, measurement, and automated resolution
- Iceberg Table Health & Maintenance — the correct maintenance sequence and why order matters
- Iceberg Delete Files Guide — monitoring and resolving delete file accumulation on CDC tables
- Apache Iceberg 1.11.0 — What's New? — deletion vectors, Variant type, and V3 improvements
- Reduce S3 Costs on Iceberg Tables — storage cost optimization for streaming lakehouses
- Iceberg Partitioning Best Practices — choosing the right partition strategy for streaming tables



