Back to blog

Kafka to Iceberg Compaction — Done Right

Streaming from Kafka into Apache Iceberg creates small files faster than any other write pattern. This guide covers why standard compaction approaches fail for streaming tables, how to measure compaction need, implement partition-aware compaction that avoids writer conflicts, tune rewriteDataFiles parameters, and run maintenance autonomously at scale.

Kafka to Iceberg Compaction — Kafka events streaming into an Iceberg table, compacted through a gear process into optimized blocks.

Streaming from Kafka into Apache Iceberg is the most common ingestion pattern in production lakehouses. It is also the most prolific source of small files, snapshot bloat, and metadata growth — the three things that make Iceberg tables slow and expensive over time.

A Flink job checkpointing every 60 seconds against a table with 100 active partitions produces 144,000 new files per day. Kafka Connect with 5-minute commit intervals against 50 partitions produces 14,400. Even at moderate volumes, a streaming table accumulates over 100,000 undersized files in its first week — each one tracked in manifests, each one triggering an S3 GET on every query, each one making your lakehouse a little slower.

Compaction is the fix: read many small files, write fewer large ones. But standard compaction — a nightly Spark job that rewrites everything — does not work for streaming tables. It conflicts with active writers, leaves 23 hours of degradation between runs, wastes compute on healthy partitions, and never catches up on high-throughput tables.

This guide covers how to compact streaming Iceberg tables correctly: how to measure when compaction is needed, the right strategy, the right timing, the right partition boundaries, the right execution engine, and how to coordinate compaction with the rest of the maintenance lifecycle. It also covers how a purpose-built control plane like LakeOps replaces manual compaction scripts with autonomous, event-driven maintenance that adapts to each table's write pattern.

Why streaming tables are different

Batch tables accumulate small files predictably — after each load, the table is stable until the next one. Compaction can run once after the load completes, targeting the full output. Streaming tables never stop writing. Every checkpoint creates new files in the active partition while compaction tries to consolidate older ones.

This creates four problems that batch compaction approaches cannot solve:

Continuous file creation. A streaming table's file count grows linearly with time and partition count. There is no "after the load" quiet period. A table with 50 partitions and 5-minute checkpoints adds 14,400 files per day, 100,800 per week, 432,000 per month. Without continuous compaction, the table degrades indefinitely.

Active partition conflicts. Iceberg uses optimistic concurrency control. Each writer assumes exclusive access and validates at commit time. When a compaction job rewrites files in a partition that a streaming writer is still appending to, the commit fails with a ValidationException. One of them must retry — and on a hot partition with sub-minute writes, retries cascade. This is the single most common cause of compaction failures on streaming tables.

Snapshot accumulation. Every checkpoint commit creates a new Iceberg snapshot. A table with 5-minute commits generates 8,640 snapshots per month. Each snapshot pins references to data files, preventing storage reclamation. Without snapshot expiration running in coordination with compaction, the metadata layer bloats even as data files are consolidated.

Delete file overhead (CDC tables). Tables receiving CDC events via Debezium accumulate delete files on every commit. At read time, the engine reconciles each delete file against matching data files — a per-query cost that grows linearly. Without compaction to merge deletes into base data, a CDC table with 10,000 updates per hour becomes progressively slower to query.

Multiple engines and streaming jobs writing to Iceberg tables — fragmented files
Multiple engines and streaming jobs writing to the same Iceberg tables. Each writer produces different file sizes and commit frequencies — creating the fragmented file distribution that compaction must resolve.

For a deeper look at the full small files lifecycle — root causes, measurement, and prevention — see Fixing Small Files in Apache Iceberg.

Measuring compaction need

Before running compaction, quantify the problem. Iceberg exposes structural metadata through system tables that every query engine can read. Use these to determine which tables and partitions need compaction — and how urgently.

File count and size per partition

sql
1SELECT2  partition,3  COUNT(*) AS file_count,4  AVG(file_size_in_bytes) / 1048576 AS avg_size_mb,5  SUM(file_size_in_bytes) / 1073741824 AS total_size_gb,6  SUM(CASE WHEN file_size_in_bytes < 67108864 THEN 1 ELSE 0 END) AS small_files7FROM analytics.clickstream.files8GROUP BY partition9ORDER BY file_count DESC;

A healthy partition has fewer than 100 files averaging 256–512 MB each. A streaming partition with 5,000 files averaging 3 MB is in critical condition. The small_files column (files under 64 MB) tells you exactly how many files compaction needs to consolidate.

Manifest health

sql
1SELECT2  COUNT(*) AS manifest_count,3  AVG(added_data_files_count + existing_data_files_count) AS avg_files_per_manifest,4  SUM(length) / 1048576 AS total_manifest_size_mb5FROM analytics.clickstream.manifests;

Target: fewer than 100 manifests per snapshot. Streaming tables with 5-minute commits can accumulate 8,600+ manifests in 30 days. When manifest count exceeds 100, query planning time increases measurably — the engine must open, decompress, and evaluate each manifest before reading any data. Manifest rewrite (rewrite_manifests) consolidates them after compaction.

Snapshot velocity

sql
1SELECT2  COUNT(*) AS snapshot_count,3  MIN(committed_at) AS oldest_snapshot,4  MAX(committed_at) AS latest_snapshot5FROM analytics.clickstream.snapshots;

If snapshot count exceeds 1,000, expiration is not running or is not aggressive enough. Each snapshot pins references to data files, preventing deletion of superseded data. On streaming tables with 5-minute commits, 3–7 day retention is typical.

Table health score

Combine file count, average file size, and manifest count into a single assessment:

sql
1WITH file_stats AS (2  SELECT3    COUNT(*) AS total_files,4    AVG(file_size_in_bytes) AS avg_file_size,5    SUM(CASE WHEN file_size_in_bytes < 67108864 THEN 1 ELSE 0 END) AS small_files6  FROM analytics.clickstream.files7),8manifest_stats AS (9  SELECT COUNT(*) AS total_manifests10  FROM analytics.clickstream.manifests11)12SELECT13  total_files,14  ROUND(avg_file_size / 1048576, 1) AS avg_file_size_mb,15  small_files,16  total_manifests,17  CASE18    WHEN avg_file_size < 67108864 AND total_files > 1000 THEN 'Critical — compact immediately'19    WHEN avg_file_size < 134217728 AND total_files > 500 THEN 'Warning — schedule compaction'20    ELSE 'Healthy'21  END AS status22FROM file_stats, manifest_stats;

Run this query periodically (or wire it into Airflow/Dagster) to trigger compaction based on actual table state rather than fixed schedules.

The three compaction strategies

Iceberg's `rewriteDataFiles` action supports three strategies. Each makes a different tradeoff between speed and query performance improvement.

Binpack

Groups small files by partition and concatenates them into target-sized files without reordering data. No sort, no shuffle — pure I/O. This is the fastest strategy and the lowest-cost per byte compacted.

sql
1CALL catalog.system.rewrite_data_files(2  table => 'analytics.clickstream',3  strategy => 'binpack',4  options => map(5    'target-file-size-bytes', '268435456',6    'min-file-size-bytes', '67108864',7    'max-file-size-bytes', '483183820',8    'min-input-files', '5',9    'partial-progress.enabled', 'true',10    'partial-progress.max-commits', '10',11    'max-concurrent-file-group-rewrites', '10'12  )13);

When to use for streaming tables: As your baseline. Every streaming table should get binpack compaction at minimum. It eliminates the file-count penalty — more files per query plan, more S3 GET requests, more manifest entries — without the compute overhead of sorting. For append-only event streams where queries do full-partition scans (e.g., daily aggregations), binpack is often sufficient.

Limitation: Binpack does not improve data layout. Queries that filter on specific columns (WHERE event_type = 'purchase') still scan entire files because related rows are scattered randomly across all files in the partition.

Sort

Reads files, sorts data by specified columns, and writes sorted output. This physically clusters related rows so that Parquet min/max statistics enable file-level pruning — queries skip entire files that cannot contain matching data.

sql
1CALL catalog.system.rewrite_data_files(2  table => 'analytics.clickstream',3  strategy => 'sort',4  sort_order => 'event_type ASC NULLS LAST, event_timestamp ASC',5  options => map(6    'target-file-size-bytes', '268435456',7    'min-file-size-bytes', '67108864',8    'min-input-files', '5',9    'partial-progress.enabled', 'true',10    'partial-progress.max-commits', '10'11  )12);

When to use for streaming tables: When your queries consistently filter on the same 1–2 columns. A clickstream table where every dashboard query includes WHERE event_type = ... benefits from sorting by event_type. The sort adds CPU cost during compaction but pays for itself on every subsequent query through data skipping.

Tradeoff: Sort compaction is 3–5x slower than binpack on Spark. For streaming tables requiring multiple compaction passes per day, this compute cost is material. A Rust-based compaction engine narrows this gap — sort compaction at 780 seconds versus Spark's 1,612 seconds on identical 200 GB datasets — making sort viable even for high-frequency streaming maintenance.

Z-order

A multidimensional clustering technique that interleaves bits from multiple columns to create a space-filling curve. This enables file pruning across any combination of the Z-ordered columns — unlike linear sort, which only prunes efficiently on the leading column.

sql
1CALL catalog.system.rewrite_data_files(2  table => 'analytics.clickstream',3  strategy => 'sort',4  sort_order => 'zorder(user_id, event_type, page_url)',5  options => map(6    'target-file-size-bytes', '536870912',7    'partial-progress.enabled', 'true',8    'max-concurrent-file-group-rewrites', '5'9  )10);

When to use for streaming tables: When ad-hoc analytics queries filter on unpredictable combinations of 2–4 columns. Z-order is expensive — the most compute-intensive strategy — but it is the only way to optimize for multi-dimensional range predicates. Limit to 2–4 columns; more columns dilute the clustering benefit.

Tradeoff: The highest compaction cost for the broadest query benefit. On streaming tables, Z-order is typically run less frequently than binpack — perhaps daily or weekly — while binpack handles the continuous small-file consolidation.

Strategy comparison for streaming tables

StrategySpeedQuery improvementBest streaming use case
BinpackFastest (I/O only)File count reduction onlyBaseline for all streaming tables
Sort3–5x slower (Spark)2–10x scan reduction on sorted columnsPredictable filter patterns (dashboards, reports)
Z-orderSlowestMulti-column pruningAd-hoc analytics across 2–4 filter columns

In practice, most streaming tables benefit from a two-tier approach: frequent binpack to keep file counts manageable, periodic sort compaction to optimize layout for the dominant query pattern. For a detailed comparison of compaction engines and strategies, see 9 Iceberg Compaction Tools Compared.

The impact of compaction: before and after

Concrete performance differences on a streaming clickstream table (50 partitions, 7 days of 5-minute Flink checkpoints):

MetricBefore compaction (100,800 files)After compaction (700 files)
Average file size3 MB430 MB
Query planning time8.2 seconds0.3 seconds
S3 GET requests per full scan100,800700
Column statistics effectivenessPoor (wide min/max ranges)Good (tight ranges per file)
Query execution time (daily aggregate)45 seconds4 seconds
Manifest files2,01628

The 10x+ query speedup comes from three sources: fewer files to plan against, fewer S3 requests to execute, and tighter column statistics that enable predicate pushdown to skip entire files.

Tuning rewriteDataFiles parameters

The rewrite_data_files procedure has over 15 parameters. Most teams run with defaults and then wonder why compaction is slow, fails on large tables, or causes conflicts. Here are the parameters that matter for streaming tables:

File selection

  • `target-file-size-bytes` (default: 512 MB) — the target output file size. 256 MB is better for streaming tables with frequent reads; 512 MB for heavy scan workloads. Set this as a table property (write.target-file-size-bytes) so both writers and compaction use the same target
  • `min-file-size-bytes` (default: 75% of target) — files smaller than this are candidates for compaction. Increase to be more aggressive; decrease to skip borderline files
  • `max-file-size-bytes` (default: 180% of target) — files larger than this are always rewritten, regardless of other criteria. Prevents oversized files from accumulating
  • `min-input-files` (default: 5) — minimum number of candidate files in a partition before compaction runs. Set to 5 for streaming to skip healthy partitions. Set to 2 if you want aggressive consolidation

Parallelism and grouping

  • `max-concurrent-file-group-rewrites` (default: 5) — how many file groups are rewritten simultaneously. Increase to 10–20 if your cluster has capacity. Each file group runs as an independent task, so higher concurrency finishes faster but uses more memory
  • `max-file-group-size-bytes` (default: 100 GB) — maximum data in a single file group. Reduce to 5–10 GB for memory-constrained environments. Larger groups produce fewer commits but risk OOM on Spark

Partial progress (critical for streaming)

  • `partial-progress.enabled` (default: false) — set to `true` for streaming tables. Without it, a conflict on one partition invalidates the entire compaction job. With it, each file group commits independently
  • `partial-progress.max-commits` (default: 10) — maximum number of separate commits the job produces. If you have 100 file groups and max-commits is 10, Iceberg batches 10 groups per commit. Higher values increase fault tolerance but add metadata overhead (each commit is a new snapshot)
  • `partial-progress.max-failed-commits` (default: same as max-commits) — how many commit failures are allowed before the entire job fails. On streaming tables with active writers, set this to at least 3–5 to tolerate occasional hot-partition conflicts

Advanced options

  • `use-starting-sequence-number` (default: true) — uses the sequence number of the snapshot at compaction start rather than the new snapshot. This prevents V2 delete files from being applied to compacted data files that should not have them — important for CDC tables
  • `rewrite-all` (default: false) — force rewriting all files regardless of size thresholds. Use only for one-time sort optimization when changing the sort order of an existing table
  • `rewrite-job-order` — controls the order in which file groups are processed. Options include bytes-asc, bytes-desc, files-asc, files-desc, and none. Use files-desc to prioritize the most fragmented partitions first

Partition-aware compaction: the critical pattern

The single most important technique for compacting streaming tables is: never compact the partition currently being written to. This eliminates writer conflicts entirely.

Iceberg's optimistic concurrency model detects conflicts at commit time. When a compaction job rewrites files in a partition that received new writes since the compaction started, the commit is rejected. On streaming tables with minute-level writes, the hot partition conflicts on almost every compaction attempt.

The solution is partition-aware compaction: restrict compaction to historical (cold) partitions where no new data is arriving. The streaming job writes to today's partition; compaction targets yesterday's and older.

Implementation in Spark

sql
1-- Compact only cold partitions (older than 1 day)2CALL catalog.system.rewrite_data_files(3  table => 'analytics.clickstream',4  strategy => 'sort',5  sort_order => 'event_type ASC',6  where => 'event_date < current_date()',7  options => map(8    'target-file-size-bytes', '268435456',9    'min-input-files', '5',10    'partial-progress.enabled', 'true',11    'partial-progress.max-commits', '10',12    'max-concurrent-file-group-rewrites', '10',13    'rewrite-job-order', 'files-desc'14  )15);

The where clause limits the scope to partitions older than today. The streaming writer appends to today's partition without interference. When today becomes yesterday, the next compaction run picks it up.

`partial-progress.enabled` is critical for streaming tables. It allows the compaction job to commit in chunks — if a conflict occurs on one chunk, only that chunk retries rather than the entire table rewrite. On a table with 100+ partitions, this prevents a single conflict from invalidating hours of compaction work.

`min-input-files` prevents wasting compute on partitions that do not need compaction. A partition with only 3 files, each 200 MB, is already well-sized. Setting min-input-files to 5 skips these partitions and focuses compute on the ones with hundreds of tiny files.

`rewrite-job-order` set to files-desc ensures the most fragmented partitions — the ones with the most files — are compacted first. If the job is interrupted or times out, the partitions with the worst degradation have already been handled.

Flink provides the RewriteDataFilesAction through the iceberg-flink-runtime JAR. For streaming workloads, run compaction as a separate Flink job — not in the same pipeline as ingestion — to isolate resource consumption:

java
1Actions.forTable(table)2  .rewriteDataFiles()3  .binPack()4  .filter(Expressions.lessThan("event_date", today))5  .option("target-file-size-bytes", "268435456")6  .option("min-input-files", "5")7  .option("partial-progress.enabled", "true")8  .option("partial-progress.max-commits", "10")9  .option("max-concurrent-file-group-rewrites", "10")10  .execute();

Run this job on a dedicated Flink cluster or task manager to avoid contention with the streaming ingestion pipeline. Scheduling through Airflow or a similar orchestrator at 1–4 hour intervals keeps file counts manageable without over-compacting.

Orchestrating with Airflow

For teams using Apache Airflow, a practical DAG pattern for streaming compaction:

python
1from airflow import DAG2from airflow.providers.apache.spark.operators.spark_sql import SparkSqlOperator3from datetime import datetime4 5with DAG(6    'streaming_table_maintenance',7    schedule_interval='0 */2 * * *',8    start_date=datetime(2026, 1, 1),9    catchup=False,10) as dag:11 12    expire_snapshots = SparkSqlOperator(13        task_id='expire_snapshots',14        sql="""15            CALL catalog.system.expire_snapshots(16                table => 'analytics.clickstream',17                older_than => current_timestamp() - INTERVAL 5 DAYS,18                retain_last => 10019            )20        """,21    )22 23    remove_orphans = SparkSqlOperator(24        task_id='remove_orphans',25        sql="""26            CALL catalog.system.remove_orphan_files(27                table => 'analytics.clickstream',28                older_than => current_timestamp() - INTERVAL 7 DAYS29            )30        """,31    )32 33    compact_files = SparkSqlOperator(34        task_id='compact_files',35        sql="""36            CALL catalog.system.rewrite_data_files(37                table => 'analytics.clickstream',38                strategy => 'binpack',39                where => 'event_date < current_date()',40                options => map(41                    'target-file-size-bytes', '268435456',42                    'min-input-files', '5',43                    'partial-progress.enabled', 'true',44                    'partial-progress.max-commits', '10'45                )46            )47        """,48    )49 50    rewrite_manifests = SparkSqlOperator(51        task_id='rewrite_manifests',52        sql="""53            CALL catalog.system.rewrite_manifests(54                table => 'analytics.clickstream'55            )56        """,57    )58 59    expire_snapshots >> remove_orphans >> compact_files >> rewrite_manifests

This DAG enforces the correct maintenance order and runs every 2 hours. Each step is a separate task so failures are isolated — if orphan cleanup fails, it does not block compaction from running on the data that is already clean.

The full maintenance sequence

Compaction alone is not enough. Streaming tables require four coordinated maintenance operations, in a specific order:

1. Expire snapshots — remove snapshots beyond the retention window. A streaming table with 5-minute commits accumulates 8,640 snapshots per month. Each pins references to data files, preventing deletion of superseded bytes. Typical streaming retention is 3–7 days.

sql
1CALL catalog.system.expire_snapshots(2  table => 'analytics.clickstream',3  older_than => TIMESTAMP '2026-06-04 00:00:00',4  retain_last => 1005);

2. Remove orphan files — failed checkpoints, aborted commits, and concurrent writer conflicts leave data files in S3 that no snapshot references. These orphan files are invisible to Iceberg but billable by S3. On mature streaming lakes, orphans routinely account for 25–40% of billable storage.

sql
1CALL catalog.system.remove_orphan_files(2  table => 'analytics.clickstream',3  older_than => TIMESTAMP '2026-05-31 00:00:00'4);

Always use a safety window (7+ days) for orphan cleanup to avoid deleting files from in-progress writes.

3. Compact data files — merge small files into 256–512 MB targets. This is where partition-aware compaction from the previous section applies.

4. Rewrite manifests — consolidate the fragmented manifest files that streaming commits created. Each commit adds at least one manifest. After 30 days of 5-minute commits, the table carries roughly 8,600 manifests that query planning must traverse.

sql
1CALL catalog.system.rewrite_manifests(2  table => 'analytics.clickstream'3);

Manifest rewrite is the most underappreciated maintenance operation. After compaction merges 100,000 files into 700, the 2,016 manifests still reference the old file layout. Planning still reads all of them. rewrite_manifests consolidates them into a handful of properly sized manifests, cutting planning time from seconds to milliseconds.

The order matters. Expiring snapshots first frees data file references for orphan cleanup. Orphan cleanup removes files before compaction runs — so compaction does not waste compute rewriting files that are about to be deleted. Manifest rewrite runs last because it should reflect the final compacted file set. Running these operations independently — separate cron jobs on separate schedules — produces conflicts and wasted work. For a deep dive on maintenance sequencing, see Iceberg Table Health & Maintenance.

Compaction transforms scattered small files into optimized blocks
The compaction pipeline: hundreds of small files from streaming writes funneled through an optimization engine into properly sized, query-ready file blocks.

Why cron-based compaction fails for streaming

A nightly Spark compaction job is the default approach for most teams. Here is why it falls apart on streaming tables:

23 hours of degradation. Between nightly runs, every query pays the full small-file penalty. A streaming table accumulating 14,400 files per day means queries run against increasingly fragmented data all day. By the time compaction runs at midnight, the table has 14,400 uncompacted files. After compaction, the cycle restarts — by noon the next day, 7,200 new files have already accumulated.

Hot partition conflicts. A nightly job that targets the entire table will try to compact today's partition alongside historical ones. The streaming writer is still appending to today's partition. The compaction commit fails. Without partition-aware filtering, the entire job may abort — including the work on historical partitions that would have succeeded.

No priority ordering. A cron schedule treats all tables equally. Your critical customer-facing clickstream table with 200,000 small files waits in the same queue as a low-priority staging table with 50 files. Resources are spread uniformly instead of concentrated where they are needed.

Wasted compute on healthy partitions. Without min-input-files filtering, the job scans every partition — including ones that are already optimally sized. On a table with 365 daily partitions, only the most recent 7 typically need compaction. The other 358 are rewritten unnecessarily.

No coordination with other maintenance. Snapshot expiration runs on a separate schedule. Orphan cleanup runs on another. Compaction may rewrite files that snapshot expiration is about to dereference — wasted I/O. Orphan cleanup may run before snapshot expiration has freed the references — missing files. The operations interfere with each other instead of feeding into each other.

The cost of the compaction engine

For streaming tables, the compaction engine runs frequently — multiple times per day on high-throughput tables. The per-run cost compounds quickly.

JVM overhead

Spark-based compaction carries structural overhead that is irrelevant to the compaction workload itself:

  • Cluster provisioning — starting a Spark job requires provisioning an EMR/Dataproc cluster or allocating capacity on a shared cluster. Cold starts take 2–5 minutes before any compaction work begins
  • JVM startup and GC — garbage collection pauses during compaction introduce latency variance. On large tables with high delete file counts, full GC events can stall compaction for seconds. Spark routinely OOMs on tables above 1 TB with complex delete file patterns — requiring cluster resizing or job splitting
  • Idle cluster costs — if you keep a Spark cluster running for compaction, you pay for idle time between runs. If you provision on demand, you pay for startup latency
  • Executor overhead — Spark's executor model requires over-provisioning because under-provisioning causes failures. The compute cost of keeping tables healthy often rivals the compute cost of querying them

Cost comparison

Spark (EMR)Rust/DataFusion (LakeOps)
200 GB binpack1,612 seconds / ~$1.54221 seconds / ~$0.21
200 GB sortNot benchmarked (estimated ~$3.50)780 seconds / ~$0.75
Cost per TB~$50/TB~$5/TB
1.2 TB tableOOM failure11 minutes
Peak throughput~400 MB/s2,522 MB/s
JVM startup2–5 minutes0 seconds
GC pausesYes (unpredictable)None

Data from production benchmarks across 10 tables totaling 5.5 TB. The gap is structural, not marginal — Rust eliminates GC pauses, Apache DataFusion provides vectorized columnar execution with Arrow, and lock-free parallelism means worker threads never stall.

Compaction benchmarks — LakeOps Rust engine vs Spark
Production compaction benchmarks — LakeOps Rust engine vs Spark across file counts and data volumes. Consistent 5–8x speed improvements translate directly to lower compute cost for streaming tables requiring frequent compaction.

When you compact streaming tables multiple times per day across dozens of tables, the cost difference between $5/TB and $50/TB is the difference between compaction as an affordable utility and compaction as a significant line item on your cloud bill. See Iceberg Cost Optimization for a broader cost analysis.

Compaction for CDC streaming tables

Tables receiving CDC events from Kafka via Debezium face an additional compaction challenge: delete file accumulation.

Every update and delete in the source database produces delete files in the Iceberg table — position deletes (V2) or deletion vectors (V3). Without compaction, these delete files accumulate and every query must reconcile them against data files at read time. The cost is linear: 1,000 delete files means 1,000 additional reconciliation passes per query.

CDC compaction strategy

  • Use merge-on-read (MoR) for the write path — lightweight streaming writes that create delete files alongside data files
  • Run compaction frequently (1–4 hours) to merge delete files into base data — restoring read performance to baseline
  • Set `delete-file-threshold` to trigger compaction when delete file count exceeds a threshold (e.g., 50 per partition) rather than on a fixed schedule
  • On Iceberg V3, enable [deletion vectors](/blog/apache-iceberg-1-11-whats-new) — compressed Roaring bitmaps that replace position delete files and reduce both storage overhead and read-time reconciliation cost
  • Monitor delete-to-data file ratios — a rising ratio signals compaction is not keeping pace with updates. See Iceberg Delete Files Guide for monitoring techniques

Measuring delete file overhead

sql
1SELECT2  content,3  COUNT(*) AS file_count,4  SUM(file_size_in_bytes) / 1048576 AS total_size_mb5FROM analytics.customers.all_data_files6GROUP BY content;7-- content = 0: data files, content = 1: position deletes, content = 2: equality deletes

When delete file count approaches or exceeds data file count, read performance is severely degraded. Compaction merges the deletes into clean data files, resetting the ratio to zero.

V3 deletion vectors change the game

Iceberg V3's deletion vectors store deleted row positions as Roaring bitmaps inside Puffin files, attached to each data file. This replaces the V2 pattern of creating separate position delete files — which fragment metadata and scale poorly on high-update tables.

For CDC tables, deletion vectors reduce compaction urgency. Where a V2 table with 10,000 updates per hour might need compaction every 1–2 hours to keep delete file counts manageable, a V3 table with deletion vectors can tolerate longer intervals because the bitmap overhead is minimal. Compaction still matters — it merges deleted rows into clean data files — but the read-time penalty between compaction runs is dramatically lower.

Autonomous compaction with LakeOps

LakeOps replaces cron-based compaction with an autonomous, event-driven maintenance system built specifically for the problems streaming tables create.

LakeOps architecture — control plane for Iceberg catalogs, engines, and storage
LakeOps connects to your existing catalogs and storage as a control plane — Kafka-sourced streaming tables, batch tables, and multi-engine queries all managed through a single system.

Event-driven triggers

LakeOps continuously monitors structural signals per table and partition — file count, average file size relative to target, delete-file-to-data-file ratio, manifest depth, and snapshot velocity. Compaction fires only when thresholds are crossed:

  • A streaming table accumulating 1,000 files per hour may compact every 30 minutes
  • A low-volume batch table may compact once a week
  • A CDC table with rising delete ratios triggers compaction when the delete-file threshold is crossed
  • A partition that already has optimal file sizes is skipped entirely

No wasted runs on healthy tables. No missed runs on degraded ones. No fixed schedule that treats every table identically.

Query-aware sort optimization

The compaction engine does not just merge files — it reorganizes data based on how tables are actually queried. LakeOps collects query telemetry across every connected engine (Trino, Spark, Snowflake, Athena, DuckDB, Flink) and identifies which columns appear in WHERE, JOIN, and GROUP BY clauses per table.

Compaction then applies the sort order that maximizes data skipping for the observed query mix. The planner evaluates single-dimension sort, multi-column sort, and Z-order independently per table — not a global default applied to every table. Sort orders adapt as query patterns evolve, so a clickstream table whose dominant filter shifts from event_type to user_id gets re-sorted automatically.

This turns compaction from a pure file-consolidation operation into a continuous query optimization.

Coordinated maintenance pipeline

Compaction runs as step 3 in a sequenced pipeline: snapshot expiration → orphan cleanup → compaction → manifest rewrite. Each step's output feeds the next. The full pipeline is coordinated per table, with conflict-aware execution that targets cold partitions while active streaming partitions continue receiving writes.

Policies at scale

When you operate dozens of streaming tables, per-table cron configuration does not scale. LakeOps policies apply maintenance rules at the catalog or namespace level — every table in scope inherits the policy automatically, including new tables created after the policy is defined. A single Adaptive Maintenance policy bundles compaction, snapshot expiration, orphan cleanup, and manifest rewrite into a data-driven workflow that reacts to table activity without separate schedules.

LakeOps table health monitoring — Healthy, Warning, Critical classification
Lake-wide table health monitoring — every table classified by health status. Streaming tables with small-file accumulation or delete file buildup surface as Warning or Critical immediately.
LakeOps events — compaction, snapshot expiration, and manifest rewrites per table
Every maintenance operation logged with before/after metrics — raw_clickstream compacted from 970 files to 87, snapshots expired, orphans cleaned. The operations pipeline runs in the correct sequence without manual coordination.
LakeOps in action — from catalog connection to autonomous compaction and optimization of streaming Iceberg tables.

Configuration checklist for streaming compaction

Writer-side prevention

  • Set checkpoint/commit intervals to 5 minutes for analytical workloads. Sub-minute freshness is rarely needed and creates 5x more files
  • Use `hash` write distribution mode in Flink to align records with partitions before writing — reduces cross-partition file scatter
  • Set `write.target-file-size-bytes` to 268435456 (256 MB) for general analytics, 536870912 (512 MB) for heavy scan workloads
  • Prefer `day(timestamp)` partitioning over hour(timestamp) unless volume exceeds 5 GB/hour. Coarser partitions concentrate files where compaction can consolidate them. See Partitioning Best Practices

Compaction-side configuration

  • Always use partition-aware compaction — filter with where to exclude the active (hot) partition
  • Enable `partial-progress.enabled` — allows compaction to commit in chunks, so a single conflict does not invalidate hours of work
  • Set `partial-progress.max-commits` to 10–20 for large tables with many partitions
  • Set `max-concurrent-file-group-rewrites` to 10 if cluster capacity allows — increases parallelism
  • Set `min-input-files` to 5 — skip partitions with only a few large files that are already well-sized
  • Set `rewrite-job-order` to files-desc — compact the most fragmented partitions first
  • Target 256–512 MB output files — optimal for S3 GET costs and Parquet column-chunk read efficiency
  • Run binpack every 1–4 hours as a baseline; run sort compaction daily or weekly for query optimization
  • For CDC tables, set `delete-file-threshold` to trigger compaction when delete files exceed 50 per partition

Maintenance coordination

  • Run maintenance in order: expire snapshots → remove orphans → compact → rewrite manifests
  • Set snapshot retention to 3–7 days for streaming tables
  • Use a 7+ day safety window for orphan cleanup to avoid deleting in-progress writes
  • Never schedule these operations as independent cron jobs — coordinate them as a pipeline (Airflow DAG) or use LakeOps Adaptive Maintenance

Monitoring

  • Track file count per partition over time — the leading indicator of small-file accumulation. Alert when any partition exceeds 500 files
  • Monitor average file size — should stay above 128 MB post-compaction. Alert when average drops below 64 MB
  • Query `table.files` and `table.manifests` metadata tables periodically to compute a table health score
  • Watch S3 GET request costs in CloudWatch/Billing — a sudden increase signals small-file proliferation before query latency degrades visibly
  • Track delete file count on CDC tables — rising delete-to-data ratios signal compaction is falling behind
  • Monitor compaction duration — if it exceeds the compaction interval, you are falling behind. Increase frequency, switch to a faster engine, or scale out
  • Track query planning time — if planning exceeds 5 seconds on a table that previously planned in milliseconds, manifest bloat or small files are the cause

Summary

Compaction on streaming Iceberg tables is not a nightly job. It is a continuous system that must run alongside active writers without conflicts, target the right partitions at the right frequency, and coordinate with snapshot expiration, orphan cleanup, and manifest maintenance.

The minimum viable approach is partition-aware binpack compaction on a 1–4 hour schedule, with partial progress enabled, the active partition excluded, and metadata-driven thresholds determining when each table needs attention. The production-grade approach adds query-aware sort optimization, event-driven triggers, maintenance sequencing, and an execution engine that makes frequent compaction economically viable.

LakeOps provides the full stack — from catalog connection to autonomous, sequenced maintenance that keeps streaming tables compacted, lean, and queryable without scripts, cron jobs, or manual intervention.

How streaming ingestion inflates storage and compute costs — and how autonomous compaction recovers them.

Further reading

Tags

Apache IcebergApache IcebergCompactionApache KafkaStreamingSmall FilesTable Maintenance

Related articles

Found this useful? Share it with your team.