Automating Apache Iceberg Table Maintenance

Apache Iceberg gives you ACID transactions, schema evolution, partition evolution, and time travel — all in an open table format readable by every major query engine. What it does not give you is a way to keep tables healthy over time.

Every write to an Iceberg table creates a new snapshot, adds manifest entries, and produces data files. Over weeks and months, this accumulates into a structural debt that degrades query performance, inflates storage costs, and slows down every engine connected to the table. The fix requires four distinct maintenance operations — compaction, snapshot expiration, orphan file cleanup, and manifest rewriting — executed in the right order, at the right frequency, across every table in the lake.

Iceberg ships the SQL procedures and Java APIs for each operation. But it does not ship scheduling, health detection, sequencing logic, conflict handling, multi-table prioritization, observability, or alerting. Those are left to the operator — and the gap between "we have the procedures" and "the lake is healthy" is where most teams struggle.

This guide covers what each maintenance operation does, why the order matters, how teams typically automate with scripts and orchestrators, where that approach breaks down, and how a dedicated control plane like LakeOps replaces manual maintenance with an autonomous, policy-driven system that keeps every table optimized without human intervention.

The five maintenance operations

Iceberg's maintenance surface consists of five operations. Each targets a different layer of the table — data files, metadata, or storage — and each has dependencies on the others.

1. Snapshot expiration

Every commit to an Iceberg table — every write, every compaction, every schema change — creates a new snapshot. Snapshots enable time travel and rollback, but they also pin references to data files. A table with 5-minute streaming commits accumulates 8,640 snapshots per month. Each one prevents the storage layer from reclaiming superseded data files.

sql

1CALL catalog.system.expire_snapshots(2  table => 'analytics.clickstream',3  older_than => TIMESTAMP '2026-06-04 00:00:00',4  retain_last => 1005);

Key parameters:

older_than — snapshots older than this timestamp are expired. Typical streaming retention is 3–7 days; batch tables may retain 14–30 days for compliance
retain_last — minimum number of snapshots to keep regardless of age. Prevents accidentally expiring all snapshots on a low-write table
max_concurrent_deletes — parallelism for file deletion. Increase on tables with many expired files to reduce job duration

Without regular expiration, the metadata layer grows unboundedly and query planning slows as engines traverse an ever-longer snapshot chain. See Iceberg Table Health & Maintenance for a detailed walkthrough of snapshot lifecycle management.

2. Orphan file cleanup

Orphan files are data files that exist in object storage but are not referenced by any Iceberg snapshot. They are created by failed writes, aborted commits, crashed compaction jobs, and concurrent writer conflicts. Iceberg does not track them — they are invisible to query engines but fully billable by S3, GCS, or ADLS.

On mature data lakes, orphan files routinely account for 25–40% of billable storage on affected table prefixes.

sql

1CALL catalog.system.remove_orphan_files(2  table => 'analytics.clickstream',3  older_than => TIMESTAMP '2026-05-31 00:00:00'4);

Critical safety rule: Always use a retention threshold of 7+ days. Files from in-progress writes are temporarily orphaned until the writer commits. Deleting them prematurely corrupts the table. The default threshold in Iceberg is 3 days; production environments with long-running Spark or Flink jobs should use 7 days or more.

3. Data file compaction

Streaming ingestion, micro-batch ETL, and frequent appends create thousands of undersized data files. A Flink job checkpointing every 60 seconds against 100 partitions produces 144,000 new files per day — each one tracked in manifests, each one triggering an S3 GET on every query.

Compaction merges these small files into 256–512 MB targets, reducing file count by 10–100x and improving query performance proportionally. Iceberg supports three strategies — binpack, sort, and Z-order — each making a different tradeoff between compaction speed and query improvement.

sql

1CALL catalog.system.rewrite_data_files(2  table => 'analytics.clickstream',3  strategy => 'sort',4  sort_order => 'event_type ASC NULLS LAST',5  where => 'event_date < current_date()',6  options => map(7    'target-file-size-bytes', '268435456',8    'min-input-files', '5',9    'partial-progress.enabled', 'true',10    'partial-progress.max-commits', '10',11    'max-concurrent-file-group-rewrites', '10'12  )13);

The where clause is critical for streaming tables — it restricts compaction to cold partitions, preventing conflicts with active writers. partial-progress.enabled allows the job to commit in chunks so a single conflict does not invalidate the entire run. For a deep dive on compaction strategies and tuning, see Kafka to Iceberg Compaction — Done Right.

4. Manifest rewriting

Every commit creates at least one new manifest file. Manifests track which data files belong to which partitions, along with per-file column statistics. After weeks of streaming writes, a table can accumulate thousands of manifest files — each one read during query planning.

sql

1CALL catalog.system.rewrite_manifests(2  table => 'analytics.clickstream'3);

Manifest rewriting consolidates fragmented manifests into a handful of properly sized ones, organized by partition spec. On a table with 2,000+ manifests, this can cut query planning time from seconds to milliseconds. Always run manifest rewriting after compaction — compaction changes the file set, and manifests should reflect the final layout.

5. Metadata file cleanup

Each snapshot also produces a metadata file. Iceberg can clean these automatically if you enable the table property:

sql

1ALTER TABLE analytics.clickstream SET TBLPROPERTIES (2  'write.metadata.delete-after-commit.enabled' = 'true',3  'write.metadata.previous-versions-max' = '100'4);

This is a write-time optimization that does not require a separate maintenance job. Enable it on every production table.

Why the order matters

The four active maintenance operations have dependencies. Running them in the wrong order wastes compute, misses cleanup opportunities, or produces stale metadata.

The correct sequence is: expire snapshots → remove orphan files → compact data files → rewrite manifests.

Compaction transforms scattered small files into optimized blocks — The maintenance pipeline: scattered small files consolidated into optimized blocks through the correct sequence of operations.

Why expire before compact. Snapshot expiration dereferences data files that are no longer needed. If you compact first, the compaction engine reads and rewrites files that expiration would have removed — wasting compute on data no query will ever read again. On a table with months of accumulated snapshots, this waste can amount to hundreds of gigabytes of unnecessary I/O.

Why clean orphans before compact. After expiration releases file references, orphan cleanup removes the physical files. This reduces the storage footprint before compaction runs, so compaction operates on a clean dataset. It also prevents stale orphan files from inflating per-partition file counts and skewing compaction decisions.

Why compact before rewriting manifests. Compaction removes input files and creates new output files. Manifests track files. If you rewrite manifests first, they will be aligned to the pre-compaction layout. The moment compaction runs, those carefully rewritten manifests become stale and fragmented again. Compact first, then rewrite manifests against the final file set.

Running operations independently — separate cron jobs on separate schedules — produces exactly these problems. Snapshot expiration runs on Monday, compaction runs on Tuesday, orphan cleanup runs on Wednesday. Each operation has already missed the window where its output would feed the next step cleanly.

Measuring table health

Before automating maintenance, you need to know which tables need attention and how urgently. Iceberg's metadata tables provide the raw signals.

File health per partition

sql

1SELECT2  partition,3  COUNT(*) AS file_count,4  AVG(file_size_in_bytes) / 1048576 AS avg_size_mb,5  SUM(CASE WHEN file_size_in_bytes < 67108864 THEN 1 ELSE 0 END) AS small_files6FROM analytics.clickstream.files7GROUP BY partition8ORDER BY file_count DESC;

Healthy: fewer than 100 files per partition, averaging 256–512 MB. Warning: 500+ files or average size below 128 MB. Critical: 1,000+ files or average below 64 MB.

Snapshot accumulation

sql

1SELECT2  COUNT(*) AS snapshot_count,3  MIN(committed_at) AS oldest_snapshot,4  MAX(committed_at) AS latest_snapshot5FROM analytics.clickstream.snapshots;

If snapshot count exceeds 1,000, expiration is not running or not aggressive enough. Each snapshot pins file references and adds planning overhead.

Manifest fragmentation

sql

1SELECT2  COUNT(*) AS manifest_count,3  AVG(added_data_files_count + existing_data_files_count) AS avg_files_per_manifest4FROM analytics.clickstream.manifests;

Target: fewer than 100 manifests per snapshot. Streaming tables with 5-minute commits can accumulate 8,600+ manifests in 30 days.

Running these queries manually does not scale beyond a handful of tables. At 50+ tables, you need automated health classification — which is one of the core capabilities a control plane like LakeOps provides. LakeOps continuously monitors these signals across every table and classifies each as Healthy, Warning, or Critical, surfacing the tables that need attention without manual SQL inspection.

Automating with Airflow and scripts

The most common approach to Iceberg maintenance automation is an Airflow DAG that runs Spark SQL procedures on a schedule:

python

1from airflow import DAG2from airflow.providers.apache.spark.operators.spark_sql import SparkSqlOperator3from datetime import datetime, timedelta4 5with DAG(6    'iceberg_maintenance',7    schedule_interval='0 2 * * *',8    start_date=datetime(2026, 1, 1),9    catchup=False,10) as dag:11 12    expire = SparkSqlOperator(13        task_id='expire_snapshots',14        sql="""15            CALL catalog.system.expire_snapshots(16                table => 'analytics.clickstream',17                older_than => current_timestamp() - INTERVAL 5 DAYS,18                retain_last => 10019            )20        """,21    )22 23    orphans = SparkSqlOperator(24        task_id='remove_orphans',25        sql="""26            CALL catalog.system.remove_orphan_files(27                table => 'analytics.clickstream',28                older_than => current_timestamp() - INTERVAL 7 DAYS29            )30        """,31    )32 33    compact = SparkSqlOperator(34        task_id='compact_files',35        sql="""36            CALL catalog.system.rewrite_data_files(37                table => 'analytics.clickstream',38                strategy => 'binpack',39                where => 'event_date < current_date()',40                options => map(41                    'target-file-size-bytes', '268435456',42                    'min-input-files', '5',43                    'partial-progress.enabled', 'true'44                )45            )46        """,47    )48 49    manifests = SparkSqlOperator(50        task_id='rewrite_manifests',51        sql="""52            CALL catalog.system.rewrite_manifests(53                table => 'analytics.clickstream'54            )55        """,56    )57 58    expire >> orphans >> compact >> manifests

This works for a single table. It enforces the correct operation order, runs daily, and isolates failures per step. But it has structural limitations that compound as the lake grows.

Where scripts and cron jobs break down

The gap between Iceberg's maintenance procedures and production-grade automation is not a single missing feature — it is an entire operational layer. Here is what scripts and Airflow DAGs cannot provide:

Capability	Iceberg procedures	Scripts / Airflow	What you actually need
Maintenance operations	Four SQL procedures	Same, wrapped in tasks	Same
Scheduling	None	Cron / Airflow scheduler	Event-driven triggers based on table state
Health detection	None	Manual SQL queries	Continuous per-table monitoring with thresholds
Multi-table prioritization	None	First-in-first-out queue	Cost-aware ordering (worst tables first)
Operation sequencing	None	DAG dependency chains	Coordinated pipeline with awareness of each step's output
Conflict handling	`ValidationException` on retry	Catch-and-retry logic you build	Partition-aware execution that avoids hot partitions
Resource isolation	None	Cluster sizing	Bounded memory per operation, no OOM
Observability	Procedure return values	Log parsing	Per-operation audit trail with before/after metrics
Alerting	None	Custom integrations	Severity-ranked insights (Critical / Warning / Healthy)
Policy governance	None	DAG configuration files	Versioned, auditable policies at catalog/namespace/table scope

Fixed schedules are wasteful. A nightly compaction job runs on every table regardless of whether it needs compaction. A streaming table that accumulated 50,000 small files in 6 hours waits until midnight. A batch table that received no new data since last week gets compacted anyway.

Per-table DAGs do not scale. At 10 tables, you maintain 10 DAGs. At 100 tables across 3 catalogs, you maintain 100 DAGs with different schedules, different retention windows, different compaction strategies. Every new table requires a new DAG — and the DAGs themselves become a maintenance burden.

No awareness of table state. Airflow runs what you schedule, when you schedule it. It does not know that raw_clickstream has 200,000 small files and needs urgent compaction while dim_products has 12 files and needs nothing. Both get the same treatment at the same time.

JVM overhead compounds. Each Spark maintenance job carries JVM startup (2–5 minutes), garbage collection pauses, and executor provisioning costs. Across 100 tables, the compaction compute bill can rival the query compute bill. On tables above 1 TB with complex delete patterns, Spark routinely OOMs — requiring cluster resizing or job splitting.

No cross-operation coordination. Even with DAG dependencies enforcing the correct order per table, there is no coordination across tables. Compaction on Table A may conflict with a streaming writer on the same cluster. Orphan cleanup on Table B may start before snapshot expiration on Table B has finished because they are in separate DAGs.

LakeOps: autonomous maintenance for every table

LakeOps is an autonomous control plane for Apache Iceberg that replaces scripts, cron jobs, and Airflow DAGs with a closed-loop maintenance system. It connects to your existing catalogs (Glue, REST, Polaris, Nessie, S3 Tables) and query engines (Trino, Spark, Flink, Snowflake, Athena, DuckDB) without moving data or changing pipelines. Your data stays in your storage account; LakeOps reads metadata and telemetry, executes maintenance, and commits through standard Iceberg APIs.

LakeOps architecture — control plane for Iceberg catalogs, engines, and storage — LakeOps connects to your existing catalogs and storage as a control plane — streaming tables, batch tables, and multi-engine queries all managed through a single system.

Here is how LakeOps automates each maintenance operation, component by component.

Compaction: Rust engine with query-aware sort

LakeOps runs compaction on a purpose-built engine written in Rust on Apache DataFusion. The engine processes Parquet data through Arrow columnar buffers with bounded memory, lock-free parallelism, and no JVM — eliminating the startup, GC, and OOM problems that plague Spark-based compaction.

Production benchmarks show binpack completing in 221 seconds versus Spark's 1,612 seconds on identical 200 GB datasets — at roughly $5/TB versus $50/TB. A 1.2 TB table that caused Spark to OOM completed in 11 minutes on the Rust engine. For a full benchmark analysis, see 9 Iceberg Compaction Tools Compared.

Compaction benchmarks — LakeOps Rust engine vs Spark — Production compaction benchmarks across 10 tables totaling 5.5 TB. The Rust engine delivers 5–8x faster compaction at a fraction of the compute cost.

Beyond speed, LakeOps compaction is query-aware. The engine collects query telemetry across every connected engine — Trino, Spark, Snowflake, Athena, DuckDB, Flink — and identifies which columns appear in WHERE, JOIN, and GROUP BY clauses per table. Compaction then sorts data by those columns, so Parquet min/max statistics enable engines to skip entire files. The result: every query runs faster across every engine, without changing a single query or pipeline.

Compaction is configurable per table through the Optimization tab — strategy (binpack or sort), predicate scoping, cron schedule, and an instant Execute button for on-demand runs.

Snapshot expiration: automated retention with safety guardrails

LakeOps runs snapshot expiration as the first step in the maintenance pipeline. Retention is configurable per policy — use the table's own history.expire.max-snapshot-age-ms setting, or override with a custom retention period (e.g., 5 days for streaming, 30 days for compliance). A minimum snapshot retention count prevents accidentally expiring all snapshots on low-write tables.

Configuration options include whether to delete associated metadata files and data files — giving you control over how aggressively storage is reclaimed. The operation runs automatically on the configured cron schedule or as part of Adaptive Maintenance, with every execution logged in the Events tab.

Orphan file cleanup: safe reclamation with age thresholds

After snapshot expiration releases file references, LakeOps runs orphan cleanup to remove the physical files from storage. The configurable age threshold (default: 7 days) ensures files from in-progress writes are never deleted. On mature streaming lakes where orphans account for 25–40% of storage costs, this operation alone can produce significant savings. See Reduce S3 Costs on Iceberg Tables for storage cost analysis.

Manifest rewriting: planning-time optimization

After compaction produces the final file layout, LakeOps rewrites manifests to match. Fragmented manifests from weeks of streaming writes are consolidated into properly sized manifests organized by partition spec. This cuts query planning time — the overhead engines pay before reading any data — from seconds to milliseconds on tables with heavy write activity.

The coordinated pipeline

These four operations do not run independently. LakeOps sequences them per table in the correct order: expire snapshots → remove orphans → compact → rewrite manifests. Each step's output feeds the next. The pipeline is conflict-aware — compaction targets cold partitions while active streaming partitions continue receiving writes. If a conflict occurs, LakeOps retries the affected partition on the next cycle. No data is lost or corrupted.

LakeOps events — compaction, snapshot expiration, and manifest rewrites per table — Every maintenance operation logged with before/after metrics — raw_clickstream compacted from 970 files to 87, snapshots expired, orphans cleaned. The operations pipeline runs in the correct sequence without manual coordination.

Policies: maintenance rules at scale

Configuring maintenance per table works at 10 tables. At 100 or 500, you need policies. LakeOps policies apply maintenance rules at the catalog, namespace, or table level — every table in scope inherits the policy automatically, including new tables created after the policy is defined.

Policy types

Adaptive Maintenance — bundles compaction, snapshot expiration, manifest rewrite, and delete file handling into a single data-driven policy. LakeOps monitors table activity and runs the right operations at the right time. When active on a table, it manages all individual optimization sections automatically
Expire Snapshots — configurable retention period, minimum snapshot count, metadata/data file deletion toggles
Remove Orphan Files — configurable age threshold (default: 7 days)
Rewrite Manifests — consolidate manifests on a schedule
Configuration & Governance — enforce table-level settings like Iceberg format version, default file format (Parquet/ORC/Avro), write distribution mode, and commit retry settings across the organization

Scope and precedence

Policies cascade: catalog-wide → namespace → table. More specific policies override broader ones. A per-table policy always takes precedence over a namespace or catalog-wide policy for the same operation type. This lets you set sensible defaults at the catalog level and override only where needed.

New tables automatically inherit policies from their parent namespace or catalog — no manual setup required. Tables can be excluded from inherited policies with a single toggle.

Example policy set

Policy	Type	Scope	Schedule
prod_adaptive_maintenance	Adaptive Maintenance	analytics catalog	Data-driven
prod_expire_snapshots	Expire Snapshots	analytics catalog	Hourly
prod_rewrite_manifests	Rewrite Manifests	analytics catalog	Daily at 4:00 AM
org_orphan_cleanup	Remove Orphan Files	All catalogs	Daily at 3:00 AM
streaming_compaction_override	Per-table compaction	raw_clickstream	Every 2 hours

Every policy change is versioned and auditable — you can see when a policy was created, modified, enabled, or disabled, and by whom.

LakeOps Policies — versioned governance across catalogs and namespaces — The Policies dashboard: compaction, orphan removal, snapshot expiration, and configuration policies — each with status toggle, type, schedule, and audit timestamps. One place replaces scattered Airflow DAGs.

Observability: seeing the health of every table

Autonomous maintenance without observability is a black box. LakeOps provides three levels of visibility:

Lake-wide dashboard

The dashboard shows aggregate metrics across all connected catalogs: total operations run, average query acceleration (pre vs. post optimization), estimated cost savings, CPU and storage reduction, and total data optimized. This is the executive view — a single screen that answers "is the lake getting healthier or worse?"

Per-table health classification

Every table is continuously classified as Healthy, Warning, or Critical based on file count, file size distribution, manifest fragmentation, snapshot depth, and orphan file accumulation. Streaming tables with small-file buildup or delete file accumulation surface as Warning or Critical immediately — not after a user reports slow queries.

LakeOps table health monitoring — Healthy, Warning, Critical classification — Lake-wide table health monitoring — every table classified by health status. Tables with structural issues surface immediately.

Insights: proactive alerts

The Insights tab surfaces problems at four severity levels before users notice them:

CRITICAL — severe fragmentation (e.g., raw_clickstream with 312 partitions and extreme file count)
HIGH — excessive manifests or snapshot accumulation (e.g., search_query_logs with 2,000+ manifests)
WARNING — partition skew, small file buildup, or rising delete-to-data ratios
LOW — early signs of accumulation that will become problems if unaddressed

Each insight links directly to the affected table for one-click investigation. After policy-driven maintenance resolves the issue, the insight clears automatically.

LakeOps Insights — proactive alerts driven by table and query signals — Severity-ranked Insights: CRITICAL for partition file issues, HIGH for excessive manifests and snapshots, WARNING for partition skew and small files — surfaced before query latency degrades.

Per-table events and metrics

The Events tab provides a complete audit trail — every compaction, snapshot expiration, orphan removal, and manifest rewrite with duration, before/after file counts, data volume, and status. The Metrics tab shows convergence toward targets: avgFileSize trending toward 256 MB, totalFiles decreasing, snapshot count stabilizing.

Simulations: preview before you apply

Before committing to a compaction strategy, LakeOps lets you simulate the impact on a real Iceberg branch. Layout simulations show how many files will be rewritten, expected output file count and average size, field access frequency analysis for query-aware optimization, and side-by-side comparison of multiple strategies. This eliminates the guesswork of "should I binpack or sort?" — run both simulations and compare the projected query improvement.

Getting started: from manual to autonomous

The path from manual scripts to autonomous maintenance is incremental. LakeOps supports every step:

1. Connect your catalogs. Point LakeOps at your existing Glue, REST, Polaris, Nessie, or S3 Tables catalog. Discovery and health classification begin immediately — no agents to install, no data to move.

LakeOps — connect catalogs and start — Connecting a catalog to LakeOps — point at your catalog, discovery and health classification begin immediately.

2. Inspect table health. Review which tables are Healthy, Warning, or Critical. Check the Insights tab for severity-ranked issues. Identify the tables that need attention first.

3. Run maintenance manually. Use the per-table Optimization tab to run compaction, snapshot expiration, orphan cleanup, or manifest rewriting on individual tables. Review results in the Events and Metrics tabs. Build confidence in how each operation affects your tables.

LakeOps Optimization — per-table compaction and maintenance configuration — Per-table optimization controls — compaction strategy, target file size, snapshot retention, and scheduling. Start manual, then toggle Enabled for continuous automation.

4. Enable per-table schedules. Toggle on scheduled compaction, snapshot expiration, and manifest rewriting for individual tables. Set cron expressions and verify results over a few cycles.

5. Create catalog-wide policies. Once validated on individual tables, create policies at the namespace or catalog level. Every existing and future table inherits the maintenance rules automatically.

6. Enable Adaptive Maintenance. For fully autonomous operation, activate Adaptive Maintenance. LakeOps monitors table activity signals and runs the right operations at the right time — no fixed schedules, no manual tuning. The system adapts as workloads change.

See the full product walkthrough — from connecting a catalog to autonomous optimization across the lake:

LakeOps demo — autonomous Iceberg lake management: observability, query-aware compaction, snapshots, manifests, orphan cleanup, policies, and agentic readiness.

Configuration checklist

Table properties (set once per table)

Enable metadata cleanup: write.metadata.delete-after-commit.enabled = true
Set target file size: write.target-file-size-bytes = 268435456 (256 MB) for analytics, 536870912 (512 MB) for heavy scans
For streaming tables, use hash write distribution mode to reduce cross-partition file scatter
Prefer day(timestamp) hidden partitioning over hour(timestamp) unless volume exceeds 5 GB/hour. See Partitioning Best Practices

Maintenance operations (per table or via policy)

Expire snapshots: 3–7 day retention for streaming, 14–30 days for batch/compliance. Retain at least 100 snapshots
Orphan cleanup: 7+ day age threshold. Run daily after snapshot expiration
Compaction: Binpack every 1–4 hours for streaming tables, daily for batch. Sort compaction daily or weekly for query optimization. Always exclude the active partition
Manifest rewrite: Daily, after compaction completes

Monitoring thresholds (alert when breached)

File count per partition exceeds 500
Average file size drops below 64 MB
Snapshot count exceeds 1,000
Manifest count exceeds 100 per snapshot
Query planning time exceeds 5 seconds
S3 GET request costs spike more than 20% week-over-week

Summary

Apache Iceberg provides the maintenance operations. It does not provide the automation, scheduling, health detection, conflict handling, sequencing, observability, or governance that production lakehouses require. Most teams start with Airflow DAGs and Spark SQL procedures — and most teams outgrow that approach within months as the table count crosses 50.

The production-grade approach is a dedicated control plane that monitors every table continuously, sequences operations correctly, adapts to each table's write pattern, executes compaction on an engine built for the workload, and provides lake-wide observability with per-table audit trails.

LakeOps provides the full stack — from catalog connection to autonomous, policy-driven maintenance that keeps every Iceberg table compacted, lean, and queryable without scripts, cron jobs, or manual intervention.

How table maintenance debt inflates storage and compute costs — and how autonomous maintenance recovers them.