Apache Iceberg Table Health and Maintenance: A Complete Guide

Apache Iceberg Table Health and Maintenance — health score dashboard showing 92 Healthy with status indicators for Snapshots, Manifests, Delete Files, Orphan Files, and File Health beside a geometric iceberg

An Iceberg table that ran perfectly during development will, left unattended in production, slow down. Storage costs will climb. Query latency will drift upward. Metadata operations that once took milliseconds will start taking seconds. None of this happens because of a code bug or a schema change. It happens because Iceberg tables are living data structures — and living data structures need maintenance.

This guide covers what table health means in practice, the five maintenance operations every production lakehouse needs, why the order you run them in matters more than most teams realize, the metrics that reveal problems before users notice them, and how to automate the full lifecycle so tables stay healthy without manual intervention.

Optimizing Iceberg Lakehouse Performance — problems, solutions, and outcomes — The four common Iceberg table health problems (small files, fragmented manifests, unsorted data, delete file accumulation) and the outcomes of proper maintenance (faster queries, lower costs, higher throughput, healthier data).

What table health means for Iceberg

Every Iceberg table has three layers that determine query performance and storage efficiency:

Metadata layer. The metadata tree — metadata files, manifest lists, and manifests — is how query engines discover which data files to read. Every commit adds a new snapshot pointing to a new manifest list. Over time, the manifest tree deepens and widens: more manifests reference more data files across more partitions. Scan planning — the phase where an engine reads metadata to figure out which files contain relevant data — slows in direct proportion to metadata volume. A table with 15 manifests plans a query in milliseconds; the same table with 1,500 manifests may take seconds before a single data file is opened.

File layout. Iceberg stores data in immutable Parquet files on object storage. Writes create new files; they never modify existing ones. A streaming pipeline checkpointing every 60 seconds against 100 active partitions generates 144,000 new files per day. Each file carries overhead — HTTP round trips for each S3 GET, Parquet footer parsing, memory allocation for reader initialization. When the average file size drops below a few megabytes, that per-file overhead dominates actual data processing time.

Physical organization. Within data files, row ordering determines how effectively engines can skip irrelevant data. Parquet stores column min/max statistics per row group. If a column used in a WHERE clause is randomly distributed across files, every file's min/max range overlaps with the filter predicate, and the engine reads all of them. If the same column is sorted, most files can be skipped entirely. The difference between a well-sorted and a randomly-ordered table can be 10x or more in scan volume for filtered queries.

A healthy table has compact metadata (few manifests relative to data volume), right-sized data files (128–512 MB depending on workload), minimal delete file backlog, and a physical sort order aligned with production query patterns. An unhealthy table has some combination of: too many small files, thousands of manifests, accumulated snapshots pointing to files no query will ever need, orphan files consuming storage silently, and pending deletes forcing every reader to reconcile row-level changes on the fly.

Healthy vs Unhealthy Iceberg tables — the contrast autonomous maintenance creates — The visual contrast: an unhealthy Iceberg table (fragmented files, scattered manifests, slow queries) versus a healthy one after autonomous maintenance (compacted files, aligned metadata, fast reads). This is what the five maintenance operations achieve.

The five maintenance operations

1. Snapshot expiration

Every write, append, overwrite, or delete creates a new snapshot. Snapshots enable time travel — querying the table as it existed at a prior point — and provide rollback safety. But snapshots also anchor data files: as long as a snapshot exists, every data file it references remains on storage, even if newer snapshots have replaced that data.

In a busy production table receiving hundreds of writes per day, snapshot count grows quickly. A table with 30 days of snapshots at 200 commits per day holds 6,000 snapshots. Each snapshot references a manifest list, which references manifests, which reference data files. The metadata tree becomes enormous, and scan planning degrades.

Snapshot expiration removes snapshots older than a retention threshold and dereferences the data files they exclusively held. Files still referenced by at least one retained snapshot remain untouched.

sql

1-- Expire snapshots older than 7 days2CALL catalog.system.expire_snapshots(3  table => 'db.events',4  older_than => TIMESTAMP '2026-05-19 00:00:00',5  retain_last => 106);

Retention policies. The right retention window depends on your operational needs. Teams that rely on time travel for debugging or rollback typically retain 3–7 days. Audit-heavy environments may keep 30 days. The key tradeoff is between safety (the ability to roll back) and cost (the storage and metadata overhead of keeping old snapshots). In most production systems, 5–7 days with a minimum of 5–10 retained snapshots covers the realistic rollback window while keeping metadata manageable. The retain_last parameter is critical — it protects against expiring a snapshot that a long-running query is actively reading, which would cause the query to fail.

What happens to data files. Expiring a snapshot deletes data files that are no longer referenced by any retained snapshot. However, this deletion is not always complete — concurrent operations, catalog limitations, or partial failures can leave files behind. Additionally, expiration cannot clean up files from writes that never committed successfully. Both cases produce orphan files — which brings us to the next operation.

2. Orphan file cleanup

Orphan files are data files, metadata files, or manifest files that exist on storage but are not referenced by any current table metadata. They accumulate from several sources:

Incomplete expiration. Snapshot expiration attempts to delete unreferenced files, but concurrent operations or partial failures can leave some behind.
Failed writes. A Spark job that writes Parquet files to S3 but crashes before committing leaves files on storage that no snapshot ever referenced.
Compaction residue. Compaction reads input files, writes new merged files, and commits the new files to the table. If the post-commit cleanup is interrupted, old input files linger on storage.
Schema evolution artifacts. Certain metadata operations can leave behind unreferenced manifest files.

Orphan files are invisible to query engines — they do not appear in any snapshot, manifest, or metadata file. But they consume storage and inflate S3 bills. In production lakehouses, orphan accumulation can reach hundreds of terabytes before anyone notices.

sql

1-- Remove orphan files older than 3 days2CALL catalog.system.remove_orphan_files(3  table => 'db.events',4  older_than => TIMESTAMP '2026-05-23 00:00:00',5  dry_run => true6);

Always run a dry run first. The older_than parameter is a safety mechanism — it ensures you do not delete files from writes that are still in progress. A 3-day buffer is standard for most environments. Files younger than the threshold are left untouched even if they appear unreferenced, because an in-flight commit may still claim them.

3. Data file compaction

Compaction is the most discussed maintenance operation — and for good reason. It directly determines query performance. Compaction reads small data files and rewrites them into larger, optimally-sized files. Two strategies exist:

Binpack merges files without changing row order. It is fast, low-risk, and solves the small-file problem. Use binpack when your queries scan full partitions, when no single column dominates filter predicates, or when you need to compact quickly with minimal compute.

sql

1-- Binpack compaction: merge small files into ~256 MB targets2CALL catalog.system.rewrite_data_files(3  table => 'db.events',4  strategy => 'binpack',5  options => map(6    'target-file-size-bytes', '268435456',7    'min-file-size-bytes',   '201326592',8    'max-file-size-bytes',   '335544320'9  )10);

Sort compaction rewrites files with rows physically ordered by one or more columns. This maximizes data skipping for filtered queries — Parquet min/max statistics become tight per-file ranges instead of overlapping intervals. Sort compaction is more expensive (it requires a full shuffle and sort) but delivers the largest query performance gains for workloads with consistent filter patterns.

sql

1-- Sort compaction: order by event_date and user_id2CALL catalog.system.rewrite_data_files(3  table => 'db.events',4  strategy => 'sort',5  sort_order => 'event_date ASC NULLS LAST, user_id ASC NULLS LAST',6  options => map(7    'target-file-size-bytes', '268435456'8  )9);

Target file sizes. The optimal file size depends on the workload. Point-lookup tables (high selectivity, few rows returned) benefit from smaller files in the 64–128 MB range — smaller files mean finer-grained skipping. Full-scan analytical tables benefit from larger files in the 256–512 MB range — fewer files mean less per-file overhead. Most production tables land at 256 MB as a reasonable default. For a deeper treatment of compaction strategies and file sizing, see our compaction guide.

When to compact. Compact when the average file size in active partitions drops significantly below the target, or when file count exceeds a threshold that makes scan planning expensive. A partition with 500 files averaging 4 MB each is a clear compaction candidate. A partition with 20 files averaging 240 MB each is not. If small-file accumulation is your primary concern, our small files guide covers the problem in depth.

4. Manifest rewriting

Manifests are the index layer between metadata and data files. Each manifest tracks a set of data files — their paths, partition values, column statistics, and file sizes. When you append data, Iceberg creates a new manifest for the new files. When you compact data, the old manifests are replaced. Over time, manifests fragment: too many small manifests, manifests that track files across unrelated partitions, or manifests left over from historical operations that no longer reflect the current file layout.

Manifest fragmentation slows scan planning. The query engine must open and parse every manifest in the current snapshot's manifest list to identify relevant files. More manifests means more metadata I/O before any data file is touched.

sql

1-- Rewrite manifests to consolidate and align with partitions2CALL catalog.system.rewrite_manifests(3  table => 'db.events'4);

Manifest rewriting consolidates small manifests into larger ones and optionally aligns manifests with partition boundaries so each manifest covers a contiguous partition range. This makes partition pruning during scan planning more effective — the engine can skip entire manifests rather than opening them to discover that none of their files match the query filter.

When to rewrite. Rewrite manifests when the manifest count is high relative to the table's data volume, when manifests are small (tracking fewer than 100–200 files each), or after a large compaction pass that has dramatically changed the file layout. The rule of thumb: if scan planning takes noticeably longer than it should for the data volume, manifest fragmentation is a likely cause.

5. Delete file cleanup

Iceberg supports row-level deletes through delete files — position delete files that identify specific rows by file path and row position, and equality delete files that match rows by column values. Delete files are efficient for writes (no data rewriting needed at write time) but expensive for reads: every query must reconcile data files against pending delete files to filter out deleted rows.

As delete files accumulate, read performance degrades. A table with a high delete-file-to-data-file ratio forces every scan to do significant reconciliation work, even for queries that would otherwise touch only a small fraction of the data.

sql

1-- Rewrite data files to physically apply pending deletes2CALL catalog.system.rewrite_data_files(3  table => 'db.events',4  options => map(5    'delete-file-threshold', '3'6  )7);

This rewrites only data files that have more than the specified number of associated delete files. The delete files are physically applied — deleted rows are excluded from the new data files — and the delete files themselves are removed. The result is a clean dataset with no pending deletes, restoring read performance to its pre-delete baseline. For a full treatment of delete file mechanics and merge-on-read overhead, see our delete files guide.

Maintenance sequencing: why order matters

The five operations are not independent. Running them in the wrong order wastes compute, misses cleanup opportunities, or creates new problems.

The optimal sequence is: (1) snapshot expiration → (2) orphan file cleanup → (3) data file compaction → (4) manifest rewriting.

Why expire before compact. Snapshot expiration dereferences data files that are no longer needed. If you compact first, the compaction engine may read and rewrite files that snapshot expiration would have removed moments later. You have spent compute processing data that was about to be garbage-collected. In a table with months of accumulated snapshots, this waste can be substantial — compaction rewrites hundreds of gigabytes of data that no query will ever read again.

Why clean orphans before compact. Orphan cleanup after expiration removes the physical files that expiration dereferenced. This reduces the total storage footprint before compaction runs, so compaction operates on a clean dataset. It also prevents confusion — orphan files sitting in partition directories can interfere with certain storage-level operations and inflate metrics.

Why compact before rewriting manifests. Compaction changes the file layout — it removes input files and creates new output files. Manifests track files. If you rewrite manifests first, they will be aligned to the pre-compaction layout. The moment compaction runs, those carefully rewritten manifests become stale and fragmented again. Compact first, then rewrite manifests against the final file layout.

Where delete file cleanup fits. Delete file rewriting is a form of compaction — it rewrites data files to apply pending deletes. It should run in the compaction phase, either as part of the same compaction pass or immediately after.

Common maintenance mistakes

Running compaction before expiring snapshots. The single most common mistake. Teams schedule nightly compaction but forget snapshot expiration. The compaction job processes gigabytes of data referenced only by ancient snapshots that no one queries. The fix is simple: always expire first.

Expiring snapshots too aggressively. The opposite extreme is also dangerous. If you expire snapshots while a long-running query is using one of them, the query can fail because the data files it needs may be deleted. Always retain snapshots for at least as long as your longest-running query, and use retain_last as a safety floor.

Not cleaning orphan files. Orphans are invisible to query engines and to most monitoring tools. They do not affect query performance, so teams do not notice them — until the storage bill arrives. In one production deployment, orphan accumulation reached approximately 200 TB across 324 tables before detection. Regular orphan cleanup should be a standard part of every maintenance schedule.

Setting orphan cleanup retention too short. The Iceberg docs explicitly warn: removing orphan files with a retention interval shorter than the time expected for any write to complete is dangerous — it can corrupt the table by deleting in-progress files. The default is 3 days. Never go below this unless you can guarantee all writes complete in less time.

Ignoring manifest fragmentation. Teams that master compaction often overlook manifests. A well-compacted table with 2,000 tiny manifests will still have slow scan planning. Manifest rewriting is the cheapest maintenance operation (it only rewrites metadata, not data) and delivers outsized planning-time improvements.

Compacting healthy partitions. Not every partition needs compaction on every run. A cron job that compacts all partitions nightly wastes compute on partitions that are already at target file size. Condition-based triggers — compact only when file count exceeds a threshold or average file size drops below a minimum — eliminate wasted runs.

Using one target file size for all tables. A 512 MB target file size is great for full-scan analytical tables but oversized for point-lookup tables that return a few hundred rows per query. Per-table (or per-workload) file size targets produce better results than a single global default.

Neglecting delete file accumulation. Delete files are easy to forget because they do not cause errors — queries still return correct results. But read performance degrades silently as delete files pile up. Monitor the delete-file-to-data-file ratio and rewrite when it exceeds a threshold (typically 3–5 delete files per data file).

Monitoring table health: key metrics

You cannot maintain what you do not measure. These are the metrics that matter for Iceberg table health. For a lake-wide observability strategy, see our Iceberg observability solution.

File count per partition. The most direct indicator of small-file accumulation. A partition with 50 files at 256 MB each is healthy. A partition with 5,000 files at 2 MB each needs immediate compaction. Track the 95th percentile across partitions — a single hot partition with excessive file count can drag down queries that touch it.

Average file size. Complements file count. A partition with 100 files averaging 4 MB is a compaction candidate even though the file count is moderate. Target: within 50–100% of your configured target file size.

Manifest count. The number of manifest files in the current snapshot. Healthy tables typically have fewer than one manifest per 50–100 data files. A table with 500 data files and 400 manifests has extreme fragmentation.

Delete file ratio. The number of delete files relative to data files. A ratio above 0.1 (one delete file per ten data files) indicates accumulating deletes that should be applied. A ratio above 0.5 is a read-performance emergency.

Snapshot count. Total snapshots retained. Directly impacts metadata size and scan-planning overhead. If your retention policy is 7 days but your table has 15,000 snapshots, expiration is not running or not keeping up.

Partition skew. The ratio between the largest and smallest partitions by file count or data size. High skew means some partitions are healthy while others are severely fragmented. Lake-wide averages can hide partition-level problems — always inspect the distribution. For strategies to minimize skew through better partitioning, see our partitioning best practices guide.

Total table size vs. referenced data. If total storage (all files in the table's directory) significantly exceeds the data referenced by the current snapshot, orphan files or old snapshot data are consuming space.

Automating maintenance

Manual maintenance — running SQL procedures by hand or scheduling one-off scripts — works for a handful of tables. It does not scale to hundreds or thousands of tables with different write patterns, query profiles, and SLA requirements.

Cron-based scheduling is the most common first step. A nightly job runs snapshot expiration, orphan cleanup, compaction, and manifest rewriting in sequence. It is simple to implement and covers the majority of maintenance needs. The limitations emerge at scale: cron cannot adapt to table state. A table that receives no writes still gets compacted. A streaming table that accumulates 100,000 files between midnight runs waits 24 hours for relief. All tables get the same target file size, the same retention policy, the same compaction strategy regardless of workload.

Event-driven approaches trigger maintenance based on table state rather than wall-clock time. When file count exceeds a threshold, compact. When snapshot count crosses a limit, expire. When orphan volume grows, clean. Event-driven maintenance eliminates wasted runs on idle tables and responds immediately to tables that need attention. The complexity cost is building the monitoring infrastructure that detects these conditions and the orchestration layer that triggers the right operation at the right time. For a deep dive into autonomous table maintenance, see our dedicated guide.

The ideal system combines both: baseline cron schedules as a safety net with event-driven triggers for responsive, condition-based maintenance. Per-table policies allow different retention windows, file size targets, and compaction strategies for different workload profiles.

How LakeOps automates Iceberg table maintenance

LakeOps is a dedicated lakehouse control plane for Apache Iceberg that automates every maintenance operation described in this guide. It connects to your existing catalogs (Glue, REST, Polaris, Nessie, S3 Tables) and query engines (Trino, Spark, Flink, Snowflake, Athena, DuckDB) without moving data or changing pipelines. Once connected, it classifies every table's structural health and runs the full maintenance lifecycle autonomously.

Modern Lakehouse Architecture — LakeOps control plane for autonomous management — LakeOps sits as a dedicated control plane connecting catalogs (Glue, REST, S3 Tables), query engines (Spark, Trino, Flink, Snowflake, Athena, DuckDB), and the Iceberg data layer — orchestrating maintenance across the entire multi-engine lakehouse.

LakeOps walkthrough — table health analysis, autonomous maintenance, and optimization across your Iceberg catalog.

Key capabilities for table health and maintenance:

Lake-wide health classification — every table continuously scored as Critical, Warning, or Healthy based on file counts, manifest state, snapshot age, and delete file ratios. Proactive Insights at four severity levels (Critical, High, Warning, Low) surface degradation before queries slow down.
Coordinated maintenance pipeline — the full sequence runs automatically: snapshot expiration → orphan cleanup → compaction → manifest optimization. Each operation's output feeds the next. Triggered by structural signals per table, not fixed schedules.
Event-driven triggers — continuous monitoring of file count, average file size, delete-file ratio, manifest count, and snapshot depth per table and partition. Maintenance fires only when thresholds are crossed — no wasted runs on healthy tables, no missed runs on degraded ones.
Rust execution engine — built on Apache DataFusion with no JVM startup, no GC pauses, no executor provisioning. Binpack compaction completes 86% faster than Spark in production benchmarks across 5.5 TB. Non-blocking commits ensure concurrent readers and writers are never interrupted.
Query-aware sort optimization — telemetry from every connected engine identifies which columns production queries actually filter, join, and group on. Sort orders adapt as access patterns shift.
Layout simulations — test proposed sort changes on a real Iceberg branch with production queries replayed, comparing cost and performance impact before modifying any production data.
Per-table policies — maintenance rules at catalog, namespace, or table scope. A streaming events table gets aggressive compaction triggers and short snapshot retention; a compliance audit table gets 30-day retention and conservative thresholds. Policies are versioned and auditable.
Full observability — every operation logged with before/after metrics (file counts, data sizes, duration, status). Dashboard shows 30-day optimization activity, cost savings, and health trends across the lake.

LakeOps Table Monitoring — health classification across the lake — Lake-wide table health classification: Critical, Warning, and Healthy tables with file counts, data sizes, snapshot counts, and partition details — the single view that reveals which tables need maintenance.

LakeOps Dashboard — lake-wide health overview — LakeOps Dashboard: 30-day optimization activity, cost savings, health distribution (Critical/Warning/Healthy), and recent maintenance operations — the single pane of glass for your Iceberg lakehouse.

LakeOps maintenance pipeline — connect, operate, observe — The coordinated maintenance pipeline: connect catalogs, choose manual or autonomous mode, run operations in sequence, then observe with metrics, health scores, and policies.

LakeOps Table Events — every maintenance operation logged — Table-level Events: every step in the maintenance pipeline — Compact Data Files (970→87 files), Expire Snapshots, Rewrite Manifests — with duration, impact, and status per table.

LakeOps Insights — proactive alerts driven by table and query signals — Insights: severity-ranked alerts for file structure issues, excessive manifests and snapshots, partition skew, and small files — surfacing the specific structural problems that drive maintenance decisions.

LakeOps Layout Simulations — data-driven sort order decisions — Layout Simulations: field access frequency from real queries per column, candidate sort configurations compared against the baseline, and projected impact — ensuring sort compaction delivers maximum data skipping.

LakeOps Optimization tab — compaction and snapshot configuration — Per-table optimization: compaction strategy (binpack/sort), target file size, snapshot expiration policy, and orphan cleanup settings — tunable per table with lake-wide defaults as the baseline.

Production maintenance checklist

Use this checklist to assess your current maintenance posture and identify gaps:

[ ] A dedicated control plane or automation system manages the full maintenance lifecycle — a tool like LakeOps that classifies table health (Critical/Warning/Healthy), sequences operations correctly, triggers maintenance based on actual table state, and provides lake-wide observability. Manual scripts and cron jobs are a starting point, not an end state.
[ ] Snapshot expiration is running on every table with a defined retention policy (typically 5–7 days)
[ ] Orphan file cleanup runs regularly with a safety buffer (≥3 days) and has been verified with a dry run
[ ] Compaction runs with appropriate strategies — binpack for general tables, sort for tables with consistent filter patterns
[ ] Target file sizes are configured per workload — 128–256 MB for selective queries, 256–512 MB for full scans
[ ] Manifest rewriting runs after compaction to consolidate metadata against the current file layout
[ ] Delete files are monitored and rewritten when the ratio exceeds 3–5 delete files per data file
[ ] Sequencing follows the correct order: expire → clean orphans → compact → rewrite manifests
[ ] Monitoring tracks file count, average file size, manifest count, delete file ratio, snapshot count, and partition skew
[ ] Alerting is configured for critical thresholds — excessive file counts, snapshot accumulation, orphan growth
[ ] Per-table policies exist for tables with distinct SLA requirements or workload profiles

Iceberg gives you a powerful, open table format. Keeping it healthy in production requires deliberate, coordinated maintenance across all five operations. The teams that treat table maintenance as a first-class operational concern — not an afterthought — are the ones running lakehouses that scale smoothly, cost predictably, and query fast. If you're ready to automate the full lifecycle, LakeOps handles health classification, sequenced maintenance, and per-table policies across your entire catalog — explore the managed Iceberg solution or see how it drives cost optimization at scale.