Back to blog

Iceberg Metadata Lifecycle: Maintenance and Optimization

A deep technical guide to managing the metadata layer that makes Apache Iceberg fast — snapshots, manifests, metadata.json files, and Puffin statistics — covering expiration, consolidation, orphan cleanup, and the sequencing that prevents production incidents.

LakeOps Data Lake Insights showing metadata health alerts across Iceberg tables — manifest fragmentation, snapshot accumulation, and partition skew

Apache Iceberg's performance advantage over Hive-style data lakes comes entirely from its metadata layer. Instead of listing directories at query time, Iceberg maintains a structured tree of metadata files, manifest lists, manifest files, and column-level statistics that allows query planners to eliminate irrelevant data without opening a single Parquet file. This metadata layer is what makes partition pruning, file-level data skipping, and row-group filtering possible at petabyte scale.

But metadata is not static. Every commit creates new metadata files. Every append adds manifest entries. Every snapshot accumulates until explicitly expired. On production tables with continuous ingest, this accumulation becomes its own performance problem — inflated planning times, bloated storage, and garbage collection that never runs. The metadata that makes Iceberg fast can make it slow if you do not maintain it.

This guide covers the full metadata lifecycle for production Iceberg tables: what each metadata component does, how it degrades, the correct maintenance operations in the correct sequence, and how to automate the lifecycle at lake scale with a control plane like LakeOps.

Iceberg's metadata architecture: the four-layer tree

Every Iceberg table maintains a metadata tree with four distinct layers. Understanding what each layer stores and how it grows is essential for knowing when and how to maintain it.

Layer 1: metadata.json. The root of the tree. Created on every commit, it contains the current schema, partition spec, sort order, default spec/sort IDs, properties, and a list of snapshots. The catalog (Glue, REST, Nessie) stores a pointer to the current metadata.json file — this is the single atomic pointer that makes Iceberg commits safe.

Layer 2: Snapshot and manifest list. Each snapshot represents a complete, immutable table state at a point in time. A snapshot points to a manifest list (an Avro file) that indexes all manifest files for that version. The manifest list carries partition-level summary statistics (min/max per partition column) enabling the first pruning pass during query planning.

Layer 3: Manifest files. Avro files that track individual data files. Each manifest entry records the data file's path, partition values, file size, record count, and per-column statistics (value count, null count, lower bound, upper bound). This is where file-level data skipping lives — the planner uses these statistics to eliminate files whose ranges cannot match the query predicate.

Layer 4: Puffin statistics. Optional sidecar files containing advanced statistics that manifest min/max cannot express: NDV (number of distinct values) sketches via Apache DataSketches Theta algorithm, Bloom filters for high-cardinality point lookups, and histograms. Query engines use these for join-order optimization and definitive "not present" answers.

Each layer grows independently. metadata.json grows with schema changes and snapshot accumulation. Manifest lists grow with snapshot count. Manifest files fragment with every append, compaction, and delete operation. Understanding which layer is degrading tells you which maintenance operation to run.

How metadata degrades: the five failure modes

Metadata degradation is silent. Tables continue to return correct results — they just get slower. These are the five failure modes that accumulate in production:

1. Snapshot accumulation. Every commit creates a snapshot. A streaming table with 5-minute checkpoints creates 288 snapshots per day — 8,640 per month. Each snapshot references a manifest list that the planner must be aware of. Worse, unreferenced data files from old snapshots cannot be garbage-collected until the snapshot is expired. Production tables have been observed holding 120 TB of reclaimable data behind expired-but-not-cleaned snapshots.

2. Manifest fragmentation. Each append creates new manifest files. A table receiving hourly batch appends for a year accumulates 8,760 manifests. During query planning, the engine must read each surviving manifest to evaluate file-level statistics. Planning that takes 0.5 seconds with 30 manifests takes 4+ seconds with 300. The threshold for interactive workloads is roughly 50–100 manifests per table.

LakeOps Insights — manifest overload degrading query planning
Table Insights showing 92 manifest files (threshold: 50) with 43 undersized manifests severely impacting query performance — a common metadata health problem on high-write analytics tables that have never run manifest consolidation.

3. metadata.json bloat. Iceberg stores all historical schemas in the metadata.json file. Tables with frequent schema evolution (column additions, type promotions) accumulate hundreds of schema versions. Production reports show metadata.json files reaching 10 MB compressed (250 MB uncompressed), consuming ~4 GB of memory on load in engines like Trino. The problem is particularly acute for tables with thousands of columns where each schema includes the full column list.

4. Orphan file accumulation. Failed Spark jobs, speculative execution, interrupted compaction, and retried streaming operations leave data files on object storage that no metadata references. These orphans waste storage and can confuse operators into thinking tables are larger than they are. Without periodic cleanup, orphan accumulation grows linearly with write frequency and failure rate.

5. Stale statistics. After compaction rewrites files, the manifest statistics for the new files are fresh. But Puffin statistics (NDV, Bloom filters) computed before compaction reference files that no longer exist. Stale statistics mislead query optimizers into suboptimal join orders and missed pruning opportunities.

Snapshot expiration: the first maintenance operation

Snapshot expiration is always the first operation in the maintenance sequence because it determines what downstream operations can safely do. Until a snapshot is expired, all files it references — data files, delete files, manifests, manifest lists — remain live and cannot be cleaned up.

What expiration does:

  1. 1.Removes the snapshot entry from the metadata.json snapshot list.
  2. 2.Deletes manifest lists that are no longer referenced by any remaining snapshot.
  3. 3.Deletes manifest files that are no longer referenced by any remaining manifest list.
  4. 4.Marks data files and delete files as candidates for physical deletion (if no other snapshot references them).

Retention strategies by workload type:

  • Batch ETL tables (daily loads): Retain 5–20 snapshots or 7 days, whichever is shorter. Time-travel beyond a week rarely has operational value for batch tables.
  • Streaming tables (minute-level commits): Retain 100+ snapshots or 3–7 days. High commit velocity means snapshot IDs turn over rapidly; retaining too few breaks concurrent readers that hold snapshot references.
  • Audit-critical tables: Retain 30–90 days or use Iceberg tags/branches to preserve specific snapshots indefinitely without preventing expiration of intermediate ones.

Parallel expiration for large tables. The sequential table.expireSnapshots() API becomes slow on tables with thousands of snapshots and millions of referenced files. Use Spark Actions for parallel execution — or a control plane that handles parallelism and conflict detection automatically across your entire catalog.

Snapshots tab — 83 snapshots on a high-write table ready for expiration
Snapshots tab showing 83 snapshots with IDs, timestamps, and operations — Tag, Branch, Rollback, and Set Current actions available per snapshot. Retention beyond your SLA is pure metadata and storage cost with no operational benefit.
Expire Snapshots — production event showing 2,928 snapshots removed
Production expire run: 2,928 snapshots deleted, 5,819 total files removed, 263 MB reclaimed, 2,891 manifests and 2,928 manifest lists deleted in 3m 47s. On high-write tables, expiration is the highest-impact metadata operation for both planning speed and storage cost.

Safety guardrails:

  • Always set retain_last or min_snapshots_to_keep to prevent accidental deletion of all snapshots during aggressive expiration.
  • Never expire snapshots newer than your longest-running query's start time — an active reader holding a snapshot reference will fail if that snapshot disappears.
  • Use METADATA_ONLY cleanup level when data files are shared across tables (e.g., table clones or branched development).

The history.expire.min-snapshots-to-keep table property acts as a floor — even if the time-based threshold would expire more, this minimum is always preserved. Set it to at least 2 for any table with concurrent readers.

Orphan file cleanup: reclaiming storage safely

After snapshot expiration dereferences files, orphan cleanup physically deletes them from object storage. This is a destructive, irreversible operation that requires careful sequencing and safety margins.

What creates orphan files:

  • Failed or interrupted compaction jobs that write new files but never commit the rewrite.
  • Speculative execution in Spark where multiple attempts write the same file; only one commit succeeds.
  • Streaming checkpoint recovery where a writer restarts and abandons in-flight files.
  • Expired snapshots whose data files were dereferenced but not yet physically deleted.

The critical safety parameter: `olderThan`. The default is 3 days. This means only files with modification timestamps older than 3 days (and not referenced by any metadata) are candidates for deletion. This prevents deleting files that concurrent writers are actively creating but have not yet committed to metadata.

Orphan file removal — production results across multiple tables
Remove Orphan Files across the lake: 13.6 GB reclaimed from ice_html5_sdk_events (1m 9s), 74.8 GB from ice_desktop_sdk_events (13m 6s), and multiple staging tables cleaned in under 1 second each — all with SUCCESS status.

Risks and mitigation:

  • Concurrent writer conflicts. If a writer is mid-commit (file written, metadata not yet updated), orphan cleanup may incorrectly identify that file as orphaned. The olderThan parameter prevents this by only targeting files older than the safety window.
  • Clock skew. Object storage timestamps and the cleanup job's clock must be reasonably synchronized. S3 guarantees strong read-after-write consistency but metadata propagation can lag.
  • Cross-table references. If tables share data files (via branching or cloning), orphan cleanup on one table may delete files still referenced by another. Scope cleanup carefully or use table-level isolation.

Sequencing rule: Always run orphan cleanup after snapshot expiration, never before. If you run orphan cleanup before expiration, files still referenced by unexpired snapshots are protected — but you waste compute scanning files that cannot be deleted. If you run expiration after cleanup, newly dereferenced files sit on storage until the next cleanup cycle.

Manifest consolidation: reducing planning overhead

Manifest rewriting (RewriteManifests) consolidates fragmented manifest files into fewer, larger manifests aligned with the current file layout. This directly reduces query planning time because the engine opens fewer files during the manifest-reading stage.

When manifest fragmentation hurts:

Every append operation creates at least one new manifest file. On a table receiving 24 hourly appends per day, that is 24 new manifests daily. After a month without consolidation, the table has 720+ manifests — each requiring an Avro file read during planning. At 5–15ms per manifest read (S3 GET + Avro parse), planning alone costs 3.6–10.8 seconds before any data is scanned.

What RewriteManifests does:

  1. 1.Reads all current manifest files referenced by the latest snapshot.
  2. 2.Groups file entries by partition ranges to maximize the effectiveness of manifest-list pruning.
  3. 3.Writes new consolidated manifests with optimal entry counts (typically 8 MB target size).
  4. 4.Atomically commits a new snapshot pointing to the rewritten manifest list.

Partition-aligned manifest rewriting. The most effective manifest consolidation groups entries by partition value so that each manifest covers a narrow partition range. This maximizes manifest-list-level pruning — when a query filters on event_date >= '2026-05-01', the planner can skip entire manifests whose partition ranges are entirely before May 2026.

Trino recently added an optimize_manifests procedure that performs this consolidation. In Spark, the RewriteManifests action supports custom partition ordering via the sort() method to align manifest boundaries with primary query patterns.

Rewrite Manifests — consolidation and Puffin statistics controls
Optimization tab showing Rewrite Manifests (consolidate fragmented manifest files for improved metadata performance), Rewrite Position Delete Files, and Compute Table Statistics (Puffin) — the three metadata-layer operations that address planning overhead without touching data files.

Sequencing rule: Run manifest rewriting after compaction, not before. Compaction changes the physical file layout — files are merged, split, or rewritten. If you consolidate manifests first, compaction immediately fragments them again. Rewriting after compaction produces manifests that accurately reflect the stable file layout.

metadata.json management: preventing root-level bloat

The metadata.json file is the root of Iceberg's metadata tree. Every table operation that changes schema, properties, or snapshot state creates a new metadata.json version. Without lifecycle management, this file grows unbounded.

Growth vectors:

  • Snapshot history: Each snapshot entry adds ~200–500 bytes. At 288 commits/day, that is 56–140 KB/day of snapshot entries alone.
  • Schema accumulation: Iceberg never drops historical schemas from metadata.json. Tables with frequent column additions accumulate all previous schemas. A table with 500 columns that has undergone 200 schema changes stores all 200 complete schema versions.
  • Partition spec history: Similar to schemas — evolved partition specs are retained forever.
  • Properties: Table properties set and unset over time accumulate in the properties map.

Configuration for automatic management:

  • write.metadata.delete-after-commit.enabled = true — automatically deletes the oldest tracked metadata files after each new commit. Prevents unbounded accumulation of metadata.json versions on storage.
  • write.metadata.previous-versions-max = 100 — caps how many previous metadata.json files are tracked. The default of 100 is appropriate for most workloads; streaming tables may need higher values to support concurrent time-travel queries.

The memory problem. When a query engine loads a table, it deserializes the entire metadata.json file into memory. At 250 MB uncompressed (observed in production for large, schema-heavy tables), this consumes ~4 GB of heap — a significant portion of a coordinator's memory budget. The Iceberg community is actively working on solutions: external schema files (lazy-loaded on demand), incremental metadata diffs (append-only instead of full rewrites), and metadata.json pruning of historical schemas no longer needed for backward compatibility.

Practical mitigation today: Keep snapshot expiration aggressive to reduce the snapshot list in metadata.json. Avoid unnecessary schema churn (batch column additions into single commits). Monitor metadata.json file sizes as part of table health checks — any table where metadata.json exceeds 5 MB should be investigated.

Puffin statistics: enriching metadata beyond min/max

Manifest files store per-file column statistics (min, max, null count, value count). These are sufficient for range-based pruning but insufficient for three common analytics patterns:

  • Join optimization. The query optimizer needs NDV (number of distinct values) to choose between broadcast and shuffle joins. Without NDV, it defaults to shuffle — orders of magnitude slower for small dimension tables.
  • Point lookups on high-cardinality keys. A filter WHERE transaction_id = 'abc123' cannot benefit from min/max statistics on a UUID column because min/max spans the entire value space in every file. Bloom filters provide definitive "this file does not contain the value" answers.
  • Cardinality estimation. Query cost models need accurate row count estimates after applying predicates. NDV sketches combined with min/max enable better selectivity estimates.

The ComputeTableStats action analyzes data files and writes statistics to Puffin sidecar files using the Apache DataSketches Theta sketch algorithm. AWS Glue Data Catalog integrates this natively — generating NDV statistics that Amazon Redshift Spectrum and Athena use for join optimization.

Bloom filter file-skipping (emerging). A proof-of-concept Puffin-backed Bloom filter index reduces file planning from 658 files to 1 file for point lookups on high-cardinality columns — consulted during planning before any data files are opened. This is a planning-time optimization (stage 3 of the pruning pipeline) rather than a read-time filter within Parquet row groups.

When to compute statistics:

  • After sort compaction stabilizes file layout (statistics must match current files).
  • On join key columns for star-schema fact tables.
  • On high-cardinality filter columns used in point lookups.
  • Not on columns that change with every compaction cycle — recompute after layout stabilizes.

The maintenance sequence: why order prevents incidents

Each metadata maintenance operation has dependencies on the others. Running them out of order wastes compute, produces stale results, or risks data loss. The correct sequence for production tables — and why each step depends on the previous one:

1. Expire snapshots → Determines which files are eligible for deletion. Without expiration running first, orphan cleanup cannot identify true orphans (files still referenced by unexpired snapshots are protected). Compaction should not rewrite files that are about to be dereferenced.

2. Remove orphan files → Reclaims storage from files that expiration dereferenced. Must run after expiration so newly orphaned files are caught. Must run before compaction so the compaction engine does not waste time processing dead files.

3. Compact data files → Merges small files, applies delete files, and optionally re-sorts data. Must run after orphan cleanup (avoid compacting files that will be deleted). Produces a new physical layout that downstream operations depend on.

4. Rewrite manifests → Consolidates the manifest tree to reflect the new post-compaction layout. Must run after compaction — rewriting before compaction produces manifests that immediately become fragmented again.

5. Compute statistics → Refreshes Puffin blobs with NDV and Bloom filters matching the current file set. Must run last because statistics reference specific data files — if computed before compaction, they reference files that no longer exist.

6. Observe and alert → Monitor file count trends, manifest depth, snapshot accumulation rate, and delete-file growth. Trigger the next maintenance cycle when signals cross thresholds.

Lake-wide maintenance events — full audit trail
Lake-wide Events log showing every maintenance operation across catalogs with type, duration, and impact — filterable by namespace, operation type, or status. The audit trail that proves maintenance is running correctly across hundreds of tables.
Per-table event history — maintenance operations with duration and impact
Table-level event history: each maintenance operation with type, duration, files processed, and bytes reclaimed — the audit trail that answers "when did this table last get maintained?" without digging through Airflow logs.

Running this sequence manually across hundreds of tables with different ingest rates, retention requirements, and compaction schedules is where manual Airflow DAGs break down. The coordination overhead grows with table count. A single scheduling mistake — expiring snapshots concurrently with an active compaction job — can corrupt table state.

Automating the metadata lifecycle at lake scale

At 10–50 tables, shell scripts and Airflow DAGs handle metadata maintenance. At 200+ tables across multiple catalogs with different engines writing concurrently, manual coordination becomes the bottleneck. The problems are systemic:

  • Heterogeneous retention needs. Audit tables need 90-day snapshot retention. Streaming tables need 3 days. Applying one policy to all tables either wastes storage or breaks compliance.
  • Conflict awareness. Maintenance operations must not conflict with active writers. OCC (optimistic concurrency control) retries handle most conflicts, but compaction that repeatedly conflicts with a streaming writer wastes cluster resources.
  • Cross-table dependencies. Shared catalogs mean one table's orphan cleanup must not accidentally affect another table's files (especially with table clones or branches).
  • Observability. Knowing which tables are degrading — and which specific metadata layer is the problem — requires continuous monitoring that static scripts do not provide.

A managed Iceberg control plane solves this by treating metadata maintenance as a system-level concern rather than a per-table script. LakeOps implements this as a closed-loop system: observe table metadata state → classify which tables need which operations → execute in the correct sequence with conflict awareness → log results → repeat. The system handles the heterogeneity that makes manual approaches fragile — different retention policies per namespace, different compaction strategies per workload type, and different observation thresholds per table tier.

Maintenance policies — scheduled operations across the lake
Policies dashboard: snapshot expiration on high-write namespaces, orphan cleanup every 7 days, compaction on critical analytics facts — each with next run, last run, and enable toggle. Policies encode the maintenance sequence so it runs correctly without manual coordination.
Maintenance policy wizard — six operation types
Policy creation wizard: six operation types — Expire Snapshots, Remove Orphan Files, Compact Data Files, Rewrite Manifests, Rewrite Position Delete Files, and Rewrite Equality Delete Files. Each type has per-operation configuration for retention windows, age thresholds, and target sizes.

Production deployments running this automated approach across 786+ tables and 112+ PB report 12× average query acceleration and up to 80% cost reduction — with every maintenance operation logged, reversible, and auditable. The key insight is that metadata maintenance is not a one-time fix: it is a continuous loop that must run faster than degradation accumulates. For a detailed breakdown of how compaction fits into this loop, the relationship is direct — compaction produces the file layout that manifest consolidation and statistics computation depend on.

Monitoring metadata health: what to measure

Effective metadata maintenance requires continuous monitoring of specific signals per table. These metrics directly indicate which maintenance operation is needed:

Snapshot-layer metrics: - Snapshot count (trend over time — should be bounded by retention policy) - Oldest snapshot age (should not exceed retention SLA) - Unreferenced data files count (files eligible for cleanup after expiration)

Manifest-layer metrics: - Total manifest file count (threshold: 50–100 for interactive workloads) - Average manifest file size (undersized manifests indicate fragmentation) - Manifest entries per manifest (should be balanced — too few entries per manifest means too many manifests)

metadata.json metrics: - File size in bytes (alert at >5 MB, investigate at >10 MB) - Schema version count (indicates schema evolution velocity) - Time to load/deserialize (directly measured by query engines)

Puffin statistics metrics: - Statistics freshness (computed-at snapshot vs current snapshot) - Column coverage (which columns have statistics vs which are used in joins/filters) - Statistics file size and read latency

LakeOps table metrics — structural health of an optimized table
Per-table metrics showing 9.5B total records, 3.0K active data files at 129 MB average — a well-maintained table. When manifest count or snapshot depth deviates from healthy baselines, the metrics view reveals exactly which metadata layer needs attention.

Platforms that surface these signals across every table in a catalog — with automated alerting when thresholds are crossed — eliminate the reactive firefighting that characterizes most data platform teams. Instead of discovering metadata bloat after dashboard timeouts, you address it before users notice. This is the operational model that moves data engineering from maintenance toil toward building data products. For the full picture of how this connects to query performance optimization and cost reduction, the key insight is that metadata health is the foundation — no amount of engine tuning compensates for a planning pipeline bottlenecked on fragmented manifests or bloated snapshot lists.

Summary

Iceberg's metadata layer is not a set-and-forget system. It is a living structure that grows with every commit, fragments with every append, and degrades silently until queries slow down or storage bills spike. The five failure modes — snapshot accumulation, manifest fragmentation, metadata.json bloat, orphan file buildup, and stale statistics — each have specific remediation operations that must run in the correct sequence.

The maintenance sequence (expire → clean orphans → compact → rewrite manifests → compute statistics → observe) exists because each step enables the next. Breaking this sequence wastes compute, produces stale results, or risks data loss. At scale, encoding this sequence into automated policies with conflict awareness and per-table tuning is the only reliable path. Manual scripts that worked at 30 tables become the primary source of incidents at 300.

For production deployments, the combination of continuous metadata observability with autonomous maintenance execution keeps every table in the lake within healthy metadata bounds — manifests consolidated, snapshots bounded, orphans reclaimed, and statistics current. That is the difference between a data platform that degrades between maintenance windows and one that stays optimized continuously.

Platform walkthrough — catalog connection, table health analysis, and autonomous metadata lifecycle management for production Iceberg tables.

Related articles

Found this useful? Share it with your team.