Iceberg Metadata at Scale: Keep Query Planning Fast on Petabyte Tables

Apache Iceberg metadata at petabyte scale — manifests, statistics, and planning performance

At petabyte scale, Iceberg metadata at scale is often the performance bottleneck — not Parquet files and not your query engine. Trino, Spark, and Flink are fast once planning finishes. What breaks is the metadata tree: manifests multiply into the thousands, snapshot chains deepen with every commit, and column statistics balloon. Query planning drifts from milliseconds to minutes. Coordinators OOM. S3 GET costs spike. And the query has not read a single row of data yet.

This guide synthesizes production patterns for managing Iceberg metadata at petabyte scale with Iceberg's official maintenance procedures and the evolving V4 spec proposals. Understanding the problem requires understanding the structure first.

How Iceberg metadata is structured

Every Iceberg table has a metadata tree with four distinct layers. Understanding this tree is essential because every performance problem at scale traces back to one or more layers growing beyond what the query planner can efficiently traverse.

Metadata files. The top of the tree. A metadata file is a JSON document that records the table's current schema, partition spec, sort order, default values, and — critically — the list of snapshots. Each snapshot is a pointer to a specific manifest list. The metadata file also tracks properties like write.metadata.previous-versions-max, which controls how many prior metadata files are retained. On tables with frequent schema changes or property updates, the metadata file itself can grow to tens of megabytes, but this is rarely the primary bottleneck.

Snapshots and manifest lists. Each commit to an Iceberg table creates a new snapshot. A snapshot is a pointer to a manifest list file — an Avro file that enumerates all the manifest files comprising that snapshot's view of the table. The manifest list also stores summary statistics for each manifest: the partition range it covers, the count of data files it tracks, and aggregate column statistics. These summary statistics allow the planner to skip entire manifests without opening them, but only if the manifests are well-organized — a point that becomes critical at scale.

Manifest files. The core of the metadata tree. Each manifest is an Avro file that contains one row per data file or delete file. Each row records the file's path on storage, its partition tuple, file size, record count, column-level statistics (lower bounds, upper bounds, null counts, NaN counts per column), and split offsets. Manifests are immutable — once written, they are never modified. New appends create new manifests; compaction creates new manifests that replace old ones. A manifest typically tracks hundreds to thousands of data files, and its size ranges from a few hundred kilobytes to tens of megabytes depending on how many files and columns the table has.

Data file entries. The leaves of the metadata tree. Each entry in a manifest corresponds to a single Parquet (or ORC, or Avro) data file on object storage. The entry does not contain the data itself — it contains the metadata about the data: where the file lives, what partition it belongs to, and the column-level statistics that enable file-level data skipping. For a table with 100 columns, each data file entry carries statistics for all 100 columns — lower bound, upper bound, null count, and NaN count. That is 400 statistics values per file. Multiply by a million files and the statistics alone occupy significant memory during planning.

The traversal path for query planning is: read the current metadata file → find the latest snapshot → open the manifest list → read each manifest file → evaluate each data file entry's partition values and column statistics against the query predicate → produce the list of files to scan. Every layer in this tree is a potential bottleneck at sufficient scale.

Why metadata grows

Metadata growth is not a bug — it is a direct consequence of how Iceberg works. The properties that make Iceberg powerful (immutable files, snapshot isolation, schema evolution, column-level statistics) are the same properties that cause metadata to accumulate. Understanding the growth drivers is the first step toward controlling them.

Streaming writes create many small manifests. Every commit creates at least one new manifest file. A streaming pipeline committing every 60 seconds generates 1,440 manifests per day per table. After a month without consolidation, that single table has 43,200 manifests. Each manifest may track only one or a handful of data files — a wildly inefficient ratio. The manifest layer was designed for batch workloads where each commit adds thousands of files in a single manifest. Streaming inverts this assumption: many commits, few files each, and the manifest count grows linearly with commit frequency.

Each commit adds a snapshot. Snapshots provide time-travel capability and rollback safety, but each snapshot anchors its entire manifest list and all referenced data files in storage. A table receiving 200 commits per day accumulates 6,000 snapshots in a month. The metadata file must enumerate all retained snapshots, the catalog must track them, and any metadata operation that scans the snapshot list (like finding the current snapshot, or traversing the history) pays a cost proportional to the total count.

Each file gets statistics entries. Column-level statistics are what make Iceberg's data skipping work — without them, the planner cannot skip files whose data range does not overlap the query filter. But statistics come at a cost. A table with 200 columns stores 800 statistics values (lower bound, upper bound, null count, NaN count) per data file entry in the manifest. A manifest tracking 1,000 files for a 200-column table stores 800,000 statistics values. When the planner reads this manifest into memory to evaluate predicates, all 800,000 values must be deserialized and held in the coordinator's heap — even if the query filters on only two columns.

Partition evolution compounds the problem. When a table's partition spec changes (for example, moving from daily to hourly partitioning), Iceberg retains manifests written under both specs. Old manifests use the old spec; new manifests use the new one. The planner must understand both specs and apply the appropriate predicate transformation for each manifest. This works correctly, but it doubles the manifest population during the transition period and can leave legacy manifests permanently fragmented if they are never rewritten.

Compaction creates replacement manifests. Data file compaction merges small files into larger ones, but the metadata side effect is that old manifests (tracking the input files) are replaced by new manifests (tracking the output files). If compaction runs on a subset of partitions, the result is a mix of old and new manifests — some aligned with the current file layout, some referencing files that no longer exist in the latest snapshot. Without a follow-up manifest rewrite, this fragmentation persists.

What metadata bloat looks like in production

Metadata bloat does not announce itself with an error message. It manifests as a gradual degradation that teams often misattribute to engine problems, network issues, or data volume growth. Recognizing the symptoms early is critical.

Planning times drift from milliseconds to minutes. The most visible symptom. A query that planned in 200 milliseconds six months ago now takes 45 seconds before the first data file is opened. The data volume may have only doubled, but the manifest count has increased tenfold. Planning time scales with manifest count and total statistics volume, not with raw data size. A 500 TB table with 50 well-organized manifests plans faster than a 50 TB table with 5,000 fragmented manifests.

Coordinator OOM failures. The query coordinator (Trino's coordinator node, Spark's driver, Flink's job manager) must read all manifests into memory during planning. Each manifest is deserialized from Avro into in-memory data structures — file paths, partition tuples, statistics arrays. On a table with 10,000 manifests averaging 500 file entries each, the coordinator must hold metadata for 5 million files simultaneously. With 100 columns of statistics per file, this can consume 20–40 GB of heap memory. Coordinators sized for the data volume — not the metadata volume — crash with OutOfMemoryError during planning, before any executor has processed a single row.

S3 GET explosion. Every manifest file is a separate object on S3 (or GCS, or ADLS). Reading 5,000 manifests means 5,000 S3 GET requests during planning. At S3's typical first-byte latency of 50–100ms per GET, sequential manifest reading takes 250–500 seconds. Even with parallelism, the sheer number of GET requests creates pressure on S3 rate limits (3,500 GETs per second per prefix). Tables with manifests spread across a single prefix can trigger S3 throttling (HTTP 503 SlowDown errors), adding retry overhead on top of the already-excessive latency.

Metadata files outgrow the data they describe. In extreme cases, metadata volume exceeds 1% of the data volume — sometimes dramatically. A production deployment running streaming IoT ingestion reported metadata reaching 5 TB for a table whose actual data was smaller. The metadata file JSON alone was hundreds of megabytes. The manifest list referenced thousands of manifests, each tracking a handful of small files created by per-minute commits. The table was functional — queries returned correct results — but planning a simple aggregation triggered OOM errors on the coordinator because the system tried to materialize file-level statistics for millions of files.

Catalog latency increases. REST catalogs, Glue, Nessie, and Polaris all must read the current metadata file to resolve table state. When the metadata JSON grows to tens of megabytes (from thousands of retained snapshots and accumulated properties), catalog operations slow down. A DESCRIBE TABLE that should take 50 milliseconds takes 2 seconds. A REFRESH TABLE that should be instant requires downloading and parsing a large JSON blob. This affects every engine connected to the catalog, not just the one running the heavy query.

LakeOps Dashboard: lake-wide health overview showing optimization activity, cost savings, and health distribution across Critical, Warning, and Healthy tables — the single view that reveals metadata degradation across the lake.

Iceberg's built-in mechanisms for metadata control

Iceberg is not unaware of the metadata growth problem. The spec and the standard libraries include several mechanisms for controlling metadata size. The challenge is that these mechanisms require explicit invocation and careful sequencing — they do not run automatically.

Manifest merging on commit

When a commit creates a new snapshot, Iceberg can optionally merge small manifests together. The commit.manifest-merge.enabled property (default: true) controls this behavior. When enabled, the commit path checks whether the new snapshot's manifest list contains manifests that are small enough to benefit from merging. If so, it reads the entries from multiple small manifests, writes them into a single larger manifest, and references the merged manifest in the new snapshot's manifest list.

The target manifest size is controlled by commit.manifest.target-size-bytes (default: 8 MB). Manifests smaller than this threshold are candidates for merging. The minimum count of manifests to merge is controlled by commit.manifest.min-count-to-merge (default: 100).

Manifest merging on commit is a background optimization that reduces manifest proliferation for streaming workloads, but it has limitations. It only considers manifests from the current commit — it does not consolidate historically fragmented manifests. It adds latency to the commit path, which can conflict with streaming latency SLAs. And it uses the table's default merge parameters, which may not be optimal for all partition layouts. For tables with severe manifest fragmentation from historical writes, commit-time merging is insufficient — explicit manifest rewriting is required.

Manifest rewriting procedure

The rewrite_manifests procedure reads all manifests in the current snapshot, consolidates them into optimally sized manifests aligned with partition boundaries, and commits a new snapshot with the rewritten manifests. This is a metadata-only operation — no data files are read or rewritten. It is fast relative to data file compaction and delivers outsized planning-time improvements.

sql

1-- Rewrite manifests to consolidate and align with partitions2CALL catalog.system.rewrite_manifests(3  table => 'db.events'4);

The impact is direct and measurable. A table with 2,000 fragmented manifests — each tracking 10–50 files scattered across multiple partitions — might consolidate to 40 manifests, each tracking 500–1,000 files within a contiguous partition range. Planning time drops from 8 seconds to 200 milliseconds. The only cost is reading and rewriting the manifest Avro files, which typically completes in seconds to low minutes even for large tables.

Manifest rewriting should always run after data file compaction. Compaction changes the file layout — it removes small input files and creates larger output files. Manifests track files. If you rewrite manifests before compaction, the carefully organized manifests become stale the moment compaction changes the underlying files. The correct sequence is always: expire snapshots → remove orphan files → compact data files → rewrite manifests.

LakeOps enforces this sequencing automatically. After its Rust-based compaction engine completes a compaction pass, it immediately rewrites manifests against the new file layout — ensuring manifests always reflect the current state of the data. This eliminates the gap between compaction and manifest rewriting that causes fragmentation to persist in manual or cron-based maintenance schedules.

Snapshot expiration

Snapshot expiration removes snapshots older than a retention threshold and dereferences the manifest lists and data files they exclusively held. This is the most important metadata control mechanism because snapshots are the multiplier — each retained snapshot anchors its entire manifest tree, preventing garbage collection of manifests and data files that newer snapshots have superseded.

sql

1-- Expire snapshots older than 7 days, keeping at least 102CALL catalog.system.expire_snapshots(3  table => 'db.events',4  older_than => TIMESTAMP '2026-06-21 00:00:00',5  retain_last => 106);

On a table with 6,000 accumulated snapshots, expiring all but the last 10 can reduce the metadata file size from 50 MB to 500 KB. The manifest list for the current snapshot becomes the only manifest list that matters. Manifests exclusively referenced by expired snapshots become candidates for garbage collection. The metadata tree shrinks dramatically, and catalog operations speed up.

The retain_last parameter is essential. It prevents expiring a snapshot that a long-running query is actively reading, which would cause the query to fail when it tries to access data files that have been garbage-collected. For most production environments, retaining 5–10 snapshots with a 5–7 day age window covers the realistic rollback and debugging window.

LakeOps automates snapshot expiration as the first step in its maintenance pipeline, running it continuously across all connected tables. When snapshot count exceeds configurable thresholds, expiration fires immediately — not on a fixed schedule. This prevents the metadata chain from growing unbounded during periods of heavy write activity.

Metadata cleanup

Beyond snapshots and manifests, Iceberg tables accumulate metadata file versions. Each commit writes a new metadata JSON file, and the table property write.metadata.previous-versions-max (default: 100) controls how many prior versions are retained. On a table with 10,000 commits, only the most recent 100 metadata files are kept. However, if this property is set too high — or if it was not set during the table's early life — hundreds or thousands of metadata JSON files can accumulate on storage, each referencing its own snapshot list and manifest structure.

Cleaning up old metadata files is a storage-level operation: list the metadata directory, compare against the files referenced by the current and retained metadata versions, and delete the rest. This is conceptually simple but operationally dangerous — deleting a metadata file that a concurrent reader or catalog is using can corrupt the table state from that reader's perspective. LakeOps handles this as part of its orphan file cleanup pass, with safety checks that ensure only truly unreferenced metadata files are removed.

The V4 spec: architectural relief for metadata scale

The mechanisms above are operational patches — they manage growth after it happens. The Iceberg V4 spec proposals attack the root cause by rearchitecting the metadata tree itself. Foundational V4 types were merged into the Iceberg codebase in early 2026, Parquet-based manifest writers are under active development, and the root manifest integration is progressing through community review. While V4 is not yet formally released, the implementation is well underway — these proposals represent the most significant metadata changes in Iceberg's history.

Adaptive metadata trees

The headline V4 proposal replaces the rigid two-level structure (manifest list → manifest files) with a dynamic, adaptive tree built around a Root Manifest. The Root Manifest serves as the single entry point for each snapshot and can contain entries at different levels depending on the workload.

For small writes (streaming commits that add one or a few files), entries are inlined directly into the Root Manifest. There is no separate manifest file created — the commit writes a single Parquet node containing the new file entries. This achieves O(1) commit overhead, eliminating the manifest-per-commit proliferation that drives metadata bloat on streaming tables.

For large batch writes that add thousands of files, the Root Manifest points to leaf manifests exactly as the current architecture does. The tree adapts to the workload shape: streaming tables keep hot writes near the root for speed; batch tables use deeper trees for organization. One spec, two operating modes.

As the root fills with inlined entries, background maintenance rebalances them into leaf manifests. Writers can choose when to pay this rebalancing cost — immediately at commit time or deferred to a maintenance window. This gives streaming pipelines sub-second commit latency without sacrificing the organizational benefits of manifest-based metadata for large tables.

Parquet-based manifests

V4 proposes switching manifest storage from Apache Avro to Apache Parquet. This is not a cosmetic change — it fundamentally alters how planners interact with metadata.

With Avro manifests, the planner must deserialize the entire manifest to access any column's statistics. Reading the lower bounds for a single filter column requires deserializing all 200 columns' statistics for every file entry. Memory usage is proportional to the full manifest size, regardless of query selectivity.

With Parquet manifests, the planner can use columnar projection — reading only the statistics columns needed for the query's filter predicates. A query filtering on event_date reads only the event_date lower-bound and upper-bound columns from the manifest, skipping the statistics for the other 199 columns entirely. Memory usage drops by 99% for narrow filters on wide tables. Column pruning and predicate pushdown — the same techniques that make Parquet fast for data — now apply to metadata itself.

This change is particularly impactful for wide tables common in ML and feature store workloads, where tables may have 500–2,000 columns but queries typically filter on 2–5. Current Avro manifests force the planner to load statistics for all columns; Parquet manifests let it read only what it needs.

Single-file commits

In the current V2/V3 architecture, every commit writes at least three files: a new metadata JSON, a new manifest list, and one or more new manifest files. For a streaming pipeline committing every second, this means three or more S3 PUT operations per commit — 259,200 file writes per day for metadata alone.

V4's single-file commits collapse this into one file write per commit. The Root Manifest contains everything: the updated metadata, the manifest references, and the new file entries. Combined with server-side catalog commits (where the catalog stores small commits directly in its database rather than writing to object storage), V4 can achieve zero-file commits for trivial operations — the commit is a database row update, not a storage write.

The compound effect is substantial. Fewer metadata files means fewer S3 GETs during planning, fewer objects to manage during garbage collection, and a simpler metadata tree to traverse. For tables at petabyte scale with millions of historical commits, this architectural change prevents metadata growth from becoming proportional to commit frequency.

Colocated deletion vectors

V4 replaces position delete files with deletion vectors embedded directly in the data file's manifest entry. Instead of maintaining separate delete files that every reader must reconcile, deleted rows are tracked as a bitmask within the manifest entry itself. This eliminates the delete-file-to-data-file reconciliation overhead that degrades read performance on tables with frequent row-level updates, and it removes an entire class of metadata objects (delete files and their manifest entries) from the metadata tree.

Practical strategies for today

V4 promises architectural relief, but production tables need help now. These strategies work with the current V2/V3 spec and have been validated across production deployments managing hundreds of petabytes.

Manifest rewriting schedules

Do not wait for manifest fragmentation to cause planning problems — schedule manifest rewriting proactively. The right frequency depends on write patterns:

Streaming tables (continuous commits): rewrite manifests every 4–6 hours. Streaming creates manifests fastest, and planning degradation is most visible on tables that serve interactive queries alongside continuous ingest.
Frequent batch tables (hourly or more): rewrite manifests daily, immediately after compaction.
Infrequent batch tables (daily or weekly loads): rewrite manifests after each load cycle.

Condition-based triggers outperform fixed schedules. Instead of rewriting manifests every 6 hours regardless of state, trigger a rewrite when manifest count exceeds a threshold (e.g., 200 manifests per snapshot) or when the average entries per manifest drops below a minimum (e.g., fewer than 100 file entries per manifest). This avoids wasted runs on tables that are already well-organized.

LakeOps monitors manifest count per snapshot continuously across all connected tables. When fragmentation exceeds configurable thresholds — for example, more than 200 manifests or fewer than 50 entries per manifest — it triggers manifest consolidation automatically, immediately after compaction completes. This condition-based approach eliminates 95% of the unnecessary manifest rewrites that fixed schedules produce while catching every instance of genuine fragmentation.

LakeOps Compaction — per-table optimization settings — Per-table optimization: compaction strategy, target file size, snapshot expiration, and orphan cleanup — tunable per table with lake-wide defaults.

Target manifest sizes

The commit.manifest.target-size-bytes property controls how large manifests grow before they are split or merged. The default of 8 MB is reasonable for many workloads, but the optimal value depends on table shape:

Narrow tables (10–30 columns): each manifest entry is small because per-file statistics are compact. Target 16–32 MB manifests to reduce manifest count without creating manifests that are expensive to parse.
Wide tables (100+ columns): each manifest entry is large because of per-column statistics. Target 8–16 MB to keep individual manifests parseable without consuming excessive memory.
Very wide tables (500+ columns): consider using write.metadata.metrics.default to limit which columns carry statistics. Not every column needs lower/upper bounds — restrict statistics to columns that appear in query filters. This reduces manifest entry size by 80–90% and dramatically cuts planning memory.

The write.metadata.metrics.column.* properties allow per-column statistics configuration. Setting write.metadata.metrics.column.description to none for a text column that is never filtered eliminates its statistics from every manifest entry — across millions of files, the cumulative savings are substantial.

Partition specs that limit manifest proliferation

Partition design directly affects manifest count because Iceberg creates separate manifest entries per partition value. A table partitioned by day(event_timestamp) creates one set of manifest entries per day. The same table partitioned by hour(event_timestamp) creates 24 sets per day — and 24x more manifest fragmentation over time.

Over-partitioning is the single largest driver of manifest bloat at scale. A table partitioned by (region, day, hour, user_type) with 10 regions, 365 days, 24 hours, and 5 user types has 438,000 logical partitions. Each partition creates manifest entries, and many will contain only a few small files. The manifest layer must track all of them.

Coarser partitioning with sort-based data skipping produces better results at petabyte scale. Partition by day(event_timestamp) and sort by (region, user_type) within each partition. The partition count stays manageable (365 per year), manifest entries cluster efficiently by day, and the sort order provides data skipping on the sub-partition columns through tight per-file min/max statistics. This approach keeps manifest count proportional to the partition count (hundreds) rather than the combinatorial explosion of multi-column partitioning (hundreds of thousands).

LakeOps Layout Simulations — LakeOps layout simulations: model the impact of partition and sort-order changes on manifest count, file distribution, and planning performance before applying them to production tables.

Snapshot retention policies

Snapshot accumulation is the easiest metadata problem to prevent and the most commonly neglected. Set explicit retention policies on every production table:

sql

1-- Configure automatic snapshot expiration properties2ALTER TABLE db.events SET TBLPROPERTIES (3  'history.expire.max-snapshot-age-ms' = '604800000',  -- 7 days4  'history.expire.min-snapshots-to-keep' = '10'5);

These properties configure the table's self-expiration behavior when supported by the catalog and engine. For catalogs that do not support automatic expiration, run expire_snapshots explicitly on a schedule — daily at minimum for tables with frequent writes.

The interaction between snapshot expiration and time travel is the key tradeoff. Expiring snapshots beyond 7 days means you cannot query the table as it existed 10 days ago. For most production workloads, 5–7 days of time travel is sufficient. Compliance-heavy environments may require 30 days, in which case the metadata overhead is an accepted cost. The critical mistake is having no expiration policy at all — letting snapshots accumulate indefinitely is the fastest path to metadata bloat.

Statistics management

Column statistics (stored in Puffin files for table-level statistics and in manifest entries for file-level statistics) enable the data skipping that makes Iceberg queries fast. But statistics management requires active attention at scale.

Generate statistics for critical columns. Table-level statistics (NDV — number of distinct values, histograms) stored in Puffin files help cost-based optimizers choose better join strategies and aggregation plans. Generate them for columns frequently used in joins and GROUP BY clauses:

sql

1-- Generate table-level statistics2ANALYZE TABLE db.events COMPUTE STATISTICS FOR COLUMNS3  event_date, user_id, region;

Refresh statistics after compaction. Compaction rewrites data files, which changes the file-level statistics in manifests. But table-level statistics (Puffin files) are not automatically regenerated. Stale statistics mislead the optimizer — it may choose a hash join when a broadcast join would be faster, or estimate cardinalities incorrectly. Refresh statistics after every major compaction pass.

LakeOps generates and refreshes column statistics as part of its post-compaction pipeline. After file compaction and manifest rewriting complete, it identifies columns that benefit from updated statistics based on query engine telemetry — which columns are filtered, joined, or grouped — and regenerates statistics for those columns. This ensures the optimizer always works with current data distributions, not stale estimates from pre-compaction file layouts.

Limit statistics scope on wide tables. For tables with 200+ columns, storing full statistics (lower bound, upper bound, null count, NaN count) for every column on every file is wasteful. Most queries filter on 5–10 columns. Configure the table to store statistics only for those columns:

sql

1-- Restrict statistics to columns used in filters2ALTER TABLE db.events SET TBLPROPERTIES (3  'write.metadata.metrics.default' = 'none',4  'write.metadata.metrics.column.event_date' = 'full',5  'write.metadata.metrics.column.user_id' = 'full',6  'write.metadata.metrics.column.region' = 'full'7);

This reduces manifest entry size by 90–95% on wide tables, cutting planning memory proportionally. The tradeoff is that filters on columns without statistics cannot skip files — but if those columns are rarely filtered, the tradeoff is overwhelmingly positive.

LakeOps Tables — LakeOps table monitoring: health classification across the lake with file counts, manifest state, snapshot depth, and partition details — surfacing metadata health issues before planning degrades.

Monitoring metadata health

You cannot control what you do not measure. These are the metrics that reveal metadata health problems before they become query performance crises. A dedicated lakehouse observability layer that tracks these signals continuously is essential for teams managing more than a handful of tables.

Manifest count per snapshot

The number of manifest files in the current snapshot is the single most important metadata health metric. It directly determines how many S3 GETs the planner must execute and how many Avro files it must deserialize during planning.

Healthy range: fewer than 1 manifest per 50–100 data files. A table with 5,000 data files should have 50–100 manifests. If it has 2,000 manifests, fragmentation is severe and manifest rewriting is overdue.

Track this metric over time, not as a point-in-time measurement. A manifest count that grows by 50 per day without corresponding data volume growth indicates that manifest merging on commit is not keeping pace with write frequency. This is common on streaming tables where commits happen faster than the merge threshold allows consolidation.

Average entries per manifest

Complements manifest count by measuring manifest efficiency. A manifest tracking 5 file entries is wasting the overhead of a separate Avro file for trivial content. A manifest tracking 2,000 file entries is dense and efficient.

Healthy range: 200–2,000 entries per manifest, depending on table width. Narrow tables can support higher entry counts per manifest because each entry is smaller. Wide tables hit manifest size limits sooner due to per-column statistics, resulting in lower entry counts.

When the average drops below 50 entries per manifest, the table is paying significant per-manifest overhead for minimal content — manifest rewriting will deliver large planning improvements.

Planning time trends

Planning time is the end-to-end metric that captures all metadata health issues. It measures the time from when the engine receives a query to when it produces the list of files to scan. This includes metadata file retrieval, manifest list parsing, manifest file reading, statistics evaluation, and partition pruning.

Measure planning time for representative queries on a regular cadence — not just when users complain. A query that planned in 300 milliseconds last month and plans in 3 seconds this month has a 10x regression even though both times are technically under most SLA thresholds. The trend matters more than the absolute value.

LakeOps surfaces planning time trends as part of its Health Insights system. When planning time increases by more than 2x for any table, it generates a proactive alert: 'HIGH: excessive manifests' or 'WARNING: planning time increasing' — identifying the specific metadata cause (manifest fragmentation, snapshot accumulation, statistics bloat) and recommending the appropriate remediation. This catches degradation weeks before it impacts production queries.

Snapshot chain depth

The total number of retained snapshots directly impacts metadata file size and catalog operation latency. Track the current count against your retention policy — if the policy says 'retain 7 days' but the table has 15,000 snapshots spanning 90 days, expiration is not running or not keeping up.

Healthy range: daily_commits × retention_days, plus a small buffer. A table receiving 200 commits per day with a 7-day retention policy should have approximately 1,400 snapshots. If it has 10,000, expiration is behind.

Metadata file size

The size of the current metadata JSON file indicates overall metadata health. A metadata file under 1 MB is typical for a well-maintained table. A metadata file over 10 MB suggests excessive snapshot retention or accumulated table properties. A metadata file over 100 MB indicates a serious maintenance gap.

Monitor this with periodic listing of the metadata directory. The metadata file size grows monotonically with snapshot count — expiring snapshots and reducing write.metadata.previous-versions-max are the primary controls.

The correct maintenance sequence for metadata

All of the mechanisms and strategies above must be executed in the right order to be effective. The optimal sequence for metadata health is:

(1) Expire snapshots → (2) Remove orphan files → (3) Compact data files → (4) Rewrite manifests → (5) Refresh statistics

Why expire first. Snapshot expiration dereferences manifests and data files that are no longer needed. Every subsequent operation benefits from a smaller working set. If you compact before expiring, you may rewrite files that expiration would have garbage-collected — wasting compute on data no query will ever read. On tables with months of accumulated snapshots, this waste can mean compacting hundreds of gigabytes of effectively dead data.

Why orphans before compaction. Orphan file cleanup removes physical files left behind by incomplete expiration or failed writes. Running it before compaction ensures the compaction engine works with a clean storage footprint and does not encounter confusing files in partition directories.

Why compact before rewriting manifests. Compaction changes the file layout — removing input files and creating output files. Manifests track files. Rewriting manifests against a pre-compaction layout produces beautifully organized manifests that become immediately stale when compaction runs. Always compact first, then rewrite manifests against the final file layout.

Why statistics last. Table-level statistics (Puffin files) describe the data distribution in the current files. If you generate statistics before compaction, they describe files that compaction is about to replace. Generate statistics after compaction and manifest rewriting, when the file layout is stable.

LakeOps executes this full sequence as a coordinated pipeline — expire → orphan → compact → manifests → statistics — triggered by structural signals per table, not fixed schedules. Each operation's output feeds the next. This sequenced approach prevents the metadata-first, data-second ordering mistake that wastes compute in manual and cron-based maintenance systems.

LakeOps walkthrough — monitoring metadata health and triggering manifest optimization.

Putting it all together

Metadata management at petabyte scale is not a one-time fix — it is an ongoing operational discipline. The core principles are:

Prevent before you cure. Set manifest merge properties, statistics scope, and snapshot retention policies when you create the table — not after planning times degrade. Partition specs should be chosen with manifest proliferation in mind, not just query predicate alignment. Coarser partitions with sort-based data skipping produce fewer manifests and better planning performance at scale.

Monitor continuously. Track manifest count, entries per manifest, planning time, snapshot depth, and metadata file size as time-series metrics. Set alerts at thresholds that give you time to react — not at the point where users are already impacted. A manifest count trending upward at 100 per day needs attention this week, not after it reaches 5,000.

Automate the full lifecycle. Manual maintenance does not scale beyond a handful of tables. Cron-based maintenance does not adapt to table state. The right approach is condition-based triggers that fire maintenance operations when specific thresholds are crossed — manifest count too high, snapshot depth too deep, planning time increasing — with the full sequence executed in the correct order every time.

Prepare for V4. The adaptive metadata tree, Parquet-based manifests, single-file commits, and colocated deletion vectors will fundamentally change the metadata economics of Iceberg. Tables that are well-maintained today will transition smoothly when V4 lands. Tables with severe metadata debt will face a difficult migration. Investing in metadata health now pays dividends immediately and positions your lakehouse for the architectural shift ahead.

Iceberg at petabyte scale is a metadata management challenge as much as a data management challenge. The data files are fast by design — Parquet, columnar, compressed, and immutable. The metadata layer is what determines whether a query plans in milliseconds or minutes, whether the coordinator survives the planning phase, and whether S3 serves files or serves HTTP 503 errors. Keep the metadata lean, and the queries stay fast. If you are ready to automate metadata health across your entire lakehouse, LakeOps handles manifest optimization, snapshot lifecycle, statistics management, and sequenced maintenance across all your catalogs — explore the table health maintenance guide, automated maintenance solution, or the lakehouse performance solution.