Manifest & Metadata Optimization
LakeOps consolidates and rewrites manifest files so query planning stays fast at any scale. Smaller, optimized manifests mean faster planning and fewer metadata scans for Trino, Spark, Flink, and every engine that touches your lake.
What are manifest files?
In Apache Iceberg, manifest files are metadata files that track which data files belong to a table snapshot. Each manifest records:
- •File paths and storage locations
- •Partition values for each file
- •Record counts and file sizes
- •Column-level statistics (min, max, null count) for data skipping
Query engines read manifests during planning to determine which data files need to be scanned. When manifests become too numerous or undersized, planning slows down because the engine must read and parse more metadata before execution begins.
Why manifest optimization matters
Manifests fragment over time due to frequent small writes, compaction cycles, and schema changes. LakeOps Insights flags these issues:
LakeOps flags an Excessive Manifests insight when manifest count exceeds the threshold. A large percentage of undersized manifests severely impacts planning performance and requires rewriting.
What LakeOps optimizes
LakeOps performs manifest consolidation through the Rewrite Manifests operation, and supports additional metadata optimization capabilities:
Manifest consolidation
Merges many small manifest files into fewer, larger manifests. This reduces the number of metadata reads during query planning. The rewrite is atomic — the table transitions from old manifests to new ones in a single commit.
Impact
- • Fewer manifest files = faster metadata read during planning
- • Reduced LIST calls to object storage
- • Lower planning latency across all connected engines
Position delete optimization
Rewrites position delete files to reduce read amplification during merge-on-read queries. Position deletes mark specific rows for deletion without rewriting the entire data file. Over time, many small delete files accumulate, causing engines to perform extra I/O during reads.
Impact
- • Reduces merge-on-read overhead for delete-heavy workloads
- • Consolidates many small delete files into fewer, larger ones
- • Improves read performance without full table rewrite
Puffin statistics computation
Generates column-level statistics in Puffin format. These statistics include approximate distinct values (NDV), min/max values, and null counts. Engines use them to skip entire data files that can't contain matching rows.
Impact
- • Enables data skipping at the file level
- • Reduces scan volume for selective queries
- • Improves performance for Trino, Spark, and Flink without schema changes
Using the manifest rewrite UI
Navigate to Explore, select a table, then open the Optimization tab. Scroll to the Rewrite Manifests card.
0 4 * * * * — daily at 4:00 AM).Auto vs. Manual mode
Manifest rewrites support two execution modes, toggled at the top of the Rewrite Manifests card in the Optimization tab:
Auto (autopilot)
LakeOps rewrites manifests autonomously on the configured cron schedule. Recommended for most production tables — manifest rewrites are lightweight and safe to run frequently.
Non-disruptive: rewrites are atomic and never block concurrent reads or writes.
Manual (on-demand)
You control when rewrites run. Useful when you want to preview simulation results first, or for tables that rarely change.
The configuration is saved; trigger execution whenever you're ready.
Configuration reference
Manifest rewrites can also be applied at scale via Policies (Rewrite Manifests type).
| Setting | Default | Description |
|---|---|---|
| Schedule (cron) | 0 4 * * * * | When manifest rewrites run. Default: daily at 4:00 AM. |
| Auto / Manual | Auto | Auto: executes on schedule. Manual: on-demand only. |
Monitoring manifest operations
Track the impact of manifest rewrites through:
- •Events tab — shows “Rewrite Manifests” operations with before/after manifest count and duration.
- •Insights — “Excessive Manifests” findings should resolve after successful rewrites.
- •Query performance — planning latency should decrease for the affected table across all connected engines.
When to enable manifest optimization
- •Tables with frequent small writes (streaming, micro-batch)
- •Tables that receive many compaction cycles (each creates new manifests)
- •Large tables where planning latency is a bottleneck
- •Tables flagged by LakeOps Insights with manifest-related findings
Enabling Auto mode with the default schedule is recommended for most production tables — manifest rewrites are lightweight and safe to run frequently.
