Manifest & Metadata Optimization

LakeOps consolidates and rewrites manifest files so query planning stays fast at any scale. Smaller, optimized manifests mean faster planning and fewer metadata scans for Trino, Snowflake, Athena, and every engine that touches your lake.

What are manifest files?

In Apache Iceberg, manifest files are metadata files that track which data files belong to a table snapshot. Each manifest records:

•File paths and storage locations
•Partition values for each file
•Record counts and file sizes
•Column-level statistics (min, max, null count) for data skipping

Query engines read manifests during planning to determine which data files need to be scanned. When manifests become too numerous or undersized, planning slows down because the engine must read and parse more metadata before execution begins.

Why manifest optimization matters

Manifests fragment over time due to frequent small writes, compaction cycles, and schema changes. LakeOps Insights flags these issues:

HIGH severity insight

LakeOps flags an Excessive Manifests insight when manifest count exceeds the threshold. A large percentage of undersized manifests severely impacts planning performance and requires rewriting.

What LakeOps optimizes

LakeOps performs manifest consolidation through the Rewrite Manifests operation, and supports additional metadata optimization capabilities:

Manifest consolidation

Merges many small manifest files into fewer, larger manifests. This reduces the number of metadata reads during query planning. The rewrite is atomic — the table transitions from old manifests to new ones in a single commit.

Impact

• Fewer manifest files = faster metadata read during planning
• Reduced LIST calls to object storage
• Lower planning latency across all connected engines

Position delete optimization

Rewrites position delete files to reduce read amplification during merge-on-read queries. Position deletes mark specific rows for deletion without rewriting the entire data file. Over time, many small delete files accumulate, causing engines to perform extra I/O during reads.

Impact

• Reduces merge-on-read overhead for delete-heavy workloads
• Consolidates many small delete files into fewer, larger ones
• Improves read performance without full table rewrite

Puffin statistics computation

Generates column-level statistics in Puffin format. These statistics include approximate distinct values (NDV), min/max values, and null counts. Engines use them to skip entire data files that can't contain matching rows.

Impact

• Enables data skipping at the file level
• Reduces scan volume for selective queries
• Improves performance for all connected query engines without schema changes

Using the manifest rewrite UI

Navigate to Data > Explore, select a table, then open the Optimization tab. Scroll to the Rewrite Manifests card.

1Toggle Enabled to activate scheduled manifest rewrites.

2Set the Cron Expression to control when rewrites run (default: 0 4 * * * — daily at 4:00 AM).

3Click Save to persist, or Execute to run a rewrite immediately.

Execution control

Manifest rewrites run on the configured cron schedule when Enabled. You can also click Execute to trigger an immediate rewrite without waiting for the next scheduled time. Manifest rewrites are lightweight, atomic, and never block concurrent reads or writes.

If Adaptive Maintenance is active for the table, manifest rewrites are managed automatically and the individual section is locked.

Configuration reference

Manifest rewrites can also be applied at scale via Policies (Rewrite Manifests type).

Setting	Default	Description
Schedule (cron)	`0 4 * * *`	When manifest rewrites run. Default: daily at 4:00 AM.
Enabled	On	Toggle to enable or disable scheduled manifest rewrites.

Monitoring manifest operations

Track the impact of manifest rewrites through:

•Events tab — shows “Rewrite Manifests” operations with before/after manifest count and duration.
•Insights — “Excessive Manifests” findings should resolve after successful rewrites.
•Query performance — planning latency should decrease for the affected table across all connected engines.

When to enable manifest optimization

•Tables with frequent small writes (streaming, micro-batch)
•Tables that receive many compaction cycles (each creates new manifests)
•Large tables where planning latency is a bottleneck
•Tables flagged by LakeOps Insights with manifest-related findings

Enabling Auto mode with the default schedule is recommended for most production tables — manifest rewrites are lightweight and safe to run frequently.

← Snapshots Next: Orphan Cleanup →