
Iceberg's metadata architecture is designed for speed. Scan planning runs on a single node. Manifest files store per-file column statistics enabling engines to skip irrelevant data. Hidden partitioning prevents accidental full scans. A well-maintained Iceberg table can plan queries over petabytes with sub-second overhead.
Most production tables are not well-maintained. A table with 47,000 small files, 200 fragmented manifests, no sort order, and 23,000 accumulated delete files will be slow on every engine — Trino, Spark, Snowflake, Athena — regardless of compute. The performance ceiling is the physical state of the table, not the engine.
This article covers the six layers that determine Iceberg query performance. Each layer compounds with the others: fix one and you improve the rest. Fix all six and queries run 12× faster. LakeOps automates all six — here's how each one works.
How Iceberg decides what to read
Every query passes through three pruning layers. Each eliminates data before the next layer runs:
Layer 1: Partition pruning. Predicates are evaluated against the partition spec. A query filtering `event_time >= '2026-05-01'` on a table partitioned by `day(event_time)` eliminates every partition outside that range — often 99%+ of data eliminated in one step.
Layer 2: File pruning (data skipping). For surviving partitions, per-file min/max statistics in manifests determine which files can match the predicate. On sorted tables, this eliminates 60–95% of remaining files without opening them.
Layer 3: Row group filtering. Within surviving files, Parquet row group statistics and bloom filters skip individual row groups. For point lookups, this eliminates 80–99% of remaining data.
The compound effect: a petabyte query might plan over 12 manifests, prune to 50 files, and scan 2 GB — all on a single node. But each layer depends on physical table state. When state degrades, pruning becomes ineffective and queries slow down across every engine.
1. File sizing: eliminate per-file overhead
Every file requires at minimum one S3 GET request to read its Parquet footer. A table with 47,000 files means 47,000 metadata round trips before reading a single data row. Query time scales with file count, not data volume.
The measurement: 47,000 files → 280 files reduced query time from 52s to 5.8s. Same data, same query, same engine. 9× faster purely from file count reduction.

Why files get small. Streaming pipelines checkpointing every 60 seconds against 100 partitions create 144,000 new files per day. Within one week: 100,000+ files per table. Target file size for production: 128–512 MB depending on workload.
How [LakeOps](https://lakeops.dev) solves it. LakeOps monitors file-size distribution per table in real time. When average file size drops below threshold, compaction triggers automatically on a Rust engine executing at 2,522 MB/s peak throughput. A streaming table with 42,633 files was compacted to 69 — 99.8% file reduction — in 138 seconds. Tables never degrade because compaction is event-driven, not nightly.
2. Sort order: make data skipping effective
File sizing reduces per-file overhead. Sort order determines how much data within those files gets skipped.
When data is sorted by the columns queries filter on, Parquet row group min/max statistics become useful. A query filtering `WHERE event_date = '2026-05-15'` on sorted data can immediately skip every row group where max < '2026-05-15' or min > '2026-05-15' — without reading a single byte. On unsorted data, min/max ranges span the entire domain and statistics are useless. Every row group must be scanned.
The measurement: sorted tables scan 51% less data per query than unsorted equivalents. Every query, every engine, every day. Sorted data also compresses 9% better (163 GB vs 178 GB for 1 TB TPC-H). This directly reduces query compute cost across Athena ($5/TB scanned) and CPU-second engines alike.

The challenge: choosing the right sort columns. The wrong key accelerates one query pattern while degrading another. Z-order clustering helps for multi-dimensional filters (e.g., `pickup_location` + `fare_amount`) by providing balanced skipping across all dimensions.
How LakeOps solves it. LakeOps tracks actual query patterns per table — which columns appear in WHERE, JOIN, GROUP BY — and sorts data by the columns that deliver the highest aggregate skipping benefit during compaction. This is the single most impactful optimization: when files are organized by how data is actually queried, every engine skips irrelevant data aggressively using row group statistics that are now maximally effective.
The sort strategy adapts continuously. If a new dashboard starts filtering on a previously cold column, the next compaction cycle incorporates it. On three consecutive runs of a 1.2 TB table, the planner improved runtime from 22 minutes to 11 — zero configuration changes. It observes which strategies produce the best skip ratios and converges toward the optimal layout.
3. Manifest optimization: fast query planning
Before reading any data, the engine plans the query by reading manifests — metadata files that track which data files exist, their partition values, and column statistics. Planning speed is determined by how many manifests must be read.
A well-maintained table needs ~12 remote calls to plan regardless of size. But manifests fragment over time from streaming writes, compaction cycles, and schema changes. At 200+ manifests, planning overhead often dominates total query time — the engine spends more time deciding what to read than actually reading it.

How LakeOps solves it. Three operations target manifest and metadata health:
1. Rewrite Manifests — consolidates many small manifests into fewer, larger ones. Atomic operation, completes in seconds, planning latency drops immediately across all engines. Runs daily by default and triggers automatically after compaction.
2. Compute Puffin Statistics — generates column-level statistics (NDV, min/max, null counts) stored in Puffin files. Enables engines to make smarter join decisions (broadcast vs shuffle) and skip files more aggressively.
3. Auto-triggering — manifest rewrites run after every compaction cycle (which creates new manifests), so metadata never drifts from the optimized data layout.
4. Delete files: eliminate read-time reconciliation
Iceberg's merge-on-read (MOR) mode writes small delete files instead of rewriting data files on updates. Writes are fast, but every read must reconcile data files against delete files at scan time — performance degrades linearly with delete file count.
The production reality: 23,433 delete files covering 551M deleted rows. Every query reconciled that entire delete history. After cleanup, reconciliation overhead dropped to zero.

How LakeOps solves it. Two operations target delete file overhead:
- Rewrite Position Delete Files — consolidates many small delete files into fewer, larger ones without rewriting data files. Immediate read improvement, lightweight operation.
- Full compaction — physically applies deletions by merging delete files with corresponding data files. Zero delete files remain, zero reconciliation overhead on reads.
Both operations run on configurable schedules or trigger automatically when delete file count exceeds thresholds. Because LakeOps sequences maintenance (expire snapshots → remove orphans → compact), delete cleanup always runs against the pruned dataset — never wasting compute on files about to be expired.
5. Partition strategy: the coarsest pruning layer
Partitioning is the first filter — it eliminates entire directories of data before file-level statistics are evaluated. The right partition strategy prunes 99%+ of data in one step. The wrong one forces full manifest scans on every query.
Key principles:
- Partition by your most common filter column with low cardinality. For time-series: `day(event_time)`.
- Avoid high-cardinality keys (`user_id`) that create millions of tiny partitions — inflating the manifest tree.
- Target each partition at 2–5× the target file size, so compaction produces well-sized files within each partition.
Iceberg advantages for partitioning: Hidden partitioning maps predicates to partitions automatically (no need for users to know physical layout). Partition evolution allows changing granularity (monthly → daily) as a metadata-only operation — no data rewrite, no downtime.
6. Multi-engine routing: right engine for every query
Optimized tables unlock faster engines. A degraded table with 47,000 small files might only work on Spark (which parallelizes across the mess). After compaction to 280 sorted files, DuckDB handles the same query in 0.5s instead of 52s. Table optimization makes more engines viable for more workloads — this is where LakeOps multi-engine routing delivers compound acceleration.

[LakeOps routing strategies](https://lakeops.dev/docs/query-routing):
- Latency — route to the fastest engine for the query shape. DuckDB: 0.5s. Same query on Athena: 2.3s.
- Cost — route to the cheapest engine meeting latency target. DuckDB at $0.01/query vs Snowflake at $0.08.
- Throughput — balance load across engines to prevent bottlenecks under mixed workloads.
The routing layer learns from execution history. If a query consistently runs faster on one engine, future executions route there automatically. Combined with table-level optimization, production deployments achieve 12× average query acceleration.
How the six layers compound
Each layer amplifies the others. The correct maintenance sequence for a degraded table:
1. Expire snapshots + remove orphans — prune dead data before compaction wastes CPU on it.
2. Compact to target file size — merge small files, physically apply deletes.
3. Sort by query patterns — use layout simulations to test before committing.
4. Rewrite manifests — consolidate against the new layout.
5. Compute Puffin statistics — enable aggressive engine-level pruning.
6. Route queries — direct each workload to the fastest viable engine.

Why continuous beats one-time
Table state degrades continuously. Every streaming checkpoint creates small files. Every MOR update creates delete files. Every append fragments manifests. A one-time compaction fix lasts days or weeks before degradation returns.

LakeOps runs this optimization loop autonomously. Connect your catalogs (AWS Glue, REST, S3 Tables) in 10 minutes. It monitors every table and runs the right maintenance at the right frequency — compaction, manifest rewrites, snapshot expiration, delete cleanup, sort-order adaptation — coordinated in the correct sequence, per table. Production deployments achieve 12× average query acceleration with zero pipeline changes.

The performance ceiling of your Iceberg lakehouse is not the query engine. It is the physical state of your tables. Keep that state optimized — continuously, autonomously — and every engine, every query, every user benefits.
Explore the Iceberg Lakehouse Performance solution page for detailed benchmarks, or read 7 Iceberg Lakehouse Cost Reduction Strategies to see how performance and cost optimization reinforce each other.



