Back to blog

Optimizing Iceberg Lakehouse Performance

Iceberg tables degrade silently — small files from streaming, unsorted data, fragmented manifests, accumulated delete files. Each one caps query speed regardless of engine. Six concrete optimization layers, how they interact, and how autonomous maintenance keeps every table at peak performance.

Modern Iceberg Lakehouse Architecture — LakeOps control plane for autonomous management, optimization, multi-engine routing, and AI readiness

Iceberg's metadata architecture is designed for speed. Scan planning runs on a single node. Manifest files store per-file column statistics enabling engines to skip irrelevant data. Hidden partitioning prevents accidental full scans. A well-maintained Iceberg table can plan queries over petabytes with sub-second overhead.

Most production tables are not well-maintained. A table with 47,000 small files, 200 fragmented manifests, no sort order, and 23,000 accumulated delete files will be slow on every engine — Trino, Spark, Snowflake, Athena — regardless of compute. The performance ceiling is the physical state of the table, not the engine.

This article covers the six layers that determine Iceberg query performance. Each layer compounds with the others: fix one and you improve the rest. Fix all six and queries run 12× faster. LakeOps automates all six — here's how each one works.

How Iceberg decides what to read

Every query passes through three pruning layers. Each eliminates data before the next layer runs:

Layer 1: Partition pruning. Predicates are evaluated against the partition spec. A query filtering `event_time >= '2026-05-01'` on a table partitioned by `day(event_time)` eliminates every partition outside that range — often 99%+ of data eliminated in one step.

Layer 2: File pruning (data skipping). For surviving partitions, per-file min/max statistics in manifests determine which files can match the predicate. On sorted tables, this eliminates 60–95% of remaining files without opening them.

Layer 3: Row group filtering. Within surviving files, Parquet row group statistics and bloom filters skip individual row groups. For point lookups, this eliminates 80–99% of remaining data.

The compound effect: a petabyte query might plan over 12 manifests, prune to 50 files, and scan 2 GB — all on a single node. But each layer depends on physical table state. When state degrades, pruning becomes ineffective and queries slow down across every engine.

1. File sizing: eliminate per-file overhead

Every file requires at minimum one S3 GET request to read its Parquet footer. A table with 47,000 files means 47,000 metadata round trips before reading a single data row. Query time scales with file count, not data volume.

The measurement: 47,000 files → 280 files reduced query time from 52s to 5.8s. Same data, same query, same engine. 9× faster purely from file count reduction.

Multi-engine file ingestion chaos — small files piling into an Iceberg lakehouse
The reality of multi-engine writes: six engines continuously producing small files that pile into disorganized heaps. Without compaction, file count compounds with every ingestion cycle.

Why files get small. Streaming pipelines checkpointing every 60 seconds against 100 partitions create 144,000 new files per day. Within one week: 100,000+ files per table. Target file size for production: 128–512 MB depending on workload.

How [LakeOps](https://lakeops.dev) solves it. LakeOps monitors file-size distribution per table in real time. When average file size drops below threshold, compaction triggers automatically on a Rust engine executing at 2,522 MB/s peak throughput. A streaming table with 42,633 files was compacted to 69 — 99.8% file reduction — in 138 seconds. Tables never degrade because compaction is event-driven, not nightly.

2. Sort order: make data skipping effective

File sizing reduces per-file overhead. Sort order determines how much data within those files gets skipped.

When data is sorted by the columns queries filter on, Parquet row group min/max statistics become useful. A query filtering `WHERE event_date = '2026-05-15'` on sorted data can immediately skip every row group where max < '2026-05-15' or min > '2026-05-15' — without reading a single byte. On unsorted data, min/max ranges span the entire domain and statistics are useless. Every row group must be scanned.

The measurement: sorted tables scan 51% less data per query than unsorted equivalents. Every query, every engine, every day. Sorted data also compresses 9% better (163 GB vs 178 GB for 1 TB TPC-H). This directly reduces query compute cost across Athena ($5/TB scanned) and CPU-second engines alike.

LakeOps Layout Simulations — field access frequency and sort strategy comparison
Layout Simulations: field access frequency analysis shows which columns appear in FILTER, SELECT, and JOIN operations — driving sort-order decisions. Multiple strategies tested side-by-side on real query patterns.

The challenge: choosing the right sort columns. The wrong key accelerates one query pattern while degrading another. Z-order clustering helps for multi-dimensional filters (e.g., `pickup_location` + `fare_amount`) by providing balanced skipping across all dimensions.

How LakeOps solves it. LakeOps tracks actual query patterns per table — which columns appear in WHERE, JOIN, GROUP BY — and sorts data by the columns that deliver the highest aggregate skipping benefit during compaction. This is the single most impactful optimization: when files are organized by how data is actually queried, every engine skips irrelevant data aggressively using row group statistics that are now maximally effective.

The sort strategy adapts continuously. If a new dashboard starts filtering on a previously cold column, the next compaction cycle incorporates it. On three consecutive runs of a 1.2 TB table, the planner improved runtime from 22 minutes to 11 — zero configuration changes. It observes which strategies produce the best skip ratios and converges toward the optimal layout.

3. Manifest optimization: fast query planning

Before reading any data, the engine plans the query by reading manifests — metadata files that track which data files exist, their partition values, and column statistics. Planning speed is determined by how many manifests must be read.

A well-maintained table needs ~12 remote calls to plan regardless of size. But manifests fragment over time from streaming writes, compaction cycles, and schema changes. At 200+ manifests, planning overhead often dominates total query time — the engine spends more time deciding what to read than actually reading it.

LakeOps Insights tab — manifest and file health diagnostics
Table Insights: HIGH alert for 92 manifest files (threshold: 50) with 43 undersized — severely impacting planning performance. These diagnostics identify exactly which layer is degraded.

How LakeOps solves it. Three operations target manifest and metadata health:

1. Rewrite Manifests — consolidates many small manifests into fewer, larger ones. Atomic operation, completes in seconds, planning latency drops immediately across all engines. Runs daily by default and triggers automatically after compaction.

2. Compute Puffin Statistics — generates column-level statistics (NDV, min/max, null counts) stored in Puffin files. Enables engines to make smarter join decisions (broadcast vs shuffle) and skip files more aggressively.

3. Auto-triggering — manifest rewrites run after every compaction cycle (which creates new manifests), so metadata never drifts from the optimized data layout.

4. Delete files: eliminate read-time reconciliation

Iceberg's merge-on-read (MOR) mode writes small delete files instead of rewriting data files on updates. Writes are fast, but every read must reconcile data files against delete files at scan time — performance degrades linearly with delete file count.

The production reality: 23,433 delete files covering 551M deleted rows. Every query reconciled that entire delete history. After cleanup, reconciliation overhead dropped to zero.

LakeOps compaction — small files through the Rust engine to organized output
The compaction pipeline: fragmented small files (including delete files) flow through the Rust engine, producing clean outputs. Delete files are physically applied — eliminating reconciliation permanently.

How LakeOps solves it. Two operations target delete file overhead:

- Rewrite Position Delete Files — consolidates many small delete files into fewer, larger ones without rewriting data files. Immediate read improvement, lightweight operation.

- Full compaction — physically applies deletions by merging delete files with corresponding data files. Zero delete files remain, zero reconciliation overhead on reads.

Both operations run on configurable schedules or trigger automatically when delete file count exceeds thresholds. Because LakeOps sequences maintenance (expire snapshots → remove orphans → compact), delete cleanup always runs against the pruned dataset — never wasting compute on files about to be expired.

5. Partition strategy: the coarsest pruning layer

Partitioning is the first filter — it eliminates entire directories of data before file-level statistics are evaluated. The right partition strategy prunes 99%+ of data in one step. The wrong one forces full manifest scans on every query.

Key principles:

- Partition by your most common filter column with low cardinality. For time-series: `day(event_time)`.

- Avoid high-cardinality keys (`user_id`) that create millions of tiny partitions — inflating the manifest tree.

- Target each partition at 2–5× the target file size, so compaction produces well-sized files within each partition.

Iceberg advantages for partitioning: Hidden partitioning maps predicates to partitions automatically (no need for users to know physical layout). Partition evolution allows changing granularity (monthly → daily) as a metadata-only operation — no data rewrite, no downtime.

6. Multi-engine routing: right engine for every query

Optimized tables unlock faster engines. A degraded table with 47,000 small files might only work on Spark (which parallelizes across the mess). After compaction to 280 sorted files, DuckDB handles the same query in 0.5s instead of 52s. Table optimization makes more engines viable for more workloads — this is where LakeOps multi-engine routing delivers compound acceleration.

LakeOps control plane — multi-engine architecture with query routing
The LakeOps control plane routes each query to the engine matching its performance profile — interactive queries to Trino/DuckDB, heavy scans to Spark, high-concurrency BI to Snowflake.

[LakeOps routing strategies](https://lakeops.dev/docs/query-routing):

- Latency — route to the fastest engine for the query shape. DuckDB: 0.5s. Same query on Athena: 2.3s.

- Cost — route to the cheapest engine meeting latency target. DuckDB at $0.01/query vs Snowflake at $0.08.

- Throughput — balance load across engines to prevent bottlenecks under mixed workloads.

The routing layer learns from execution history. If a query consistently runs faster on one engine, future executions route there automatically. Combined with table-level optimization, production deployments achieve 12× average query acceleration.

How the six layers compound

Each layer amplifies the others. The correct maintenance sequence for a degraded table:

1. Expire snapshots + remove orphans — prune dead data before compaction wastes CPU on it.

2. Compact to target file size — merge small files, physically apply deletes.

3. Sort by query patterns — use layout simulations to test before committing.

4. Rewrite manifests — consolidate against the new layout.

5. Compute Puffin statistics — enable aggressive engine-level pruning.

6. Route queries — direct each workload to the fastest viable engine.

Healthy vs unhealthy Iceberg tables for queries
The performance difference: unhealthy tables (small files, fragmented manifests, slow queries) on the left. Healthy tables (compacted, sorted, fast) on the right. Autonomous maintenance continuously moves tables from left to right.

Why continuous beats one-time

Table state degrades continuously. Every streaming checkpoint creates small files. Every MOR update creates delete files. Every append fragments manifests. A one-time compaction fix lasts days or weeks before degradation returns.

Getting started with autonomous optimization
Minutes to value: connect catalogs, choose autonomous or manual mode, operations run continuously with full observability from day one.

LakeOps runs this optimization loop autonomously. Connect your catalogs (AWS Glue, REST, S3 Tables) in 10 minutes. It monitors every table and runs the right maintenance at the right frequency — compaction, manifest rewrites, snapshot expiration, delete cleanup, sort-order adaptation — coordinated in the correct sequence, per table. Production deployments achieve 12× average query acceleration with zero pipeline changes.

Watch LakeOps in action — from catalog connection to autonomous optimization delivering 12× query acceleration.
LakeOps table Metrics — file sizes, record counts, and health over time
Per-table metrics: 9.5B records, 379.83 GB, 3.0K data files at 129.35 MB avg, zero delete files. Records distribution shows a healthy table converging toward optimal state.

The performance ceiling of your Iceberg lakehouse is not the query engine. It is the physical state of your tables. Keep that state optimized — continuously, autonomously — and every engine, every query, every user benefits.

Explore the Iceberg Lakehouse Performance solution page for detailed benchmarks, or read 7 Iceberg Lakehouse Cost Reduction Strategies to see how performance and cost optimization reinforce each other.

Related articles

Found this useful? Share it with your team.