
Apache Iceberg is the right table format for open analytics: ACID commits, hidden partitioning, engine-agnostic catalogs, and metadata that lets planners skip data before reading it. None of that removes the need for physical optimization. Analytics workloads — BI dashboards refreshing every few minutes, analysts running overlapping ad-hoc SQL, semantic layers issuing the same filter patterns thousands of times per day — amplify every structural mistake in your tables.
The performance ceiling for analytics on Iceberg is not your query engine. It is the physical layout of the data: partition granularity, file count and size, sort order, manifest depth, snapshot backlog, and delete-file accumulation. A table with 47,000 small files, 200 fragmented manifests, and no sort order will be slow on Trino, Snowflake, Spark, and Athena alike.
This guide is written for data platform engineers who own lakehouse reliability. It explains how analytics queries interact with Iceberg metadata, what to measure before changing anything, and eight optimization components you can implement in order — manually today, or continuously with an autonomous control plane like LakeOps. For engine-level pruning mechanics, see Optimizing Iceberg Lakehouse Performance; this article focuses on the analytics runbook.
Why analytics workloads need a different optimization mindset
ETL and streaming writers optimize for ingest throughput. Analytics readers optimize for selective scan efficiency and planning latency. The same table serves both, which is why production lakes degrade silently:
- Read-heavy ratio. A dashboard table might see 10,000 reads per write. A one-time compaction that helps ingest layout does nothing for the analyst at 9 AM if 8,000 small files landed overnight.
- Repeated predicates. BI tools filter on
event_date,region, andproduct_idon every refresh. If data is not sorted and partitioned for those columns, every refresh scans the same 90% of irrelevant bytes. - Concurrency spikes. Morning dashboard loads hit the same hot partitions simultaneously — file count and manifest depth matter as much as bytes scanned.
- Multi-engine access. Trino, Snowflake, DuckDB, and Athena read the same Iceberg metadata but bill differently ($/TB scanned vs $/credit vs $/query). Layout that helps one engine helps all; routing picks the cheapest winner.

Diagnose before you rewrite: signals that hurt analytics
Run these checks per table before changing partition specs or kicking off a full sort compaction:
| Signal | Where to look | Analytics symptom |
|---|---|---|
| File count >> data volume | Table metrics, files metadata | Slow planning, high S3 LIST/GET cost, p95 latency spikes |
| Avg file size < 64 MB on a fact table | File size distribution | Dashboard timeouts despite selective WHERE clauses |
| Manifest count > 50–100 | Manifest list, Insights alerts | Seconds of planning before first byte read |
| Snapshot depth > retention SLA | Snapshots tab | Storage bloat; planning reads stale manifest entries |
| Delete files growing | Partitions / table metrics | MOR tables slow down linearly even for narrow filters |
| Partition skew | Files per partition | One hot partition caps concurrency |
| Filter columns ≠ partition/sort keys | Query logs, BI SQL | Full scans inside "pruned" partitions |

How Iceberg plans an analytics query (three pruning layers)
Every analytics SQL statement goes through the same planning pipeline, regardless of engine:
1. Partition pruning. Predicates on partition source columns eliminate whole partition groups using manifest-list min/max bounds. A filter event_time >= '2026-05-01' on days(event_time) can drop 99%+ of metadata before file lists are evaluated — but only if the partition spec matches how analysts actually filter.
2. File pruning (data skipping). For remaining partitions, per-file column statistics in manifests eliminate files whose min/max ranges cannot match the predicate. This is where sort order matters: unsorted files have wide min/max spans, so pruning fails and engines open every file.
3. Row-group pruning. Inside each Parquet file, row-group statistics (and optional bloom filters in Puffin blobs) skip byte ranges. Tight sort order plus appropriately sized row groups is what makes sub-second dashboard queries possible on terabyte tables.
Iceberg's design keeps scan planning on a single coordinator with bounded remote metadata reads — far cheaper than Hive-style directory listing at scale. That advantage disappears when manifests fragment or files multiply. The eight components below restore pruning effectiveness for analytics.
1. Partition for how analysts filter — not how data arrives
Goal: eliminate whole date ranges, regions, or tenants in one metadata step before opening files.
Steps:
- 1.Pull 30 days of query history (or BI semantic-layer SQL) and rank columns by filter frequency.
- 2.Choose one primary partition column — usually
days(timestamp)ormonths(timestamp)for event analytics. - 3.Use Iceberg hidden transforms so analysts filter on
event_timedirectly; engines map to partition keys automatically. - 4.For secondary high-cardinality filters used in almost every query (
tenant_id,country), addbucket(N)only if each bucket will hold multiple target-sized files — not millions of tiny partitions. - 5.When workloads shift from monthly reporting to daily drill-down, use partition evolution (metadata-only) instead of a full rewrite. New writes use the new spec; historical data stays readable.
Anti-patterns: partitioning on user_id for a billion-user table; raw string partitions that BI tools do not filter on; more than 10,000 active partitions with <3 files each.

2. Size files and row groups for your analytics latency target
Goal: minimize per-query metadata overhead without sacrificing parallelism for large aggregations.
Each data file costs at least one object-store GET to read Parquet footers during planning. Analytics tables with tens of thousands of small files spend more time on metadata than on data — dashboard p95 latency tracks file count, not row count.
Recommended targets:
- Interactive BI / sub-second targets:
write.target-file-size-bytes= 128–256 MB;write.parquet.row-group-size-bytes= 16–32 MB. - Ad-hoc analyst exploration: 256–512 MB files, 32–64 MB row groups.
- Large aggregation jobs that scan most of a partition: 512 MB–1 GB files, 64–128 MB row groups.
Production example: 47,000 files → 280 files at ~512 MB average cut query time from 52s to 5.8s — same SQL, same engine, same data volume. Nine times faster from file consolidation alone.

3. Sort (or Z-order) by columns that appear in WHERE and JOIN
Goal: make file- and row-group-level statistics tight enough that selective analytics queries skip 50–95% of bytes.
Steps:
- 1.Rank columns by appearances in WHERE, JOIN, and GROUP BY from production query logs — not guessed at schema design time.
- 2.Run sort compaction on the top 1–2 columns for single-dimension dashboards (e.g.
event_date, thenregion). - 3.Evaluate Z-order when two dimensions are equally hot (geo + product category) and neither alone dominates.
- 4.Before a costly rewrite, test candidate layouts on an Iceberg branch with real query replay — measure bytes scanned and planning time vs baseline.
- 5.Re-evaluate quarterly; new dashboards introduce new filter columns.
Sorted fact tables routinely scan ~51% less data per query than unsorted equivalents on the same engine. On Athena-style pricing ($/TB scanned), that is direct savings on every refresh. On CPU-billed engines, it is lower concurrency pressure during peak hours.
Query-pattern-driven sort in practice. The hardest part of sort optimization is not running the compaction — it is knowing which columns to sort by, and when to change. LakeOps solves this by collecting cross-engine query telemetry — every WHERE, JOIN, and GROUP BY across Trino, Spark, Snowflake, Athena, and DuckDB — and automatically selecting the sort order that maximizes data skipping for the actual workload mix. When a new dashboard starts filtering on a previously cold column, the next compaction cycle incorporates it. Production deployments achieve 12× average query acceleration across the lake using this feedback loop — without manual sort-key selection or quarterly re-tuning.

4. Keep metadata lean: manifests, snapshots, and Puffin stats
Goal: sub-second planning so analytics time is spent reading data, not listing it.
Manifests. Each append and compaction adds manifest files. Hundreds of small manifests inflate planning IO. Schedule rewrite manifests after compaction so the manifest tree matches the current file set. Teams report planning dropping from multiple seconds to sub-second once manifest counts fall from hundreds to dozens.
Snapshots. Every commit creates a snapshot. Analytics tables with continuous ingest can accumulate thousands. Retain 7–30 days for audit and time travel, then expire aggressively. Expiration also dereferences data files so orphan cleanup can reclaim storage. One production table held 120 TB of reclaimable snapshot-linked data — pure cost with zero analytics value.
Puffin statistics. Manifest min/max handles most file pruning. For join-heavy analytics (star-schema facts to many dimensions), compute Puffin blobs with NDV sketches and bloom filters on join keys — engines use them for better join order and existence checks on high-cardinality IDs. Run after layout stabilizes so statistics match the files analysts actually query.

5. Run compaction continuously — and pick binpack vs sort deliberately
Goal: sustain file size and sort order as ingest runs 24/7, not only after nightly batch windows.
Nightly cron compaction fails analytics SLAs: data lands at 7 AM, dashboards run at 9 AM, compaction ran at 2 AM. Event-driven triggers — file-count thresholds, average size below target, delete-file ratio — compact when the table actually degrades.
| Strategy | When to use for analytics | Tradeoff |
|---|---|---|
| Binpack | Streaming ingest tables needing fast, cheap file consolidation | Restores file size; does not improve filter skipping |
| Sort | Fact tables with stable filter columns and read:write > 50:1 | Higher rewrite cost; largest scan reduction per dashboard |
| Z-order | Two+ equally hot filter dimensions | Balanced skipping; less tight per column than pure sort |
Why compaction engine choice matters for analytics. Compaction is not a one-time migration — it runs continuously. On a 200 GB table, Spark sort compaction ran ~1,612s (~$1.54); the LakeOps Rust engine (built on Apache DataFusion) ran ~221s (~$0.21) in production benchmarks — same data, 86% lower maintenance cost. When compaction runs across hundreds of analytics tables weekly, engine efficiency determines whether maintenance is a budget line item or noise. The LakeOps engine also self-improves: on three consecutive runs of the same 1.2 TB table it reduced runtime from 22 minutes to 11 — learning data distribution and access patterns without configuration changes.
Serving analytics directly from the lake. Teams running dashboards on well-compacted, sorted Iceberg tables report that the performance gap between a lakehouse and a dedicated analytics warehouse closes entirely for read-heavy workloads. Some production deployments render BI dashboards and embedded analytics straight from the Iceberg lake to the UI — no CDC pipeline, no materialized views, no separate serving layer — because continuous sort compaction keeps query latency in the sub-second range across Trino and DuckDB. The prerequisite is that compaction runs faster than data arrives, files stay sorted by dashboard filter patterns, and manifests stay consolidated. When those conditions hold, the lake is the analytics engine.


For engine architecture detail, see Efficient Lakehouse Compaction at Scale.
6. Collapse delete files before they tax every dashboard refresh
Goal: zero read-time reconciliation for merge-on-read (MOR) dimension and fact tables.
CDC, late-arriving facts, and BI-driven corrections create position and equality delete files. Each analytics scan merges deletes in memory. A table with 23,000+ delete files covering hundreds of millions of rows added seconds to every query — even narrow ones.
Steps:
- 1.Monitor delete-file count per partition alongside active data files.
- 2.Run lightweight rewrite position deletes when counts spike but full rewrite is not yet warranted.
- 3.Schedule sort compaction with delete merging on analytics tables with daily updates — physical apply deletes so reads hit clean Parquet only.
- 4.Sequence after snapshot expiration so you do not compact files about to be removed.

7. Route analytics SQL to the engine that matches cost and latency
Goal: same optimized Iceberg table, cheapest engine that meets the dashboard SLA.
Open lakes standardize storage; engines still price differently. Example production spreads: DuckDB ~$0.01/query at ~0.5s p50; Trino ~$0.03 at ~1.8s; Snowflake ~$0.08 at ~2.1s for comparable selective queries. Wrong defaults at 10k queries/day compound into six-figure annual waste — the same class of problem covered in 7 Iceberg lakehouse cost reduction strategies.
Practical routing rules for analytics:
- Latency mode — executive dashboards and embedded analytics → fastest engine for selective SQL (often Trino or DuckDB on well-compacted tables).
- Cost mode — scheduled reports and wide aggregations → cheapest engine within a relaxed SLA (Athena, Spark, StarRocks depending on stack).
- Throughput mode — peak morning load → spread concurrent BI sessions to avoid single-engine queueing.
Routing only works when table layout makes "cheap" engines viable. Compacting to a few hundred sorted files is what lets DuckDB or Trino beat Spark on interactive SQL. See multi-engine query routing for the full stack view.

8. Automate the maintenance loop — or analytics performance will regress
Goal: warehouse-like reliability on open storage without warehouse lock-in.
Every layer above decays without continuous execution. Streaming adds small files; MOR adds deletes; commits add snapshots; new dashboards change optimal sort keys. The correct sequence for analytics tables — and why order matters — is covered in Autonomous Iceberg Table Maintenance:
- 1.Expire snapshots — release dereferenced files; shorten planning paths.
- 2.Remove orphan files — reclaim storage; avoid compacting dead objects.
- 3.Compact data files — binpack or sort based on table role; apply deletes.
- 4.Rewrite manifests — align metadata with new layout.
- 5.Compute statistics — refresh Puffin / table stats for join-heavy models.
- 6.Observe and alert — file count, manifest depth, skew, delete ratio.
Manual Airflow DAGs across hundreds of tables rarely keep pace. A managed Iceberg control plane that sequences these operations per table — event-driven triggers, per-table tuning, and full audit logs — is how platform teams stop firefighting dashboard latency tickets without adding another nightly script. LakeOps implements this loop as a system: it observes which tables degrade between runs, which sort orders are stale relative to current query patterns, and which partitions have drifted from target file size — then acts in the correct sequence, per table, autonomously.

Analytics optimization checklist (copy for runbooks)
| # | Component | Action | Expected analytics impact |
|---|---|---|---|
| 1 | Partitioning | Match spec to top filter columns; evolve when drill-down changes | 90–99% partition elimination |
| 2 | File sizing | 128–512 MB targets by workload; fix small-file drift | Up to 9× latency reduction on selective SQL |
| 3 | Sort / Z-order | Align layout to WHERE/JOIN telemetry | 51–95% fewer bytes scanned |
| 4 | Metadata | Rewrite manifests; expire snapshots; refresh Puffin | Sub-second planning |
| 5 | Compaction | Event-driven; sort on read-heavy facts | Sustains layers 2–3 |
| 6 | Delete files | Merge MOR deletes on update-heavy tables | Removes per-query reconciliation tax |
| 7 | Routing | Cost/latency policies per workload class | Up to 8× lower cost per query |
| 8 | Automation | Sequenced policies + health alerts | Prevents regression between sprints |
Layers compound: sorted partitions make routing to Trino/DuckDB viable; routing makes scan reduction financially visible; automation keeps dashboards fast on Monday morning after a weekend of ingest.
Getting started without boiling the ocean
Pick the five tables behind your slowest dashboards. For each: check file count, average size, manifest count, and top three filter columns from query logs. Run orphan cleanup and snapshot expiration first (cheap, immediate storage win), then sort compaction on the worst fact table, then manifest rewrite.
If you operate at lake scale, connect catalogs once and let policies run the sequence above continuously. LakeOps plugs into AWS Glue, REST catalogs (Polaris, Nessie, Gravitino, Lakekeeper), DynamoDB, and S3 Tables in about ten minutes — reads metadata and query telemetry only, no data copy. Manual mode lets you inspect and trigger; autonomous mode executes against your retention and compaction policies with full audit trails. Production results across customer deployments: 12× average query acceleration, up to 80% cost reduction, and 786+ tables managed autonomously across 112+ PB — with every operation logged, reversible, and auditable.




