Back to blog

Iceberg Lake for Data Analytics: Optimization Guide

Eight optimization layers for data platform engineers running BI, ad-hoc SQL, and aggregation pipelines on Apache Iceberg — from partition design and file sizing through compaction, routing, and continuous maintenance.

Iceberg Lake for Data Analytics: Optimization Guide — iceberg on water with analytics dashboard showing 9.4× query speed, 68% cost efficiency gain, and 82% less data scanned

Apache Iceberg is the right table format for open analytics: ACID commits, hidden partitioning, engine-agnostic catalogs, and metadata that lets planners skip data before reading it. None of that removes the need for physical optimization. Analytics workloads — BI dashboards refreshing every few minutes, analysts running overlapping ad-hoc SQL, semantic layers issuing the same filter patterns thousands of times per day — amplify every structural mistake in your tables.

The performance ceiling for analytics on Iceberg is not your query engine. It is the physical layout of the data: partition granularity, file count and size, sort order, manifest depth, snapshot backlog, and delete-file accumulation. A table with 47,000 small files, 200 fragmented manifests, and no sort order will be slow on Trino, Snowflake, Spark, and Athena alike.

This guide is written for data platform engineers who own lakehouse reliability. It explains how analytics queries interact with Iceberg metadata, what to measure before changing anything, and eight optimization components you can implement in order — manually today, or continuously with an autonomous control plane like LakeOps. For engine-level pruning mechanics, see Optimizing Iceberg Lakehouse Performance; this article focuses on the analytics runbook.

Why analytics workloads need a different optimization mindset

ETL and streaming writers optimize for ingest throughput. Analytics readers optimize for selective scan efficiency and planning latency. The same table serves both, which is why production lakes degrade silently:

  • Read-heavy ratio. A dashboard table might see 10,000 reads per write. A one-time compaction that helps ingest layout does nothing for the analyst at 9 AM if 8,000 small files landed overnight.
  • Repeated predicates. BI tools filter on event_date, region, and product_id on every refresh. If data is not sorted and partitioned for those columns, every refresh scans the same 90% of irrelevant bytes.
  • Concurrency spikes. Morning dashboard loads hit the same hot partitions simultaneously — file count and manifest depth matter as much as bytes scanned.
  • Multi-engine access. Trino, Snowflake, DuckDB, and Athena read the same Iceberg metadata but bill differently ($/TB scanned vs $/credit vs $/query). Layout that helps one engine helps all; routing picks the cheapest winner.
Multi-engine ingestion — how analytics tables degrade between compaction runs
Multiple engines writing on schedule produce small, unsorted files continuously. Analytics queries pay for that disorder on every dashboard refresh until compaction catches up.

Diagnose before you rewrite: signals that hurt analytics

Run these checks per table before changing partition specs or kicking off a full sort compaction:

SignalWhere to lookAnalytics symptom
File count >> data volumeTable metrics, files metadataSlow planning, high S3 LIST/GET cost, p95 latency spikes
Avg file size < 64 MB on a fact tableFile size distributionDashboard timeouts despite selective WHERE clauses
Manifest count > 50–100Manifest list, Insights alertsSeconds of planning before first byte read
Snapshot depth > retention SLASnapshots tabStorage bloat; planning reads stale manifest entries
Delete files growingPartitions / table metricsMOR tables slow down linearly even for narrow filters
Partition skewFiles per partitionOne hot partition caps concurrency
Filter columns ≠ partition/sort keysQuery logs, BI SQLFull scans inside "pruned" partitions
LakeOps Insights — alerts that map to analytics slowdowns
Lake-wide Insights surface the structural problems analytics feels first: partition file overload, excessive manifests and snapshots, small-file buildup, and partition skew — each maps to a specific fix in the sections below.

How Iceberg plans an analytics query (three pruning layers)

Every analytics SQL statement goes through the same planning pipeline, regardless of engine:

1. Partition pruning. Predicates on partition source columns eliminate whole partition groups using manifest-list min/max bounds. A filter event_time >= '2026-05-01' on days(event_time) can drop 99%+ of metadata before file lists are evaluated — but only if the partition spec matches how analysts actually filter.

2. File pruning (data skipping). For remaining partitions, per-file column statistics in manifests eliminate files whose min/max ranges cannot match the predicate. This is where sort order matters: unsorted files have wide min/max spans, so pruning fails and engines open every file.

3. Row-group pruning. Inside each Parquet file, row-group statistics (and optional bloom filters in Puffin blobs) skip byte ranges. Tight sort order plus appropriately sized row groups is what makes sub-second dashboard queries possible on terabyte tables.

Iceberg's design keeps scan planning on a single coordinator with bounded remote metadata reads — far cheaper than Hive-style directory listing at scale. That advantage disappears when manifests fragment or files multiply. The eight components below restore pruning effectiveness for analytics.

1. Partition for how analysts filter — not how data arrives

Goal: eliminate whole date ranges, regions, or tenants in one metadata step before opening files.

Steps:

  1. 1.Pull 30 days of query history (or BI semantic-layer SQL) and rank columns by filter frequency.
  2. 2.Choose one primary partition column — usually days(timestamp) or months(timestamp) for event analytics.
  3. 3.Use Iceberg hidden transforms so analysts filter on event_time directly; engines map to partition keys automatically.
  4. 4.For secondary high-cardinality filters used in almost every query (tenant_id, country), add bucket(N) only if each bucket will hold multiple target-sized files — not millions of tiny partitions.
  5. 5.When workloads shift from monthly reporting to daily drill-down, use partition evolution (metadata-only) instead of a full rewrite. New writes use the new spec; historical data stays readable.

Anti-patterns: partitioning on user_id for a billion-user table; raw string partitions that BI tools do not filter on; more than 10,000 active partitions with <3 files each.

LakeOps Partitions tab — skew and file counts per partition
87 partitions, 3,012 files, 34.6 avg files per partition — use these metrics to spot skew and decide when partition evolution beats rewriting data.

2. Size files and row groups for your analytics latency target

Goal: minimize per-query metadata overhead without sacrificing parallelism for large aggregations.

Each data file costs at least one object-store GET to read Parquet footers during planning. Analytics tables with tens of thousands of small files spend more time on metadata than on data — dashboard p95 latency tracks file count, not row count.

Recommended targets:

  • Interactive BI / sub-second targets: write.target-file-size-bytes = 128–256 MB; write.parquet.row-group-size-bytes = 16–32 MB.
  • Ad-hoc analyst exploration: 256–512 MB files, 32–64 MB row groups.
  • Large aggregation jobs that scan most of a partition: 512 MB–1 GB files, 64–128 MB row groups.

Production example: 47,000 files → 280 files at ~512 MB average cut query time from 52s to 5.8s — same SQL, same engine, same data volume. Nine times faster from file consolidation alone.

Production compaction benchmarks — file reduction on real analytics tables
Ten production tables (5.5 TB): streaming workloads reached 99.8% file reduction; batch and delete-heavy tables consolidated in minutes. File sizing is the fastest win for dashboard latency.

3. Sort (or Z-order) by columns that appear in WHERE and JOIN

Goal: make file- and row-group-level statistics tight enough that selective analytics queries skip 50–95% of bytes.

Steps:

  1. 1.Rank columns by appearances in WHERE, JOIN, and GROUP BY from production query logs — not guessed at schema design time.
  2. 2.Run sort compaction on the top 1–2 columns for single-dimension dashboards (e.g. event_date, then region).
  3. 3.Evaluate Z-order when two dimensions are equally hot (geo + product category) and neither alone dominates.
  4. 4.Before a costly rewrite, test candidate layouts on an Iceberg branch with real query replay — measure bytes scanned and planning time vs baseline.
  5. 5.Re-evaluate quarterly; new dashboards introduce new filter columns.

Sorted fact tables routinely scan ~51% less data per query than unsorted equivalents on the same engine. On Athena-style pricing ($/TB scanned), that is direct savings on every refresh. On CPU-billed engines, it is lower concurrency pressure during peak hours.

Query-pattern-driven sort in practice. The hardest part of sort optimization is not running the compaction — it is knowing which columns to sort by, and when to change. LakeOps solves this by collecting cross-engine query telemetry — every WHERE, JOIN, and GROUP BY across Trino, Spark, Snowflake, Athena, and DuckDB — and automatically selecting the sort order that maximizes data skipping for the actual workload mix. When a new dashboard starts filtering on a previously cold column, the next compaction cycle incorporates it. Production deployments achieve 12× average query acceleration across the lake using this feedback loop — without manual sort-key selection or quarterly re-tuning.

LakeOps Layout Simulations — sort strategies tested against real SQL
Field access frequency (SELECT, FILTER, JOIN) from real workloads, plus branch simulations that compare predicted scan reduction before you commit to a full sort rewrite.

4. Keep metadata lean: manifests, snapshots, and Puffin stats

Goal: sub-second planning so analytics time is spent reading data, not listing it.

Manifests. Each append and compaction adds manifest files. Hundreds of small manifests inflate planning IO. Schedule rewrite manifests after compaction so the manifest tree matches the current file set. Teams report planning dropping from multiple seconds to sub-second once manifest counts fall from hundreds to dozens.

Snapshots. Every commit creates a snapshot. Analytics tables with continuous ingest can accumulate thousands. Retain 7–30 days for audit and time travel, then expire aggressively. Expiration also dereferences data files so orphan cleanup can reclaim storage. One production table held 120 TB of reclaimable snapshot-linked data — pure cost with zero analytics value.

Puffin statistics. Manifest min/max handles most file pruning. For join-heavy analytics (star-schema facts to many dimensions), compute Puffin blobs with NDV sketches and bloom filters on join keys — engines use them for better join order and existence checks on high-cardinality IDs. Run after layout stabilizes so statistics match the files analysts actually query.

LakeOps table Insights — manifest overload hurting planning
92 manifest files vs a 50-file threshold — a common analytics planning bottleneck. Rewrite manifests after compaction to align metadata with the optimized file layout.

5. Run compaction continuously — and pick binpack vs sort deliberately

Goal: sustain file size and sort order as ingest runs 24/7, not only after nightly batch windows.

Nightly cron compaction fails analytics SLAs: data lands at 7 AM, dashboards run at 9 AM, compaction ran at 2 AM. Event-driven triggers — file-count thresholds, average size below target, delete-file ratio — compact when the table actually degrades.

StrategyWhen to use for analyticsTradeoff
BinpackStreaming ingest tables needing fast, cheap file consolidationRestores file size; does not improve filter skipping
SortFact tables with stable filter columns and read:write > 50:1Higher rewrite cost; largest scan reduction per dashboard
Z-orderTwo+ equally hot filter dimensionsBalanced skipping; less tight per column than pure sort

Why compaction engine choice matters for analytics. Compaction is not a one-time migration — it runs continuously. On a 200 GB table, Spark sort compaction ran ~1,612s (~$1.54); the LakeOps Rust engine (built on Apache DataFusion) ran ~221s (~$0.21) in production benchmarks — same data, 86% lower maintenance cost. When compaction runs across hundreds of analytics tables weekly, engine efficiency determines whether maintenance is a budget line item or noise. The LakeOps engine also self-improves: on three consecutive runs of the same 1.2 TB table it reduced runtime from 22 minutes to 11 — learning data distribution and access patterns without configuration changes.

Serving analytics directly from the lake. Teams running dashboards on well-compacted, sorted Iceberg tables report that the performance gap between a lakehouse and a dedicated analytics warehouse closes entirely for read-heavy workloads. Some production deployments render BI dashboards and embedded analytics straight from the Iceberg lake to the UI — no CDC pipeline, no materialized views, no separate serving layer — because continuous sort compaction keeps query latency in the sub-second range across Trino and DuckDB. The prerequisite is that compaction runs faster than data arrives, files stay sorted by dashboard filter patterns, and manifests stay consolidated. When those conditions hold, the lake is the analytics engine.

LakeOps Optimization tab — binpack vs sort policies per table
Per-table policy: binpack for ingest-heavy tables, sort for read-heavy analytics facts. Set target file size and schedule; simulate before applying sort to a petabyte fact table.
Rust compaction engine — maintenance without JVM overhead
Purpose-built compaction avoids Spark JVM startup and GC — relevant when analytics tables need frequent, small compaction runs instead of rare batch rewrites.

For engine architecture detail, see Efficient Lakehouse Compaction at Scale.

6. Collapse delete files before they tax every dashboard refresh

Goal: zero read-time reconciliation for merge-on-read (MOR) dimension and fact tables.

CDC, late-arriving facts, and BI-driven corrections create position and equality delete files. Each analytics scan merges deletes in memory. A table with 23,000+ delete files covering hundreds of millions of rows added seconds to every query — even narrow ones.

Steps:

  1. 1.Monitor delete-file count per partition alongside active data files.
  2. 2.Run lightweight rewrite position deletes when counts spike but full rewrite is not yet warranted.
  3. 3.Schedule sort compaction with delete merging on analytics tables with daily updates — physical apply deletes so reads hit clean Parquet only.
  4. 4.Sequence after snapshot expiration so you do not compact files about to be removed.
Compaction pipeline — delete files merged into clean data files
Full compaction applies position deletes physically — fragmented inputs in, organized blocks out. Analytics queries stop paying per-read reconciliation overhead.

7. Route analytics SQL to the engine that matches cost and latency

Goal: same optimized Iceberg table, cheapest engine that meets the dashboard SLA.

Open lakes standardize storage; engines still price differently. Example production spreads: DuckDB ~$0.01/query at ~0.5s p50; Trino ~$0.03 at ~1.8s; Snowflake ~$0.08 at ~2.1s for comparable selective queries. Wrong defaults at 10k queries/day compound into six-figure annual waste — the same class of problem covered in 7 Iceberg lakehouse cost reduction strategies.

Practical routing rules for analytics:

  • Latency mode — executive dashboards and embedded analytics → fastest engine for selective SQL (often Trino or DuckDB on well-compacted tables).
  • Cost mode — scheduled reports and wide aggregations → cheapest engine within a relaxed SLA (Athena, Spark, StarRocks depending on stack).
  • Throughput mode — peak morning load → spread concurrent BI sessions to avoid single-engine queueing.

Routing only works when table layout makes "cheap" engines viable. Compacting to a few hundred sorted files is what lets DuckDB or Trino beat Spark on interactive SQL. See multi-engine query routing for the full stack view.

Engine cost and latency comparison for the same analytics SQL
Per-query cost and latency vary sharply by engine. After table optimization, route each analytics pattern to the engine that meets SLA at lowest cost.

8. Automate the maintenance loop — or analytics performance will regress

Goal: warehouse-like reliability on open storage without warehouse lock-in.

Every layer above decays without continuous execution. Streaming adds small files; MOR adds deletes; commits add snapshots; new dashboards change optimal sort keys. The correct sequence for analytics tables — and why order matters — is covered in Autonomous Iceberg Table Maintenance:

  1. 1.Expire snapshots — release dereferenced files; shorten planning paths.
  2. 2.Remove orphan files — reclaim storage; avoid compacting dead objects.
  3. 3.Compact data files — binpack or sort based on table role; apply deletes.
  4. 4.Rewrite manifests — align metadata with new layout.
  5. 5.Compute statistics — refresh Puffin / table stats for join-heavy models.
  6. 6.Observe and alert — file count, manifest depth, skew, delete ratio.

Manual Airflow DAGs across hundreds of tables rarely keep pace. A managed Iceberg control plane that sequences these operations per table — event-driven triggers, per-table tuning, and full audit logs — is how platform teams stop firefighting dashboard latency tickets without adding another nightly script. LakeOps implements this loop as a system: it observes which tables degrade between runs, which sort orders are stale relative to current query patterns, and which partitions have drifted from target file size — then acts in the correct sequence, per table, autonomously.

LakeOps Dashboard — lake-wide KPIs and recent maintenance operations
Lake-wide view: 12.4× query acceleration, $1.37M saved, table health tiers (786 tables — 70 critical, 105 warning, 566 healthy), and a Recent Operations log showing compaction, snapshot expiration, and orphan cleanup — evidence the loop is running, not just configured.

Analytics optimization checklist (copy for runbooks)

#ComponentActionExpected analytics impact
1PartitioningMatch spec to top filter columns; evolve when drill-down changes90–99% partition elimination
2File sizing128–512 MB targets by workload; fix small-file driftUp to 9× latency reduction on selective SQL
3Sort / Z-orderAlign layout to WHERE/JOIN telemetry51–95% fewer bytes scanned
4MetadataRewrite manifests; expire snapshots; refresh PuffinSub-second planning
5CompactionEvent-driven; sort on read-heavy factsSustains layers 2–3
6Delete filesMerge MOR deletes on update-heavy tablesRemoves per-query reconciliation tax
7RoutingCost/latency policies per workload classUp to 8× lower cost per query
8AutomationSequenced policies + health alertsPrevents regression between sprints

Layers compound: sorted partitions make routing to Trino/DuckDB viable; routing makes scan reduction financially visible; automation keeps dashboards fast on Monday morning after a weekend of ingest.

Getting started without boiling the ocean

Pick the five tables behind your slowest dashboards. For each: check file count, average size, manifest count, and top three filter columns from query logs. Run orphan cleanup and snapshot expiration first (cheap, immediate storage win), then sort compaction on the worst fact table, then manifest rewrite.

If you operate at lake scale, connect catalogs once and let policies run the sequence above continuously. LakeOps plugs into AWS Glue, REST catalogs (Polaris, Nessie, Gravitino, Lakekeeper), DynamoDB, and S3 Tables in about ten minutes — reads metadata and query telemetry only, no data copy. Manual mode lets you inspect and trigger; autonomous mode executes against your retention and compaction policies with full audit trails. Production results across customer deployments: 12× average query acceleration, up to 80% cost reduction, and 786+ tables managed autonomously across 112+ PB — with every operation logged, reversible, and auditable.

Connect catalogs and start autonomous or manual optimization
Typical rollout: connect catalogs → choose manual or autonomous → maintenance runs continuously with lake-wide observability from day one.
Platform walkthrough — catalog connection, table health analysis, and autonomous optimization for production Iceberg tables.

Related articles

Found this useful? Share it with your team.