Iceberg Lakehouse Optimization — The Right Way

Iceberg lakehouse optimization — multi-engine ecosystem (AWS, Databricks, Trino, DuckDB, Snowflake, Flink, and more) around a shared Iceberg lake, with observability and optimization above the waterline

The data lakehouse promised the best of both worlds: the flexibility and cost of a data lake with the reliability and performance of a data warehouse. Apache Iceberg delivered on that promise at the format level — ACID transactions, schema evolution, hidden partitioning, time travel, and snapshot isolation, all running on commodity object storage like S3, GCS, or ADLS.

But there is a gap between what Iceberg enables and what it automates. The format provides the primitives. It does not compact your files, expire your snapshots, clean up orphans, consolidate your manifests, or enforce governance policies. Every one of those responsibilities falls on your data platform team — and at production scale, the operational burden grows faster than most teams anticipate.

In the 2025 State of the Apache Iceberg Ecosystem survey (fielded January 2026), respondents managing thousands to tens of thousands of Iceberg tables reported that most organizations still rely on custom scripts and internal tooling to handle compaction, metadata growth, snapshot lifecycle, and retention enforcement. That approach works at small scale. It breaks down at the scale where Iceberg's value is greatest.

This article breaks down every operational pillar of a production Iceberg lakehouse, explains what happens when each one is neglected, and walks through how LakeOps delivers autonomous optimization across all of them.

What makes a lakehouse different from a data lake

A data lake is object storage with data in it. Parquet files on S3, organized by convention, queried by whatever engine happens to be pointed at them. There is no transactional layer, no schema enforcement, and no metadata contract between writers and readers.

A data lakehouse adds structure. Apache Iceberg provides the structural layer: every write creates an immutable snapshot that references manifest files, which in turn reference the actual data files. This metadata chain gives you ACID guarantees across concurrent readers and writers, schema evolution without rewriting data, partition evolution without breaking queries, and time travel to any previous state.

The lakehouse model decouples storage from compute entirely. Your data stays on S3, GCS, or ADLS — in your account, under your control. Compute engines like Trino, Spark, Snowflake, Athena, DuckDB, and Flink connect to the same Iceberg tables through a shared catalog (AWS Glue, Polaris, Gravitino, Nessie, Lakekeeper, or any REST-compliant catalog). Different teams use different engines for different workloads, all against the same data.

This architecture eliminates vendor lock-in and dramatically reduces storage costs. But it introduces an operational contract that closed platforms used to handle for you: somebody has to keep the tables healthy.

Why Iceberg tables degrade without management

The degradation pattern is predictable and well-documented. A streaming pipeline appends rows every few minutes, creating thousands of small Parquet files per partition. Each commit adds new manifest files to track those data files. Snapshots accumulate because nobody configured expiration. Failed Spark jobs leave orphan files on storage that no snapshot references but that still cost money every month.

The symptoms compound from multiple directions. Query performance degrades — engines open hundreds of small files and fragmented manifests instead of a few optimally-sized ones, and because files are unsorted relative to actual query patterns, engines scan far more data than necessary for every read. A query that returned in 2 seconds last quarter now takes 15. Storage costs balloon — orphan files, expired snapshots still referencing data, and duplicate files from incomplete compaction runs can inflate the storage bill significantly (one documented case saw a 350 TB lake shrink to 230 TB — 34% — after removing stale data). Planning overhead grows — query engines read every manifest to build an execution plan, and at 200+ manifests per table, planning dominates execution time. Engineering time evaporates — someone has to write, maintain, debug, and be on-call for the scripts that hold it together.

At 50 tables, you can manage this manually. At 200 tables across multiple catalogs and engines, the scripts become the problem — brittle, uncoordinated, and blind to the interactions between operations. At 500+ tables, manual maintenance is not a viable operating model.

The pillars of a managed Iceberg lakehouse

A production Iceberg lakehouse needs a set of operational capabilities working together — from observability and compaction through lifecycle management and governance to multi-engine routing and AI readiness. Each pillar depends on the others: observability feeds compaction decisions, compaction output determines what snapshots can be expired, snapshot expiration produces orphan files for cleanup, and policies codify all of it at fleet scale. Neglecting any single pillar creates cascading degradation across the rest.

1. Lake-wide observability

You cannot manage what you cannot see. In a multi-engine lakehouse, observability is fragmented by default — storage metrics live in the cloud console, query metrics sit inside each engine's UI, and Iceberg metadata is accessible only through Spark shell commands or catalog APIs. By the time someone correlates a latency regression with a compaction backlog, the problem has been compounding for weeks.

A managed lakehouse needs unified telemetry across every layer: table structure (file counts, file sizes, partition health), metadata state (manifest count, snapshot depth, statistics freshness), query performance (latency, throughput, error rates per engine), and operational history (every compaction, expiration, and cleanup with its duration and impact).

LakeOps provides this in a single dashboard. The executive view shows total operations run, average query acceleration, cumulative cost savings, CPU and storage reduction, and total data optimized. Below, health cards classify every table as Critical, Warning, or Healthy based on file count, file size distribution, manifest fragmentation, snapshot depth, and orphan file accumulation.

LakeOps Dashboard — last 30 days optimization activity — The LakeOps Dashboard showing 30-day optimization activity: 12,211 total operations, 12.4x average query acceleration, $1,374,672 in cost savings, and 786 tables classified by health status — Critical, Warning, and Healthy — all from a single control plane.

For table-level investigation, the Explore view provides per-table metrics: total records, total size, active data files, average file size versus the target, and a records distribution chart over the last 60 snapshots. Position deletes, equality deletes, stale files — everything needed to diagnose structural health before users complain.

LakeOps table Metrics — per-table health summary — The Metrics tab for customer_orders: 9.5B total records, 379.83 GB total size, 3.0K active data files, and average file size of 129.35 MB. The records distribution chart tracks volume across the last 60 snapshots.

Drilling deeper, the Metrics tab exposes the time-series signals that drive autonomous optimization decisions: records distribution over time (showing how data files are added, deleted, and rewritten across snapshots), file size distribution (a healthy table clusters around the 128–512 MB target range), and table size growth trends over the last 30 days. These are the signals LakeOps uses to determine when compaction is needed and whether the current sort order still matches actual query patterns.

LakeOps table Metrics — records distribution, file size distribution, and table size growth — Table metrics for account_balance_for_ui showing records distribution over the last 10 snapshots (added, deleted, and position delete files), file size distribution (130.02 MB average, 57.2% in the 128M–512M optimal range), and table size growth over the last 30 days (~377–391 GB).

The Insights engine surfaces problems at four severity levels — CRITICAL, HIGH, WARNING, LOW — before users notice them. Each insight links directly to the affected table and the specific remediation action. This continuous health scoring is what enables the autonomous optimization described in the sections below: you cannot automate what you cannot measure.

LakeOps Data Insights — lake-wide table health at a glance — Lake-wide Insights showing CRITICAL alerts for partition data file issues, HIGH alerts for excessive manifests (92 files, threshold 50), WARNING for partition skew, and LOW for small file accumulation. Each insight links directly to the affected table for one-click remediation.

2. Query-aware compaction

Compaction is the single highest-impact optimization in an Iceberg lakehouse — and how you compact matters far more than whether you compact at all. Most compaction tools treat it as a file-sizing exercise: merge small files into bigger ones and move on. That reduces file count, but it leaves the data physically unsorted, which means every query still scans far more bytes than necessary.

LakeOps takes a fundamentally different approach. The compaction engine observes your actual query patterns — which columns appear in WHERE clauses, which partitions are accessed most frequently, which tables serve latency-sensitive dashboards versus batch ETL — and uses that telemetry to determine how data files should be physically organized. When a dashboard filters on `event_date` and `region`, LakeOps sorts the data on those columns so engines can skip entire row groups using min/max statistics. A query that scanned gigabytes against unsorted files scans megabytes afterward. The result: every query runs faster and consumes less CPU, across every engine, across every consumer hitting those tables.

This is the difference between generic compaction and query-aware compaction. Binpack reduces file count. Sort-order compaction driven by observed access patterns reduces the amount of data every query touches — and that reduction translates directly into lower compute costs on Trino, Spark, Snowflake, Athena, and every other engine reading those tables. When thousands of queries per day each scan less data, the cumulative CPU and cost savings are substantial.

The engine itself is built in Rust with Apache DataFusion — not JVM-based Spark. No garbage collection pauses. No executor provisioning. No idle cluster costs. In published benchmarks (March 2026) on 10 production tables totaling 5.5 TB: 101,223 files consolidated to 19,170 (81.1% reduction), peak throughput of 2,522 MB/s, and on one streaming table, up to 99.8% file-count reduction (tens of thousands of small files down to dozens). On a ~1.2 TB table, Apache Spark failed with an out-of-memory error while LakeOps completed compaction in about eleven minutes on comparable hardware.

Compaction Duration and Cost benchmarks — LakeOps vs Spark vs S3 Tables — Compaction benchmarks: S3 Tables (6,300s), Apache Spark (1,612s), LakeOps binpack (221s), and LakeOps sort (780s). LakeOps delivers the fastest compaction at a fraction of the cost across both strategies.

Every compaction commits atomically using Iceberg's optimistic concurrency control, without blocking concurrent reads or writes. And because the engine continuously learns — adapting sort order as query patterns shift over time — the performance gains compound rather than degrade.

3. Snapshot lifecycle management

Every write operation in Iceberg creates a new snapshot. Snapshots are what make time travel and rollback possible — but unbounded accumulation is expensive. Each snapshot references a manifest list, which references manifest files, which reference data files. At thousands of snapshots on a single table, the metadata tree becomes so deep that query planning overhead dominates execution time. And every expired-but-undeleted snapshot still references data files on storage, preventing them from being reclaimed.

LakeOps manages the full snapshot lifecycle. The Snapshots panel lists every snapshot with its ID, timestamp, operation type, and references. You can tag a snapshot, create a branch from it, roll back the table, or set it as the current version — all from the UI. Retention policies run on configurable cron schedules and respect both a time window (e.g., 30 days) and a minimum snapshot count to preserve time-travel capability. Expiration is conflict-aware: it never removes a snapshot that an active reader depends on.

LakeOps Snapshots panel showing 83 snapshots for customer_orders — The Snapshots tab for customer_orders showing 83 snapshots with their IDs, timestamps, and operation types. Actions include Tag, Branch, Rollback, and Set current snapshot.

The impact at scale is dramatic. A single expiration run on one production table deleted 2,928 snapshots and 5,819 files, reclaiming 263.52 MB of manifest data in under 4 minutes. On another table, 22,034 snapshots and 675,510 files were expired, reclaiming 179.49 GB of storage in a single pass. These are not one-time savings — they recur continuously as new snapshots accumulate.

Expire Snapshots event detail: 2,928 snapshots deleted, 263.52 MB reclaimed — Event detail for an Expire Snapshots operation: 2,928 snapshots deleted, 5,819 files removed, 263.52 MB reclaimed, 2,891 manifests and 2,928 manifest lists deleted — completed in 3m 47s.

4. Manifest and metadata optimization

Iceberg's manifest files are the index layer between snapshots and data files. They store column-level statistics (min/max, null counts) that enable partition pruning and data skipping. After many compaction and append cycles, manifests fragment — a table might have 200 manifest files where 30 would suffice. The cost is invisible until it becomes the bottleneck: every query opens every manifest, reads its statistics, and evaluates which data files to scan. At 200+ manifests, planning overhead dominates execution.

LakeOps automates three distinct metadata operations. Manifest consolidation merges fragmented manifests so query planners open fewer files — reducing planning time from seconds to milliseconds on large tables. Position delete optimization consolidates small delete files generated by merge-on-read operations, directly improving read performance on tables with frequent updates. Puffin statistics computation generates column-level statistics (distinct values, min/max ranges, null ratios) that enable engines to skip row groups more aggressively.

LakeOps Optimization tab — Rewrite Manifests, Rewrite Position Deletes, and Compute Statistics — The Optimization tab showing three metadata operations: Rewrite Manifests (consolidate and optimize manifest files for improved metadata performance), Rewrite Position Delete Files (optimize position deletes for better query read performance), and Compute Table Statistics with Puffin column selector. Each has its own toggle and schedule.

The Insights engine flags manifest fragmentation automatically — a HIGH alert fires when manifest count exceeds a configurable threshold (default: 50 per snapshot). In Auto mode, LakeOps triggers rewrites when metadata drift exceeds these thresholds, without human intervention.

LakeOps Insights tab showing HIGH alert for 92 manifest files exceeding threshold — Table Insights for customer_orders showing a HIGH alert for 92 manifest files (threshold: 50) with 43 undersized — severely impacting query performance and flagged for automated remediation.

5. Orphan file cleanup

Orphan files are data objects on S3, GCS, or ADLS that no live Iceberg snapshot references. They accumulate silently from aborted writes, failed Spark jobs, interrupted compaction runs, and incomplete table drops. They serve no analytical purpose, but object storage charges per byte regardless. In production lakes, orphan files routinely account for a significant portion of the storage bill.

The challenge is safety. A naive cleanup that compares storage against current metadata can accidentally delete files from in-flight jobs. LakeOps solves this with a policy-based approach: an age threshold (default: 7 days) ensures only files unreferenced for at least that long are candidates for removal. Cleanup schedules coordinate with snapshot expiration — since expiration itself can produce newly unreferenced files that should be captured in the same sweep.

LakeOps Orphan Files Cleanup configuration — The Orphan Files Cleanup panel: age threshold set to 3 days (range 1–90 days). Safety notice warns that files are only removed if not referenced in any snapshot and exceeding the age threshold. Cron expression schedules cleanup at 03:00 AM daily.

In one production run, orphan cleanup removed 59,831 files (74.8 GB) from a single table in 13 minutes. Across the fleet, multiple tables were cleaned in under a minute each — all tracked with per-operation status, duration, and impact.

LakeOps Orphan Files Cleanup results across multiple tables — Recent orphan file removal operations across the fleet — ice_html5_sdk_events (1m 9s, 13.6 GB), ice_desktop_sdk_events (13m 6s, 74.8 GB), and staging tables cleaned in under 1s each. All operations completed with SUCCESS status.

6. Organization-wide policies and governance

Individual table optimization is necessary but insufficient. At fleet scale — hundreds of tables, multiple catalogs, teams with different SLAs — maintenance has to be codified into policies that enforce consistent standards while allowing targeted overrides.

LakeOps provides a centralized Policies dashboard where every policy is listed with its status toggle, type (Compact Data Files, Expire Snapshots, Rewrite Manifests, Remove Orphan Files, Configuration), schedule, and scope. Six policy types are available: Expire Snapshots, Remove Orphan Files, Compact Data Files, Rewrite Manifests, Rewrite Position Delete Files, and Rewrite Equality Delete Files.

Select Maintenance Operation — six policy types available for automation — The policy creation wizard showing six operation types: Expire Snapshots, Remove Orphan Files, Compact Data Files, Rewrite Manifests, Rewrite Position Delete Files, and Rewrite Equality Delete Files — each with type-specific configuration.

Scope resolution follows a specificity hierarchy: a table-level policy overrides a namespace-level default, which overrides a catalog-wide baseline. All operations are conflict-aware — scheduling respects concurrent writers so maintenance never blocks production pipelines. The recommended practice is to stagger heavy operations off-peak and schedule snapshot expiration before orphan cleanup so the full chain of waste removal executes in the correct order.

The policy engine is what turns individual optimizations into a self-managing system. Define retention windows, compaction thresholds, file size targets, and manifest depth limits once — they apply across every table in every catalog. Every operation is logged, auditable, and reversible.

Multi-engine query routing

Production lakehouses rarely use a single engine. A typical deployment has Trino for interactive analytics, Snowflake for BI dashboards, Athena for ad-hoc exploration, DuckDB for local development, and Spark for batch ETL. Without a routing layer, each team picks the engine they know — regardless of whether it is the cheapest or fastest for that particular query shape.

LakeOps connects all engines to a unified routing layer that makes per-query decisions based on cost ceilings, latency targets, table health, and historical performance patterns. Three routing strategies — Cost (cheapest within latency bounds), Latency (fastest for the query shape), and Throughput (load-balanced across available capacity) — can be applied per routing group.

LakeOps Query Engines Overview — The Query Engines directory showing 6 registered engines — AWS Athena, Trino, DuckDB, StarRocks, Snowflake, and ClickHouse — with real-time status, query counts, average runtime, and cost per query.

Routing Endpoints give each application or agent a stable URL (e.g., `e1fa3c3c.lakeops.dev`), a defined engine pool, query-type scope, and priority level. Analytics routes SELECT and AGGREGATE queries to low-latency engines. ETL routes INSERT and MERGE to cost-optimized compute. Each endpoint is independently configurable and inherits the full policy stack.

LakeOps Routing Groups and Endpoints — Routing groups configured for enterprise workloads — Analytics, BI, ETL, and Reports — each with its own stable endpoint URL, engine assignments, and query type filters.

Side-by-side engine comparison makes it straightforward to benchmark alternatives against the same Iceberg tables — comparing success rates, average runtimes, cost per query, and data scanned across Trino, Snowflake, DuckDB, and others.

LakeOps Engine Comparison — Engine comparison: Trino at $0.03/query and 1.8s average; Snowflake at $0.08/query and 2.1s; DuckDB at $0.01/query and 0.5s. Success rates: 99.5%, 99.8%, and 100% respectively.

Agentic AI readiness

AI agents are becoming primary consumers of lakehouse data. They issue SQL iteratively inside tool-use loops, generate unpredictable query shapes, and require sub-second latency from tables that were designed for batch workloads. This creates infrastructure requirements that most lakehouses are not prepared for: schema discovery so agents know what data exists, query guardrails that enforce cost ceilings and prevent PII leakage at machine speed, low-latency reads against well-compacted files, and stable routing endpoints that agents can call without per-session configuration.

LakeOps provides an agent-native MCP (Model Context Protocol) interface with schema-aware tools, async query support, and wire compatibility with PostgreSQL, MySQL, and Arrow Flight. Layered guardrails — ReadOnly, CostEstimate, PIIMask, and HumanApproval — ensure agents operate within safety boundaries. An intelligent routing stack (Adaptive + LLM + Semantic routers, with 0ms cached decisions) selects the optimal engine per query shape without agent-side configuration.

The system forms a closed loop: the compaction engine continuously reshapes file layout based on actual query patterns — including agent queries — so the lake self-optimizes as AI adoption scales. Hot tables are compacted first. Partitions adapt to access patterns. Routing weights adjust as data quality improves. The infrastructure layer that makes AI agents production-safe on your data lake.

Layout simulations: test before you commit

Choosing the wrong sort order or partitioning scheme can make things worse, not better. LakeOps addresses this with Layout Simulations: a safe, branch-based testing environment that evaluates layout changes before applying them to production. Simulations run on a real Iceberg branch created from the latest snapshot — layout changes are applied to the branch, query patterns are replayed, and results are compared against the current baseline. The branch is discarded afterward; no production data is modified.

LakeOps Layout Simulations — Layout Simulations for customer_orders showing three configurations with field access frequency analysis across SELECT, FILTER, and JOIN operations. The Layout Customization Diff compares simulation results against the baseline — strategy, sort columns, and average file size for each approach.

Getting connected in minutes

LakeOps connects to your existing catalogs and object storage in approximately 10 minutes. There are no agents to install, no data to move, and no pipelines to modify. Your data stays in your account.

LakeOps Connect Catalogs — 4 catalogs connected across 3 regions — The LakeOps Catalogs page: 4 catalogs across 3 regions — ecommerce_prod (Glue + S3, 342 tables, 48.2 TB), warehouse_analytics (DynamoDB + S3, 198 tables, 31.7 TB), marketing_events (REST + S3, 127 tables, 18.4 TB), and data_science_sandbox (S3 Tables, 119 tables, 14.1 TB). Health status visible per catalog.

Supported catalog types include AWS Glue, DynamoDB-backed catalogs, REST catalogs (Polaris, Gravitino, Nessie, Lakekeeper), and S3 Tables. The onboarding flow: connect your catalog and storage, let the platform analyze table health from metadata and query patterns, enable automated optimization, and start monitoring. No vendor lock-in, no infrastructure changes, and no data leaves your environment.

Watch LakeOps in action — from catalog connection to autonomous optimization in minutes.

Where the savings come from

The up to 80% reduction in CPU and storage costs is not from any single optimization. It is the compound effect of every layer working in the right sequence.

Storage costs drop because orphan files are continuously removed, expired snapshots release their referenced data, and compaction consolidates thousands of small files into optimally-sized ones — eliminating redundant storage and the API calls that come with it.

Compute costs drop from three directions. First, query-aware sort-order compaction reorganizes data by actual access patterns, so every query — across every engine — scans dramatically less data; less data scanned means less CPU consumed per query, and that saving multiplies across thousands of queries per day. Second, the Rust compaction engine replaces JVM-based Spark clusters, eliminating executor provisioning, garbage collection overhead, and idle cluster costs for the maintenance work itself. Third, intelligent query routing sends each query to the cheapest appropriate engine instead of a default.

Query latency improves up to 12x through the combination of query-pattern-driven sort compaction (predicate pushdown skips entire row groups), file consolidation (fewer, larger files), manifest consolidation (faster planning), and Puffin statistics (more aggressive pruning). The improvement is measured across engines — Trino, Spark, Snowflake — after compaction and layout optimization.

Orchestrating operations in the right sequence adds further efficiency. Snapshot expiration runs before orphan cleanup so newly unreferenced files are captured in the same sweep. Compaction runs after expiration so it does not waste cycles merging files about to be garbage-collected. Manifest optimization runs after compaction so it operates on the final file set. Each operation's output becomes the next one's clean input.

The operational reality

The Iceberg ecosystem is mature. The format works. Multi-engine support is broad and improving. The hard part is no longer choosing a table format or picking an engine — it is running the lakehouse in production: keeping hundreds of tables healthy, maintaining observability across engines, enforcing governance without a monolithic control point, and preparing infrastructure for AI agents that are already reshaping how data is consumed.

That operational layer is what separates a data lake from a managed lakehouse. It is the difference between Iceberg tables that degrade silently and Iceberg tables that self-optimize continuously. And it is exactly what LakeOps was built for: autonomous management for every table, every catalog, and every engine — from a single control plane.