Back to blog

Apache Iceberg Cost Optimization in 2026

Your Iceberg lake is overcharging you from four directions at once — storage bloat, query compute waste, compaction overhead, and engineering time. This post breaks down exactly where each dollar goes and how autonomous table management eliminates the waste without touching your pipelines.

LakeOps measured results on real Iceberg workloads: 95% faster compaction, 12x query performance improvement, 80% cost reduction

Iceberg is the right table format. The ecosystem has converged — Snowflake, Databricks, AWS, Google Cloud, and every major query engine read and write it natively. That question is settled.

The cost question is not settled. Most data platform teams that have adopted Iceberg are significantly overpaying for their lake — not because of anything Iceberg does wrong, but because Iceberg gives you the primitives to run a healthy lake and then leaves the maintenance entirely to you. In practice, most organizations do partial maintenance at best: they run compaction on their hottest tables, maybe they have a snapshot expiration job that someone wrote two years ago and is afraid to touch, and orphan files accumulate indefinitely.

This post breaks down the four distinct cost components that compound in an unmaintained Iceberg lake, shows you exactly how to measure each one, and explains how autonomous table management eliminates them without code changes or pipeline disruption.

The four cost buckets in an Iceberg lake

Before diving into solutions, it helps to understand that Iceberg costs come from four distinct sources that interact with each other. Fixing one without the others delivers partial results. The full reduction — we see up to 80% total cost reduction in production — comes from addressing all four simultaneously.

Bucket 1: Object storage costs. This includes the raw bytes you pay to store on S3, GCS, or ADLS. The headline rate looks straightforward, but the actual bill is inflated by files that shouldn't exist at all: expired snapshots that still reference data objects, orphan files from aborted writes and failed compaction runs, and duplicate data from incomplete rewrites. One customer found 120 TB of deletable data in expired snapshots alone — $33K/year in pure waste that had been accumulating for months without anyone noticing.

Bucket 2: Query compute costs. Every time a query engine reads from your lake, it pays per byte scanned and per API call to object storage. A well-maintained Iceberg table with properly-sized files and a good sort order lets engines skip irrelevant data aggressively via predicate pushdown — they read a fraction of the physical bytes. An unmaintained table with thousands of small files and stale statistics forces engines to scan everything. Unsorted tables scan 51% more data per query than sorted equivalents — on every single query, across every engine, indefinitely.

Bucket 3: Compaction compute costs. Maintenance isn't free. Every time you run compaction via Spark, you're paying for JVM startup, executor provisioning, garbage collection overhead, and idle cluster time. The typical Spark-based compaction job is over-provisioned because under-provisioning causes failures, and failures mean 2 AM pages. The compute cost of keeping tables healthy often rivals the compute cost of querying them. A 200 GB binpack compaction on Spark costs roughly $1.54 in compute. The same job on a purpose-built Rust engine costs $0.21 — 86% cheaper, on identical hardware.

Bucket 4: Engineering time costs. Someone wrote those Spark compaction jobs. Someone maintains them, debugs them when they fail, monitors them, and is on-call when they break at 2 AM. As your lake grows from 50 tables to 500 tables, the engineering time to keep maintenance scripts working grows proportionally — but headcount doesn't. This is the cost that never shows up on a cloud bill but often exceeds the infrastructure spend when you factor in fully-loaded engineering time.

Bucket 1: Storage bloat — how it compounds

Storage cost inflation follows a predictable pattern. It starts small and accelerates.

When a Spark job fails mid-write, it leaves partial data files on object storage. When snapshot expiration isn't configured, every snapshot persists forever — and each snapshot holds references to the data files that existed at that point in time, preventing garbage collection. When tables get dropped or renamed, their data stays on disk indefinitely because no one set up orphan file cleanup. After six months of this, a production lake has layers of unreferenced data accumulating at scale.

The numbers from real deployments are striking. One production scan found approximately 200 TB of dead data across 324 tables — roughly 1.8 million orphan files from aborted transactions, failed Spark jobs, and stale tables from departed team members. That was $4,000 per month in storage for data the tables didn't even reference. Another customer had 120 TB removable from expired snapshots alone, representing $33,000 per year in waste. A single cleanup pass reclaimed that storage in 10 minutes.

LakeOps Orphan Files Cleanup results across multiple tables
Recent Operations log showing Remove Orphan Files across the fleet — ice_html5_sdk_events (1m 9s, 13.6 GB reclaimed), ice_desktop_sdk_events (13m 6s, 74.8 GB reclaimed), daily_metrics_reddit_test, cluster_metadata, prod_metrics_per_segmentation_from_ui, daily_metrics_old, and more — all completed with SUCCESS status. Total reclamation in a single fleet sweep: 88+ GB of storage waste eliminated.

The other insidious storage cost is manifest metadata. Every Iceberg snapshot contains a manifest list, which points to manifest files, which point to data files. After months of appends and compaction without manifest optimization, a table might accumulate 200+ manifest files where 30 would suffice. Manifests are not free — they take up storage, and more importantly, every query planner reads every manifest before deciding which data files to scan. At 200+ manifests, query planning overhead is often larger than the actual scan.

Bucket 2: Query compute — the small file tax

The small file problem is the most well-known Iceberg cost driver, but its true scale is routinely underestimated.

Each Parquet file in your lake requires at least one S3 GET request to read its metadata — the file footer containing row group statistics, column encodings, and schema information. For a table with 47,000 files covering 1 TB of data, every query that touches that table makes 47,000 S3 GET requests just for metadata before reading a single row. At $0.0004 per 1,000 GET requests, that adds up — but the bigger cost is compute time spent waiting for those round trips.

The production measurement: 47,000 files on a single table caused a query to take 52 seconds. After compacting to 280 files, the same query ran in 5.8 seconds. That is a 9× improvement — which translates directly to 9× less CPU time per query, and therefore 9× less compute cost per execution. Run that query 10,000 times per day across your fleet and the cost difference is enormous.

LakeOps Insights tab showing HIGH alert for 92 manifest files exceeding threshold
Table Insights for customer_orders. A HIGH alert shows 92 manifest files (threshold: 50) with 43 undersized, severely impacting query performance. A WARNING flags 6 partitions significantly larger than average indicating data skew. A LOW note highlights 3 partitions with small file accumulation.

Sort order matters as much as file count. When data is physically sorted by the columns your queries filter on, query engines can use min/max statistics in Parquet row group metadata to skip entire row groups without reading them. If your queries filter on `event_date` and the data is sorted by `ingestion_time`, every query has to scan every row group looking for matching dates. Sorted layouts cut cumulative scan size by 51% compared to unsorted equivalents — on every query, every day, across every engine.

Sorted data also compresses better. A 1 TB TPC-H Lineitem table compresses to 163 GB when sorted versus 178 GB unsorted — a 9% improvement that compounds into real cost reduction at scale.

Delete files are the hidden third component. Every merge-on-read update or delete operation in Iceberg generates a delete file alongside the original data file. When you query a table with position deletes, the engine has to read both the data files and the delete files and apply the deletions at read time — for every read, on every query, until those deletes are physically compacted away. One production table had 23,433 delete files accumulating over 551 million rows of deleted data. Every query against that table was paying the full cost of reading and filtering that entire delete history.

Bucket 3: Compaction compute — why Spark is the wrong tool

The dominant pattern in Iceberg compaction is Spark-based: a scheduled Airflow DAG spins up a Spark cluster, reads the small files, writes merged output, and commits to the catalog. This works. It also costs 7–10× more than it needs to, and at scale it becomes the single largest controllable cost in your data platform.

The core problem is JVM overhead. Spark's execution model was designed for general-purpose distributed computation — it handles arbitrary DAGs of transformations across heterogeneous workloads. Compaction is none of those things. Compaction is a narrow, IO-bound read-merge-write operation with predictable memory requirements and no complex computation. Running it on Spark means paying for JVM startup time, garbage collection pauses that stall executors at unpredictable intervals, and executor provisioning that has to be over-sized to avoid OOM failures.

The benchmark numbers are clear. On a 200 GB test table (approximately 1 TB uncompressed), binpack compaction takes 1,612 seconds on Apache Spark and costs approximately $1.54. The same job on LakeOps's Rust engine with Apache DataFusion takes 221 seconds and costs $0.21 — 86% cheaper, nearly 8× faster. AWS S3 Tables' built-in compaction takes 6,300 seconds on the same data.

Compaction Duration and Cost benchmarks comparing S3 Tables, Apache Spark, and LakeOps
Compaction Duration and Cost benchmarks: S3 Tables (6,300s), Apache Spark (1,612s), LakeOps binpack (221s), LakeOps sort (780s). LakeOps delivers the fastest compaction at a fraction of the cost across both strategies.

The gap is structural, not marginal. Rust eliminates GC pauses and JVM overhead. Apache DataFusion provides vectorized columnar execution with Arrow. Lock-free parallelism means worker threads never stall waiting on each other. Bounded memory per worker means the engine never OOMs regardless of table size — while Spark routinely OOMs on large tables and requires cluster resizing or job splitting to proceed.

Production evidence: a 1.192 TB table (`balance_snapshots`) caused Spark to run out of memory on the same hardware where LakeOps completed the job in 11 minutes at 1,572 MB/s throughput. A streaming table with 42,633 small files was compacted down to 69 files — 99.8% file reduction — in 138 seconds.

The self-improvement characteristic compounds the cost savings further. On three consecutive runs of the same 1.2 TB table, the compaction engine improved from 22 minutes to 18 minutes to 11 minutes as it learned the workload's access patterns and data distribution. No configuration changes between runs — the engine observes query telemetry and adjusts its optimization strategy automatically.

Bucket 4: Snapshot bloat — the metadata trap

Every write operation in Iceberg — every append, overwrite, delete, or compaction — creates a new snapshot. Snapshots are what make time travel and atomic rollback possible. They are also, without active management, a compounding cost problem.

Each snapshot contains a manifest list, which references a set of manifest files, which reference the actual data files. At 10,000+ snapshots on a single active table, the metadata tree becomes deeply nested. Query planning — the phase where the engine determines which data files to scan — reads every manifest to evaluate column statistics. When manifests number in the hundreds and reference stale or redundant entries, query planning time dominates total query time.

LakeOps Snapshots panel showing 83 snapshots for customer_orders
The Snapshots panel for customer_orders showing 83 snapshots with IDs, timestamps, and operations. LakeOps retention policies run on a configurable schedule, respecting both a time window and minimum snapshot count while staying conflict-aware.

The other snapshot cost is storage. Each expired snapshot that has not been explicitly deleted still holds references to data objects that cannot be garbage collected. The metadata itself accumulates — manifest lists, manifest files, statistics files — and the referenced data files remain on object storage even after the data is logically expired. One production expiration run on a table with 23,183 snapshots deleted 2,928 snapshots and 5,819 files, reclaiming 263.52 MB of manifest data in under 4 minutes. On another table, 22,034 snapshots and 675,510 files were expired, reclaiming 179.49 GB of storage.

Expire Snapshots event detail: 2,928 snapshots deleted, 263.52 MB reclaimed
Event detail for an Expire Snapshots operation: 2,928 snapshots deleted, 5,819 total files removed, 263.52 MB reclaimed, 2,891 manifests and 2,928 manifest lists deleted — completed in 3m 47s with SUCCESS status.

The correct retention policy balances two competing requirements: enough snapshots to support your time-travel and rollback SLAs, and aggressive enough expiration to prevent metadata bloat. For most production workloads, this means a time window of 7–30 days combined with a minimum snapshot count of 5–10. The expiration policy must also be conflict-aware — it should never expire a snapshot that an active reader currently depends on.

The compounding interaction between cost buckets

The four cost buckets do not operate independently. Each one amplifies the others.

Small files inflate query compute costs directly through API call overhead. They also inflate compaction compute costs because the merge operation has more file handles to manage and more footer reads to perform. Small files cause snapshot accumulation to accelerate because each high-frequency streaming write creates its own snapshot, and without aggressive expiration those snapshots pile up along with their manifests — which further increases query planning overhead.

Orphan files inflate storage costs directly but also degrade compaction efficiency. A compaction engine that must distinguish live files from orphan files during its scan has more work to do and may make suboptimal decisions about which files to merge. Expiring snapshots before running orphan cleanup is critical because expiration itself produces newly unreferenced files — if orphan cleanup runs first, it misses the files that expiration will subsequently dereference.

Manifest fragmentation interacts with small files in a particularly damaging way. A table with many small files will also tend to have many manifests, because each manifest entry corresponds to one or more data files. Fixing small files via compaction reduces manifest count over time, but manifest rewriting should run alongside compaction — not only after it — to capture the planning overhead reduction sooner.

The implication: running these operations in isolation yields partial results. Running them as a coordinated, sequenced maintenance loop yields the full 80% reduction.

What autonomous management actually does

The alternative to manual scripts is not just automation — it is a system that understands the relationships between these maintenance operations and sequences them correctly, at the right frequency, for each table in your fleet.

Query-aware compaction. The starting point is continuous analysis of actual query patterns against each table. Which columns appear in WHERE clauses? Which partitions are accessed most frequently? What is the file size distribution? LakeOps builds a per-table model from query telemetry and access heatmaps, then compacts and re-sorts data to match those patterns. Tables that receive high-frequency streaming writes get compacted more aggressively. Tables with stable batch workloads get less frequent but deeper compaction.

LakeOps Optimization tab showing Rewrite Manifests and Rewrite Position Delete Files
The Optimization tab for ice_desktop_sdk_events_v2 showing metadata maintenance operations: Safety Notice (files only removed if unreferenced and exceeding age threshold), Rewrite Manifests (consolidate and optimize manifest files for improved metadata performance), Rewrite Position Delete Files (optimize position delete files to improve query read performance), and Compute Table Statistics (Puffin) with column selector. Each toggle is independently configurable.

Snapshot lifecycle management with conflict awareness. Retention policies are straightforward to define — time window plus minimum count — but safe execution is not. A naive expiration run that removes a snapshot an active reader depends on causes data errors. LakeOps schedules expiration with full awareness of concurrent readers, and coordinates snapshot expiration before orphan file cleanup so the full chain of waste removal executes in the correct order.

Orphan file detection and cleanup. The default age threshold of 7 days ensures that files from in-progress writes are never accidentally removed. Scope can be set at the catalog level, namespace level, or per-table, with include and exclude patterns for fine-grained control. The impact at fleet scale is significant: 59,831 orphan files removing 74.8 GB from a single table in 13 minutes; 13.6 GB from another in under 90 seconds.

Manifest and metadata optimization. Rewrite Manifests consolidates fragmented manifest files so query planners open fewer — cutting planning time from seconds to milliseconds on large tables. Rewrite Position Deletes merges the delete files generated by merge-on-read operations. Compute Statistics generates Puffin files with column-level statistics that enable engines to skip row groups more aggressively. Each operation has its own configurable schedule and auto-trigger threshold.

Rewrite Manifests, Position Delete Files, and Puffin Statistics controls
The Optimization tab for ice_desktop_sdk_events_v2 showing Rewrite Manifests (toggle enabled — consolidate and optimize manifest files), Rewrite Position Delete Files, and Compute Table Statistics (Puffin) with column selector. The full LakeOps sidebar is visible with Catalogs, Explore, Tables, Insights, Events, Recommendations, Policies, and Routing navigation. In Auto mode, LakeOps triggers rewrites when metadata drift exceeds configurable thresholds.

Intelligent query routing. Storage optimization reduces the cost per query, but routing reduces the cost per query further by ensuring it runs on the cheapest appropriate engine. DuckDB for point lookups and narrow aggregations costs a fraction of Snowflake for the same workload. LakeOps profiles query shapes, partition access patterns, and engine cost profiles to route each workload to the cheapest compute path that meets its latency target. Production result: 65% compute cost reduction, 47% storage reduction across 100 tables, achieved through routing and compaction working in concert.

Organization-wide policies. At 500+ tables, no individual can manage maintenance decisions manually. The policy engine defines rules at catalog, namespace, or table level — and the specificity hierarchy ensures table-level overrides take precedence over namespace defaults, which override catalog-wide baselines. Operations are staggered off-peak and coordinated to avoid conflicts.

Policies dashboard showing active maintenance policies across catalogs
The Policies dashboard showing active maintenance policies: compaction for critical tables, orphan file removal every 7 days, snapshot expiration for high-write namespaces, and configuration policies. Each shows its status toggle, type, next run, last run, and last updated timestamp.

Creating a policy starts with selecting the maintenance operation to automate — Expire Snapshots, Remove Orphan Files, Compact Data Files, Rewrite Manifests, Rewrite Position Delete Files, or Rewrite Equality Delete Files — then configuring its scope, schedule, and parameters.

Select Maintenance Operation — six policy types available for automation
The Select Maintenance Operation wizard showing six automatable operations: Expire Snapshots (automatically remove old snapshots based on retention policies), Remove Orphan Files (clean up orphaned data files no longer referenced), Compact Data Files (merge small files into larger ones for query performance), Rewrite Manifests (optimize manifest files for metadata performance), Rewrite Position Delete Files, and Rewrite Equality Delete Files.

Layout simulations: optimizing before rewriting

Compaction and sort-order changes can deliver significant improvements — or make things worse if the wrong sort order is chosen. A sort key that accelerates one query pattern may slow down another. Before committing to a potentially expensive rewrite of a large table, it helps to know what the change will actually do.

Layout Simulations run proposed changes on a real Iceberg branch created from the current snapshot. The simulation applies the layout change to the branch, replays actual query patterns against it, and measures the impact on file sizes, scan efficiency, and planning overhead. The branch is discarded after measurement — no production data is touched.

Layout Simulations showing field access frequency and diff against baseline
Layout Simulations for customer_orders: three configurations tested — clusterByOrderDate, cluster.order_type.by.status, cluster.insert-time-line — with field access frequency analysis across SELECT, FILTER, and JOIN. The Layout Customization Diff compares each simulation against the baseline.

The field access frequency analysis — how often each column appears in SELECT, FILTER, and JOIN operations — is the foundation for choosing the right sort order. Running multiple simulations with different configurations before committing to a rewrite turns a guess into a measurement.

Full-stack visibility: knowing where your money goes

You cannot optimize what you cannot see. Most organizations operate with fragmented visibility: storage metrics in the cloud console, query metrics in each engine's UI, and Iceberg metadata accessible only through ad-hoc Spark commands. By the time someone correlates a latency regression with a compaction backlog, the problem has been compounding for weeks.

A unified observability layer changes the feedback loop. The table-level Insights tab surfaces problems at four severity levels before users notice them — a HIGH alert for excessive manifests, a WARNING for partition data skew, a LOW note for small file accumulation. One-click remediation lets you act on the insight immediately rather than filing a ticket.

Per-table metrics give you the inputs you need to understand cost drivers: total records, total size, stale file count, active data files, average file size, position deletes, equality deletes, records distribution over the last 60 snapshots. When average file size drops below your target threshold, that is the early indicator that a streaming table is accumulating small files and compaction should be triggered before query latency degrades.

LakeOps table Metrics tab for customer_orders
The Metrics tab for customer_orders: 9.5B total records, 379.83 GB total size, 0 stale files, 3.0K active data files, average file size 129.35 MB, zero position or equality deletes. The records distribution chart shows volume across the last 60 snapshots.

The Events tab provides a complete audit trail per table — every compaction, snapshot expiration, orphan removal, and manifest rewrite with its duration, impact, and status. This is what an SRE needs to answer "why did this query suddenly get slow" — a complete history of what the maintenance system did and when.

Table operations history showing events with file counts and sizes
Table Operations history for ice_desktop_sdk_events: a Remove Orphan Files operation (59,831 files, 74.81 GB freed, 13m 6.9s) and a series of Expire Snapshots runs — the largest deleting 22,034 snapshots and 675,510 files, reclaiming 179.49 GB in 32 minutes.

The executive dashboard rolls up the full impact across your lake: total operations run, average query acceleration, cumulative cost savings, CPU and storage reduction percentages, and total data optimized. One deployment measured $1,374,672 in cost savings across 90 days, 12.4× average query acceleration, and −76% CPU and storage reduction across 786 tables.

Getting to 80% cost reduction: what the math looks like

The up to 80% total cost reduction is not from any single optimization. It is the compound effect of the maintenance layers working together continuously.

Storage costs drop through three mechanisms. Orphan cleanup removes files that are consuming storage for no purpose. Snapshot expiration releases the data files that expired snapshots were preventing from being garbage-collected. Compaction consolidates thousands of small files into optimally-sized ones — and since sorted data compresses 9% better, the total bytes stored after compaction is smaller than the pre-compaction uncompressed equivalent.

Compute costs drop from three directions. The Rust compaction engine replaces JVM-based Spark clusters, reducing maintenance compute cost by 86%. Sort-order optimization means every query scans 51% less data — and less data scanned means less compute consumed across Trino, Spark, Snowflake, Athena, and every other engine. Intelligent query routing sends each query to the cheapest appropriate engine instead of defaulting to the most expensive one.

Engineering costs drop to near-zero as policy-based maintenance replaces script maintenance. Policies are defined once and execute continuously. Every action is logged, auditable, and reversible — so the on-call burden disappears along with the maintenance scripts.

Starting the cleanup: what to do first

For most production lakes, the highest-ROI starting point is orphan file cleanup followed by snapshot expiration. These two operations release storage that is being consumed with zero value — and unlike compaction, they do not require choosing sort orders or tuning parameters. You scan, you identify the waste, you remove it.

After the initial cleanup pass, the next priority is establishing retention policies and compaction schedules across your highest-traffic tables. Start with the tables that have the most complaints about latency — those are typically your highest-cost tables from a compute perspective and will show the clearest improvement signal.

The full closed-loop — compaction, snapshots, orphans, manifests, routing — runs continuously thereafter, keeping each table healthy as a baseline rather than as a periodic intervention.

LakeOps connects to your existing catalogs and object storage in approximately 10 minutes. Supported catalog types include AWS Glue, DynamoDB-backed catalogs, REST catalogs (Polaris, Gravitino, Nessie, Lakekeeper), and S3 Tables. The onboarding flow connects your catalog, analyzes table health from metadata and query patterns, enables automated optimization, and starts monitoring — with no agents to install, no data to move, and no pipelines to modify.

LakeOps connected catalogs and table explorer
LakeOps with 4 connected catalogs visible in the sidebar tree: ecommerce_prod, warehouse_analytics, marketing_events, and data_science_sandbox. After connecting via Glue, DynamoDB, REST (Polaris/Lakekeeper), or S3 Tables, the Explore view lets you browse tables, inspect schemas, and access all optimization tabs — Info, Snapshots, Metrics, Optimization, Insights, Simulations, Events, Policies, Query, and Settings.

If your lake is still running cron-based Spark compaction and manually expiring snapshots, the tooling has caught up with the problem. Apache Iceberg's table format is production-ready. Your maintenance layer should be too — and it should cost a fraction of what your current approach spends to keep the lake healthy.

Related articles

Found this useful? Share it with your team.