
Apache Iceberg is the de facto standard for open lakehouse table formats. Snowflake, Databricks, AWS, and every major query engine read and write Iceberg natively. The format question is settled — but the cost question is wide open.
Most production lakehouses overspend by 60–80%. Not because Iceberg is expensive, but because it exposes maintenance primitives — compaction, snapshot management, orphan cleanup — and leaves execution to you. Without continuous optimization, costs compound silently across storage, compute, and engineering time.
This article covers seven strategies that reduce Iceberg lakehouse costs by 60–80% in production. Each strategy compounds with the others — and all seven are capabilities of LakeOps, the autonomous control plane for Apache Iceberg.
1. Deploy an autonomous control plane
The cost problem: Every optimization in this list requires ongoing execution. Compaction after every write batch. Snapshot expiration on schedule. Orphan scans weekly. Sort orders evolving with query patterns. Without orchestration, each becomes a maintenance burden that degrades as team priorities shift — and costs creep back.
The solution: Replace cron jobs, Airflow DAGs, and manual scripts with a system that observes table state and acts on it continuously. An autonomous control plane sequences operations intelligently: expire snapshots → remove orphans → compact → optimize manifests. This ensures compaction never wastes CPU rewriting files about to be deleted.

How LakeOps does it: Connects to your existing catalogs in ~10 minutes — no agents, no data movement, no pipeline changes. Begins analyzing table health immediately from metadata and query patterns. Production results: up to 80% total cost reduction, $1.37M saved in 3 months, 786+ tables across 112+ PB managed autonomously with full audit trail and one-click rollback.

2. Run compaction on a Rust engine at 10% the cost of Spark
The cost problem: Small files multiply per-query overhead. A table with 47,000 files forces 47,000 S3 GET requests per query. But the traditional fix — Spark-based compaction — is itself expensive. JVM startup, garbage collection, over-provisioned clusters. A 200 GB compaction on Spark: $1.54, 1,612 seconds.
The solution: A purpose-built Rust compaction engine eliminates JVM overhead entirely. Same 200 GB job: $0.21, 221 seconds. That is 86% cheaper and 8× faster — for the maintenance operation itself.

How LakeOps does it: The Rust engine (built on Apache DataFusion) runs compaction event-driven — triggered by file-size degradation, not on blind cron schedules. It goes beyond binpack: observes which columns appear in WHERE clauses and sorts data accordingly. Sorted tables scan 51% less data per query — directly reducing per-query compute cost. The engine learns from telemetry: three consecutive runs of a 1.2 TB table improved from 22 min → 18 → 11, zero config changes.
The query cost impact: 47,000 files → 280 files reduced query time from 52s to 5.8s. 9× less compute cost per query — multiplied by every query, every day. For a deeper dive into the compaction engine architecture, see Efficient Lakehouse Compaction at Scale.
3. Expire snapshots to reclaim storage
The cost problem: Every Iceberg write creates a snapshot. Without active expiration, snapshots accumulate indefinitely — holding references to data files and preventing garbage collection. One production table had 120 TB of reclaimable data from snapshot bloat alone: $33,000/year in pure waste.
The solution: Automated lifecycle management that balances time-travel SLAs (7–30 days retention) with aggressive expiration — conflict-aware, so active readers are never disrupted.

How LakeOps does it: Retention policies define the minimum snapshot count and time window per table (or fleet-wide via policies). Expiration runs on schedule and is sequenced before orphan cleanup — so dereferenced files are immediately eligible for removal. Production result: 22,034 snapshots and 675,510 files expired from a single table, reclaiming 179.49 GB. Another run: 2,928 snapshots cleared in under 4 minutes.
4. Remove orphan files at near-zero compute cost
The cost problem: Orphan files — S3 objects no table references — accumulate from failed Spark jobs, aborted transactions, and dropped tables. They serve zero queries but cost real money. One fleet-wide scan found 200 TB of dead data across 324 tables: $4,000/month in wasted storage.
The solution: Automated detection and cleanup with a safety threshold (7 days) to protect in-progress writes.

How LakeOps does it: Scans catalog-wide, per-namespace, or per-table with include/exclude patterns. Runs after snapshot expiration in the maintenance sequence — catching newly dereferenced files that expiration released. ROI is immediate: storage savings begin on the next billing cycle. 59,831 orphan files (74.8 GB) removed from a single table in 13 minutes.
5. Sort data by query patterns to scan less
The cost problem: When data is unsorted, every query scans every byte. On Athena at $5/TB scanned, that's a direct cost per scan. On Trino/Spark billed per CPU-second, it means full compute utilization for partial results. Multiply by thousands of daily queries and the waste is substantial.
The solution: Sort data by the columns queries actually filter on. Parquet min/max statistics in sorted files let engines skip irrelevant data entirely — 51% less data scanned per query.

How [LakeOps](https://lakeops.dev) does it: Tracks which columns appear in WHERE, JOIN, and GROUP BY per table — automatically. During compaction, data is sorted by the columns delivering the highest aggregate skipping benefit. No manual sort-key selection. The sort strategy adapts as patterns evolve. For uncertain cases, layout simulations test proposed orders on a real Iceberg branch (real data, real queries replayed) before committing to expensive rewrites. Sorted data also compresses 9% better — real money at petabyte scale.
6. Route queries to the cheapest engine that fits
The cost problem: Most lakehouses run multiple engines — Trino, Spark, Snowflake, DuckDB, Athena — each with different pricing. Without routing, every query hits the default engine. A point lookup on DuckDB: $0.01. Same query on Snowflake: $0.08. 10,000 queries/day on the wrong engine: $255,000/year in waste.
The solution: A routing layer that directs each query to the cheapest engine meeting its latency target — without changing application code.

How LakeOps does it: Three routing strategies — Cost (cheapest engine within latency bounds), Latency (fastest for interactive), Throughput (balanced load distribution). One endpoint for applications and AI agents, automatic SQL dialect translation, per-group concurrency limits. Per-agent, per-user, per-pipeline cost attribution shows exactly where spend originates. Compounding benefit: compacted, sorted tables make cheap engines viable for more workloads — more queries qualify for the lowest-cost tier.
7. Leverage Iceberg partition evolution (zero-cost repartitioning)
The cost problem: Traditional Hive-style repartitioning requires a full table rewrite — reading every file, reorganizing into new structures. On petabyte tables: thousands of dollars and hours of compute. Most teams never repartition, even when the original choice no longer matches workloads.
The solution: Iceberg's partition evolution changes partitioning as a metadata-only operation. Old data stays in its original layout, new data uses the new spec, both coexist. No rewrites, no downtime, no compute cost.
How LakeOps helps: LakeOps observability surfaces partition effectiveness metrics — skew, file counts per partition, and access patterns — so you know exactly when and how to evolve. Hidden partitioning (supported natively by Iceberg) prevents accidental full scans from BI tools filtering on raw date columns instead of partition keys. Teams can start coarse (monthly), observe patterns via LakeOps metrics, then evolve to daily or hourly — adapting without a rewrite.
The compound effect
Each strategy delivers savings independently. Together they achieve 60–80% cost reduction because they compound:
The control plane (1) ensures compaction (2) runs continuously, producing sorted files that scan 51% less data. Snapshot expiration (3) and orphan cleanup (4) remove dead data before compaction — so compaction never wastes CPU. Query-aware layout (5) means every engine scans half the data. Routing (6) sends each query to the cheapest viable engine. Partition evolution (7) prevents expensive repartitioning projects.


Getting started
LakeOps connects to your existing catalogs — AWS Glue, REST catalogs (Polaris, Gravitino, Nessie, Lakekeeper), DynamoDB, and S3 Tables — in ~10 minutes. The initial scan identifies where your lake overspends and quantifies projected savings before any changes are made.

Run in manual mode (inspect and trigger yourself) or autonomous mode (continuous execution against your policies). Every operation is logged, auditable, and reversible. Your cloud bill reflects the improvement within the first billing period.
For a deeper look at how these strategies translate to query performance acceleration, read Optimizing Iceberg Lakehouse Performance.


