
Amazon S3 is where Apache Iceberg data lives. Every Parquet data file, every Avro manifest, every metadata.json pointer — all of it sits in S3 buckets billed per gigabyte-month and per API request. For a well-maintained 100 TB lakehouse the bill is predictable. For the majority of production Iceberg deployments — where streaming ingestion creates hundreds of thousands of small files, failed writes leave orphan objects behind, and snapshots retain data no query will read again — S3 becomes the single largest line item on the AWS invoice, often 2–4× what the same logical data should cost.
The root cause is structural: Iceberg's write patterns interact with S3's pricing model in ways that systematically inflate spend. Small files multiply API calls. Retained snapshots pin storage that cannot be reclaimed. Unsorted data forces full-file reads on scan-priced engines like Athena. None of this is visible in the Iceberg catalog or a standard S3 dashboard — it requires joining table-level metadata with storage billing to understand where money goes and why.
This guide quantifies five cost vectors specific to Iceberg on S3, maps each to AWS S3 pricing mechanics, and walks through five production strategies to address them. LakeOps fits naturally into the story as the autonomous control plane that operationalizes all five strategies at lake scale — from identifying which tables waste S3 budget to executing correctly sequenced maintenance with a Rust engine that completes compaction at a fraction of Spark's cost.
How Iceberg inflates your S3 bill
Iceberg stores data as immutable Parquet files, tracks them through manifest files and snapshot metadata, and relies on explicit maintenance procedures to clean up after itself. When that maintenance does not keep pace with ingestion — or does not run at all — five cost vectors compound against S3's pricing model.
Small files from streaming ingestion. Streaming engines like Flink and Spark Structured Streaming checkpoint at fixed intervals — typically every 1–10 minutes. Each checkpoint commits new Parquet files, one per partition. A table with 100 partitions and 10-minute checkpoints produces 14,400 new files per day — most of them a few megabytes each. Every file incurs a PUT on write ($0.005 per 1,000 PUTs), a GET on every subsequent read ($0.0004 per 1,000 GETs), and appears in LIST operations during query planning ($0.005 per 1,000 LISTs). A table with 500,000 small files that gets queried 50 times a day generates roughly 25 million GET requests per month — $10 per month in GET costs alone for a single table, on top of LIST costs during planning that grow proportionally with file count.
Orphan files from failed or partial writes. When a Spark job crashes mid-compaction, when a Flink checkpoint partially commits, or when a concurrent write conflict forces a retry, data files land in S3 but are never referenced in the committed metadata. These orphan files are invisible to Iceberg — they do not appear in any snapshot or manifest — but S3 bills for them at the full $0.023 per GB per month storage rate. On mature streaming lakes, orphan files routinely account for 25–40% of billable S3 storage on affected prefixes. Only Iceberg's `remove_orphan_files` procedure can reclaim them.
Retained snapshots pinning unreclaimable data. Every Iceberg commit creates a new snapshot referencing the current set of data files. By default, snapshots are never expired. Each retained snapshot pins references to data files that may have been logically superseded — a compaction that rewrites 1,000 small files into 10 large ones cannot delete the original 1,000 until every snapshot referencing them is expired. A streaming table with 10-minute commits accumulates over 4,300 snapshots per month. If snapshot expiration is not configured, every byte ever committed remains billable in S3 indefinitely.
Metadata overhead from manifest fragmentation. Query planning reads manifest files to determine which data files to scan. Fragmented manifests — hundreds of small manifest files from frequent commits — multiply the GET requests during planning. A table with 500 manifest files queried 100 times per day issues 50,000 manifest GETs per day. At $0.0004 per 1,000 GETs, manifest reads alone are modest — but the latency impact cascades: slow planning delays query start, holds engine resources longer, and inflates compute cost on time-priced clusters.
Scan amplification from unsorted data. When Parquet data files are not sorted on query predicate columns, row-group min/max statistics cannot skip irrelevant data. The engine reads entire files even when the query needs a fraction of the rows. On Athena at $5 per TB scanned, the difference is material: a well-sorted 1 TB table scanned with a selective filter might read 50 GB ($0.25); the same query on unsorted data reads the full terabyte ($5.00). Across a workload of 200 daily queries, poor sort order can cost $900/month in Athena scan charges versus $45 on sorted data.

These five vectors are not independent. Retained snapshots prevent orphan cleanup from reclaiming storage. Small files inflate scan amplification. Unsorted data forces full reads that multiply GET requests. Fixing one vector in isolation helps; addressing all five as a system is where production teams report 50–80% reductions in total S3 spend.
The S3 pricing model through an Iceberg lens
S3 Standard pricing has four dimensions, each mapped directly to Iceberg operations:
| Dimension | S3 Standard price | Iceberg operation |
|---|---|---|
| Storage | $0.023/GB/month | Data files, manifests, metadata, orphans, snapshot-pinned files |
| PUT/COPY/POST/LIST | $0.005 per 1,000 | Writes, compaction rewrites, manifest commits, query planning LISTs |
| GET/SELECT | $0.0004 per 1,000 | Data file reads, manifest reads, metadata fetches |
| Data retrieval | Free (Standard tier) | N/A for Standard; applies to IA/Glacier tiers |
Consider a concrete example. A 100 TB lakehouse with 500,000 files (average 200 MB each) stores data efficiently — large files, minimal API overhead. Storage cost: $2,300/month. If 200 queries per day each scan 1% of files on average, that is 1,000 GETs per query × 200 queries = 200,000 GETs/day = ~6M GETs/month, costing roughly $2.40/month in GET fees.
Now consider the same 100 TB of logical data spread across 5,000,000 files (average 20 MB each) — typical of a streaming-heavy lake without compaction. Storage is the same $2,300/month for data, but add 20 TB of orphan files ($460/month) and 15 TB of snapshot-pinned dead data ($345/month). Every query now touches 10× more files: 10,000 GETs per query × 200 queries = 60M GETs/month = $24/month in GET fees. LIST operations for planning against 5M objects cost proportionally more. And on Athena, scan amplification from unsorted small files turns a $500/month scan bill into $5,000/month. The total delta: roughly $5,000/month more for the same logical data — $60K/year in avoidable S3 and Athena spend.
The math is clear: the S3 bill does not reflect data volume — it reflects table structure. File count, file size, orphan volume, snapshot retention, and sort order determine cost far more than the raw terabytes of business data.
Five strategies to reduce S3 cost
1. Compact small files to target size
Compaction is the single highest-impact S3 cost reduction. Merging thousands of small files into fewer large files — targeting 256–512 MB per file for analytics workloads — reduces GET requests proportionally, shrinks LIST operations during planning, and enables better row-group statistics for predicate pushdown.
A streaming table with 10-minute checkpoints across 100 partitions creates 144,000 files per day. At 5 MB average file size, that is 720 GB/day of data in 144K objects. After compaction to 512 MB targets, the same data sits in roughly 1,400 files — a 99% reduction in object count that cuts GET costs, planning latency, and manifest size proportionally.
Two compaction strategies apply:
- Binpack merges small files into target-size files without changing sort order. Fast, low-overhead, and the right first move for any table with a small-file problem.
- Sort merges and reorders data by specified columns. More expensive — the engine reads and rewrites all data — but the payoff extends beyond S3 API costs: sorted data enables row-group min/max pruning that directly reduces bytes scanned on Athena ($5/TB). For high-query tables, sort compaction pays for itself within days.
2. Expire snapshots and remove orphan files
Snapshot expiration and orphan cleanup are the fastest path to reclaiming wasted S3 storage — but sequence matters. Running `remove_orphan_files` before `expire_snapshots` is a no-op for snapshot-pinned data: those files are still referenced by unexpired snapshots and will not be identified as orphans. The correct sequence is always expire first, then clean orphans.
Snapshot expiration removes old snapshot metadata and, when file cleanup is enabled, deletes data files referenced only by the expired snapshots. Configure both time-based (older_than) and count-based (retain_last) retention. Streaming tables with 10-minute commits should retain 3–7 days of snapshots — enough for rollback safety, short enough to release the vast majority of superseded data files.
Orphan cleanup deletes storage objects that exist in S3 but are not referenced in any current table metadata — the remnants of failed writes, crashed compaction jobs, and partial commits. Use a conservative older_than threshold of 3–7 days (the Spark procedure defaults to 3 days) to protect in-flight writes and long-running queries that may still reference recently committed files.
The storage impact is substantial. On mature streaming lakes, orphans and snapshot-pinned files together represent 25–40% of billable S3 storage. A production deployment reported removing ~200 TB of orphan data across 324 tables in under 30 minutes — pure S3 storage reclaimed with no impact on active queries.
3. Optimize file layout for scan reduction
Compaction reduces API costs by shrinking file count. Layout optimization reduces scan costs by making predicate pushdown effective. The two are complementary — and on Athena at $5/TB scanned, layout optimization is often the higher-value lever.
Query-aware sort reorders data files around the columns queries actually filter on. When a 1 TB table is sorted on event_date and user_id, a query filtering WHERE event_date = '2026-05-01' AND user_id = 42 can skip 95%+ of row groups using Parquet min/max statistics. The same query on unsorted data reads the full terabyte. At Athena pricing, that is $0.25 versus $5.00 per query — a 20× cost difference on every execution.
The challenge is choosing the right sort columns. Sort order should reflect production query patterns, not schema intuition. Layout simulations — replaying historical SQL against candidate sort strategies — validate the performance gain before paying the I/O cost of a full data rewrite.
4. Apply S3 storage class policies
S3 Intelligent-Tiering moves objects between access tiers automatically with no retrieval fees and no performance impact for frequently accessed data:
| Tier | Activation | Savings vs Standard |
|---|---|---|
| Frequent Access | Default | Baseline |
| Infrequent Access | 30 days without access | ~40% cheaper |
| Archive Instant Access | 90 days without access | ~68% cheaper |
For Iceberg specifically, compact before tiering: S3 Infrequent Access charges a minimum object size of 128 KB and a minimum storage duration of 30 days. Thousands of small files below 128 KB are each billed as 128 KB in IA tier — compaction should always run before lifecycle rules move data to cheaper tiers. Enable Intelligent-Tiering at the bucket level; the monitoring fee ($0.0025 per 1,000 objects/month) is trivial relative to 40–68% storage savings on infrequently accessed data. For Iceberg-specific guidance, see the AWS Prescriptive Guidance.
5. Route queries to scan-efficient engines
S3 cost is not just about storage — scan-priced engines turn every poorly optimized table into a recurring charge. Athena at $5/TB scanned is the clearest example: a 1 TB table queried 100 times/month costs $500 on Athena if scans are full-table, versus $0.01–$1.00 per query on DuckDB or Trino reading from well-compacted data.
The key insight: table structure determines which engines are viable, and engine choice determines scan cost. Compaction and sort are prerequisites; routing is the multiplier. Well-compacted tables unlock DuckDB (no cluster, no per-TB charge) and Trino (right-sized cluster, fraction of Athena's cost) as alternatives that fragmented tables cannot support.
Why these strategies fail without automation
Each strategy above is well-documented. The Spark procedures exist. The Athena cost model is public. Yet most production Iceberg lakehouses still overpay by 2–4×. Why?
Sequencing dependencies. Expiration must run before orphan cleanup. Compaction must complete before storage-class transitions make sense. Sort must align with current query patterns, not last quarter's. Running these in the wrong order wastes compute — or worse, leaves storage unreclaimable.
Scale breaks manual execution. A 50-table lake can be maintained with Airflow DAGs. A 500-table lake across multiple catalogs (Glue, REST, S3 Tables) cannot. Each table has different ingestion rates, query patterns, and degradation timelines. Fixed cron schedules compact healthy tables that do not need it and miss degraded tables between runs.
No feedback loop. Without observability that connects table structure to S3 billing, teams react to monthly invoice spikes rather than preventing them. By the time you notice the bill, months of orphan accumulation and snapshot sprawl are already billed.
Compaction cost itself. Spark-based compaction on EMR is the most common approach — but JVM startup, executor provisioning, garbage-collection overhead, and cluster idle time mean the maintenance itself is expensive. If compaction costs more than the S3 savings it produces, teams stop running it.
Addressing these gaps requires more than better scripts. It requires a system — specific operational components working together.
Components of an S3 cost reduction system
Solving Iceberg S3 cost at production scale requires four components working as a closed loop:
1. Observability layer — know which tables waste money
Before optimizing anything, you need visibility into which tables are structurally degraded and how that maps to S3 spend. This means health classification (file count, manifest depth, snapshot sprawl, orphan volume, partition skew) across every table in every catalog — surfaced as severity-ranked alerts that lead next month's invoice, not trailing indicators that explain last month's.
2. Execution engine — fast, affordable maintenance
Compaction, expiration, orphan removal, and manifest rewrites must be cheap enough to run continuously. If the engine costs more than the S3 savings, automation stalls. The execution layer needs to complete lifecycle operations at a fraction of Spark's cost — purpose-built engines (Rust, DataFusion) that finish binpack in minutes rather than hours and run sort compaction without provisioning full EMR clusters.
3. Orchestration logic — sequenced, health-driven, conflict-aware
Operations must execute in the correct order (expire → orphans → compact → manifest rewrite → statistics refresh), trigger from health signals rather than fixed schedules, and avoid conflicting with streaming writers or active readers. This is the control plane intelligence — it decides which tables need which operations, when to run them, and how to adapt when conditions change.
4. Policy and routing framework — guardrails at lake scale
Retention windows, compaction thresholds, and cleanup rules need to be defined once and enforced from table scope up through namespace to catalog baselines — versioned, auditable, and overridable per workload. Query routing extends this: directing workloads to the cheapest viable engine based on table readiness, cost policies, and latency requirements.
These four components form a loop: observe → decide → execute → measure → adapt. Any tool that covers only one component (e.g., just compaction, or just observability) leaves the others as manual work.
Tools that deliver these components
Amazon S3 Tables — embedded storage-layer automation
Amazon S3 Tables embeds lifecycle management directly into the storage layer. It automatically compacts data files (binpack, sort, or z-order), manages snapshot retention, removes unreferenced files, and applies Intelligent-Tiering — all without user-managed jobs. S3 Tables delivers up to 10× higher transactions per second and up to 3× faster query performance through continuous optimization.
S3 Tables pricing: $0.0265/GB storage (vs $0.023 for standard S3), $0.002 per 1,000 objects for compaction processing, and $0.005/GB processed during compaction. For tables that would otherwise require EMR clusters for maintenance, S3 Tables can be cheaper end-to-end despite the higher per-GB rate — the zero-ops model eliminates cluster costs entirely.
Covers: Execution engine (auto-compaction), partial orchestration (snapshot/orphan management), storage tiering. Does not cover: Cross-catalog observability, multi-engine query routing, policy enforcement across mixed catalog estates, or user-controlled maintenance sequencing. Tables must reside in the S3 Tables storage model — no retroactive conversion from general-purpose buckets without a rewrite.
Best for: Greenfield AWS Iceberg workloads standardized on Athena/EMR/Redshift Spectrum that want zero-config maintenance embedded at the storage layer.
LakeOps — autonomous lakehouse control plane
LakeOps is an autonomous lakehouse control plane built in Rust on Apache DataFusion. It connects to existing Iceberg catalogs — Glue, REST/Polaris, Nessie, Gravitino, Lakekeeper, S3 Tables — and delivers all four components as a unified system. Production deployments: $1.37M saved in 3 months, 46.8 PB optimized in 30 days.
Observability layer. Health tiers (Critical, Warning, Healthy) classify every table by file count, manifest depth, snapshot sprawl, orphan volume, and partition skew. The Insights engine surfaces severity-ranked alerts — leading indicators of next month's S3 spike — so teams act on structural problems before invoices arrive. See Iceberg lakehouse observability.

Execution engine. Rust-based compaction completes binpack in 221 seconds versus 1,612 seconds for Spark on 200 GB tables — 86% less wall-clock time, proportionally lower compute cost per terabyte. When lifecycle automation runs across hundreds of tables weekly, engine speed determines whether S3 cost reduction is self-funding. For a detailed comparison, see 9 Iceberg Table Compaction Tools Compared.

Orchestration logic. Sequenced maintenance runs the correct operation order on every table: expire snapshots → remove orphans → compact data files → rewrite manifests → refresh column statistics. Each step triggers from table-health signals — not fixed cron schedules. Conflict-aware execution ensures compaction does not collide with streaming writers or active readers. Query-aware sort analyzes cross-engine telemetry (filter/join column frequency from Athena, Trino, Spark, DuckDB) to determine optimal sort order per table, with Layout Simulations validating projected scan reduction before committing to a rewrite.

Policy and routing. Lake-wide policies enforce compaction thresholds, retention windows, and cleanup rules from table scope up through namespace to catalog baselines — versioned, auditable, and toggleable from a single dashboard. Multi-engine routing sends queries to the cheapest viable engine based on cost, latency, and throughput policies: ad hoc scans to DuckDB, interactive analytics to Trino, batch ETL to Spark. See managed Iceberg for how maintenance and routing work together.

Routing and policies sit on top of the execution layer. Once tables are compacted and sorted, the Events view tracks every maintenance operation across the lake — showing which ran, how long each took, and what changed.

The Insights engine identifies where the next dollar of waste will come from — specific tables with specific structural problems ranked by severity.

Covers: All four components — observability, execution, orchestration, policy/routing — as a closed loop across any catalog and engine. Best for: Multi-catalog estates, mixed-engine workloads, teams running at lake scale (50+ tables) who need unified S3 cost reduction rather than per-table scripting. Complements S3 Tables and Glue — operates at the control-plane layer above storage. For the full autonomous maintenance model, see autonomous Iceberg table maintenance.
A practical S3 cost reduction runbook
S3 cost reduction is a sequenced project, not a one-time fix. This week-by-week runbook converts the strategies above into operational milestones.
Week 1 — Audit. Enable S3 Storage Lens on your lakehouse buckets. Analyze prefix-level metrics: object count, average object size, and storage by prefix. Cross-reference with Iceberg metadata to identify the top 20 tables by object count and the top 20 by orphan volume (compare total-data-files-size-in-bytes from Iceberg metadata to S3 billable storage on the same prefix). Document which tables have never had snapshots expired, which have unbounded snapshot growth, and which have the widest gap between logical and physical storage. This audit typically reveals that 3–5 tables account for the majority of avoidable S3 spend.
Week 2 — Quick wins. Run `expire_snapshots` on the worst offenders with older_than set to 7 days and retain_last set to a minimum safe count (e.g., 5). Follow immediately with `remove_orphan_files` on the same tables — always after expiration, never before. Measure the S3 storage delta on affected prefixes. Teams routinely reclaim 20–40% of storage on high-ingestion tables in this step alone.
Week 3 — Compaction. Binpack the top 10 tables by file count, targeting 256–512 MB per file. Use the fastest available compaction engine — the compute cost of compaction should be a fraction of the S3 and Athena savings it produces. Validate results: check file count reduction, average file size, and query latency on representative workloads before and after. If key fact tables have identifiable filter columns, pilot sort compaction on 1–2 tables and measure Athena scan reduction.
Week 4 — Systematize. Replace per-table cron with namespace-scoped policies that trigger maintenance from health signals. Enable S3 Intelligent-Tiering on lakehouse buckets for automatic storage class transitions. Establish routing policies for query engines — send ad hoc workloads to DuckDB or Trino where compacted tables make them viable, keep batch ETL on Spark, and reserve Athena for workloads that justify per-TB scan pricing. Publish a monthly dashboard showing S3 storage cost, API cost, and Athena scan cost per table and per team.

Summary
The S3 bill for an Iceberg lakehouse does not reflect how much data you have — it reflects how well your tables are structured. Small files multiply API costs. Orphans and retained snapshots inflate storage. Unsorted data amplifies scan charges on Athena. Fragmented manifests slow planning and hold compute resources longer. Every dollar of avoidable S3 spend traces back to a table maintenance operation that either ran in the wrong order, ran on the wrong schedule, or never ran at all.
Fix the structure and the bill follows. Expire snapshots before cleaning orphans. Compact small files to target sizes. Sort data around production query patterns. Apply Intelligent-Tiering for automatic storage class transitions. Route queries to the cheapest engine the table layout can support. Do this across every table — not just the ones someone remembers to maintain — and the S3 bill drops to what the data actually costs, not what neglect charges. For teams running at lake scale across multiple catalogs and engines, LakeOps operationalizes that entire loop autonomously: observe, decide, execute, adapt — with the cost savings measured in the same S3 dashboard that showed the problem.



