
Apache Iceberg on Amazon S3 is the default architecture for open lakehouses on AWS. S3 provides the storage layer — durable, scalable, pay-per-use — and Iceberg provides the table format that turns flat object storage into something that behaves like a database: ACID transactions, schema evolution, time travel, and partition pruning, all on top of immutable Parquet files in S3 buckets.
The combination works because Iceberg was designed for object stores. Every component of an Iceberg table — data files, manifest files, manifest lists, and metadata pointers — is a standalone S3 object. No HDFS dependencies, no rename operations, no file-system assumptions that break on eventually consistent storage. AWS has built native Iceberg support into Glue, Athena, EMR, Redshift, and S3 Tables, creating an integrated ecosystem where the table format is a first-class citizen across analytics services.
This guide is a checklist for running Iceberg on S3 successfully in production. It covers:
- Architecture — how Iceberg's metadata hierarchy maps to S3 objects
- Autonomous management — the control plane that makes Iceberg on S3 operationally viable at scale
- AWS services — Glue, Athena, EMR, Redshift, S3 Tables, and ingestion options
- S3 configuration — prefix distribution, FileIO, encryption, IAM
- Security and governance — Lake Formation, access control, encryption at rest
- Performance — partitioning, sort order, bloom filters, manifest optimization
- Maintenance and operations — the correct sequence, why it fails at scale, and how to automate it
How Iceberg works on S3
An Iceberg table on S3 is a hierarchy of immutable files. Understanding this hierarchy is essential for configuring S3 correctly and diagnosing performance issues.
Metadata file (`metadata.json`). The root pointer for every table state. Each commit creates a new metadata.json containing the current schema, partition spec, sort order, default properties, and a reference to the current snapshot. The catalog (Glue, REST, or Hive) stores the path to the latest metadata file — this is the entry point for every query.
Snapshot and manifest list. Each snapshot represents a complete, consistent view of the table at a point in time. The snapshot references a manifest list (snap-*.avro) — an Avro file that enumerates all manifest files belonging to that snapshot, along with partition-level summary statistics (min/max values, file counts, size totals) used for partition pruning during planning.
Manifest files. Each manifest (*.avro) tracks a subset of data files — their S3 paths, partition values, file-level statistics (column min/max, null counts, value counts), and file size. Query engines read manifests to determine which data files to scan, skipping entire manifests whose partition ranges do not match the query predicate. Manifest-level pruning is the first layer of scan reduction.
Data files. The actual Parquet (or ORC/Avro) files containing row data. On S3, these are immutable objects written once and never modified. Updates and deletes produce new data files (or delete files in Merge-on-Read mode); the old files remain until snapshot expiration and garbage collection remove them.
Every layer in this hierarchy is a separate S3 object — read independently, cached independently, and billed independently. A query against an Iceberg table on S3 issues GET requests for the metadata file, the manifest list, the relevant manifests, and the matching data files. The efficiency of this chain — how many objects are read, how large they are, and how effectively the engine prunes at each layer — determines both query latency and S3 API cost.

Autonomous management for Iceberg on S3
Iceberg provides the table format. AWS provides the storage and compute services. But neither manages the operational lifecycle of your tables. Every Iceberg table degrades over time — snapshots pile up, orphan files inflate S3 costs, small files multiply from streaming writes, and sort orders go stale as query patterns change. Across hundreds of tables with different ingestion rates, engines, and catalogs, this degradation is invisible until invoices spike or queries slow down. Manual Airflow DAGs and cron-based Spark scripts cannot keep up.
A lakehouse control plane is the infrastructure component that closes this gap. It sits between your catalogs and query engines — continuously monitoring every table, understanding cross-engine query patterns, and applying the right maintenance at the right time. Without it, an Iceberg-on-S3 deployment works at proof-of-concept scale but breaks operationally at production scale. With it, the lakehouse stays fast, lean, and correctly optimized as a baseline — not an aspiration.
LakeOps is the autonomous control plane for Apache Iceberg, built in Rust on Apache DataFusion. It connects to any Iceberg catalog (Glue, REST/Polaris, Nessie, S3 Tables, Gravitino) and orchestrates the full operational loop across all query engines (Athena, Trino, Spark, DuckDB, Flink, Snowflake). What it provides:
Full-lake observability. Every table across every catalog is classified by structural health — Critical, Warning, Healthy — based on file count, manifest depth, snapshot accumulation, orphan volume, and partition skew. Severity-ranked Insights surface degradation as leading indicators: teams fix problems before they appear on invoices or query dashboards. Cross-engine query telemetry shows which columns drive filters, which tables are hottest, and where latency is trending — informing both triage and optimization.

Autonomous compaction and maintenance. A Rust-based execution engine completes binpack in 221 seconds versus 1,612 seconds for Spark on 200 GB tables — making continuous maintenance across hundreds of tables economically viable rather than a second infrastructure cost. Query-aware sort analyzes real SQL from Athena, Trino, and Spark to physically organize data by actual access patterns — so every query across every engine benefits from row-group pruning. The full maintenance sequence (expire → orphans → compact → manifests → statistics) runs in correct order, triggered by health signals, not fixed schedules. Every operation is tracked with full auditability — duration, before/after file counts, and status.

Production deployments report $1.37M saved in 3 months and 46.8 PB optimized in 30 days — the compound effect of continuous maintenance running at native speed across the full lake.

Multi-engine query routing. Production estates query the same tables from Athena, Trino, Spark, and DuckDB — each optimal for different workloads. LakeOps provides a unified routing layer that dispatches queries to the best engine based on cost, latency, and table health. Routing endpoints give applications a stable URL with defined engine pools and priority levels — eliminating hardcoded engine-selection logic that becomes suboptimal as workloads evolve.

Policy governance. Compaction thresholds, retention windows, and cleanup rules are defined once and enforced from table scope through namespace to catalog baseline — versioned, auditable, and overridable per workload. This replaces per-team Airflow DAGs with centralized governance that scales from 10 to 10,000 tables without additional engineering effort.

AI enablement. An agent-native MCP interface with guardrails provides AI pipelines fast, consistent access to optimized table state and metadata — enabling agentic data workflows that read, query, and reason over the lakehouse autonomously.

LakeOps is not a replacement for AWS services — it orchestrates them. Athena still runs your queries, EMR still runs your Spark jobs, Glue still catalogs your tables. LakeOps ensures the tables those services depend on stay healthy, optimized, and governed — autonomously, across the full estate. For a deep dive into the control-plane approach, see Managed Iceberg in 2026.
AWS services for Iceberg
AWS provides a broad ecosystem of services that integrate natively with Iceberg on S3. Each service addresses a different part of the analytics stack — cataloging, querying, processing, ingestion, or managed storage.
S3 storage layer
S3 is the physical storage for all Iceberg table components. Two configuration choices have outsized impact on performance:
`S3FileIO` is Iceberg's native S3 client, optimized for object-store operations. It uses multipart upload with parallel part streaming for writes, range-based GETs for columnar reads, and handles S3's eventual consistency model correctly. It replaces HadoopFileIO as the recommended I/O implementation for S3 — HadoopFileIO works but carries HDFS-oriented overhead (rename-based commits, directory listing assumptions) that is unnecessary on S3.
`ObjectStoreLocationProvider` distributes data file paths across randomized S3 prefixes rather than placing all files under a single partition-based directory tree. S3 throttles requests at the prefix level (5,500 GETs and 3,500 PUTs per second per prefix). Without prefix distribution, high-throughput streaming tables hit 503 SlowDown errors on hot partitions. ObjectStoreLocationProvider eliminates this by hashing file paths across the key space.
Encryption. S3 supports server-side encryption with S3-managed keys (SSE-S3), AWS KMS keys (SSE-KMS), and customer-provided keys (SSE-C). For Iceberg workloads, SSE-S3 is the simplest option with no additional API overhead. SSE-KMS adds per-request KMS calls — relevant for high-throughput tables where KMS request limits may become a factor. Configure encryption at the bucket level or through Iceberg table properties (s3.sse.type, s3.sse.key).
AWS Glue Data Catalog
AWS Glue Data Catalog serves as a native Iceberg catalog, tracking the current metadata pointer for each table. Since Glue 3.0, Iceberg tables are first-class objects in the catalog — CREATE TABLE ... USING iceberg registers the table in Glue, and all metadata updates (schema evolution, partition spec changes, snapshot commits) are reflected through Glue's API.
Glue exposes an Iceberg REST catalog endpoint, enabling any Iceberg-compatible engine to connect without vendor-specific catalog implementations. AWS Lake Formation integrates with Glue to provide column-level and row-level access control on Iceberg tables — policies defined in Lake Formation are enforced across Athena, EMR, and Redshift queries.
Glue also provides table optimizers that run compaction (binpack, sort, z-order), snapshot retention, and orphan file deletion as managed operations — no Spark cluster required.
Amazon Athena
Amazon Athena provides serverless SQL on Iceberg tables, priced at $5 per TB scanned. Athena supports CREATE TABLE ... USING ICEBERG, CTAS (Create Table As Select), INSERT INTO, MERGE INTO, UPDATE, and DELETE — full DML on Iceberg tables without provisioning any infrastructure.
Key Iceberg features on Athena include time travel (FOR SYSTEM_TIME AS OF), schema evolution, hidden partitioning, and the `OPTIMIZE` command for in-place compaction. Athena uses Iceberg's metadata for partition pruning and column projection, so well-structured tables with effective sort orders can reduce scan volume — and cost — by an order of magnitude.
Amazon EMR
Amazon EMR runs Spark, Flink, and Trino with native Iceberg support. EMR provides full DML capabilities — including streaming writes via Flink and Spark Structured Streaming, stored procedures for maintenance operations (rewrite_data_files, expire_snapshots, remove_orphan_files), and all Iceberg table management operations.
EMR 7.x includes Iceberg extensions for materialized views, automatic compaction triggers, and integration with Glue Data Catalog. For production deployments, EMR is typically the workhorse for heavy ETL, streaming ingestion, and maintenance operations that require full programmatic control over Iceberg's table management API.
Amazon Redshift
Amazon Redshift supports federated queries on Iceberg tables through the Glue Data Catalog. Using Redshift Spectrum, analysts can join Iceberg tables in S3 with Redshift-managed tables in a single query — useful for combining hot data in Redshift with cold historical data in the lakehouse. Iceberg access from Redshift is read-only; writes go through EMR, Athena, or Glue.
Amazon S3 Tables
Amazon S3 Tables is AWS's fully managed Iceberg offering, embedding table semantics directly into the S3 storage layer. Tables are created in dedicated table buckets with a REST Catalog API — each bucket supports up to 10,000 tables.
S3 Tables provides automatic compaction (binpack, sort, or z-order), snapshot management, and orphan file cleanup with zero user-managed jobs. AWS reports up to 10× higher transactions per second compared to general-purpose S3 buckets and up to 3× faster query performance through continuous optimization. Intelligent-Tiering is supported natively, moving infrequently accessed table data to cheaper storage tiers automatically. S3 Tables also supports cross-region replication for disaster recovery.
S3 Tables is suited for greenfield Iceberg deployments on AWS that want embedded storage-layer automation. The trade-off: tables must reside in S3 table buckets (no retroactive conversion from general-purpose buckets without rewriting data), and user control over maintenance sequencing and trigger conditions is limited.
Data ingestion services
Amazon Data Firehose supports direct Iceberg table delivery from streaming sources — Kinesis Data Streams, MSK, and direct PUT — writing Parquet files to Iceberg tables in S3 with automatic schema handling. For more complex streaming pipelines, Apache Flink on EMR and AWS Glue streaming jobs provide full control over commit intervals, partitioning strategies, and write parallelism.
S3 configuration best practices
Default S3 settings work for small-scale Iceberg deployments. At production scale, several configuration choices prevent performance bottlenecks and operational issues.
Use `ObjectStoreLocationProvider`. Set write.object-storage.enabled=true on every table. This distributes data files across randomized S3 prefixes, preventing request throttling on high-throughput partitions. Without it, streaming tables writing to date-partitioned paths concentrate all PUTs on a single prefix — hitting S3's 3,500 PUT/s limit and triggering 503 SlowDown errors.
Use `S3FileIO`, not `HadoopFileIO`. Configure io-impl=org.apache.iceberg.aws.s3.S3FileIO in your catalog properties. S3FileIO handles multipart uploads, parallel reads, and S3-native consistency correctly. HadoopFileIO adds unnecessary overhead from HDFS-oriented operations (directory listings, rename-based commits) that S3 does not need.
Configure retry strategies. S3 returns transient 503 errors under load. Set s3.retry.num-retries (default 5) and use exponential backoff. For write-heavy tables, increase s3.multipart.threshold and s3.multipart.size to reduce the number of API calls per file write.
Apply S3 object tagging for lifecycle management. Use Iceberg's s3.write.tags.* properties to tag data files at write time. Tags enable S3 lifecycle rules that target specific file categories — for example, transitioning compacted files to Intelligent-Tiering while keeping recently written files in Standard for immediate re-compaction. For a deeper dive into S3 cost strategies, see 7 Iceberg lakehouse cost reduction strategies.
IAM policies. Iceberg operations require s3:GetObject, s3:PutObject, s3:DeleteObject, and s3:ListBucket on the data bucket. Maintenance operations (orphan cleanup, compaction) additionally require s3:ListBucket with prefix-scoped permissions. For Glue catalog access, include glue:GetTable, glue:UpdateTable, glue:GetDatabase, and related Glue API permissions. Follow the AWS Prescriptive Guidance for least-privilege IAM templates.
Security and governance
Iceberg on S3 inherits AWS's security model — IAM for access control, encryption for data protection, and Lake Formation for fine-grained governance. Getting this right at the start avoids retroactive permission rework.
AWS Lake Formation. Lake Formation provides column-level and row-level access control on Iceberg tables registered in Glue. Define permissions once in Lake Formation; they are enforced consistently across Athena, EMR, and Redshift queries. This replaces per-service IAM policy management with centralized, auditable governance — critical for multi-team environments where different groups need different levels of access to the same tables.
Encryption at rest. All Iceberg data files, manifests, and metadata should be encrypted on S3. SSE-S3 (AES-256 managed by S3) has zero API overhead and is the default for most workloads. SSE-KMS provides key rotation, audit trails via CloudTrail, and compliance-grade key management — but adds per-request KMS API calls that can throttle at high throughput. For S3 Tables, encryption is always on with S3-managed keys; customer-managed KMS keys are supported for table buckets.
Catalog-level governance. For estates spanning multiple catalogs (Glue + REST + S3 Tables), governance policies need to be consistent across all access paths. Lake Formation covers Glue-cataloged tables. For the full estate — including tables in REST catalogs, Nessie, or S3 Tables — a control-plane layer that enforces retention, cleanup, and access policies across all catalogs provides unified governance that no single AWS service delivers alone.
Performance optimization
Iceberg's performance on S3 depends on how data is organized within files, how files are organized within partitions, and how effectively the engine can skip irrelevant data at each layer.
Hidden partitioning. Unlike Hive-style partitioning, Iceberg hidden partitions are derived from source column transforms — days(event_timestamp), bucket(user_id, 16), truncate(region, 2). Users write queries against the source columns (WHERE event_timestamp > '2026-01-01'), and the engine automatically maps predicates to partition boundaries. This eliminates the class of bugs where queries forget to filter on the partition column and scan the entire table.
Partition evolution. Iceberg supports changing the partition scheme on existing tables as a metadata-only operation — no data rewrite required. Old data retains its original partitioning; new data is written with the updated spec. The engine handles mixed partition layouts transparently during query planning. This allows partition strategies to evolve with changing data volumes and query patterns without expensive migrations.
Sort order and predicate pushdown. Sorting data files on frequently filtered columns enables row-group min/max pruning in Parquet. When a 1 TB table is sorted by event_date, a query filtering WHERE event_date = '2026-05-01' reads only the row groups containing that date — potentially skipping 95%+ of the data. On Athena at $5/TB scanned, this is the difference between $5.00 and $0.25 per query. Define sort order at table creation or evolve it later; the sort is applied during compaction.
The challenge is choosing the right sort columns. Sort order should reflect production query patterns, not schema intuition. Tools like LakeOps provide layout simulations that replay historical SQL against candidate sort strategies — quantifying projected data-skip improvement before committing to an expensive full-table rewrite. This is especially valuable on large fact tables where a wrong sort choice wastes hours of compute.

Bloom filters. For point-lookup queries (WHERE user_id = '12345'), Bloom filters on high-cardinality columns enable row-group skipping even when the column is not the primary sort key. Set write.parquet.bloom-filter-enabled.column.{col}=true and tune the false-positive rate (default 0.01) based on cardinality. Bloom filters add a small storage overhead per data file but eliminate full-file reads for selective point queries.
Manifest optimization. Frequent commits (streaming, micro-batch) create many small manifest files. Each query must read all relevant manifests during planning. Running `rewrite_manifests` consolidates fragmented manifests into fewer, larger ones — reducing planning-time GET requests and improving query startup latency. A table with 500 manifests consolidated to 50 sees 10× fewer manifest reads per query.
File sizing. Target 128–512 MB per data file for analytics workloads. Files smaller than 128 MB multiply GET requests and inflate manifest size. Files larger than 512 MB reduce parallelism for engines that assign one task per file. Streaming tables naturally produce undersized files at each checkpoint — compaction corrects this by merging small files to the target size. For a comprehensive performance optimization guide, see optimizing Iceberg lakehouse performance.
Table maintenance and production operations
Iceberg does not clean up after itself automatically. Every commit creates new immutable files; old files remain in S3 until explicitly removed. Without maintenance, tables degrade predictably — and at scale, manual maintenance breaks down.
Why tables degrade
Snapshots accumulate. A streaming table committing every 10 minutes creates 4,300 snapshots per month. Each snapshot pins references to data files that may have been superseded — those files cannot be deleted from S3 until the referencing snapshot is expired. Unbounded snapshot growth means unbounded storage growth.
Orphan files pile up. Failed writes, crashed compaction jobs, and aborted transactions leave data files in S3 that are not referenced by any snapshot. These orphan files are invisible to Iceberg but S3 bills for them. On mature streaming lakes, orphan files account for 25–40% of billable S3 storage on affected prefixes.
Small files multiply. Streaming engines checkpoint at fixed intervals, producing one file per partition per commit. A table with 100 partitions and 10-minute checkpoints creates 14,400 files per day — degrading query latency and inflating S3 API costs linearly.

None of this degradation is visible in S3 dashboards or default CloudWatch metrics. Detecting it requires reading Iceberg metadata — file counts, manifest depth, snapshot history, partition distributions — and classifying each table by structural health. Tools that surface this as severity-ranked alerts (CRITICAL for extreme file proliferation, WARNING for snapshot sprawl) let teams act before queries regress or invoices spike.

The correct maintenance sequence
Each operation depends on the output of the previous step:
- 1.Expire snapshots — dereference superseded data files
- 2.Remove orphan files — delete unreferenced S3 objects (must run after expiration)
- 3.Compact data files — merge small files into 128–512 MB targets
- 4.Rewrite manifests — consolidate against the compacted layout
- 5.Refresh column statistics — update min/max for the new file set
Running these out of order wastes compute. Compacting before expiring snapshots rewrites files about to be dereferenced. Cleaning orphans before expiring snapshots misses the largest category of reclaimable storage. The Spark procedures for each step exist — the challenge is executing them correctly, in sequence, at scale.
Why manual maintenance fails
A 50-table lakehouse can be maintained with Airflow DAGs. A 500-table estate across Glue catalogs, REST catalogs, and S3 Tables — with streaming and batch tables at different ingestion rates — cannot. Each table has its own degradation timeline. Fixed schedules compact healthy tables that do not need it and miss degraded tables between runs. Without observability connecting table structure to S3 billing, teams react to monthly invoice spikes rather than preventing them. And Spark-based maintenance on EMR — JVM startup, executor provisioning, cluster idle time — can cost more than the savings it produces.
What a production deployment needs
The AWS services provide the building blocks (storage, cataloging, querying, compute). What they do not provide is the operational loop that keeps tables healthy. A production Iceberg-on-S3 deployment needs five additional capabilities:
Observability — health classification across every table: file count, manifest depth, snapshot sprawl, orphan volume, partition skew. Surfaced as severity-ranked alerts that lead invoices and query regressions, not trail them.
Execution engine — maintenance that is cheap enough to run continuously. If compaction costs more than the S3 savings, teams stop running it. Purpose-built engines (native code, no JVM overhead) complete the same work at a fraction of Spark's time and cost.
Orchestration — correct operation order, health-driven triggers (not fixed cron), and conflict-aware scheduling that avoids collisions with streaming writers. The control-plane intelligence.
Query routing — intelligent dispatch of queries to the engine best suited for each workload (Athena for ad-hoc, Trino for latency-sensitive dashboards, Spark for heavy ETL). Without this, teams hardcode engine choices that become suboptimal as tables and workloads evolve.
Policy framework — retention windows, compaction thresholds, and cleanup rules defined as versioned policies scoped from table through namespace to catalog baseline. Without this, every table's maintenance drifts independently.
Amazon S3 Tables — storage-layer automation
Amazon S3 Tables embeds maintenance directly into the storage layer. Auto-compaction (binpack, sort, z-order), snapshot management, and orphan cleanup run continuously with zero configuration — covering the execution engine and partial orchestration for tables in S3 table buckets.
S3 Tables does not cover cross-catalog observability, query-aware sort optimization, multi-engine routing, or policy enforcement across mixed catalog estates. It is the right starting point for greenfield AWS Iceberg workloads with Athena and EMR as the primary engines.
LakeOps — the control plane in action
The control plane introduced at the start of this guide delivers all five capabilities. In the maintenance context, what matters most is compaction economics and governance enforcement:
The Rust engine's speed (221s vs Spark's 1,612s on 200 GB tables) makes continuous maintenance across the full estate self-funding rather than a second infrastructure cost. For a detailed comparison, see Iceberg compaction.

Policies enforce the correct maintenance sequence, retention windows, and compaction rules across every catalog — versioned and auditable. No per-team Airflow DAGs, no drifting configurations between tables. See how maintenance and governance work together.

S3 Tables and LakeOps are complementary: S3 Tables handles storage-layer automation for tables in table buckets, LakeOps handles the full operational loop — observability, orchestration, compaction, routing, and governance — across mixed-catalog estates with multiple engines. For the full maintenance model, see autonomous Iceberg table maintenance.
Summary — the production checklist
Running Iceberg on S3 successfully requires each layer to be addressed deliberately:
Storage and catalog: S3 with ObjectStoreLocationProvider and S3FileIO. Glue Data Catalog (or REST catalog) for metadata. Encryption at rest. Lake Formation for fine-grained access control.
Compute and querying: Athena for serverless SQL. EMR for Spark/Flink streaming and heavy ETL. Redshift for federated analytics. S3 Tables for managed Iceberg with embedded automation.
Performance: Hidden partitioning aligned to query patterns. Sort order validated against actual workloads (layout simulations). Bloom filters for point lookups. 128–512 MB target file size. Consolidated manifests.
Maintenance: Sequenced lifecycle (expire → orphans → compact → manifests → statistics). Execution engine fast enough to make automation self-funding. Health-driven triggers, not fixed cron.
Operations at scale: Observability across all catalogs. Policy governance versioned and scoped. Cross-engine query routing. S3 Tables handles storage-layer automation for table buckets. LakeOps handles the full operational loop — observability, orchestration, compaction, and governance — across mixed-catalog estates with multiple engines.
Each layer builds on the previous one. Skip governance and you cannot enforce maintenance. Skip maintenance and performance degrades. Skip observability and you discover problems from invoices rather than alerts. The guide above covers each layer; the tools at the end operationalize them.


