AWS Glue Iceberg Optimization: A Practical Guide

AWS Glue Iceberg Optimization — an S3 bucket with scattered data objects funneled through an optimization lens into a geometric iceberg, with icons for Search, Analytics, and Tuning

AWS Glue is the default entry point for Apache Iceberg on AWS. It serves as both the catalog — registering table metadata in the Glue Data Catalog — and the ETL engine — running Spark-based jobs that read and write Iceberg tables. Glue has offered built-in table optimizers since late 2023, starting with compaction and expanding through 2024 to include snapshot retention, orphan file cleanup, and — as of December 2024 — Merge-on-Read delete file compaction, partition evolution support, and partial progress commits. With Glue 5.0 shipping Iceberg 1.7.1, the integration is the most capable it has ever been. For teams already on AWS, the path is frictionless: create a table, write data, enable optimizers, and Glue handles the rest.

That description holds for a handful of tables with predictable batch workloads. At production scale — hundreds of tables, mixed streaming and batch ingestion, multiple query engines (Athena, Trino, Spark, Redshift), and teams with different SLAs — the built-in optimizers hit hard boundaries. Compaction runs on threshold-based triggers with no awareness of cross-engine query patterns. There is no telemetry to inform sort order decisions. Maintenance operations run independently with no sequencing guarantees. And observability is limited to CloudWatch metrics that tell you a job ran, not whether your table is healthy.

This guide covers the full Glue-Iceberg stack: catalog configuration, ETL job patterns, built-in optimizer tuning, common problems, and the architectural limits you will hit as you scale. Tools like LakeOps address these gaps by providing autonomous table maintenance, lake-wide observability, and query-aware optimization as a dedicated control plane for Iceberg — we'll cover how it complements Glue later in this guide.

AWS Glue Data Catalog for Iceberg

The AWS Glue Data Catalog acts as the Iceberg catalog implementation on AWS. When you register an Iceberg table in the Glue Data Catalog, it stores the pointer to the current metadata.json file in S3 — the root of Iceberg's metadata tree. Every query engine that connects to the Glue catalog (Athena, EMR, Redshift Spectrum, Spark on Glue) resolves table locations through this pointer.

Registering Iceberg tables

Tables can be registered through Spark SQL in a Glue ETL job, through Athena DDL, or via the AWS Glue API. The Spark approach is most common for production pipelines because it provides full control over table properties at creation time.

sql

1-- Create an Iceberg table registered in the Glue Data Catalog2CREATE TABLE glue_catalog.analytics.page_events (3  event_id STRING,4  user_id STRING,5  event_type STRING,6  page_url STRING,7  event_timestamp TIMESTAMP,8  properties MAP<STRING, STRING>9)10USING iceberg11PARTITIONED BY (days(event_timestamp))12LOCATION 's3://datalake-prod/analytics/page_events'13TBLPROPERTIES (14  'format-version' = '2',15  'write.parquet.compression-codec' = 'zstd',16  'write.target-file-size-bytes' = '134217728',17  'write.metadata.delete-after-commit.enabled' = 'true',18  'write.metadata.previous-versions-max' = '100'19);

Glue catalog configuration for Spark jobs

Glue ETL jobs need explicit Spark configuration to use the Glue Data Catalog as an Iceberg catalog. This is set in the job's --conf parameters or in a GlueContext initialization script.

python

1# Glue ETL job — Spark configuration for Iceberg with Glue catalog2import sys3from awsglue.transforms import *4from awsglue.utils import getResolvedOptions5from pyspark.context import SparkContext6from awsglue.context import GlueContext7from awsglue.job import Job8 9args = getResolvedOptions(sys.argv, ['JOB_NAME'])10sc = SparkContext()11glueContext = GlueContext(sc)12spark = glueContext.spark_session13 14# These are typically set via --datalake-formats iceberg in job parameters15# but can also be set explicitly:16spark.conf.set("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")17spark.conf.set("spark.sql.catalog.glue_catalog.warehouse", "s3://datalake-prod/")18spark.conf.set("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")19spark.conf.set("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")

Key properties to set at table creation:

`format-version = 2` — enables row-level deletes (position deletes and equality deletes), which is required for efficient MERGE INTO and DELETE operations
`write.parquet.compression-codec` — zstd offers the best compression-to-speed ratio for analytical workloads; snappy remains the default if unset
`write.target-file-size-bytes` — controls the target size for data files written by Spark; 128 MB (134217728) is a reasonable default for mixed workloads
`write.metadata.delete-after-commit.enabled` — automatically cleans up old metadata files to prevent metadata directory bloat in S3

IAM and Lake Formation considerations

The Glue catalog respects AWS IAM policies and, optionally, Lake Formation permissions. For Iceberg tables, the IAM role running the ETL job or query engine needs:

glue:GetTable, glue:GetDatabase, glue:UpdateTable on the catalog resources
s3:GetObject, s3:PutObject, s3:DeleteObject on the S3 location
s3:ListBucket on the bucket for metadata discovery
lakeformation:GetDataAccess if Lake Formation governance is enabled

A common production mistake is granting broad s3:* permissions instead of scoping to specific prefixes. This creates security risk and makes it harder to audit which jobs access which tables.

Glue ETL jobs writing to Iceberg

AWS Glue ETL jobs run Apache Spark under the hood, which means Iceberg write semantics follow the standard Spark-Iceberg integration. However, the Glue runtime adds its own constraints — worker types, job bookmarks, auto-scaling behavior — that affect how data lands in Iceberg tables.

Write patterns and best practices

Batch append workloads. The most straightforward pattern: a Glue job reads source data (S3, JDBC, Kafka via micro-batch), transforms it, and appends to an Iceberg table. Use INSERT INTO for simple appends and MERGE INTO when deduplication or upsert logic is required.

python

1# Batch append — straightforward INSERT INTO2df = spark.read.parquet("s3://raw-data/events/2026-05-25/")3df_transformed = df.select(4    "event_id", "user_id", "event_type",5    "page_url", "event_timestamp", "properties"6).filter("event_type IS NOT NULL")7 8df_transformed.writeTo("glue_catalog.analytics.page_events").append()

python

1# Upsert pattern — MERGE INTO for deduplication2spark.sql("""3  MERGE INTO glue_catalog.analytics.page_events t4  USING staging_events s5  ON t.event_id = s.event_id6  WHEN MATCHED THEN UPDATE SET *7  WHEN NOT MATCHED THEN INSERT *8""")

Controlling file output size. Glue auto-scaling adjusts the number of workers during a job, which directly affects the number of output files. Each Spark task writes one file per partition. A job with 100 workers writing to 50 partitions can produce 5,000 files in a single run. To control this:

Set write.target-file-size-bytes on the table to guide the Spark writer
Repartition the DataFrame before writing: df.repartition("event_date") for partition-aligned writes
Use coalesce() to reduce output parallelism when the dataset is small relative to the cluster size

python

1# Repartition to match Iceberg partitioning for fewer, larger files2df_transformed \3    .repartition("event_date") \4    .sortWithinPartitions("event_timestamp") \5    .writeTo("glue_catalog.analytics.page_events") \6    .append()

Streaming micro-batch. Glue Streaming jobs (Spark Structured Streaming) checkpoint to Iceberg tables at configurable intervals. Shorter intervals mean fresher data but more small files. A 60-second trigger interval against 20 active partitions produces 28,800 files per day — each generating a new snapshot and manifest entry. This is the primary driver of small-file problems in Glue-Iceberg architectures, and precisely why Glue's December 2024 addition of partial progress commits and MoR compaction matters for streaming workloads.

Glue job parameters that matter for Iceberg

Parameter	Recommendation	Why
`--datalake-formats`	`iceberg`	Loads Iceberg JARs into the Spark runtime
`--conf spark.sql.iceberg.handle-timestamp-without-timezone`	`true`	Prevents timestamp casting issues
Worker type	`G.2X` or `G.4X`	Iceberg writes are memory-intensive; `G.1X` causes spills
`--enable-auto-scaling`	`true`	Let Glue scale down for small datasets to reduce file count
`--job-bookmark-option`	`job-bookmark-enable`	Prevents reprocessing the same source data

Glue built-in table optimizers

AWS Glue offers managed table optimizers that run maintenance operations on Iceberg tables without requiring separate ETL jobs. Initially launched with compaction in late 2023, the optimizers expanded in December 2024 to include MoR (Merge-on-Read) delete file compaction, nested data type support, partial progress commits, and schema and partition evolution support. Optimizers are configured per-table — or at catalog level via the UpdateCatalog API — through the Glue console, CLI, or API.

Compaction

Glue compaction merges small data files into larger ones. Three strategies are available:

Binpack — merges files without changing sort order; fastest, lowest compute cost
Sort — rewrites data files sorted by specified columns; improves query performance for filtered reads
Z-order — interleaves multiple sort columns for multi-dimensional filtering; useful when queries filter on different column combinations

Since December 2024, the compaction optimizer also handles MoR (Merge-on-Read) delete files — it monitors partitions for positional and equality deletes, compacts them into the base data files, and commits partial progress to reduce conflicts with concurrent writers. This is critical for streaming workloads where delete files accumulate rapidly between compaction cycles.

Configuration is done via the AWS CLI or console:

bash

1# Enable Glue compaction optimizer via AWS CLI2aws glue update-table-optimizer \3  --catalog-id 123456789012 \4  --database-name analytics \5  --table-name page_events \6  --type compaction \7  --table-optimizer-configuration '{8    "enabled": true,9    "roleArn": "arn:aws:iam::123456789012:role/GlueOptimizer",10    "settings": {11      "compactionStrategy": {12        "strategy": "binpack"13      },14      "targetFileSizeMB": 12815    }16  }'

Snapshot retention

Iceberg tables accumulate snapshots with every write operation. Each snapshot references manifest files and data files; old snapshots keep references alive, preventing garbage collection. Glue's snapshot retention optimizer expires snapshots older than a configured threshold and deletes the associated data files. The default configuration retains snapshots for 5 days with a minimum of 1 snapshot — which is often too aggressive for tables that need time-travel capabilities, and too lenient for high-write streaming tables.

bash

1# Enable snapshot retention — expire snapshots older than 5 days, keep at least 32aws glue update-table-optimizer \3  --catalog-id 123456789012 \4  --database-name analytics \5  --table-name page_events \6  --type retention \7  --table-optimizer-configuration '{8    "enabled": true,9    "roleArn": "arn:aws:iam::123456789012:role/GlueOptimizer",10    "settings": {11      "retentionPeriodInDays": 5,12      "minSnapshotsToKeep": 313    }14  }'

Orphan file deletion

Orphan files are data files in S3 that are no longer referenced by any Iceberg snapshot. They accumulate from failed writes, aborted jobs, and stale snapshots that were expired without cleanup. Glue's orphan file deletion optimizer identifies and removes these files on a daily schedule, with a default retention of 3 days before deletion.

bash

1# Enable orphan file deletion — remove files unreferenced for more than 7 days2aws glue update-table-optimizer \3  --catalog-id 123456789012 \4  --database-name analytics \5  --table-name page_events \6  --type orphan_file_deletion \7  --table-optimizer-configuration '{8    "enabled": true,9    "roleArn": "arn:aws:iam::123456789012:role/GlueOptimizer",10    "settings": {11      "retentionPeriodInDays": 712    }13  }'

All three optimizers are useful for basic hygiene. They reduce storage cost, prevent unbounded metadata growth, and keep file counts manageable for query engines. For most teams, enabling all three on every production table is the minimum viable maintenance configuration.

Common problems with Glue-Iceberg workloads

Even with built-in optimizers enabled, production Glue-Iceberg deployments encounter recurring issues. These are structural problems driven by the interaction between Glue's execution model and Iceberg's file-based architecture.

Small files from frequent ETL jobs

The most pervasive problem. Every Glue job run writes new data files — one per Spark task per partition. A job running every 15 minutes with 10 active partitions produces 960+ files per day. Streaming jobs with 60-second triggers are worse by an order of magnitude. Small files degrade query performance because each file requires a separate S3 GET request during scan, and engines spend more time on I/O overhead than on actual data processing.

Glue's built-in compaction helps — it monitors partitions and fires when file count and size thresholds are crossed — but the fixed thresholds and lack of priority ordering mean it may lag behind file creation on high-velocity tables. The gap between file creation rate and compaction frequency means tables oscillate between degraded and optimized states.

Partition sprawl

High-cardinality partition columns (hour-level timestamps, user IDs, or composite partition keys) create thousands of partitions, each containing a small number of files. Iceberg handles partition pruning efficiently, but the metadata overhead grows with partition count — more manifests, more manifest entries, and longer planning times. Glue jobs that write to many partitions simultaneously exacerbate this because each partition receives one small file per task.

The fix is partition strategy design: use days() or months() transforms instead of hours(), or switch to hidden partitioning with Iceberg's partition transforms to decouple physical layout from logical partition granularity.

Stale snapshot accumulation

Without snapshot expiration enabled, every write operation adds a snapshot that is never removed. A table with 100 daily writes accumulates 36,500 snapshots per year. Each snapshot references a manifest list, and each manifest list references manifests, creating a metadata graph that grows linearly with commit count. Query engines that resolve the current snapshot still pay metadata overhead when the metadata directory is bloated — listing objects in a prefix with thousands of metadata files is slow on S3.

Glue's snapshot retention optimizer addresses this, but the default retention period may be too generous for high-write tables. A 7-day retention on a table with 100 daily writes means 700 live snapshots at any time — still enough to cause planning latency.

Uncoordinated maintenance

The three Glue optimizers (compaction, snapshot retention, orphan cleanup) run independently. There is no guarantee that snapshot expiration completes before orphan cleanup runs, or that compaction targets the current set of live files. In the worst case, compaction rewrites files that snapshot expiration is about to dereference, wasting compute. Or orphan cleanup runs before expiration has finished, missing files that were just dereferenced.

The correct maintenance sequence is: (1) expire snapshots → (2) remove orphan files → (3) compact data files → (4) rewrite manifests. Glue does not enforce this ordering.

Tuning Glue compaction

Glue's compaction optimizer provides configurable parameters that can be tuned to match your workload. Getting these right is the difference between compaction that keeps up with ingestion and compaction that burns DPU-hours without impact.

Target file size

The targetFileSizeMB parameter controls the output file size after compaction. The right value depends on the query pattern:

Point lookups and small-range scans: 64–128 MB. Smaller files mean less data read per query, which matters when queries touch a single partition or a narrow key range.
Full-partition scans and aggregations: 256–512 MB. Larger files reduce the number of S3 GET requests and improve throughput for sequential reads.
Mixed workloads: 128 MB is a reasonable default that balances both access patterns.

Compaction strategy selection

Binpack for tables where query performance is acceptable and you only need file consolidation. It is the cheapest strategy because it does not re-sort data.
Sort when queries consistently filter on one or two columns. Sort compaction physically orders data so engines can skip row groups using Parquet min/max statistics. For a table filtered by event_date and user_id, sorting on those columns can reduce scan volume by 50–80%.
Z-order when queries filter on varying column combinations. Z-order interleaves sort dimensions so no single column dominates, providing moderate pruning across multiple filter columns. The trade-off is that z-order produces worse pruning than single-column sort for any individual query pattern.

For a detailed comparison of strategies and engines, see our compaction tools benchmark.

Compression codecs

Compression is set at the table level via write.parquet.compression-codec, not in the compaction optimizer. Compaction rewrites files using the table's configured codec:

Zstd — best compression ratio with acceptable decompression speed; reduces S3 storage cost and GET data transfer
Snappy — fastest decompression; default in most Spark configurations; slightly larger files than zstd
LZ4 — similar to snappy in speed, marginally better compression
For most analytical workloads, switch from snappy to zstd. The storage savings compound across the entire table.

When compaction triggers

Glue's compaction optimizer continuously monitors table partitions and fires when internal thresholds are met — for example, when a partition exceeds 100 files each smaller than 75% of the target file size. This is better than a blind cron, but you cannot customize the thresholds or configure:

Custom file count or size thresholds per partition
Event-driven triggers (compact immediately after an ETL job completes)
Priority ordering (compact the most degraded tables first)
Frequency caps or concurrency limits across tables

This is the most significant tuning limitation. Teams that need precise control over compaction timing still rely on custom Spark jobs or external orchestration.

Glue limitations for Iceberg at scale

The built-in Glue optimizers are competent for individual table maintenance — especially after the December 2024 enhancements. For teams scaling beyond a few dozen tables or operating multi-engine environments, several architectural limitations surface.

Single-catalog scope

Glue optimizers operate only on tables registered in the AWS Glue Data Catalog. If your architecture includes REST catalogs (Polaris, Nessie, Gravitino), S3 Tables buckets, or Iceberg tables managed by other catalogs, Glue cannot maintain those tables. You need separate maintenance pipelines for each catalog — and no unified view of table health across the estate.

No cross-engine telemetry

Glue has no visibility into queries from Athena, Trino, Redshift, DuckDB, or Snowflake. Sort order decisions in Glue compaction are based on static configuration, not actual query patterns. If Athena queries filter by region and timestamp but the table is sorted by event_id, the sort compaction is wasted effort. Without cross-engine telemetry, there is no way to know which columns matter most for data skipping.

No coordinated maintenance sequencing

As discussed, the three optimizers run independently. Compaction may rewrite files that snapshot expiration is about to dereference. Orphan cleanup may run before expired snapshots release their file references. There is no dependency graph, no sequencing guarantee, and no way to configure one.

No manifest optimization

Glue does not offer a manifest rewrite optimizer. Over time, as compaction replaces data files, manifests accumulate stale entries and grow beyond optimal size. Query planning time increases because engines must read and parse more manifest files. The only way to rewrite manifests on Glue-cataloged tables is to run a custom Spark job calling rewrite_manifests().

Limited observability

Glue provides CloudWatch metrics for optimizer runs (success/failure, duration, DPU usage) but does not surface Iceberg-level health indicators: file count per partition, average file size, delete-file ratio, manifest depth, snapshot count trends, or storage cost attribution per table. Diagnosing why a table is slow requires manual inspection of metadata files — running SELECT * FROM table.snapshots and SELECT * FROM table.files in Athena, then correlating results across tables.

DPU cost for compaction

Glue compaction runs on Glue DPUs — the same Spark infrastructure used for ETL jobs. Compaction across hundreds of tables consumes significant DPU-hours, and the JVM-based Spark runtime carries overhead (garbage collection, executor provisioning, serialization) that makes file-rewriting operations slower and more expensive than they need to be.

How LakeOps complements AWS Glue

LakeOps is not a replacement for AWS Glue — Glue remains your catalog and ETL engine. LakeOps connects to the Glue Data Catalog as one of its catalog sources and adds the operational layer that Glue does not provide: autonomous maintenance with correct sequencing, cross-engine observability, query-aware optimization, and lake-wide governance. No data is moved or copied — LakeOps reads metadata and file-level statistics from S3.

Modern Lakehouse Architecture with LakeOps Control Plane — LakeOps sits alongside AWS Glue as a dedicated control plane — connecting to the Glue Data Catalog, ingesting telemetry from all query engines, and running autonomous maintenance across the full table estate.

LakeOps in action — connecting to an Iceberg catalog, analyzing table health, and running autonomous optimization.

Key capabilities for Glue-based lakehouses:

Catalog connectivity — connects to the Glue Data Catalog via standard IAM credentials. Also supports REST catalogs (Polaris, Nessie, Gravitino) and S3 Tables. Multi-catalog architectures get a unified view across all catalog sources — the single pane of glass that Glue cannot offer for non-Glue-cataloged tables.
Autonomous maintenance with correct sequencing — runs the full pipeline in order: snapshot expiration → orphan cleanup → compaction → manifest optimization. Each step's output feeds the next. Prioritized by health classification — Critical tables first, then Warning, then Healthy.
Rust execution engine — built on Apache DataFusion. Binpack compaction in 221s vs 1,612s for Spark on comparable datasets — making continuous maintenance across hundreds of Glue-cataloged tables economically viable. No JVM overhead, no GC pauses, no executor provisioning.
Cross-engine query-aware optimization — collects telemetry from Athena, Trino, Spark, Redshift, DuckDB, Snowflake, and Flink. Identifies which columns production queries filter, join, and group on per table, and applies sort orders accordingly. Layout adapts as access patterns shift.
Layout simulations — tests proposed sort changes on a real Iceberg branch, replaying production queries and comparing scan reduction before modifying any production data.
Lake-wide observability — every table classified as Critical, Warning, or Healthy. Severity-ranked Insights surface specific issues (excessive manifests, partition skew, small files, snapshot bloat) with root cause and remediation context. Every maintenance operation logged with before/after metrics for full auditability.
Policy governance — maintenance rules at catalog, namespace, or table scope replace per-table AWS CLI commands. Policies are versioned, auditable, and follow a specificity hierarchy.

LakeOps Dashboard — optimization activity, cost savings, and table health — The LakeOps Dashboard: 30-day optimization activity, cost savings, table health tiers (Critical/Warning/Healthy), and total data optimized — connecting maintenance to measurable outcomes for Glue-cataloged tables.

LakeOps connect and start — Connecting a catalog to LakeOps — point it at your Glue Data Catalog, REST catalog, or S3 Tables bucket. Discovery and health classification begin immediately.

LakeOps Layout Simulations — field access frequency and sort strategy comparison — Layout Simulations: which columns Athena, Trino, and Spark actually filter on, candidate sort configurations tested on a real Iceberg branch, and projected scan reduction — before any production data is modified.

LakeOps compaction benchmarks — production results — Production compaction benchmarks — LakeOps Rust engine vs Spark across file counts, data volumes, and strategies. Consistent 5–8× speed improvements translate directly to lower compute cost.

LakeOps table events — sequenced maintenance operations — Table-level Events: every maintenance step — Compact Data Files (970→87 files), Expire Snapshots, Rewrite Manifests — logged with duration, impact, and status.

LakeOps cost savings — CPU and storage reduction — Measured cost impact — 75% CPU reduction and 55% storage reduction from autonomous compaction, snapshot management, and orphan cleanup across Glue-cataloged tables.

Practical optimization checklist

A summary of actions to optimize your Glue-Iceberg deployment, from table creation through production operations.

Table creation - Use Iceberg format version 2 for row-level delete support - Set `write.parquet.compression-codec` to `zstd` - Set `write.target-file-size-bytes` to `134217728` (128 MB) as a baseline - Enable `write.metadata.delete-after-commit.enabled` to prevent metadata file accumulation - Use Iceberg partition transforms (`days()`, `months()`) instead of raw column partitioning - Avoid high-cardinality partition columns that create thousands of small partitions

ETL job configuration - Use `G.2X` or `G.4X` worker types — Iceberg writes are memory-intensive - Repartition DataFrames to match Iceberg partition columns before writing - Use `coalesce()` to reduce file count when datasets are small relative to cluster size - Enable job bookmarks to prevent reprocessing - For streaming jobs, balance trigger interval against file count: longer intervals produce fewer, larger files

Glue optimizer setup - Enable all three optimizers (compaction, snapshot retention, orphan cleanup) on every production table - Set compaction strategy based on query patterns: binpack for consolidation, sort for single-column filters, z-order for multi-column filters - Set snapshot retention to match your time-travel requirements — 3–5 days for most tables, 1 day for high-write streaming tables - Set orphan file retention to 3–7 days to avoid deleting files from in-progress commits

Monitoring - Set up CloudWatch alarms on Glue optimizer failures - Periodically query `table.snapshots`, `table.files`, and `table.manifests` in Athena to assess table health - Track file count and average file size per partition over time - Monitor S3 storage cost per table prefix to detect orphan file accumulation

Use a dedicated control plane for autonomous optimization - Connect a control plane like [LakeOps](https://lakeops.dev) from the start — it complements Glue by adding sequenced maintenance, cross-engine awareness, and lake-wide observability that Glue's built-in optimizers cannot provide - Prioritize sequenced maintenance (expire → orphans → compact → manifests) over independent optimizer runs — a control plane enforces this automatically - Get query-aware sort optimization driven by actual Athena, Trino, and Spark access patterns — static sort configuration goes stale as workloads evolve - Centralize maintenance policy at catalog, namespace, or table scope instead of per-table CLI configurations - Use the Rust-based compaction engine for cost-effective maintenance at scale — 5–8× faster than Spark eliminates JVM overhead across hundreds of daily runs

The Glue-Iceberg integration is a strong foundation. The built-in optimizers handle basic maintenance. But production lakehouses need more: coordinated sequencing, query-aware optimization, cross-catalog observability, and governance that scales with the estate. A dedicated control plane like LakeOps provides that layer — sitting alongside Glue, not replacing it — to keep every table healthy, fast, and cost-efficient as the lakehouse grows.

AWS Glue Iceberg Optimization: A Practical Guide

AWS Glue Data Catalog for Iceberg

Registering Iceberg tables

Glue catalog configuration for Spark jobs

IAM and Lake Formation considerations

Glue ETL jobs writing to Iceberg

Write patterns and best practices

Glue job parameters that matter for Iceberg

Glue built-in table optimizers

Compaction

Snapshot retention

Orphan file deletion

Common problems with Glue-Iceberg workloads

Small files from frequent ETL jobs

Partition sprawl

Stale snapshot accumulation

Uncoordinated maintenance

Tuning Glue compaction

Target file size

Compaction strategy selection

Compression codecs

When compaction triggers

Glue limitations for Iceberg at scale

Single-catalog scope

No cross-engine telemetry

No coordinated maintenance sequencing

No manifest optimization

Limited observability

DPU cost for compaction

How LakeOps complements AWS Glue

Practical optimization checklist

Tags

Related articles

Apache Iceberg Catalog Migration: Hive Metastore to REST, Polaris, Glue, or Nessie

Apache Iceberg Orphan Files: Safe Cleanup Without Breaking Tables

Apache Iceberg Operational Runbook: Incidents, Symptoms, and Fixes