Apache Iceberg articles

Deep dives on Apache Iceberg: table format internals, compaction, snapshots, metadata optimization, and production operations.

All Apache Iceberg LakeOps Data Platforms Data Lake Lakehouse Cloud Cost FinOps Observability QueryFlux Data Governance

45 articles

Apache IcebergTable MaintenanceCompactionSnapshot ExpirationJun 7, 2026

Automating Apache Iceberg Table Maintenance

Apache Iceberg ships the maintenance primitives — compaction, snapshot expiration, orphan cleanup, and manifest rewriting — but none of them run themselves. This guide covers why each operation matters, the correct execution order, the limitations of scripts and cron jobs, and how to automate the full lifecycle with policies, observability, and a purpose-built control plane.

Chris P

21 min read

Kafka to Iceberg Compaction — Kafka events streaming into an Iceberg table, compacted through a gear process into optimized blocks.

Apache IcebergCompactionApache KafkaStreamingJun 7, 2026

Kafka to Iceberg Compaction — Done Right

Streaming from Kafka into Apache Iceberg creates small files faster than any other write pattern. This guide covers why standard compaction approaches fail for streaming tables, how to measure compaction need, implement partition-aware compaction that avoids writer conflicts, tune rewriteDataFiles parameters, and run maintenance autonomously at scale.

Rob M

26 min read

Apache IcebergApache KafkaKafka ConnectStreaming IngestionJun 7, 2026

Kafka to Iceberg: Ingestion Guide

A practical guide to streaming data from Apache Kafka into Apache Iceberg tables — covering Kafka Connect, Apache Flink, Spark Structured Streaming, and CDC with Debezium. Includes configuration examples, schema management, partitioning strategies, production pitfalls, and how to keep streaming tables healthy at scale.

Rob M

28 min read

Apache IcebergIceberg 1.11Deletion VectorsVariant TypeJun 5, 2026

Apache Iceberg 1.11.0 — What's New?

Apache Iceberg 1.11.0 lands V3 maturity with production-ready deletion vectors, a native Variant type for semi-structured data, server-side scan planning, built-in table encryption, and a pluggable File Format API that opens the door to next-generation storage formats.

Jonathan Saring

10 min read

Apache IcebergLakehouseLakeOpsData PlatformsMay 31, 2026

Iceberg Lakehouse with AI Agents: A Guide

AI agents are becoming primary consumers of Iceberg lakehouse data — querying tables iteratively, at high frequency, and without human review. This guide walks through the five components your infrastructure needs to support agentic workloads — MCP connectivity, guardrails, multi-engine routing, self-optimizing storage, and observability — and shows how LakeOps provides each one.

Jonathan Saring

24 min read

Apache IcebergIntelligent LakehouseLakeOpsData PlatformsMay 29, 2026

Intelligent Lakehouse: Build Like Netflix

Netflix spent years building an intelligent lakehouse — Polaris for catalog management, Autotune for compaction, janitors for cleanup, and Metacat for observability. LakeOps lets every team build the same — and go beyond — in minutes. Here is what an intelligent lakehouse actually requires, and how LakeOps provides each component.

Jonathan Saring

19 min read

AWS Glue Iceberg Optimization — an S3 bucket with scattered data objects funneled through an optimization lens into a geometric iceberg, with icons for Search, Analytics, and Tuning

Apache IcebergAWS GlueCompactionTable MaintenanceMay 26, 2026

AWS Glue Iceberg Optimization: A Practical Guide

AWS Glue provides native Iceberg support for cataloging, ETL, and built-in table maintenance — but production lakehouses hit limitations fast. This guide covers Glue catalog configuration, ETL best practices, compaction tuning, common pitfalls, and how a dedicated control plane fills the operational gaps.

David W

20 min read

DatabricksApache IcebergLakeOpsDelta LakeMay 26, 2026

Databricks to Iceberg Smooth Migration

Databricks to Iceberg smooth migration opens a multi-engine lakehouse — not a platform exit. Databricks stays central for ML and Spark; Iceberg adds Trino, Snowflake, and open catalogs. Five tools: LakeOps, UC managed Iceberg, Delta UniForm, Spark, and Lakehouse Federation.

David W

18 min read

Apache IcebergdbtIncremental ModelsData LakehouseMay 26, 2026

Apache Iceberg with dbt: Optimization Guide

dbt transforms your data — but who maintains the Iceberg tables underneath? A practical guide to dbt adapters, incremental strategies, table properties, and the maintenance gap that every dbt + Iceberg team hits in production.

Rob M

16 min read

Apache Iceberg with Flink Optimization — Flink squirrel mascot with streaming data flowing through an optimization ring into a geometric iceberg, with performance metric icons

Apache IcebergApache FlinkFlink streamingIceberg compactionMay 26, 2026

Apache Iceberg with Flink: Streaming Optimization Guide

Flink streaming into Iceberg creates thousands of small files per hour. This guide covers checkpoint tuning, write distribution modes, Flink SQL patterns, and why external maintenance is essential for production streaming tables.

Chris P

15 min read

Apache Iceberg Delete Files — stacked data blocks with pink delete file markers funneled through compaction into clean, optimized data with a performance gauge showing improved read speed

Apache Icebergdelete filesmerge-on-readposition deletesMay 26, 2026

Apache Iceberg Delete Files: Reducing Merge-on-Read Overhead

Delete files let Iceberg avoid rewriting data on every UPDATE or DELETE — but every unresolved delete file forces readers to reconcile at query time. A deep guide to position deletes, equality deletes, measuring overhead, and resolving accumulation before it tanks performance.

David W

17 min read

Apache IcebergPartitioningHidden PartitioningPartition EvolutionMay 26, 2026

Apache Iceberg Table Partitioning Best Practices

Partitioning determines how much data every query must scan. Apache Iceberg's hidden partitioning and partition evolution change the game — but choosing the wrong strategy still creates performance cliffs. A practical guide to transforms, sizing, evolution, and avoiding the small-files trap.

Chris P

18 min read

Apache Iceberg Puffin Statistics — a puffin bird beside a statistics dashboard showing file counts, records, partitions, and data size, connected to a geometric iceberg

Apache IcebergPuffin statisticsNDV sketchesquery optimizationMay 26, 2026

Apache Iceberg Puffin Statistics: A Practical Guide

Puffin files store table-level statistics — NDV sketches and custom blobs — that query engines use for join ordering, split planning, and cost-based optimization. A practical guide to how they work, how to collect them, how they go stale, and how to keep them accurate at scale.

David W

18 min read

Fixing Small Files in Apache Iceberg — scattered small data cubes compacted into larger organized file blocks flowing toward a geometric iceberg

Apache IcebergCompactionSmall FilesTable MaintenanceMay 26, 2026

Fixing Small Files in Apache Iceberg: A Practical Guide

Small files silently degrade every Apache Iceberg lakehouse — inflating S3 costs, slowing query planning, and bloating metadata. This guide covers root causes, measurement, manual and automated fixes, and how to eliminate the problem at scale.

Rob M

19 min read

Apache Iceberg Table Health and Maintenance — health score dashboard showing 92 Healthy with status indicators for Snapshots, Manifests, Delete Files, Orphan Files, and File Health beside a geometric iceberg

Apache IcebergTable MaintenanceCompactionSnapshot ExpirationMay 26, 2026

Apache Iceberg Table Health and Maintenance: A Complete Guide

Iceberg tables degrade silently in production — small files multiply, snapshots accumulate, orphans waste storage, and manifests fragment. A comprehensive guide to the five maintenance operations, why sequencing matters, the metrics that reveal problems early, and how to automate the full lifecycle.

David W

20 min read

Apache IcebergTrinoIceberg optimizationpredicate pushdownMay 26, 2026

Apache Iceberg with Trino: Performance Optimization Guide

A practical guide to optimizing Apache Iceberg queries and table maintenance with Trino — covering scan planning, predicate pushdown, file pruning, Trino-side tuning, maintenance procedures, physical layout optimization, and how a dedicated control plane eliminates JVM overhead while adding cross-engine intelligence.

Chris P

18 min read

Apache Iceberg on AWS S3 — architecture diagram showing Iceberg metadata layers, AWS services, and the data lakehouse stack

Apache IcebergAWS S3Data LakeLakeOpsMay 25, 2026

Apache Iceberg on AWS S3: A Guide

Apache Iceberg on AWS S3 is the standard architecture for open lakehouses. This guide covers how Iceberg's metadata hierarchy maps to S3 objects, the AWS services ecosystem (Glue, Athena, EMR, Redshift, S3 Tables), configuration best practices, performance optimization, table maintenance, and the operational components needed for production deployments.

Rob M

24 min read

Reducing AWS S3 cost with Apache Iceberg — diagram showing S3 storage and API cost vectors from Iceberg write patterns and the optimization strategies that address them

FinOpsAWS S3Apache IcebergLakeOpsMay 25, 2026

Reducing AWS S3 Cost with Iceberg: A Guide

AWS S3 bills for Iceberg lakehouses are inflated by small files, orphan data, retained snapshots, metadata overhead, and scan amplification. This guide quantifies each cost vector with S3 pricing mechanics and walks through five strategies — compaction, expiration, layout optimization, storage tiering, and engine routing — to cut storage and query spend.

Rob M

20 min read

Snowflake to Iceberg migration — Snowflake tables flowing into an Apache Iceberg lakehouse, illustrating a hybrid multi-engine architecture where Snowflake remains a valued component

SnowflakeApache IcebergLakeOpsData PlatformsMay 25, 2026

Snowflake to Iceberg Smooth Migration

A practical guide for senior data engineers expanding Snowflake into a multi-engine Iceberg lakehouse. Covers five production tools — LakeOps, managed Iceberg, Open Catalog sync, Spark, and AWS Glue — with migration patterns, operational trade-offs, and a phased rollout sequence.

David W

17 min read

Annual cloud bill infographic showing Iceberg lakehouse spend doubling year over year — FinOps and cost reduction framing for data platform teams in 2026

FinOpsApache IcebergLakeOpsCloud CostMay 24, 2026

State of Iceberg FinOps and Cost Reduction in 2026

State of Iceberg FinOps in 2026: where lakehouse spend leaks, what to measure, how autonomous management and optimization are replacing manual maintenance — and a practical survey of tools from cloud optimizers to control planes.

David W

24 min read

Multiple Query Engines with Iceberg — Ferris the Rust crab routing queries to Trino, Snowflake, DataFusion, Databricks, Presto, ClickHouse, DuckDB, and Apache Spark over an Iceberg Lakehouse

Apache IcebergQueryFluxquery routingLakehouseMay 23, 2026

Routing Multiple Query Engines with Iceberg

How to route queries across Trino, Spark, DuckDB, Snowflake, Athena, and Flink on shared Iceberg tables — covering the architecture of a SQL routing proxy, dialect translation, routing strategies, table-aware optimization, and the tooling that makes it work.

Rob M

18 min read

Diagram showing seven Iceberg catalog options — Polaris, Nessie, Glue, Unity, Gravitino, Lakekeeper, and Hive — connected to a central Apache Iceberg symbol

Apache IcebergIceberg catalogLakehouseData LakeMay 22, 2026

Best Catalog for Apache Iceberg? A Useful Comparison

A technical comparison of the seven major Apache Iceberg catalogs — Hive Metastore, AWS Glue, Apache Polaris, Project Nessie, Databricks Unity Catalog, Apache Gravitino, and Lakekeeper — across protocol support, access control, multi-engine interoperability, credential vending, and production readiness.

Chris P

21 min read

Apache IcebergData PlatformsData LakeLakeOpsMay 20, 2026

Iceberg Lake for Data Analytics: Optimization Guide

Eight optimization layers for data platform engineers running BI, ad-hoc SQL, and aggregation pipelines on Apache Iceberg — from partition design and file sizing through compaction, routing, and continuous maintenance.

Jonathan Saring

15 min read

LakeOps Data Lake Insights showing metadata health alerts across Iceberg tables — manifest fragmentation, snapshot accumulation, and partition skew

Apache IcebergData PlatformsData LakeLakeOpsMay 20, 2026

Iceberg Metadata Lifecycle: Maintenance and Optimization

A deep technical guide to managing the metadata layer that makes Apache Iceberg fast — snapshots, manifests, metadata.json files, and Puffin statistics — covering expiration, consolidation, orphan cleanup, and the sequencing that prevents production incidents.

Jonathan Saring

19 min read

Iceberg lakehouse cost reduction — cost waste flows through LakeOps autonomous operations to deliver 80% savings

Apache IcebergLakeOpsCloud CostFinOpsMay 19, 2026

7 Iceberg Lakehouse Cost Reduction Strategies

Iceberg lakehouses silently accumulate cost from small files, dead snapshots, orphan data, unoptimized layouts, and over-provisioned compute. Seven practical strategies — from deploying an autonomous control plane to leveraging partition evolution — that production data teams use to cut lakehouse spend by up to 80%.

Jonathan Saring

9 min read

Apache IcebergLakeOpsQuery PerformanceData PlatformsMay 19, 2026

Optimizing Iceberg Lakehouse Performance

Iceberg tables degrade silently — small files from streaming, unsorted data, fragmented manifests, accumulated delete files. Each one caps query speed regardless of engine. Six concrete optimization layers, how they interact, and how autonomous maintenance keeps every table at peak performance.

David W

11 min read

Data PlatformsData LakeLakehouseApache IcebergMay 17, 2026

Data Lake vs Lakehouse vs Warehouse: A Practical Guide

Data lakes, warehouses, and lakehouses are not interchangeable — each has hard limits the others cannot cover. A practical guide for platform leaders: where each architecture wins, where it fails, cost and governance trade-offs, and how to choose (or combine) them in 2026.

Chris P

22 min read

Iceberg Table Maintenance Solution Comparison — side-by-side feature matrix for LakeOps, AWS Glue, S3 Tables, Snowflake, BigLake, Cloudera, and Starburst

Apache IcebergCompactionLakehouseData PlatformsMay 16, 2026

9 Iceberg Table Compaction Tools Compared for Production Lakehouses

Compaction keeps Apache Iceberg lakehouses fast and lean — but every tool approaches it differently. A side-by-side look at nine production options: LakeOps, AWS Glue, Amazon S3 Tables, Snowflake, Google BigLake, Cloudera, Starburst, Dremio, and Databricks.

Jonathan Saring

17 min read

LakeOps lakehouse control plane — connected to Iceberg catalogs on the left, query engines on the right, with observability, autonomous optimization, and cost management in the center

Apache IcebergLakeOpsLakehouseFinOpsMay 14, 2026

Iceberg Lakehouse Optimization with LakeOps

A practical walkthrough of optimizing an Apache Iceberg lakehouse end to end — from connecting catalogs and diagnosing table health through autonomous compaction, lifecycle management, and multi-engine routing to measurable cost and performance outcomes.

Rob M

16 min read

Data PlatformsData SwampApache IcebergLakehouseMay 13, 2026

From Data Swamp to Modern Iceberg Lakehouse

Every data lake starts with a promise of unlimited flexibility — and most end up as a swamp. Stale files, broken schemas, no observability, and engineers spending more time maintaining pipelines than analyzing data. Apache Iceberg fixed the reliability gap. A lakehouse control plane fixes everything else. A practical guide to the full transition — component by component.

Jonathan Saring

23 min read

Optimizing Iceberg Lake Compaction — scattered small data-block cubes funnel through a compaction machine onto a conveyor belt of optimized blocks, leading to a crystal-clear iceberg lakehouse

Apache IcebergCompactionLakehouseLakeOpsMay 13, 2026

Optimizing Iceberg Lake Compaction: A Guide

Compaction is the most impactful operation in an Apache Iceberg lakehouse — and the hardest to get right at scale. File merging is the easy part. Knowing when to trigger it, what sort strategy to apply per table, how to avoid conflicting with other maintenance, and how to do it without spinning up expensive JVM clusters — that is the real problem. A breakdown of what modern compaction actually requires.

Jonathan Saring

16 min read

Iceberg lakehouse optimization — multi-engine ecosystem (AWS, Databricks, Trino, DuckDB, Snowflake, Flink, and more) around a shared Iceberg lake, with observability and optimization above the waterline

Apache IcebergLakehouseLakeOpslakehouse optimizationMay 10, 2026

Iceberg Lakehouse Optimization — The Right Way

Apache Iceberg gives your lakehouse warehouse-grade reliability on object storage — but the format does not optimize itself. A practical guide to every operational pillar a production Iceberg lakehouse needs — from lake-wide observability and query-aware compaction to snapshot lifecycle, metadata health, and governance — and how LakeOps runs it all from a single control plane.

Jonathan Saring

21 min read

LakeOps table metrics showing records distribution, file size distribution, and table size growth over the last 30 days

Apache IcebergLakeOpsFinOpsData PlatformsMay 7, 2026

Autonomous Iceberg Table Maintenance for Data Lakes

Iceberg tables need continuous maintenance — compaction, snapshot expiration, manifest optimization, and orphan cleanup — but manual scripts break at scale. A deep look at what autonomous table maintenance means in practice: how telemetry-driven orchestration replaces reactive firefighting and keeps every table healthy without human intervention.

Rob M

16 min read

Modern lakehouse architecture: LakeOps control plane for autonomous management and optimization — observability, compaction, routing, AI guardrails, and governance above Iceberg on S3, with catalogs and multi-engine compute (Spark, Trino, Snowflake, Databricks, and more)

Data PlatformsApache IcebergSnowflakeDatabricksMay 7, 2026

From Databricks and Snowflake to an Open Data Platform

For a decade, Snowflake and Databricks defined enterprise data. Then the lakehouse emerged — open formats on open storage. What was missing was the operational layer to make it work at scale. An autonomous control plane turns a lakehouse into a managed open data platform — without the lock-in.

Jonathan Saring

18 min read

LakeOps measured results on real Iceberg workloads: 95% faster compaction, 12x query performance improvement, 80% cost reduction

Apache IcebergLakeOpsCloud CostFinOpsMay 5, 2026

Apache Iceberg Cost Optimization in 2026

Your Iceberg lake is overcharging you from four directions at once — storage bloat, query compute waste, compaction overhead, and engineering time. This post breaks down exactly where each dollar goes and how autonomous table management eliminates the waste without touching your pipelines.

David W

22 min read

LakeOps control plane for AI agents — MCP, guardrails, routing, storage optimization, observability, and workload policies above Iceberg tables on object storage

Apache IcebergLakeOpsQueryFluxData PlatformsMay 5, 2026

Optimizing Apache Iceberg for Agentic AI: From Slow Tables to Sub-Second Agent Queries

AI agents issue SQL iteratively, repeat query templates at high frequency, and need sub-second responses from tables designed for batch workloads. This post covers what breaks when agents hit a production Iceberg lake — and the five infrastructure layers that fix it: MCP connectivity, guardrails, multi-engine routing, self-optimizing storage, and closed-loop feedback.

Chris P

18 min read

LakeOps dashboard showing optimization activity, key metrics, and recent operations across production Iceberg tables

Apache IcebergLakeOpsFinOpsData PlatformsMay 3, 2026

Managed Iceberg in 2026: Autonomous Data Lake

Iceberg tables degrade silently — small files pile up, snapshots bloat metadata, and query latency creeps higher. A breakdown of the nine components every production data lake needs to stay healthy — starting with observability and telemetry collection, through compaction, snapshot management, and lake-wide policies, to multi-engine routing and agentic AI enablement.

Jonathan Saring

23 min read

Introducing QueryFlux: Open-Source Universal Multi-Engine Query Router and SQL Proxy

External

QueryFluxApache IcebergData PlatformsApr 11, 2026

Introducing QueryFlux: Open-Source Universal Multi-Engine Query Router and SQL Proxy

QueryFlux is a universal SQL proxy and multi-engine query router in Rust—one access layer in front of Trino, DuckDB, StarRocks, and Athena with routing, dialect translation, and observability.

Jonathan Saring

12 min read

Benchmarking Lakeops: A Production-Grade Compaction Engine for Apache Iceberg

External

Apache IcebergLakeOpsData PlatformsMar 18, 2026

Benchmarking Lakeops: A Production-Grade Compaction Engine for Apache Iceberg

How we compacted 4.5 TB across 10 real production tables, achieved up to 99.8% file reduction, and made Apache Spark OOM on a job we finished in 11 minutes.

Amit Gilad

9 min read

Building a Distributed Compaction Engine for Apache Iceberg with Rust + DataFusion

External

Apache IcebergLakeOpsData PlatformsJan 28, 2026

Building a Distributed Compaction Engine for Apache Iceberg with Rust + DataFusion

How we built a high-performance, distributed compaction engine for Apache Iceberg using Rust and DataFusion—architecture, design choices, and lessons learned.

Amit Gilad

9 min read

From 350TB to 230TB in 10 Minutes: The Hidden Weight of Stale Data

External

Apache IcebergData LakeLakeOpsSep 23, 2025

From 350TB to 230TB in 10 Minutes: The Hidden Weight of Stale Data

See how a 350TB data lake shrank to 230TB in 10 minutes by removing stale data—saving 34% in AWS S3 costs and proving the need for a control plane.

Amit Gilad

5 min read

Why Every Data Lake Needs a Control Plane: Lessons from Apache Iceberg

External

Apache IcebergData LakeLakeOpsSep 11, 2025

Why Every Data Lake Needs a Control Plane: Lessons from Apache Iceberg

Apache Iceberg delivers speed, but without a control plane snapshots pile up, costs surge, query take more time — starting with expiration.

Amit Gilad

8 min read

Cracking the Ice: The Battle Between Sort and Binpack in Apache Iceberg

External

Apache IcebergData LakeData PlatformsMay 7, 2025

Cracking the Ice: The Battle Between Sort and Binpack in Apache Iceberg

Unlocking performance vs. optimizing storage — choosing the right compaction strategy for your data lake.

Amit Gilad

7 min read

Delta Lake vs Apache Iceberg: Choosing the Right Table Format

External

Delta LakeApache IcebergData LakeLakehouseJan 30, 2025

Delta Lake vs Apache Iceberg: Choosing the Right Table Format

A detailed comparison between Delta Lake and Apache Iceberg, exploring their architectures, performance characteristics, and ideal use cases to help you make the right choice.

Amit Gilad

10 min read

Incremental Processing with Apache Iceberg & Spark: A Comprehensive Guide

External

Apache IcebergApache SparkData PlatformsSep 17, 2024

Incremental Processing with Apache Iceberg & Spark: A Comprehensive Guide

Learn how to implement efficient incremental processing with Apache Iceberg and Spark, including best practices for data lake optimization and performance tuning.

Amit Gilad

9 min read