Back to all articles

Data Lake articles

Fundamentals of data lake architecture — storage layout, partitioning, format selection, and lifecycle management.

15 articles

AWS Glue Iceberg Optimization — an S3 bucket with scattered data objects funneled through an optimization lens into a geometric iceberg, with icons for Search, Analytics, and Tuning
Apache IcebergAWS GlueCompactionTable Maintenance

AWS Glue Iceberg Optimization: A Practical Guide

AWS Glue provides native Iceberg support for cataloging, ETL, and built-in table maintenance — but production lakehouses hit limitations fast. This guide covers Glue catalog configuration, ETL best practices, compaction tuning, common pitfalls, and how a dedicated control plane fills the operational gaps.

David W
David W
20 min read
Apache Iceberg on AWS S3 — architecture diagram showing Iceberg metadata layers, AWS services, and the data lakehouse stack
Apache IcebergAWS S3Data LakeLakeOps

Apache Iceberg on AWS S3: A Guide

Apache Iceberg on AWS S3 is the standard architecture for open lakehouses. This guide covers how Iceberg's metadata hierarchy maps to S3 objects, the AWS services ecosystem (Glue, Athena, EMR, Redshift, S3 Tables), configuration best practices, performance optimization, table maintenance, and the operational components needed for production deployments.

Rob M
Rob M
24 min read
Diagram showing seven Iceberg catalog options — Polaris, Nessie, Glue, Unity, Gravitino, Lakekeeper, and Hive — connected to a central Apache Iceberg symbol
Apache IcebergIceberg catalogLakehouseData Lake

Best Catalog for Apache Iceberg? A Useful Comparison

A technical comparison of the seven major Apache Iceberg catalogs — Hive Metastore, AWS Glue, Apache Polaris, Project Nessie, Databricks Unity Catalog, Apache Gravitino, and Lakekeeper — across protocol support, access control, multi-engine interoperability, credential vending, and production readiness.

Chris P
Chris P
21 min read
Iceberg Lake for Data Analytics: Optimization Guide — iceberg on water with analytics dashboard showing 9.4× query speed, 68% cost efficiency gain, and 82% less data scanned
Apache IcebergData PlatformsData LakeLakeOps

Iceberg Lake for Data Analytics: Optimization Guide

Eight optimization layers for data platform engineers running BI, ad-hoc SQL, and aggregation pipelines on Apache Iceberg — from partition design and file sizing through compaction, routing, and continuous maintenance.

Jonathan Saring
Jonathan Saring
15 min read
LakeOps Data Lake Insights showing metadata health alerts across Iceberg tables — manifest fragmentation, snapshot accumulation, and partition skew
Apache IcebergData PlatformsData LakeLakeOps

Iceberg Metadata Lifecycle: Maintenance and Optimization

A deep technical guide to managing the metadata layer that makes Apache Iceberg fast — snapshots, manifests, metadata.json files, and Puffin statistics — covering expiration, consolidation, orphan cleanup, and the sequencing that prevents production incidents.

Jonathan Saring
Jonathan Saring
19 min read
Optimizing Iceberg Lakehouse Performance — problems (small files, fragmented manifests, unsorted data, delete files) flow through autonomous maintenance into faster queries, lower costs, higher throughput, and healthier data
Apache IcebergLakeOpsQuery PerformanceData Platforms

Optimizing Iceberg Lakehouse Performance

Iceberg tables degrade silently — small files from streaming, unsorted data, fragmented manifests, accumulated delete files. Each one caps query speed regardless of engine. Six concrete optimization layers, how they interact, and how autonomous maintenance keeps every table at peak performance.

David W
David W
11 min read
Data Lake vs Lakehouse vs Warehouse: A Practical Guide — watercolor illustration comparing a natural data lake (raw flexible storage), a lakehouse (open storage with analytics on the water), and a data warehouse (structured BI building with charts in the windows)
Data PlatformsData LakeLakehouseApache Iceberg

Data Lake vs Lakehouse vs Warehouse: A Practical Guide

Data lakes, warehouses, and lakehouses are not interchangeable — each has hard limits the others cannot cover. A practical guide for platform leaders: where each architecture wins, where it fails, cost and governance trade-offs, and how to choose (or combine) them in 2026.

Chris P
Chris P
22 min read
Iceberg Table Maintenance Solution Comparison — side-by-side feature matrix for LakeOps, AWS Glue, S3 Tables, Snowflake, BigLake, Cloudera, and Starburst
Apache IcebergCompactionLakehouseData Platforms

9 Iceberg Table Compaction Tools Compared for Production Lakehouses

Compaction keeps Apache Iceberg lakehouses fast and lean — but every tool approaches it differently. A side-by-side look at nine production options: LakeOps, AWS Glue, Amazon S3 Tables, Snowflake, Google BigLake, Cloudera, Starburst, Dremio, and Databricks.

Jonathan Saring
Jonathan Saring
17 min read
From data swamp to modern Iceberg lakehouse — illustrated journey from scattered files and broken schemas through Apache Iceberg to a managed lakehouse with a control plane
Data PlatformsData SwampApache IcebergLakehouse

From Data Swamp to Modern Iceberg Lakehouse

Every data lake starts with a promise of unlimited flexibility — and most end up as a swamp. Stale files, broken schemas, no observability, and engineers spending more time maintaining pipelines than analyzing data. Apache Iceberg fixed the reliability gap. A lakehouse control plane fixes everything else. A practical guide to the full transition — component by component.

Jonathan Saring
Jonathan Saring
23 min read
Optimizing Iceberg Lake Compaction — scattered small data-block cubes funnel through a compaction machine onto a conveyor belt of optimized blocks, leading to a crystal-clear iceberg lakehouse
Apache IcebergCompactionLakehouseLakeOps

Optimizing Iceberg Lake Compaction: A Guide

Compaction is the most impactful operation in an Apache Iceberg lakehouse — and the hardest to get right at scale. File merging is the easy part. Knowing when to trigger it, what sort strategy to apply per table, how to avoid conflicting with other maintenance, and how to do it without spinning up expensive JVM clusters — that is the real problem. A breakdown of what modern compaction actually requires.

Jonathan Saring
Jonathan Saring
16 min read
LakeOps control plane for AI agents — MCP, guardrails, routing, storage optimization, observability, and workload policies above Iceberg tables on object storage
Apache IcebergLakeOpsQueryFluxData Platforms

Optimizing Apache Iceberg for Agentic AI: From Slow Tables to Sub-Second Agent Queries

AI agents issue SQL iteratively, repeat query templates at high frequency, and need sub-second responses from tables designed for batch workloads. This post covers what breaks when agents hit a production Iceberg lake — and the five infrastructure layers that fix it: MCP connectivity, guardrails, multi-engine routing, self-optimizing storage, and closed-loop feedback.

Chris P
Chris P
18 min read
From 350TB to 230TB in 10 Minutes: The Hidden Weight of Stale DataExternal
Apache IcebergData LakeLakeOps

From 350TB to 230TB in 10 Minutes: The Hidden Weight of Stale Data

See how a 350TB data lake shrank to 230TB in 10 minutes by removing stale data—saving 34% in AWS S3 costs and proving the need for a control plane.

Amit Gilad
5 min read
Why Every Data Lake Needs a Control Plane: Lessons from Apache IcebergExternal
Apache IcebergData LakeLakeOps

Why Every Data Lake Needs a Control Plane: Lessons from Apache Iceberg

Apache Iceberg delivers speed, but without a control plane snapshots pile up, costs surge, query take more time — starting with expiration.

Amit Gilad
8 min read
Cracking the Ice: The Battle Between Sort and Binpack in Apache IcebergExternal
Apache IcebergData LakeData Platforms

Cracking the Ice: The Battle Between Sort and Binpack in Apache Iceberg

Unlocking performance vs. optimizing storage — choosing the right compaction strategy for your data lake.

Amit Gilad
7 min read
Delta Lake vs Apache Iceberg: Choosing the Right Table FormatExternal
Delta LakeApache IcebergData LakeLakehouse

Delta Lake vs Apache Iceberg: Choosing the Right Table Format

A detailed comparison between Delta Lake and Apache Iceberg, exploring their architectures, performance characteristics, and ideal use cases to help you make the right choice.

Amit Gilad
10 min read