
Data lakes, data warehouses, and data lakehouses are not three flavors of the same thing. They are fundamentally different architectures that solve different problems, fail in different ways, and impose different operational costs. Treating them as interchangeable — or assuming one can fully replace another — leads to expensive migrations, production outages, and multi-year technical debt.
This guide is written for data platform engineers and engineering leaders who need to make architecture decisions that will hold for the next five years. It covers what each architecture actually is at the systems level, where each one is irreplaceable, where each one fails, and how to decide which combination your organization actually needs.
These architectures are not interchangeable
Before diving into the details, the most important thing to understand: these are not three points on a spectrum. Each architecture has hard limitations that the others do not share.
A data warehouse cannot do what a data lake does. You cannot store unstructured data (images, video, raw logs, model artifacts) in a warehouse. You cannot run PyTorch training directly against warehouse storage — the proprietary format is opaque to ML frameworks. You cannot route the same data to five different query engines without extracting it first. Over 80% of new enterprise data is unstructured; a warehouse literally cannot ingest it.
A data lake cannot do what a warehouse does. You cannot get ACID transactions on a raw lake — concurrent writes corrupt data silently. You cannot do time travel or rollback. You cannot enforce schemas — one bad pipeline writes malformed data, and every downstream consumer breaks. You cannot serve 200 concurrent dashboard users with predictable sub-second latency. A lake has no query optimizer, no result cache, no workload management.
A lakehouse cannot do what a warehouse does (without effort). A lakehouse does not auto-optimize itself. Without active maintenance, Iceberg tables degrade to lake-level performance within weeks. A warehouse gives you sub-second dashboards on day one with zero tuning. A lakehouse gives you the same performance only after you invest in compaction, sort-order optimization, manifest consolidation, and proper snapshot lifecycle management — or deploy tooling that handles it for you.
A lakehouse is not a lake. Adding Apache Iceberg to your S3 bucket does not automatically make it a lakehouse. You need a catalog, a compaction strategy, a snapshot retention policy, an orphan cleanup process, and a manifest optimization schedule. Without these, you have a lake with metadata overhead — strictly worse than a raw lake.
Understanding these hard boundaries is essential. The rest of this article explains each architecture in depth, with concrete scenarios where one succeeds and the others cannot.

Data warehouse: the vertically-integrated analytics engine
A data warehouse is a single-vendor system where storage format, query engine, metadata management, and access control are designed, optimized, and billed as one product.
Architecture internals. Snowflake stores data in proprietary micro-partitions on cloud object storage — but you cannot access those files directly. They are opaque to PyTorch, Spark, Trino, or anything that is not Snowflake's query engine. Redshift uses zone maps and sort keys on managed storage. BigQuery uses Capacitor columnar format with automatic clustering. In all cases, the data format is proprietary and readable only by the vendor's engine. Storage and compute may scale independently, but the data is locked.
Where warehouses are irreplaceable. Three scenarios where no other architecture matches a warehouse today:
(1) High-concurrency governed BI. A warehouse handles hundreds of concurrent dashboard queries with sub-second p95 latency, workload isolation, result caching, and automatic query queuing. Snowflake defaults to 8 concurrent queries per warehouse but scales through multi-cluster warehouses. BigQuery parallelizes across thousands of slots. For user-facing BI serving hundreds of analysts simultaneously, no lakehouse engine matches this out-of-the-box experience without significant architecture work.
(2) Zero-ops compliance. In regulated industries (banking, healthcare, government), a warehouse provides a single audit log, integrated column masking, row-level security, automatic data classification, and compliance certifications (SOC 2 Type II, HIPAA, FedRAMP) — all from one vendor, one contract, one audit scope. Assembling equivalent governance across a lakehouse with multiple engines requires months of engineering and continuous validation.
(3) Teams without data platform engineers. If your organization has analysts and analytics engineers but no infrastructure team, a warehouse is self-maintaining. There is no compaction, no manifest optimization, no orphan cleanup, no snapshot expiration. You pay more per byte, but you pay nothing in operational engineering time.
Where warehouses fail. These are hard limitations, not trade-offs you can engineer around:
(1) ML training is not viable. Training frameworks (PyTorch, TensorFlow, XGBoost, JAX) need direct file access — they read Parquet or Arrow from S3 into tensors. Warehouse proprietary formats are unreadable by these frameworks. You must export data, creating copies that double storage costs and introduce staleness. Sustained ML training workloads cost up to 9× more on Snowflake than on Databricks or a lakehouse with open formats.
(2) Unstructured data does not fit. Images, video, embeddings, raw JSON event streams, model checkpoints — a warehouse cannot store or query these. With 80%+ of enterprise data now being unstructured, this is not an edge case.
(3) Multi-engine access is impossible. Once data is in Snowflake's proprietary format, only Snowflake can read it. You cannot route a heavy batch job to Spark, a streaming join to Flink, or a local exploration to DuckDB. Every workload pays Snowflake's credit rate regardless of its actual compute requirement.
(4) Cost at scale breaks. At 100+ TB of data with heavy query volumes, warehouse bills become unpredictable and unsustainable. A mid-size organization running 100K queries/month against a 1 TB warehouse can pay $400K+ annually. At petabyte scale, the gap becomes an order of magnitude.
Data lake: the cheapest possible storage with the hardest possible operations
A data lake is an architecture pattern — not a product. You store files on object storage and query them with whatever engine you choose. There is no schema enforcement, no transaction layer, no built-in optimization, and no coordination between readers and writers.
Architecture internals. S3 or equivalent object storage holds Parquet, JSON, CSV, Avro, ORC, images, and model artifacts. A Hive Metastore or AWS Glue Data Catalog provides schema registration (but not enforcement). Partitioning is directory-based (year=2026/month=05/day=17/). There is no coordination between writers: two Spark jobs writing to the same partition simultaneously produce undefined results.
Where lakes are irreplaceable. Two scenarios where neither a warehouse nor a lakehouse is appropriate:
(1) Unprocessed staging and archival. Raw event streams, IoT sensor data, log aggregation, and compliance archives that must be stored exactly as received for future, undetermined use. No schema should be imposed because the data's future use is unknown. Cost per byte must be minimized — $0.023/GB/month on S3 Standard, $0.004/GB/month on Glacier. No query optimization is needed because the data is read rarely and in bulk.
(2) ML training data at petabyte scale. Training PyTorch or TensorFlow models against hundreds of TB of images, embeddings, or Parquet files. The frameworks need direct S3 file access, not SQL interfaces. The data does not need ACID semantics — it is append-only and immutable once written. The volume makes any per-query pricing model prohibitively expensive.
Where lakes fail catastrophically. These are not minor inconveniences — they are production incidents:
(1) Concurrent writes corrupt data. Two ingestion jobs writing to the same Hive partition simultaneously can produce partial files, duplicate records, or silently overwrite each other's output. There is no conflict detection, no retry logic, no rollback. The corruption is silent — you discover it days later when downstream reports show impossible numbers.
(2) No atomicity means broken reads. A reader scanning a partition while a writer is mid-append sees partial data. There is no snapshot isolation. Long-running analytical queries return inconsistent results because the underlying files changed during execution.
(3) Schema drift destroys pipelines. A source system adds a column, changes a type, or reorders fields. The lake accepts it silently. Downstream Spark jobs fail with cryptic deserialization errors — or worse, succeed with wrong data. There is no schema enforcement layer to reject the bad write.
(4) The small-file problem is not optional. A streaming pipeline committing every 60 seconds to 10 partitions creates 14,400 files per day — over 100,000 per week on a single table. At 50 partitions with 30-second commits: one million files weekly. Each file requires a separate S3 GET request. Hive-style directory listing throttles at scale (S3 LIST is limited to 5,500 requests/second per prefix). Queries that should take seconds take minutes because file-listing overhead dominates scan time. The data swamp problem is not hypothetical — it is the default outcome without active file management.
(5) No time travel, no rollback. A pipeline bug overwrites or deletes partition data. It is gone. There is no snapshot to roll back to, no audit of what changed, no way to recover without restoring from backup (if one exists). In a warehouse or lakehouse, you would point to the previous snapshot and restore in seconds.
Data lakehouse: warehouse guarantees on lake economics — with operational requirements
A data lakehouse adds a metadata and transaction layer on top of object storage. The enabling technology is an open table format — Apache Iceberg, Delta Lake, or Apache Hudi — that turns raw files into proper database tables with ACID semantics, schema enforcement, time travel, and multi-engine access.
How Iceberg works under the hood. Apache Iceberg organizes table state into a four-level metadata hierarchy. A catalog (REST catalog, AWS Glue, Nessie, or Polaris) stores a pointer to the current metadata.json file. That file contains the full table schema with unique column IDs, partition spec, sort order, and a list of snapshots. Each snapshot points to a manifest list (Avro) that catalogs all manifest files. Each manifest file contains explicit paths to individual Parquet data files with per-column min/max statistics. This structure enables O(1) file discovery without directory listing — avoiding the LIST throttling that cripples Hive-partitioned lakes.
ACID transactions. Every write creates a new snapshot atomically. The catalog pointer updates to the new metadata.json only after all data files are successfully written. Failed writes leave the previous snapshot valid. Concurrent writers use optimistic concurrency control (OCC): the first writer to update the catalog pointer wins; others retry against the new state. This eliminates the concurrent-write corruption problem of raw lakes.
Schema evolution without data rewrite. Iceberg tracks columns by unique integer IDs, not names. You can add, drop, rename, reorder, or widen column types without rewriting data files. Old files retain their original schema; engines reconcile at read time using column IDs. This is fundamentally different from Hive, where schema changes require rewriting the entire table.
Partition evolution. Partition strategies change as metadata operations — no data rewrite. If you start with daily partitioning and later switch to hourly, new writes use the new spec while old data remains untouched. The query engine handles both layouts transparently through split planning. Hidden partitioning prunes automatically from filter predicates, eliminating the class of bugs where analysts forget partition filters.
Multi-engine access. The same Iceberg table can be queried concurrently by Spark (batch), Trino (interactive), Flink (streaming), DuckDB (local), Snowflake (governed BI), and Athena (serverless). Each engine reads the same metadata tree and sees a consistent snapshot. No data duplication, no ETL between systems.

Where lakehouses are irreplaceable. Scenarios where neither a warehouse nor a lake works:
(1) BI and ML on the same data without duplication. A warehouse cannot serve ML training (proprietary format). A lake cannot serve governed BI (no ACID, no governance). A lakehouse serves both — Trino for dashboards, PyArrow for training — from the same physical files with consistent semantics.
(2) Multi-engine workloads. Your batch ETL needs Spark, your interactive queries need Trino, your streaming needs Flink, your local exploration needs DuckDB, and your governed BI needs Snowflake. A warehouse locks you to one engine. A lake offers multiple engines but without consistency guarantees. A lakehouse provides multi-engine access with full ACID isolation between them.
(3) Data volumes above 100 TB. At this scale, warehouse compute pricing becomes untenable. Storage at $30–50/TB/year vs. $500–2,000/TB/year (including compute) creates a six-to-seven-figure annual cost difference. The lakehouse achieves comparable query performance at a fraction of the cost — provided tables are properly maintained.
(4) Streaming + analytics on the same tables. Flink appends to Iceberg tables every 1–5 minutes. Trino queries those same tables for dashboards. Snapshot isolation ensures readers see consistent state. A warehouse needs premium streaming features for similar freshness. A lake cannot provide consistent reads during concurrent writes.
Where lakehouses fail without active operation. This is the critical distinction from a warehouse — and the reason they are not interchangeable for every workload:
(1) Performance degrades without maintenance. A table receiving streaming writes accumulates small files. Without compaction running every 15–30 minutes on active partitions, query planning time grows from milliseconds to minutes. Real-world example: an Ancestry production table reached 235,000+ delete files and 23,000+ tiny data files averaging 0.01 MB each — queries that should take seconds took minutes. A Dell Federal deployment reached 45 million files with metadata consuming 5 TB — larger than the data itself.
(2) Metadata bloat is a silent killer. After 30 days of streaming writes at 60-second intervals, a table references 43,200 snapshots. Each query must traverse the manifest tree from the latest snapshot. Without snapshot expiration, metadata grows linearly with every write, adding latency to every single query. A warehouse handles this automatically; a lakehouse requires explicit snapshot retention policies.
(3) There is no built-in workload management. A warehouse queues queries, isolates workloads, and auto-scales compute. A lakehouse using Trino requires you to configure memory limits, worker pools, query routing, and failure handling. For 200 concurrent BI users expecting sub-second latency, a bare lakehouse engine requires significant tuning that a warehouse provides by default.
(4) Orphan files are permanent waste. Every failed Spark job, every killed compaction, every aborted write leaves unreferenced files on S3. They are never automatically cleaned. Without orphan detection and removal — carefully coordinated to avoid deleting files referenced by in-progress writes — storage costs grow indefinitely. A warehouse has no concept of orphan files because it fully controls its own storage lifecycle.
Vendor convergence in 2026: the lines are blurring (but not disappearing)
Every major vendor is converging on lakehouse capabilities — but the hard differences remain.
Snowflake now supports Iceberg Tables on customer-owned S3 storage with full read/write. As of March 2026, Snowflake achieved GA for CTAS against Databricks Unity Catalog via catalog-linked databases. You can query Iceberg tables in Snowflake while Spark and Trino access the same files. But Snowflake's native tables still use a proprietary format. And compute pricing still follows the credit model — you are paying warehouse rates on lakehouse-stored data.
Databricks ships Unity Catalog with REST Catalog API support (public preview March 2026, Runtime 16.4+), exposing Delta and Iceberg tables through a standard interface. UniForm materializes Delta tables as Iceberg-readable metadata. But Unity Catalog's full governance still favors the Databricks ecosystem — column masking, row filters, and lineage work best within Databricks runtimes.
AWS offers Athena (queries Iceberg on S3), Glue Catalog (metadata), EMR (Spark/Trino), and S3 Tables (managed Iceberg with compaction). The modularity offers flexibility at the cost of integration — you assemble your own lakehouse from services that were not originally designed to work together.
What convergence does NOT eliminate: Even with Snowflake reading Iceberg, you still cannot run PyTorch against Snowflake's compute layer. Even with Databricks offering Iceberg, a raw data lake on S3 still has no transactions. Even with AWS S3 Tables providing compaction, you still need snapshot expiration, manifest optimization, sort-order tuning, and query routing. The architectural differences are real; vendor convergence narrows the gaps but does not erase them.
Cost comparison: what each architecture actually costs in production
Storage. S3 Standard: $23/TB/month. Lakehouse (100 TB): ~$27,600/year. Snowflake (100 TB, on-demand): $48,000/year storage alone. The storage gap is meaningful at scale but not transformative in isolation.
Compute is where costs diverge. Snowflake: $2–4/credit, medium warehouse = 4 credits/hour. Three concurrent warehouses running 8 hours/day = $35K–70K/year. BigQuery on-demand: $6.25/TB scanned — predictable but expensive for repeated queries. Redshift Serverless: $0.375/RPU-hour.
Lakehouse engines: Athena ($5/TB scanned — cheap for ad-hoc), Trino on EC2 (~$0.50/hour small cluster), DuckDB (free), Spark on EMR ($0.10–0.30/core-hour). The lakehouse advantage: route each workload to the cheapest engine that meets its latency requirement, instead of paying one vendor's rate for everything.
The hidden maintenance cost in lakehouses. Object storage does not optimize itself. Small files accumulate, snapshots pile up, orphan files grow, manifests fragment. Without active maintenance — compaction, snapshot expiration, orphan cleanup, manifest optimization — a lakehouse degrades to lake-level performance while still carrying table-format overhead. This is where an autonomous control plane like LakeOps closes the gap: it handles continuous maintenance so lakehouse economics actually deliver in practice.

The hidden cost in warehouses: lock-in. Migrating 100 TB out of Snowflake requires extracting data through their egress pipeline (Snowflake charges $0.02/GB for cross-region transfer), converting from proprietary format, rebuilding schemas, and re-validating every downstream dependency. For organizations that have centralized petabytes in a warehouse, the switching cost is often multi-quarter and multi-million-dollar — an invisible surcharge on every future infrastructure decision.
Governance: integrated vs. modular
Warehouse governance is integrated and auditable from day one. Snowflake provides column masking, row-level security, RBAC at every level, automatic audit logging, data classification, and compliance certifications — all in one system, one contract, one audit scope. For a bank passing a SOX audit, this matters enormously.
Lake governance is assembled and fragile. IAM policies control bucket access. Apache Ranger or Lake Formation adds table-level permissions. Enforcement depends on which engine the query runs through — a Spark job might bypass the governance layer that Trino enforces. The result: governance is possible but inconsistent and labor-intensive.
Lakehouse governance is modular and portable. REST catalogs (Polaris, Gravitino, Nessie) support namespace-level access control and audit. Snowflake applies its RBAC on Iceberg tables. Trino enforces its access control. The governance follows the data across engines — but you must configure and test it for each engine independently. This is more flexible than warehouse governance but requires more engineering to implement correctly.
The practical difference: If you need to pass a compliance audit in 30 days with a two-person team, use a warehouse. If you need consistent governance across Spark, Trino, Flink, and Snowflake querying the same data, invest in lakehouse catalog-level governance. These are not interchangeable scenarios.
Performance: automatic vs. earned
Warehouse performance is automatic. Snowflake's micro-partition pruning, automatic clustering, result caching, and materialized views deliver sub-second dashboard queries with zero tuning. You load data, write SQL, and it is fast. The optimization happens internally — you do not manage it.
Lakehouse performance is earned through maintenance. A well-maintained Iceberg table — properly sized files (128–512 MB), sorted on high-cardinality filter columns, with current Puffin statistics and consolidated manifests — delivers query performance comparable to a warehouse. But this state requires active work: compaction every 15–30 minutes on active tables, sort-order optimization aligned to actual query patterns, manifest merging to keep planning time bounded, and snapshot expiration to prevent metadata bloat.
A poorly-maintained Iceberg table is worse than a warehouse AND worse than a raw lake. You get the metadata overhead of the table format (manifest traversal, snapshot resolution) without the performance benefit (because files are fragmented and unsorted). In practice: 10–50× slower than a properly maintained lakehouse or a warehouse on the same data.
Lake performance is unpredictable. Without statistics, without compaction, without manifest optimization, query engines must do full partition scans with directory listing. Performance depends entirely on how data happened to land — file sizes, partitioning schemes, formats chosen per pipeline. It works for bulk reads but fails for interactive analytics.
The practical implication: For a dashboard that 200 analysts hit every morning, a warehouse delivers sub-second on day one. A lakehouse delivers the same after proper setup and continuous maintenance. A lake cannot serve this workload at all. These are not equivalent options for this use case.
Streaming and real-time: where freshness requirements dictate architecture
Warehouses support streaming through micro-batch: Snowpipe (Snowflake), streaming inserts (BigQuery), zero-ETL integrations. Typical end-to-end latency: 1–5 minutes. Sub-minute latency requires premium features. The warehouse self-manages the resulting data — no compaction needed from you.
Lakes support streaming natively — Flink or Spark Streaming writing directly to S3 achieves second-level latency. But every commit creates small files, and without transactions, readers see partial writes. The streaming-vs-read-performance trade-off is fundamental: you get fresh data that is slow to query, or batched data that queries well. You cannot have both on a raw lake.
Lakehouses resolve this trade-off. Flink commits Iceberg snapshots every 1–5 minutes with full ACID isolation. Readers always see consistent state. The resulting small files are merged by continuous compaction running between commits. The outcome: minute-level freshness with analytical query performance. This is the only architecture that cleanly solves both freshness and performance without duplicating data into separate systems.
ML and AI: where data format determines viability
Warehouses cannot serve ML training. Period. PyTorch, TensorFlow, JAX, and XGBoost read Parquet, Arrow, or raw files from S3/GCS. They cannot read Snowflake micro-partitions or BigQuery Capacitor format. You must export data — creating copies, adding hours of latency, and doubling storage costs. For iterative model development (where training runs hourly against fresh data), this export-train-deploy cycle is prohibitively slow and expensive.
Lakes serve ML training perfectly — frameworks read Parquet directly from S3. But training data quality matters: schema drift corrupts feature pipelines, missing records produce biased models, and without time travel you cannot reproduce historical training runs for audit or debugging.
Lakehouses provide ML-ready data with guarantees. Training pipelines read Iceberg tables as Parquet through Spark or PyArrow. Schema enforcement catches drift before it reaches training. Time travel reproduces any historical table state for experiment reproducibility. The same table serves both the BI dashboard (Trino) and the training job (PyArrow) without duplication — a capability neither a warehouse nor a raw lake provides.
Decision framework: matching architecture to workload
Start with your primary outcome — not the vendor pitch. The flowchart below maps the most common decision paths; the criteria after it add the nuance that a single diagram cannot capture.

Choose a warehouse when: (1) Workloads are exclusively structured SQL analytics and BI reporting. (2) Integrated compliance and governance must work from day one with a two-person team. (3) Query concurrency is high — hundreds of simultaneous users with sub-second SLAs. (4) Operational simplicity matters more than cost. (5) Data volumes are under 50 TB where compute pricing is sustainable. (6) You have no ML training workloads hitting this data.
Choose a raw lake when: (1) Data is archival — stored for compliance or future use, rarely queried interactively. (2) ML training needs direct file access at petabyte scale. (3) Sources are heterogeneous (logs, images, video, raw JSON) and cannot conform to a schema today. (4) Cost per byte is the only constraint and no governance is needed. (5) No concurrent writes hit the same partitions.
Choose a lakehouse when: (1) BI and ML workloads coexist on the same data. (2) Multiple engines must access the same tables — Spark, Trino, Flink, DuckDB, Snowflake. (3) Data volumes exceed 100 TB. (4) Streaming and batch target the same tables. (5) You need ACID semantics and time travel without vendor lock-in. (6) You are building a platform that must evolve independently of any single vendor. (7) Your team can operate table maintenance — or you deploy automation that handles it.
The hybrid pattern (most common in practice). Many organizations run a warehouse for governed BI (Snowflake reading Iceberg tables on S3) alongside a lakehouse for ML, batch processing, and streaming. This is not a compromise — it is the intended architecture. The warehouse provides concurrency and governance for the BI layer; the lakehouse provides economics and flexibility for everything else. Data lives once in open format on object storage; engines are chosen per workload.
The operational reality: why architecture choice is not enough
The most common lakehouse failure mode is treating it like a warehouse — expecting it to be self-maintaining. Production incidents from this assumption include: query planning time growing from 200ms to 45 seconds over four weeks because nobody ran compaction; storage costs doubling in 90 days from orphan files and expired-but-undeleted snapshots; a Trino cluster running out of memory during query planning because manifest files referenced 45 million data files.
Warehouses hide this complexity and charge for it. Lakehouses expose it — compaction schedules, snapshot retention policies, orphan detection, manifest consolidation, sort-order tuning, and multi-engine query routing all become your team's responsibility.
This operational gap is real but solvable. The Iceberg ecosystem now includes autonomous table maintenance systems that continuously monitor table health, trigger compaction based on actual file state, expire snapshots safely, clean orphan files, optimize manifests, and route queries to the cheapest viable engine. The gap is not a reason to avoid the lakehouse — it is a reason to choose the right operations layer alongside it.
Where the industry is heading
The data lakehouse market is projected to grow from $8.9 billion in 2023 to over $66 billion by 2033. Apache Iceberg has become the interoperability standard — the format that Snowflake, Databricks, AWS, Google, and the open-source ecosystem all agree on. Every major vendor ships Iceberg support. The format question is settled.
The warehouse is not disappearing — it is specializing as the governed, high-concurrency BI execution layer. But increasingly, warehouses are becoming query engines that read from a lakehouse rather than self-contained storage systems. Snowflake querying Iceberg tables on your S3 is architecturally a lakehouse with a warehouse engine as one consumer.
For engineering leaders, the question is no longer 'lake vs. warehouse vs. lakehouse.' The answer for most organizations is: lakehouse as the storage and format layer, warehouse as one of several query engines, and raw lake for archival and ML staging. The real question is operational: how do you keep lakehouse tables healthy enough to deliver on their economic promise? That is the gap LakeOps fills — an autonomous control plane that keeps Iceberg tables optimized without manual intervention, making the lakehouse perform like a warehouse at a fraction of the cost.



