Snowflake to Iceberg Smooth Migration

Snowflake to Iceberg migration — Snowflake tables flowing into an Apache Iceberg lakehouse, illustrating a hybrid multi-engine architecture where Snowflake remains a valued component

Snowflake teams moving to Apache Iceberg are not abandoning Snowflake. They are expanding into a multi-engine architecture where Snowflake keeps doing what it does best — governed SQL analytics, semantic layers, BI connectivity — while Iceberg decouples storage from compute and lets Trino, Spark, DuckDB, and Athena access the same tables at their own price points. Snowflake itself has leaned into this direction: managed Iceberg tables, external volumes, Open Catalog sync, and Horizon REST catalog access are all production-ready features in 2026.

The migration itself is rarely the hard part. A CREATE ICEBERG TABLE … AS SELECT finishes in hours. The hard part starts afterward: compaction stops being automatic, snapshot retention requires explicit policy, manifest files drift, query performance degrades silently as engines write in conflicting patterns, and nobody owns the operational loop Snowflake used to hide behind automatic storage optimization. Data engineers spend the first month migrating tables and the next two years fighting table maintenance.

This guide covers five production tools that address different layers of the Snowflake-to-Iceberg stack. LakeOps leads the list — not as a migration utility, but as the operations layer that makes hybrid Snowflake + Iceberg estates manageable at scale. The remaining four handle table conversion, catalog integration, bulk data movement, and in-place metadata registration.

Three migration shapes

Before choosing tools, identify which architecture your team is migrating toward. Most Snowflake-to-Iceberg projects land in one of three shapes:

Snowflake-managed Iceberg. Snowflake stays the catalog (CATALOG = 'SNOWFLAKE'), files land on your external volume (S3, GCS, Azure), and Snowflake handles compaction through warehouse credits. Other engines access tables via Horizon REST or Open Catalog sync. This is the lowest-friction path when Snowflake remains the primary write engine and your goal is open storage, not open compute.

External catalog with Snowflake as one reader. A Glue, Polaris, Nessie, or Unity catalog owns metadata. Snowflake reads through catalog-linked databases. Spark and Trino join for pipeline, ad hoc, or cost-sensitive workloads. Snowflake keeps governed analytics and BI. This shape suits teams that want catalog authority outside Snowflake while preserving Snowflake for what it does best.

Full multi-engine lakehouse. Iceberg on object storage with an open catalog as system of record. Snowflake is one of several engines — retained for BI, governed SQL, and workloads where warehouse ergonomics justify the credit cost. Trino, Spark, and DuckDB handle ETL, exploration, and high-volume scans. Routing policies pick the right engine per workload, not a forced single-platform cutover.

The five tools below map to different layers of that stack. Most production programs combine at least two — typically a format/conversion tool plus a control plane that runs maintenance, observability, and routing across whatever shape the architecture lands in.

1. LakeOps — the operations layer for Snowflake + Iceberg

LakeOps is an autonomous lakehouse control plane built in Rust on Apache DataFusion. It connects to your existing Iceberg catalogs — Glue, REST/Polaris, Nessie, Gravitino, Lakekeeper, S3 Tables — and registers Snowflake alongside Trino, Spark, DuckDB, and every other engine on the same tables. Your data never moves. LakeOps reads catalog metadata and cross-engine query telemetry, then runs the operational loop that Snowflake used to abstract away: compaction, snapshot management, health monitoring, engine routing, and policy enforcement — extended across the entire hybrid estate.

This is the tool that addresses the gap every team discovers after migrating their first twenty tables: open Iceberg separates the responsibilities that Snowflake bundled into one product. Compaction becomes your problem. Snapshot retention becomes your problem. Manifest tuning, file sizing, cross-engine query visibility, and chargeback all become your problem. LakeOps reassembles those operational signals into a single control plane — deployed before the hybrid middle phase turns unmaintained tables into silent performance liabilities.

LakeOps Dashboard — lake-wide health during hybrid migration — The LakeOps Dashboard during a Snowflake-to-Iceberg rollout — optimization activity, cost savings, CPU and storage reduction, and Critical / Warning / Healthy tiers across every catalog.

LakeOps product demo — catalog connection, table health analysis, and autonomous optimization in action.

LakeOps delivers six capabilities that together replace the operational discipline Snowflake users lose when tables move to open Iceberg:

Lakehouse observability

LakeOps provides continuous telemetry across every table, engine, and maintenance job — not a disconnected metrics pane per engine. The Tables view classifies every namespace by health tier (Critical, Warning, Healthy) with size, record count, and last-modified timestamp inline. Cross-engine telemetry means a Snowflake query scanning a badly compacted Iceberg table surfaces the same alert as a Trino scan on the same data.

LakeOps Tables view — lake-wide inventory with health status per table — The Tables view across a hybrid Snowflake + open-engine catalog — every Iceberg table classified Healthy, Warning, or Critical with size, record count, and last-modified timestamp inline. The chargeback-grade inventory most lakes lack natively after migration off Snowflake-native tables.

Health tiers tell you which tables are degraded; the Insights engine tells you why and what to do about it — surfacing alerts for manifest bloat, snapshot accumulation, small-file proliferation, and partition skew before Trino scans time out or Snowflake external-table reads regress. See Iceberg lakehouse observability for the full model.

LakeOps Insights — proactive table health alerts across the lake — Lake-wide Insights during migration: CRITICAL alerts for partition file overload (raw_clickstream), HIGH for excessive manifests and snapshots (search_query_logs), WARNING for partition skew and small files — surfaced before Trino timeouts or Snowflake external-table regressions hit production.

Autonomous maintenance

Maintenance runs as a sequenced pipeline — not independent cron jobs that collide. The sequence matters: expire snapshots first to release metadata references, then clean orphan files, then compact data files, then rewrite manifests, then refresh statistics. Each step depends on the previous one completing cleanly. LakeOps triggers this pipeline on event-driven thresholds — a file count crossing a compaction boundary, a snapshot count exceeding retention policy — rather than fixed schedules that fire regardless of need. Every operation is logged with duration, before/after file counts, and status for audit. This is the managed Iceberg discipline that replaces Snowflake's automatic storage optimization for open tables.

LakeOps table events — sequenced maintenance operations — Table-level Events showing sequenced operations — Compact Data Files, Expire Snapshots, Rewrite Manifests — each logged with duration and file counts, making migration cutovers verifiable.

Intelligent compaction

Compaction runs on a purpose-built Rust engine, not scheduled Spark clusters. In production benchmarks on 200 GB tables, LakeOps binpack completes in 221 seconds versus 1,612 seconds for Spark — 95% faster with lower compute cost per terabyte. But speed is only half the value. LakeOps performs query-aware sort: it analyzes cross-engine telemetry — including Snowflake query patterns — to determine which columns queries actually filter on, then reorders data files around those columns so predicate pushdown and min/max pruning are maximally effective. Layout Simulations let you replay production SQL against candidate sort strategies before committing to a data rewrite, so you validate the performance gain before paying the I/O cost.

Compaction benchmarks — LakeOps Rust engine vs Spark — Production benchmarks — LakeOps binpack at 221s vs Spark at 1,612s on 200 GB tables. Faster compaction means layout optimization keeps pace with concurrent writes from Snowflake and open engines during migration.

Multi-engine routing

Register Snowflake alongside Trino, Athena, DuckDB, and Spark in one engine directory. Then map workload groups — Analytics, BI, ETL, Exploration — to stable routing endpoints with cost, latency, and throughput policies. Ad hoc exploration routes to DuckDB where unit economics win. Governed BI stays on Snowflake. Heavy ETL goes to Spark. Each workload group gets a stable endpoint URL, so gradual migration does not require clients to reconfigure connection strings on every cutover. See multi-engine query routing for the full routing model.

Routing groups — workloads mapped to engine pools — Routing groups turn engine registrations into stable endpoints — Analytics to Trino/DuckDB, BI to Snowflake, ETL to Spark/Athena — each workload mapped to the engine where it fits best during gradual migration.

Governance and policies

Lake-wide policies for compaction thresholds, retention windows, and orphan cleanup rules are defined once and scoped through a hierarchy: table → namespace → catalog. A compaction policy set at the catalog level applies to every table beneath it; namespace or table overrides narrow the scope without duplicating configuration. Policies are versioned and auditable — every change is logged with who changed what and when. A single toggle enables or disables a policy across its scope. This replaces the drift that accumulates when maintenance rules live in scattered Airflow DAGs, Spark notebooks, and tribal knowledge that nobody updates after the engineer who wrote them leaves.

LakeOps Policies dashboard — lake-wide maintenance policies — The Policies dashboard: compaction for critical tables, orphan removal every 7 days, snapshot expiration for high-write namespaces, and configuration policies — each with a status toggle, type, next run, last run, and audit timestamp. One place replaces the drift of scattered Airflow DAGs and per-team Spark scripts.

Agentic AI readiness

AI agents issue SQL iteratively and need sub-second reads from tables originally sized for batch traffic. LakeOps exposes a native MCP (Model Context Protocol) interface with layered guardrails: read-only enforcement, row limits, PII masking, and cost caps. Agent queries route through the same optimized layout and engine policies as human workloads — no separate data copy, no uncontrolled table scans. Compaction driven by production telemetry includes agent query patterns, so the lake self-optimizes as AI adoption scales. The result is a lakehouse that serves both Snowflake BI dashboards and autonomous agents from the same governed, well-maintained Iceberg tables.

Guardrails for AI agents accessing the lakehouse — Agent types connect through LakeOps guardrails — read-only access, row limits, PII mask, cost caps, human approval — before queries reach Iceberg tables shared across Snowflake and open engines.

Strengths: Hybrid Snowflake + multi-engine visibility in one control plane. Autonomous maintenance replaces Spark cron jobs. Rust-based compaction at a fraction of Spark compute cost. Query-aware optimization uses Snowflake telemetry alongside open-engine patterns. Stable routing endpoints simplify gradual workload migration.

Trade-offs: LakeOps is not a bulk data export tool. It does not move bytes out of Snowflake-native tables. Pair it with Snowflake CTAS, Spark, or Glue for the initial table conversion — then hand off ongoing operations to LakeOps.

For deeper context on why control planes matter in multi-engine architectures, see From Databricks and Snowflake to an Open Data Platform.

2. Snowflake managed Iceberg

Snowflake's native Iceberg support is the lowest-friction on-ramp for teams that want Iceberg files without leaving Snowflake operations on day one. Create an external volume pointing at your cloud storage bucket, then create Iceberg tables with CATALOG = 'SNOWFLAKE' using the Snowflake-as-catalog documentation. Data and metadata land in your bucket. Snowflake handles catalog commits and automatic storage optimization through warehouse credits.

Common migration patterns:

`CREATE ICEBERG TABLE … AS SELECT` from an existing Snowflake-native table — a logical migration in SQL that produces Iceberg-format files on your external volume.
Incremental coexistence — new datasets land as Iceberg from the start; legacy native tables migrate table-by-table based on priority and read patterns.
`ALTER ICEBERG TABLE … CONVERT TO MANAGED` — promotes an externally cataloged Iceberg table (e.g. Glue-registered) to Snowflake-managed with full lifecycle maintenance (conversion guide). Requires ALLOW_WRITES = TRUE on the external volume.
Compaction controls — ENABLE_DATA_COMPACTION can disable automatic compaction at account through table scope; target file sizes (AUTO, 16–128 MB) tune interoperability with external engines. Track compaction spend in SNOWFLAKE.ACCOUNT_USAGE.ICEBERG_STORAGE_OPTIMIZATION_HISTORY (view docs).

Snowflake-managed Iceberg incurs no Snowflake storage charges for bytes on your external volume — you pay your cloud provider for object storage and Snowflake for compute and optimization credits.

Snowflake managed Iceberg architecture — Snowflake-managed Iceberg — files land on your external volume while Snowflake remains catalog and write path. External engines access tables through Horizon REST or Open Catalog sync.

Strengths: Minimal new infrastructure. Familiar Snowflake SQL and governance. Automatic compaction for Snowflake-centric teams. Data is open-format on your own storage from day one.

Trade-offs: Catalog authority stays with Snowflake. Multi-engine write access requires careful coordination. Compaction strategy is platform-controlled and billed separately. Cross-engine maintenance visibility requires a layer above Snowflake.

3. Open Catalog sync and Horizon REST

When Spark, Trino, or Flink need to read the same tables Snowflake writes, catalog sync beats repeated exports. Two paths exist, and choosing the right one matters:

Open Catalog sync (Polaris-based). Snowflake Open Catalog exposes a REST catalog built on Apache Polaris. With CATALOG_SYNC enabled at account, database, or schema level, Snowflake syncs managed Iceberg table metadata to Open Catalog so external engines discover the same tables through a standard REST endpoint. Changes propagate automatically — no ETL copy, no manual refresh. Prerequisites teams miss in pilots:

Set BASE_LOCATION_PREFIX at account, database, or schema level so Open Catalog discovers all synced table paths predictably.
If your engine only supports two namespace levels, create the database with CATALOG_SYNC_NAMESPACE_MODE = FLATTEN — this cannot be changed after creation.
Configure the Open Catalog external catalog, service connection, and catalog integration (CATALOG_SOURCE = POLARIS) before enabling sync.
See the full Polaris sync documentation.

Horizon Catalog REST (no separate Open Catalog account). Snowflake exposes the Iceberg REST API directly through Horizon Catalog for Spark, Trino, Flink, DuckDB, and other engines — without standing up a separate Polaris tenant. Choose Horizon REST when Snowflake stays the sole catalog authority and you want the simplest external-engine path. Choose Open Catalog sync when you already run Polaris for multi-vendor governance.

Strengths: One physical dataset, zero ETL copies. Snowflake remains the write path for many teams. Open engines gain standard REST catalog access without data duplication.

Trade-offs: Open Catalog sync adds Polaris operational scope. Horizon REST bills API and storage-request usage for external engine reads. Neither path provides lake-wide maintenance or health monitoring for tables queried outside Snowflake — that requires an operations layer.

4. Apache Spark + Iceberg

For large-scale exports, non-Snowflake-native sources, or destinations outside Snowflake's catalog, Apache Spark with the Iceberg runtime remains the default bulk migration engine. Teams use the Spark Iceberg procedures and DataFrame APIs to:

Read from Snowflake via the Spark Snowflake connector (or intermediate unload to cloud storage) and write Iceberg with explicit partition specs and sort orders.
Run `migrate` or `snapshot` procedures to convert Hive or Parquet tables to Iceberg in place on object storage (AWS enterprise migration guide).
Orchestrate cutover with Airflow or Dagster — validate row counts, wire the REST catalog, flip BI connections on a schedule.

Spark is the right tool when you need full control over file sizing, partition transforms, encryption, and write parallelism — or when Snowflake-native CTAS is too expensive for full-table rewrites at petabyte scale.

Strengths: Maximum flexibility. Proven at petabyte scale. Works with any catalog (Glue, Polaris, Nessie, Unity). Full control over file layout and partition strategy.

Trade-offs: You operate Spark clusters. Compaction, snapshot retention, and catalog registration are manual unless automated. FinOps visibility requires additional tooling. Maintenance scripts become long-lived infrastructure that drifts over time.

5. AWS Glue

AWS-native Snowflake customers often already have Parquet data on S3. Rewriting every file into Iceberg format is prohibitively expensive at warehouse scale. AWS Glue supports in-place metadata conversion using Iceberg procedures:

`migrate` registers existing Parquet tables as Iceberg with metadata-only changes — no data rewrite.
`add_files` imports existing file layouts without physical copy.
`register_table` adds pre-existing Iceberg metadata to the Glue catalog.
Snowflake reads the result through a catalog integration to Glue and an external volume over the same S3 prefixes.

After registration, Glue table optimizers (compaction, snapshot retention, orphan deletion) provide managed maintenance within AWS. For cross-engine telemetry and Snowflake + Trino coordination on the same Glue catalog, a control plane like LakeOps complements Glue's native optimizers.

Strengths: Avoids full data rewrite for existing Parquet lakes. Serverless Glue jobs. Native Snowflake-to-Glue catalog integration. Lowest cost path when data already lives on S3.

Trade-offs: AWS-scoped. Glue optimizers do not coordinate with Snowflake Open Catalog sync automatically. Cross-cloud estates need additional tooling. Limited control over compaction strategy compared to purpose-built engines.

Which stack for which scenario

Most teams pick two or three of the tools above and combine them. The format/conversion tool gets data into Iceberg shape; LakeOps runs the lake afterward. Use this matrix to assemble the right combination — every row pairs a migration path with the operations layer that keeps it healthy across Snowflake and open engines.

Scenario	Recommended stack
Stay Snowflake-centric, Iceberg files on your bucket	Snowflake managed Iceberg + external volume + LakeOps for cross-engine telemetry once Trino or Athena join
Snowflake writes, external engines read with no second copy	Open Catalog sync or Horizon REST + LakeOps for shared maintenance and health monitoring across all readers
Petabyte-scale export to an open catalog destination	Apache Spark + Iceberg procedures for bulk conversion + LakeOps for ongoing compaction, snapshots, and routing
Parquet already on S3 — minimize data rewrite	AWS Glue `migrate` / `add_files` for in-place metadata + Glue table optimizers + LakeOps for cross-cloud telemetry and policy
Multi-engine routing across Snowflake, Trino, DuckDB, and Athena	LakeOps routing layer over any combination of the migration paths above
Multi-catalog estate (Glue + Polaris + Snowflake)	LakeOps as the unified control plane — connects every catalog, registers every engine

The pattern repeats across every successful program: deploy a control plane before table twenty, not after table two hundred. Without it, the hybrid middle phase silently degrades while teams are still busy migrating tables.

A practical migration sequence

Phase 1 — Pilot catalog and volume. Stand up an external volume (Snowflake path) or a Glue/Polaris catalog (open path). Migrate one non-critical table with CREATE ICEBERG TABLE … AS SELECT or Spark. Validate Snowflake reads, one open-engine read, and row count parity. Confirm that file layout, partition strategy, and sort order match the query patterns you expect from each engine.

Phase 2 — Wire operations early. Connect a control plane before migrating table twenty — not after table two hundred. Enable observability on migrated namespaces, set snapshot retention policies, and establish compaction thresholds. The hybrid middle phase is where unmaintained tables degrade silently while BI still queries Snowflake, and by the time someone notices, the remediation cost is 10x what prevention would have been. For the maintenance model, see autonomous Iceberg table maintenance.

Phase 3 — Workload routing. Compare the same Iceberg query shapes across Snowflake, Trino, and DuckDB before codifying routing policy — cost per query and latency differ enough that defaults leave money on the table. Use routing endpoints to send ad hoc and cost-sensitive SQL to Trino or DuckDB where unit economics win; keep Snowflake for governed BI, semantic layers, and workloads where warehouse features justify the credit cost.

Engine cost and latency comparison — Same Iceberg table, different engines — cost and latency for identical SQL shapes differ across Snowflake, Trino, and DuckDB. Routing policies turn that spread into measurable savings.

Phase 4 — Catalog authority. If multi-engine is the permanent architecture, decide who owns catalog authority per table. Enable CATALOG_SYNC or Horizon REST if Snowflake stays the write path. Migrate catalog authority to Polaris, Nessie, or Glue if Snowflake becomes a consumer. Document which system owns writes per table — ambiguous write ownership is the single largest source of data corruption in multi-engine lakehouses.

Table layout and catalog choice affect every downstream engine — the Apache Iceberg catalog comparison is worth reading before locking your metadata architecture.

Summary

Snowflake to Iceberg migration is two distinct problems: getting tables into the format, and running the lake afterward. Snowflake managed Iceberg, Open Catalog sync, Horizon REST, Spark, and Glue address the first. LakeOps addresses the second — and makes the hybrid middle phase survivable — with observability, autonomous maintenance, intelligent compaction, multi-engine routing, governance, and AI readiness that Snowflake users expect, extended across every engine and catalog in the estate.

Snowflake remains a first-class component in most programs — not a casualty of migration. Iceberg is the shared format. Routing picks the right engine per workload. Deploy a control plane early so migration does not trade warehouse comfort for an operations vacuum. The teams that succeed are the ones who treat lake operations as a discipline from day one, not a cleanup project after the first outage.