Databricks to Iceberg Smooth Migration

Databricks built the modern lakehouse on Delta Lake and Unity Catalog — and for most enterprises that stack still runs production ML, streaming, and governed SQL analytics. The question in 2026 is not whether Delta works, but whether a single-format ecosystem can absorb every workload touching the same data. Apache Iceberg answers that question: Trino, Flink, Snowflake, Athena, DuckDB, and a growing roster of engines speak Iceberg natively. When your data scientists train models in Databricks notebooks, your analytics engineers query from Trino, and your BI layer reads from Snowflake — Iceberg becomes the format they share without forcing every team onto one platform's compute.

Databricks has responded aggressively. Unity Catalog managed Iceberg tables (Public Preview, Runtime 16.4 LTS+) let you write native Iceberg under UC governance. Delta UniForm generates Iceberg metadata over existing Delta Parquet files for external reads. Unity Catalog's Iceberg REST Catalog API gives Spark, Trino, and Snowflake a standard endpoint into UC-managed tables. These features make migration technically feasible — but migrating formats is the easy half. The hard half is operating hundreds of Iceberg tables across engines after migration, when Predictive Optimization covers only UC-managed workloads and Trino clusters have no equivalent.

This article walks through five tools production teams combine in Databricks-to-Iceberg programs. Some convert formats; others expose metadata or federate catalogs; one provides the operational layer that makes open Iceberg as manageable as Predictive Optimization inside Databricks. LakeOps leads — not because it rewrites Delta logs, but because multi-engine Iceberg estates need unified observability, autonomous maintenance, and workload routing while migration is still in flight and Delta pipelines still coexist.

Three migration shapes

Before choosing tools, decide which architecture fits your estate. Most Databricks-to-Iceberg projects land in one of three shapes:

Stay on Delta, expose Iceberg reads. Unity Catalog managed Delta with UniForm enabled. Same Parquet files, asynchronous Iceberg metadata for external readers, Databricks remains the write path. Lowest disruption when Trino or Snowflake need read access but ML teams stay on Databricks and Delta semantics are non-negotiable.

Unity Catalog managed Iceberg. Native Iceberg tables created with USING iceberg, governed by UC, optimized by Predictive Optimization, readable and writable via the Iceberg REST Catalog from Spark, Trino, and Snowflake. Databricks stays central but the table format is Iceberg end-to-end.

Open catalog as system of record. Iceberg on your object storage with Glue, Polaris, or Nessie as metadata authority. Databricks joins via Lakehouse Federation as one engine among Trino, Snowflake, and DuckDB. You route each workload to the engine where it fits — Databricks for ML, streaming, and Spark-native jobs; cheaper engines for ad hoc SQL and scans — without abandoning the platform.

Most production programs combine at least two tools — for example UniForm plus REST Catalog access plus a control plane for cross-engine maintenance outside Predictive Optimization's scope.

1. LakeOps — the operations layer for Databricks + Iceberg

LakeOps is not a Delta-to-Iceberg conversion utility. It does not rewrite Delta transaction logs or export Unity Catalog metadata. LakeOps is an autonomous lakehouse control plane built in Rust on Apache DataFusion — a layer that connects to your Iceberg catalogs (Glue, REST/Polaris, Nessie, Gravitino, Lakekeeper, S3 Tables) and registers Databricks as one engine among many on the same tables. Data never moves; LakeOps reads metadata and cross-engine query telemetry, then runs the operational loop that Predictive Optimization handles for UC-managed Delta — extended to every table queried from Trino, Snowflake, DuckDB, or Athena.

Databricks bundles Liquid Clustering, OPTIMIZE, VACUUM, lineage tracking, and DBU chargeback inside Unity Catalog for managed tables. Predictive Optimization automates those operations using Databricks workload signals. But when tables are exposed via UniForm or the REST Catalog, their primary readers are often engines Predictive Optimization cannot see. Post-migration teams inherit compaction jobs, snapshot policies, manifest tuning, and fragmented query telemetry across Trino, Snowflake, and self-managed Spark. Platform teams deploy LakeOps before table twenty — not after table two hundred — because UniForm and REST exposure do not replace cross-engine maintenance.

LakeOps Dashboard — lake-wide health during hybrid Databricks migration — The LakeOps Dashboard during a Databricks-to-Iceberg rollout: 30-day optimization activity, cost savings, and Critical / Warning / Healthy tiers across every catalog Databricks and open engines share — Predictive Optimization–like operability for the tables it cannot reach.

LakeOps product walkthrough — connecting catalogs, health analysis, and autonomous optimization for Iceberg tables.

LakeOps delivers six capabilities that together cover every dimension of lake operations — the same pillars deployed in production control-plane environments across hybrid Databricks estates:

Lakehouse observability

Continuous telemetry across tables, engines, and maintenance jobs — not a separate dashboard per Databricks workspace or Trino cluster. The Tables view classifies every Iceberg table by health tier (Healthy, Warning, Critical) with size, record count, and last-modified timestamp inline — including UniForm-exposed and REST-accessible tables that Predictive Optimization cannot instrument when they are read from Trino, Snowflake, or Athena.

LakeOps Tables view — lake-wide inventory with health status per table — The Tables view across a hybrid Databricks + open-engine catalog — every Iceberg table classified Healthy, Warning, or Critical with size, record count, and last-modified timestamp inline. The lake-wide inventory Predictive Optimization stops at when tables leave the Databricks compute boundary.

Health tiers tell you which tables are degraded; the Insights engine tells you why and what to do about it — surfacing alerts for manifest bloat, snapshot accumulation, small-file proliferation, and partition skew before Trino planning timeouts or Snowflake scan regressions hit production. See Iceberg lakehouse observability.

LakeOps Insights — proactive alerts for migrated tables — Lake-wide Insights: CRITICAL alerts for partition data file issues, HIGH for excessive manifests and snapshot accumulation, WARNING for partition skew and small files — surfaced before Trino planning timeouts or Snowflake scan regressions hit production on UniForm or REST-exposed tables.

Autonomous maintenance

Snapshot expiration, orphan cleanup, compaction, manifest rewrite, and statistics refresh run as a sequenced pipeline — complementing or replacing the hand-rolled Spark maintenance jobs teams schedule for tables queried outside Databricks. Event-driven triggers fire when structural thresholds are crossed: file count exceeding target, snapshot depth growing past retention policy, or manifest bloat degrading planning time. Lake-wide policies scope rules from individual table overrides up through namespace defaults to catalog baselines. That is the discipline behind managed Iceberg — maintenance sequenced the way Iceberg economics require, not independent OPTIMIZE crons per team.

Lake-wide events — sequenced maintenance across catalogs — Lake-wide Events view: compaction, snapshot expiration, manifest rewrites, and orphan cleanup across every catalog in the estate — each step logged with duration, before/after metrics, and status. Cross-engine maintenance keeps pace with Databricks and open-engine writes without per-team scripts.

Intelligent compaction

Compaction runs on a purpose-built Rust engine — 95% faster than equivalent Spark maintenance jobs in production benchmarks. Binpack merges small files toward target sizes. Query-aware sort reorders data around columns that Databricks SQL and external Trino patterns actually filter on — cross-engine telemetry feeds the sort strategy so the first open-engine queries after migration do not scan poorly clustered UniForm exports. Layout Simulations replay production SQL against candidate sort strategies before rewriting a single file, letting teams validate layout changes against real workload patterns rather than guessing partition keys.

Layout Simulations — test sort strategies before rewriting data — Layout Simulations replay production SQL from Databricks and Trino against candidate sort strategies — validating data skip improvements before committing to a full rewrite. The right sort order at migration time prevents months of scan amplification across engines.

Multi-engine routing

Register Databricks alongside Trino, Snowflake, DuckDB, and Athena in one engine directory — then map workloads to stable routing endpoints with cost, latency, and query-type policies. ML, streaming, and notebook workloads stay on Databricks clusters where they belong. Ad hoc SQL and cost-sensitive scans route to cheaper engines where unit economics are 3–10x better. Stable endpoints per workload group mean consumers never reconfigure connection strings on every migration cutover. See multi-engine query routing.

Engine cost and latency comparison — routing decisions backed by data — Same Iceberg SQL shape across Databricks, Trino, and DuckDB — cost per query and latency differ enough that routing ad hoc workloads to cheaper engines saves meaningful DBU spend while ML and streaming stay on Databricks clusters.

Governance and policies

Lake-wide optimization policies span catalogs, not individual Databricks workspaces. Scope rules from table-level overrides through namespace defaults to catalog-wide baselines — compaction thresholds, snapshot retention windows, manifest rewrite triggers, and target file sizes. Cron scheduling sequences maintenance across time zones and workload windows. Conflict-aware execution ensures compaction does not collide with streaming writes or notebook jobs. The policy model replaces the per-team OPTIMIZE scripts that proliferate after migration with centralized, auditable configuration.

LakeOps Policies dashboard — active maintenance policies across the estate — Active policies for compaction, orphan cleanup, snapshot expiration, and configuration governance — each with a single status toggle, scope, schedule, and last-run audit. Centralized policy replaces the per-team OPTIMIZE / VACUUM scripts that proliferate after migration to multi-engine Iceberg.

Agentic AI readiness

ML teams and AI agents increasingly query the same Iceberg tables as BI — with iterative SQL, unpredictable access patterns, and session-length workloads that batch-era layouts were never designed for. LakeOps exposes an agent-native MCP interface so AI agents connect alongside ML notebooks through a unified control plane. Layered guardrails — read-only enforcement, row limits, PII masking, cost caps, and human-approval gates — govern unsupervised agent SQL before queries reach shared Iceberg tables. Compaction driven by cross-engine telemetry includes agent query patterns, so the lake self-optimizes as agent adoption scales beside Databricks ML pipelines.

LakeOps control plane for AI agents — six-layer guardrail stack — AI agents and ML notebooks connect through the LakeOps control plane with six governance layers — from authentication through cost limits to human approval — before queries reach Iceberg tables shared across Databricks and open engines.

For architecture context on why control planes matter in hybrid estates, see From Databricks and Snowflake to an Open Data Platform.

Strengths: Hybrid Databricks + multi-engine Iceberg operability; Predictive Optimization–grade dashboards, alerts, autonomous compaction, and routing extended to tables queried from engines UC cannot instrument — without giving up open storage or locking into a single vendor catalog.

Trade-offs: Not a bulk Delta export or format conversion tool — pair with Unity Catalog Iceberg features, UniForm, or Spark for the initial format and catalog work.

2. Unity Catalog managed Iceberg

Unity Catalog managed Iceberg tables (Public Preview, Databricks Runtime 16.4 LTS+) are the most direct path when you want native Iceberg inside Databricks governance. Tables are created with CREATE TABLE … USING iceberg; otherwise Databricks defaults to Delta Lake (managed tables documentation). Predictive Optimization runs automatic OPTIMIZE (compaction and Liquid Clustering), VACUUM, and ANALYZE on managed Iceberg using Databricks workload signals — the same autonomous maintenance Delta tables receive.

Common migration patterns:

`CREATE TABLE … USING iceberg AS SELECT` from an existing Delta table — logical migration in SQL with a full data rewrite into Iceberg layout on managed storage.
Incremental coexistence — new datasets land as managed Iceberg while legacy Delta tables migrate namespace-by-namespace, letting teams validate engine compatibility per workload.
Lakehouse Federation first — register foreign Glue or Hive tables as a stepping stone, then promote workloads to managed Iceberg when ready (see section 5).

REST Catalog external access. Managed Iceberg tables are readable and writable from external engines through Unity Catalog's Iceberg REST Catalog API at endpoint /api/2.1/unity-catalog/iceberg-rest. External setup requires enabling metastore external data access, granting EXTERNAL USE SCHEMA, and authenticating with OAuth or a PAT. Spark, Flink, Trino, and Snowflake catalog integrations connect to the same endpoint; credential vending supplies short-lived storage credentials where supported. Databricks recommends Iceberg clients 1.9.2+; Iceberg v3 features (including deletion vectors) are available on supported runtimes such as DBR 18.0+.

There is no documented in-place Delta → managed Iceberg metadata conversion without UniForm (read interoperability) or a CTAS rewrite. Teams that need native Iceberg writes from Trino must land on managed Iceberg or an open catalog — UniForm alone does not provide external write paths.

Lakehouse control plane — Unity Catalog, multi-engine access, and operations layer — Managed Iceberg under Unity Catalog keeps governance in Databricks while the Iceberg REST Catalog exposes the same tables to Trino, Snowflake, and Spark — a control plane above handles cross-engine maintenance beyond Predictive Optimization scope.

Strengths: Minimal new infrastructure for Databricks-centric teams; unified governance, lineage, and automatic maintenance via Predictive Optimization; REST Catalog read/write path for managed Iceberg from external engines.

Trade-offs: Public Preview — feature gaps vs Delta remain (no Delta generated columns or certain Delta-only constraints on managed Iceberg); Predictive Optimization scope stays Databricks-only — cross-engine maintenance needs a separate layer; foreign Iceberg tables federated into UC require REFRESH FOREIGN TABLE for external REST readers.

3. Delta UniForm

When the goal is multi-engine read access without rewriting every Delta table to native Iceberg, Delta UniForm is the lowest-friction bridge. UniForm asynchronously generates Iceberg metadata over the same Parquet files Delta already writes — negligible write overhead because conversion happens after the Delta commit. External engines read through the UC Iceberg REST Catalog.

On Databricks, enable Iceberg compatibility on Unity Catalog Delta tables:

Table registered in Unity Catalog (managed or external Delta).
Column mapping enabled (delta.columnMapping.mode = name).
Delta protocol minReaderVersion >= 2, minWriterVersion >= 7.
Writes from Databricks Runtime 14.3 LTS+.
Set delta.enableIcebergCompatV2 = true and delta.universalFormat.enabledFormats = iceberg. Runtime 15.4+ supports ALTER TABLE on existing tables; REORG TABLE … APPLY (UPGRADE UNIFORM) can rewrite files when upgrading protocol.

External Iceberg clients read UniForm tables through the Unity Catalog Iceberg REST Catalog — read-only for Delta-backed tables. Writes still go through Delta on Databricks. Snowflake catalog-linked databases against the REST endpoint are a documented cross-vendor pattern for multi-engine read access without rewriting Delta files.

Plan around documented limitations: materialized views and streaming tables are unsupported; Iceberg v2 reads do not work on tables with deletion vectors enabled (Iceberg v3 on supported runtimes addresses this); tables must be accessed by name in Unity Catalog to trigger metadata generation; UniForm metadata generation is asynchronous — external readers may lag the latest Delta commit briefly.

Strengths: No data copy; keeps ML and streaming pipelines on Delta while Trino, Snowflake, or open Spark gain Iceberg catalog access — the fastest path to multi-engine reads.

Trade-offs: Read-only for external engines; table remains Delta under the hood; not a substitute for native Iceberg writes from Trino or Flink; layout inherited from Delta partitioning may not match open-engine query patterns.

4. Apache Spark + Iceberg

For petabyte-scale moves off Delta, or when the target catalog is outside Unity Catalog (Glue, Polaris, Nessie), Apache Spark with the Iceberg runtime remains the default bulk engine — on Databricks clusters or open Spark.

Teams use the Spark Iceberg procedures and SQL/DataFrame APIs to:

CTAS from Delta — CREATE TABLE glue_catalog.db.events USING iceberg AS SELECT * FROM unity_catalog.db.delta_events rewrites into Iceberg with explicit partition specs and sort orders.
Run `migrate` or `register_table` on existing Parquet lakes in S3 — metadata-only conversion where file layout already matches Iceberg expectations (AWS enterprise migration guide).
Orchestrate cutover with Lakeflow Jobs, Airflow, or Dagster — row-count validation, REST catalog wiring, BI tool migration on a schedule.

Spark is the right tool when CTAS DBU cost is acceptable, you need partition transforms Unity Catalog does not expose, or the destination catalog is deliberately not Unity Catalog. Loading external Iceberg JARs onto Databricks for non-UC catalog writes is unsupported — use Databricks SQL against UC managed Iceberg, or run open Spark against Glue/Polaris for open-catalog destinations.

Strengths: Maximum flexibility; works for any catalog; proven at warehouse scale; pairs with Glue migrate / add_files on AWS.

Trade-offs: You operate Spark clusters and migration runbooks; FinOps and ongoing maintenance are not included — without a control plane, compaction and snapshot management become per-team responsibilities.

5. Lakehouse Federation

Many Databricks estates still have production tables in AWS Glue or a legacy Hive metastore on S3. Lakehouse Federation (catalog federation) registers those external catalogs in Unity Catalog so Databricks queries object storage in place — a common stepping stone before native Iceberg migration.

Typical pattern:

1.Federate Glue or HMS into Unity Catalog as a foreign catalog with storage credentials and authorized paths.
2.Run Databricks workloads against foreign tables while planning managed Iceberg CTAS or UniForm on priority datasets.
3.For foreign Delta tables federated from Glue/HMS, `ALTER TABLE … SET MANAGED MOVE` (or COPY) promotes them to Unity Catalog managed Delta (Runtime 17.3+, Public Preview) — then enable UniForm or CTAS to Iceberg on a schedule. Note: SET MANAGED currently requires the foreign table to be Delta format.
4.For Parquet lakes on S3, pair federation with Spark Iceberg migrate or add_files procedures — see the Spark section above.
5.Schedule REFRESH FOREIGN TABLE for namespaces accessed by external engines so metadata stays current.

Catalog federation is optimized for incremental migration — Databricks recommends moving frequently queried production tables to UC managed tables for Predictive Optimization. Federation alone does not replace native Iceberg writes from Trino; it buys time while Spark or managed Iceberg CTAS runs on critical paths.

Strengths: No big-bang cutover; governed access to legacy Glue/HMS namespaces from Databricks; SET MANAGED path for federated Delta; pairs with in-place Parquet → Iceberg on AWS.

Trade-offs: Foreign tables lack full UC managed-table optimizations; cross-engine maintenance still needs a control plane outside Databricks; foreign table metadata may lag without scheduled refreshes.

Which stack for which scenario

Production programs almost always combine two or three of the tools above. The format/conversion tool gets data into Iceberg shape; LakeOps runs the lake afterward — observability, autonomous maintenance, and routing for the tables Predictive Optimization cannot reach. Use this matrix to assemble the right combination for your estate.

Scenario	Recommended stack
Trino, Snowflake, or Athena need read access; Databricks keeps writing Delta	Delta UniForm + UC Iceberg REST Catalog + LakeOps for cross-engine maintenance and observability
Native Iceberg under Unity Catalog governance	UC managed Iceberg + Predictive Optimization + LakeOps for the tables exposed to non-UC engines
Petabyte-scale rewrite or non-UC catalog destination	Apache Spark + Iceberg procedures + open catalog (Glue/Polaris/Nessie) + LakeOps as the control plane
Legacy Glue or Hive Metastore on S3, incremental bridge	Lakehouse Federation → Spark `migrate` or UC `SET MANAGED MOVE` + LakeOps for ongoing ops
Multi-engine routing across Databricks, Trino, Snowflake, and DuckDB	LakeOps routing layer over any of the migration paths above
Hybrid Databricks + Snowflake + open engines on shared Iceberg tables	LakeOps as the unified control plane — registers every engine, runs lake-wide policy

The pattern is consistent: deploy a control plane before table twenty, not after table two hundred. The hybrid middle phase — UniForm tables read by Trino, Snowflake catalog-linked databases, ML pipelines on Delta — is where degradation accumulates silently while everyone is still busy migrating.

A practical migration sequence

Phase 1 — pilot one namespace. Pick a non-critical schema with known query patterns. Either enable UniForm on a Delta table and validate Trino/Snowflake reads via the REST Catalog, or CTAS one table to managed Iceberg. Confirm row counts, snapshot visibility, and query performance from both Databricks and at least one external engine. Document which Delta features the table uses — generated columns, deletion vectors, streaming table semantics — to identify UniForm limitations early.

Phase 2 — wire operations. Connect LakeOps before migrating table twenty. Enable health monitoring, snapshot policies, and compaction on migrated catalogs early — especially for tables where Predictive Optimization does not run because the primary reader is Trino or Snowflake. Establish baseline metrics so degradation is detectable before it reaches BI dashboards.

Phase 3 — workload routing. Benchmark the same Iceberg SQL across Databricks, Trino, and DuckDB before locking routing rules. Send ad hoc and cost-sensitive SQL to engines where unit economics win; keep Databricks clusters for ML, streaming, notebooks, and Spark-native jobs. Routing endpoints stay stable so consumers never reconfigure connection strings.

Routing endpoints — workload groups mapped to engine pools — Routing groups map Analytics to Trino/DuckDB, BI to Snowflake or Trino, ETL to Spark or Athena — Databricks keeps ML and streaming workloads with stable endpoints consumers never reconfigure during migration.

Phase 4 — catalog authority. If multi-engine is permanent, decide whether Unity Catalog remains the Iceberg REST authority or catalog authority moves to Glue/Polaris/Nessie with Databricks as a consumer via Lakehouse Federation. Document write ownership per table. The Apache Iceberg catalog comparison covers Unity Catalog alongside Glue, Polaris, and Nessie before you lock metadata architecture.

Summary

Databricks to Iceberg migration is two problems: getting tables into an Iceberg-compatible shape, and running the lake afterward. Unity Catalog managed Iceberg, Delta UniForm, Spark, and Lakehouse Federation address the first. LakeOps addresses the second — and makes the hybrid middle phase survivable — with dashboards, autonomous maintenance, intelligent compaction, and routing that Databricks users expect from Predictive Optimization, extended across every engine and catalog in the estate.

Databricks remains central for ML, streaming, and notebook workflows in most hybrid estates — Iceberg is the shared format, not a platform replacement. Choose UniForm when Databricks stays the write path and external engines need read access without a rewrite. Choose managed Iceberg when Unity Catalog should govern native Iceberg end-to-end. Choose Spark when landing in an open catalog or rewriting at scale. Choose Lakehouse Federation when Glue or HMS namespaces need a governed bridge. And deploy a control plane early — autonomous table maintenance is the long-term discipline that keeps open Iceberg from trading Databricks operational comfort for an unmaintained lake.