Back to blog

Delta Lake to Apache Iceberg: Zero-Copy Migration Without Moving Data

How to migrate Delta Lake tables to Apache Iceberg without copying or rewriting data files. Covers zero-copy metadata conversion, the mapping between Delta transaction logs and Iceberg manifest trees, Iceberg V3 spec compatibility, practical tooling (XTable, UniForm, Iceberg Delta module), and the post-migration operational discipline — compaction, sort optimization, statistics — that determines whether converted tables actually perform.

Delta Lake to Apache Iceberg zero-copy migration — metadata conversion without moving data

Delta Lake and Apache Iceberg store their data in the same physical format — columnar Parquet files on object storage. The difference between the two is entirely metadata: how each format tracks which files belong to the table, what the schema is, how partitions are defined, and what constitutes a committed transaction. That observation is the foundation of zero-copy migration. If the data files are already Parquet and Parquet is what both formats read at query time, then migrating from Delta to Iceberg does not require touching the data at all. The entire operation is a metadata translation — reading the Delta transaction log and producing an equivalent Iceberg metadata tree that points to the same physical files.

The idea sounds deceptively simple. In practice, the two metadata models are architecturally different enough that the translation requires careful mapping of schema representations, partition specs, file-level statistics, transaction semantics, and deletion tracking. The updated Apache Iceberg Delta migration module demonstrates that it is now possible to perform this translation reliably for modern Delta tables using the Delta Kernel library, including support for deletion vectors and VACUUM scenarios. This article examines how zero-copy migration works under the hood, what the tooling landscape looks like in 2026, and why the migration itself is the easy part — what comes after conversion determines whether your Iceberg tables actually perform.

Why teams migrate from Delta Lake to Iceberg

The technical merits of Delta Lake are real. It brought ACID transactions, schema enforcement, and time travel to the data lake. Databricks built a mature ecosystem around it — Unity Catalog for governance, Liquid Clustering for layout, Predictive Optimization for automated maintenance. For teams whose entire stack runs on Databricks, Delta Lake works well.

The migration pressure comes from everything outside Databricks. When analytics engineers want to query from Trino, when the data science team runs DuckDB locally, when the BI layer sits on Snowflake, when Flink handles real-time ingestion, when Athena serves serverless ad hoc queries — Delta Lake becomes a bottleneck. Non-Databricks engines cannot write to Delta tables natively. Reading requires either UniForm (which generates Iceberg metadata for read-only access) or custom Delta readers that lag behind the protocol. Every engine that needs the same data either goes through Databricks or maintains a separate copy.

Apache Iceberg removes that constraint. More than a dozen query engines support Iceberg natively for both reads and writes — Spark, Trino, Flink, Snowflake, DuckDB, Athena, Dremio, StarRocks, BigQuery, and others. A single Iceberg table is accessible from any engine through any compliant catalog (Glue, Polaris, Nessie, REST Catalog, Gravitino) without format translation layers or read-only workarounds. The table format becomes infrastructure rather than a vendor feature.

LakeOps provides the control plane that makes this transition operationally viable — connecting to existing catalogs (Glue, Polaris, REST, S3 Tables), classifying table health from the moment of conversion, and running autonomous maintenance (compaction, sort optimization, statistics generation) so freshly migrated tables deliver multi-engine performance from day one.

LakeOps Connect Catalogs
LakeOps connects to existing Iceberg catalogs — Glue, Polaris, REST, S3 Tables — without moving data, providing immediate health classification and autonomous maintenance for migrated tables.

Three forces typically push teams toward migration in 2026. First, multi-engine access — the need to query and write the same tables from engines that do not speak Delta. Second, vendor neutrality — reducing dependency on a single platform's proprietary extensions, governance model, and pricing. Third, Iceberg V3 capabilities — deletion vectors with Roaring bitmap encoding, the native Variant type for semi-structured data, row lineage tracking, and geometry types that Delta does not offer. Teams running mixed workloads across streaming, batch, ML, and ad hoc analytics find that Iceberg's open ecosystem gives them the architectural flexibility Delta's single-vendor model cannot.

The CTAS problem: why full rewrites are expensive

The default migration approach for any table format conversion is Create Table As Select — read every row from the source, write every row into the target format. For Delta to Iceberg, that looks like a Spark job that scans the entire Delta table and writes new Parquet files registered under an Iceberg metadata tree.

sql
1CREATE TABLE iceberg_catalog.prod.events2USING iceberg3PARTITIONED BY (days(event_timestamp))4AS SELECT * FROM delta_catalog.prod.events;

CTAS works. It produces a clean Iceberg table with optimal file sizes, explicit sort orders, and fresh statistics. For a 100 GB table, it finishes in minutes and the compute cost is negligible. The problem is that production tables are rarely 100 GB.

A 10 TB table takes hours to rewrite. The Spark cluster reads every Parquet file from object storage, deserializes and re-serializes every row, writes new Parquet files back to storage, and builds the Iceberg metadata tree from scratch. During that window, the source table either freezes writes (unacceptable for production pipelines) or accepts writes that the CTAS output will miss (requiring a reconciliation step that adds complexity and risk). The compute bill for scanning and rewriting 10 TB of compressed Parquet across a multi-node Spark cluster runs into thousands of dollars. Storage costs temporarily double because both the source Delta files and the new Iceberg files exist simultaneously until the cutover completes and the old files are cleaned up.

For a 100 TB estate — twenty tables averaging 5 TB each — CTAS migration becomes a multi-week project with significant compute spend, operational risk during cutover windows, and pipeline disruptions that ripple through downstream consumers. Teams that have already invested in optimized Delta table layouts — carefully tuned file sizes, Liquid Clustering, Z-ordering — are rewriting data that is already in the right physical format. The new Iceberg files will contain exactly the same Parquet row groups with exactly the same column statistics. The only thing that actually needs to change is the metadata layer that tracks those files.

That insight — that the data is already where it needs to be and only the metadata needs translation — is what makes zero-copy migration compelling.

Zero-copy metadata conversion: how it works

Zero-copy migration reads the Delta Lake transaction log and produces an equivalent Iceberg metadata tree that references the same Parquet data files. No data is read, deserialized, moved, or rewritten. The operation touches only metadata, which means it completes in minutes regardless of table size — a 10 TB table and a 100 GB table take roughly the same time because the bottleneck is metadata parsing, not data scanning.

From Data Swamp to Modern Lakehouse

The process follows a consistent pattern across all zero-copy tools. The converter reads the Delta transaction log — the sequence of JSON commit files and Parquet checkpoint files in the _delta_log/ directory — and reconstructs the table state: which files are active, what the schema looks like, how the table is partitioned, and what column-level statistics each file carries. It then maps that state to Iceberg's metadata model and writes the corresponding Iceberg metadata files alongside the existing Delta log.

Step 1: Parse the Delta transaction log

Delta Lake tracks table state through a sequential log of JSON files (_delta_log/00000000000000000000.json, 00000000000000000001.json, and so on) interspersed with Parquet checkpoint files that consolidate the log at periodic intervals. Each log entry records actions: Add (a new file is part of the table), Remove (a file is no longer part of the table), Metadata (schema or property changes), Protocol (reader/writer version requirements), and CommitInfo (audit metadata). Checkpoint files are Parquet-encoded snapshots of the accumulated table state at a specific log version, allowing readers to skip replaying the entire log history.

The converter reads the latest checkpoint and any subsequent JSON log entries to reconstruct the current table state. Modern converters — including the updated Iceberg Delta migration module — use the Delta Kernel library rather than the deprecated Delta Standalone library. Delta Kernel provides a stable, engine-independent API for reading Delta metadata without requiring a Spark runtime, which means the conversion can run as a lightweight standalone process.

Step 2: Map schema and partition specs

Delta Lake schemas are stored as JSON-encoded StructType definitions within the Metadata action. Each field has a name, a data type (including nested structs, arrays, and maps), and a nullable flag. Iceberg schemas use a different representation — each field is assigned a unique integer ID, types use Iceberg-specific naming (e.g., timestamptz instead of Delta's TimestampType), and nested structures maintain their own ID assignments.

The converter maps every Delta type to its Iceberg equivalent. Primitive types translate directly: StringType to string, LongType to long, DoubleType to double, BooleanType to boolean, BinaryType to binary, DateType to date, TimestampType to timestamptz. Decimal types preserve their precision and scale. Complex types — structs, arrays, maps — are traversed recursively with field IDs assigned at each level. The resulting Iceberg schema carries an ID-based field mapping that allows Iceberg readers to locate columns by ID rather than by name, enabling safe schema evolution after migration.

Partition specs require a different mapping. Delta Lake uses partition columns that are physically separate in the directory layout (e.g., year=2026/month=07/) or identity-based partition columns tracked in the log. Iceberg partition specs use transform-based definitions: identity(column), bucket(column, N), truncate(column, W), year(column), month(column), day(column), hour(column). For most Delta tables, partition columns map to identity transforms in Iceberg. Tables using Delta's generated columns for time-based partitioning — such as GENERATED ALWAYS AS (CAST(event_timestamp AS DATE)) — require manual spec definition because the generation logic does not translate automatically.

Step 3: Build the Iceberg metadata tree

Iceberg metadata is a tree of four layers. At the top, metadata.json stores the table schema, partition spec, sort order, properties, and a pointer to the current snapshot. Each snapshot points to a manifest list — a file that enumerates all manifest files for that snapshot. Each manifest file tracks a set of data files with their partition values, file sizes, record counts, column-level statistics (min, max, null count, NaN count), and sort order IDs.

The converter creates this entire tree from the Delta log state. Every active Add action in the Delta log becomes a data file entry in an Iceberg manifest. The partition values from each file's Delta partition metadata are encoded into the manifest entry. Column-level statistics — if present in the Delta checkpoint or log entries — are translated into Iceberg's lower_bounds, upper_bounds, null_value_counts, and value_counts fields. The manifests are grouped into a manifest list, which is referenced by a snapshot, which is referenced by the metadata.json file written to the table's metadata/ directory.

The result is a complete Iceberg metadata tree that describes exactly the same set of Parquet files the Delta log describes — same files, same partitions, same statistics. An Iceberg reader can now plan and execute queries against the table without any awareness that the data was originally written by Delta Lake.

Step 4: Handle deletions and removal actions

Delta Lake tracks file removals through Remove actions in the transaction log. When rows are deleted from a Delta table, the original data file is logically removed (via Remove) and a new data file containing the surviving rows is added (via Add). In newer Delta versions with deletion vectors enabled, the original file stays in place and a separate deletion vector file (a Roaring bitmap of deleted row positions) marks which rows to skip during reads.

For zero-copy conversion, Remove actions mean the converter must exclude those files from the Iceberg metadata. Only files that are currently active — present in an Add action and not subsequently Removed — appear in the Iceberg manifests. Deletion vectors require special handling: the converter must either translate them into Iceberg V3 deletion vectors (Roaring bitmaps stored in Puffin files associated with the data file) or exclude files with active deletion vectors and only include the fully compacted files. The updated Iceberg Delta module supports deletion vector conversion directly, mapping Delta's DV format to Iceberg V3's binary deletion vector representation.

The VACUUM scenario adds another complexity layer. When VACUUM runs on a Delta table, it physically deletes data files that have been logically removed for longer than the retention period. After VACUUM, the transaction log still references those removed files in historical log entries, but the checkpoint reflects only the surviving files. The converter must handle this gracefully — building the Iceberg state from the post-VACUUM checkpoint rather than replaying the full log history, which would reference files that no longer exist on storage.

Mapping incompatible metadata layers

The architectural differences between Delta and Iceberg metadata are significant enough that the mapping is not a simple one-to-one translation. Understanding these differences explains why zero-copy tools have limitations and why some Delta features cannot be preserved through conversion.

Transaction log vs manifest tree

Delta Lake uses a linear, append-only transaction log. Each commit appends a new JSON file to _delta_log/. The current table state is the result of replaying the log from the last checkpoint forward — an inherently sequential operation. To find which files belong to the table, a reader must load the checkpoint (if one exists) and apply all subsequent log entries. The log is also the audit trail — every commit is a permanent record of who changed what and when.

Iceberg uses a tree structure. The metadata.json points to the current snapshot. The snapshot points to a manifest list. The manifest list enumerates manifest files. Manifest files list data files. This tree can be traversed in parallel — manifest files are independent and can be read concurrently during query planning. The tree structure also enables efficient metadata pruning: a query that touches one partition only needs to read the manifest files relevant to that partition, skipping manifests that track other partitions entirely.

When converting, the entire Delta log state collapses into a single Iceberg snapshot. Delta's commit-by-commit history does not map to Iceberg's snapshot model — the converter produces one snapshot representing the current table state, not a snapshot per Delta commit. Teams that rely on Delta's transaction history for audit or lineage purposes lose that history in the conversion. The Parquet data is preserved, but the operational metadata — who committed what and when — lives only in the Delta log.

Statistics and column metadata

Delta Lake stores file-level statistics in the transaction log as JSON fields within Add actions — typically minimum and maximum values for the first N columns (32 by default), plus null counts. These statistics are used for data skipping during query planning. Iceberg stores equivalent statistics in manifest files as binary-encoded lower_bounds and upper_bounds maps keyed by field ID, along with null_value_counts, nan_value_counts, value_counts, and column_sizes.

The converter maps Delta statistics to Iceberg statistics where they exist. However, Delta tables frequently have incomplete statistics — older files written before statistics were enabled, files where the statistics columns were not configured, or files where complex types lack statistics. Iceberg manifests will reflect whatever statistics the converter can extract, but gaps remain until the table is rewritten or Puffin statistics files are generated post-migration.

Partition handling edge cases

Most Delta partition columns translate cleanly to Iceberg identity transforms. But several patterns create friction. Delta's generated partition columns — where the partition value is computed from another column via an expression like CAST(timestamp AS DATE) — have no direct Iceberg equivalent because Iceberg partition transforms are predefined (identity, bucket, truncate, time-based) rather than arbitrary expressions. Delta tables using Liquid Clustering (dynamic file co-location without fixed partition columns) have no partition spec at all — the physical layout is an optimization detail rather than a metadata contract. These tables convert to unpartitioned Iceberg tables, which may need partition specs added post-migration.

Delta's column mapping mode (name or id) also affects conversion. When column mapping is enabled, Delta uses internal column IDs rather than column names to reference schema fields. The converter must resolve these IDs to produce an Iceberg schema where field IDs are consistent and name-to-id mappings are correct. The updated Iceberg Delta module handles column mapping conversion, but edge cases with deeply nested schemas or multiple schema evolution steps can produce ID conflicts that require manual resolution.

Iceberg V3 spec compatibility with Delta Lake tables

The Iceberg V3 specification, which reached production maturity with the Iceberg 1.11.0 release, introduces features that directly improve compatibility with modern Delta Lake tables.

Deletion vectors. Iceberg V3 uses binary deletion vectors — Roaring bitmaps that mark deleted row positions within a specific data file. This is architecturally identical to Delta Lake's deletion vectors, which also use Roaring bitmaps to track deleted rows without rewriting data files. The V3 deletion vector format stores bitmaps in Puffin files associated with individual data files, replacing the positional delete files from Iceberg V2 that required expensive join operations at read time. For zero-copy migration, this means Delta tables with active deletion vectors can be converted to Iceberg V3 with their DVs preserved — the converter translates the bitmap format and writes corresponding Puffin files. Without V3, tables with deletion vectors either need to be fully compacted before conversion (eliminating the DVs) or the DVs must be dropped (which means deleted rows reappear in query results until compaction runs).

Variant type. Iceberg V3 introduces a native variant type for semi-structured data — analogous to Snowflake's VARIANT or Delta's support for JSON columns. Delta tables with string columns containing JSON payloads can be migrated to Iceberg V3 and the column type upgraded to variant post-migration, enabling predicate pushdown into the semi-structured data without flattening. This is not handled automatically by converters today but represents a post-migration optimization path.

Row lineage. V3 adds optional row lineage tracking that records which snapshot produced each row. This is not directly mapped from Delta's commit history during conversion, but it enables post-migration lineage tracking that Delta's log-based audit provided differently. New writes after conversion can leverage row lineage while historical data retains its pre-conversion provenance.

Default values and type widening. V3 supports default column values and type widening (e.g., int to long, float to double) as safe schema evolution operations. These features reduce post-migration friction — new columns can have defaults that older files do not need to provide, and type changes that were previously full rewrites become metadata-only operations.

The convergence between Delta and Iceberg V3 — particularly the shared deletion vector model and the IcebergCompatV3 writer feature in Delta 4.3 — means that Delta tables written with V3 compatibility enabled are already structured for clean Iceberg conversion. The delta.enableIcebergCompatV3 property constrains Delta writers to produce Parquet files that are natively compatible with Iceberg V3 metadata, including proper nested field IDs, column mapping, and DV encoding.

Practical tooling: the migration landscape in 2026

Three categories of tools perform Delta-to-Iceberg metadata conversion. Each makes different trade-offs between completeness, operational complexity, and ecosystem integration.

The Iceberg Delta migration module

Apache Iceberg includes a built-in module for Delta Lake migration — the snapshotDeltaLakeTable action and related procedures. The original implementation used the Delta Standalone library to parse Delta logs. Ongoing work (Apache Iceberg PR #15407) modernizes this module to use the Delta Kernel library, which supports modern Delta protocol versions (reader version 3, writer version 7), deletion vectors, VACUUM scenarios, and column mapping.

java
1// Snapshot a Delta table into Iceberg using the migration module2SparkActions.get(spark)3    .snapshotDeltaLakeTable(deltaTablePath)4    .as(icebergTableIdentifier)5    .tableProperty("format-version", "3")6    .execute();

The snapshotDeltaLakeTable action reads all Delta transactions and converts them to a new Iceberg table in a single Iceberg transaction. The original Delta table remains unchanged — the Iceberg table references the same data files but is an independent entity with its own metadata. Because the Iceberg table does not own the data files exclusively, expire_snapshots is prohibited on the converted table (it would physically delete files the Delta table still references). Iceberg metadata-only operations — schema evolution, partition evolution, metadata deletes — are allowed.

The updated module supports INSERT, UPDATE, DELETE, and VACUUM scenarios from Delta, converts all primitive and complex data types, maps partition specs, and handles deletion vector translation to Iceberg V3. Known limitations include incomplete support for Delta's generated columns, column mapping edge cases with deep nesting, and the absence of incremental conversion (converting only changes since the last sync rather than the full table state).

Apache XTable

Apache XTable (formerly OneTable, donated to the Apache Software Foundation in 2024) is a standalone metadata translation framework that supports bidirectional conversion among Delta Lake, Apache Iceberg, and Apache Hudi. XTable is format-agnostic by design — it parses the source format's metadata, maps it to an internal logical schema, and generates the target format's metadata files.

For Delta-to-Iceberg specifically, XTable reads the Delta transaction log and checkpoint files, constructs an internal representation of the table state (schema, partitions, active files, statistics), and writes Iceberg metadata.json, manifest lists, and manifest files into the table's metadata/ directory. The Parquet data files are never touched. After XTable runs, both Delta and Iceberg metadata coexist at the same storage path — the table is simultaneously readable as Delta (from _delta_log/) and as Iceberg (from metadata/).

XTable is configured through a YAML file that specifies the source table path, source format, and target formats. A single xtable-utilities sync command performs the conversion.

yaml
1sourceFormat: DELTA2targetFormats:3  - ICEBERG4datasets:5  - tableBasePath: s3://data-lake-prod/events/6    tableName: events7    namespace: prod_db

XTable's strength is simplicity — it runs as a standalone process without Spark, produces metadata alongside existing data, and supports continuous sync (re-running to pick up new Delta commits). Its limitations mirror the broader zero-copy challenge: no support for Delta identity columns, generated columns, or Liquid Clustering; Delta DML operation history does not map to Iceberg snapshots; complex Delta features (bloom filters, change data feed) do not translate. XTable also does not support Delta deletion vectors as of mid-2026, which means tables with active DVs must be compacted before conversion.

Delta UniForm

Delta UniForm is not a migration tool — it is a coexistence mechanism. When enabled on a Delta table, UniForm asynchronously generates Iceberg metadata after each Delta commit, allowing Iceberg clients to read the table without any separate conversion step. The Parquet files are shared; only Iceberg metadata is generated incrementally.

UniForm is the right tool when you are staying on Delta as the primary write format but need Iceberg read access for external engines. It is not a migration path to native Iceberg — the table remains Delta, writes must go through Delta, and the Iceberg metadata is read-only for external consumers. Delta 4.3 advances UniForm significantly: conversion is now atomic and incremental (regenerating only the changed log range rather than the full snapshot), and the experimental IcebergCompatV3 mode allows UniForm to coexist with deletion vectors on the same table.

For teams whose migration goal is full Iceberg adoption — where external engines write to Iceberg tables, where the catalog of record is Iceberg-native (Glue, Polaris, Nessie), and where Delta Lake is fully retired — UniForm is a stepping stone, not a destination. Use it during the transition period while Delta pipelines still run, then cut over to native Iceberg when the conversion tooling (Iceberg Delta module or XTable) handles your table features.

LakeOps product walkthrough — connecting catalogs, running health analysis, and autonomous optimization for Iceberg tables after migration.

Snowflake Delta Direct

Snowflake offers a proprietary path through Delta Direct — creating external Iceberg tables that generate Iceberg metadata over Delta storage locations. By configuring an external volume with write permissions and creating a table referencing the Delta path, Snowflake infers the schema from the Delta log and produces Iceberg metadata side-by-side with existing Delta files. This approach is tightly coupled to the Snowflake ecosystem and is most relevant for teams whose primary analytics engine is Snowflake and whose migration goal is Iceberg-on-Snowflake rather than open multi-engine Iceberg.

What stays the same vs what changes

Understanding what zero-copy migration preserves and what it replaces is critical for setting post-migration expectations.

What stays the same. The Parquet data files are completely untouched. Every row, every column, every row group, every page, every compression block, every Parquet footer — identical before and after conversion. File sizes do not change. File locations do not change. Storage costs do not change. If your Parquet files were written with Snappy compression, they remain Snappy-compressed. If they were sorted by user_id, they remain sorted by user_id. If they contain Parquet row group statistics, those statistics are still in the file footers. The physical data layer is a passthrough.

What changes. The metadata layer is entirely new. Instead of _delta_log/ with JSON commit files and Parquet checkpoints, there is a metadata/ directory with v1.metadata.json, manifest lists, and manifest files. The schema representation changes from Delta's JSON StructType to Iceberg's ID-based schema. Partition definitions change from Delta's column-based partitions to Iceberg's transform-based partition specs. Transaction history changes from Delta's linear commit log to Iceberg's snapshot tree. File-level statistics are re-encoded from Delta's JSON format into Iceberg's binary manifest format.

What is lost. Delta commit history — the sequence of individual commits with their timestamps, operation types, and notebook/job IDs — does not survive conversion. The Iceberg table starts with a single snapshot representing the current state. Delta-specific features that have no Iceberg equivalent — identity columns, generated columns, change data feed, column invariants — are dropped during conversion. If your pipelines depend on these features, zero-copy migration is not viable for those tables; CTAS with feature refactoring is the only option.

What is inherited (and problematic). Zero-copy migration preserves the existing file layout exactly — including its problems. Small files from streaming ingestion remain small. Oversized files from batch dumps remain oversized. Sort order is whatever the Delta writer chose, which may not match the query patterns of Iceberg-native engines like Trino that were not part of the original workload. Statistics gaps in old files persist. Partition granularity that was optimal for Databricks may not be optimal for DuckDB or Athena. The migration makes the table Iceberg-readable, but it does not make the table Iceberg-optimized.

Post-migration operations: what to do after conversion

The moment zero-copy migration completes, the table is technically an Iceberg table. It has valid metadata, it is queryable from any Iceberg-compatible engine, and it references the same data it always did. But a technically valid Iceberg table and a well-performing Iceberg table are different things. The post-migration operational sequence determines whether the converted table delivers the multi-engine performance teams migrated for.

LakeOps Dashboard — lake-wide health during migration
The LakeOps Dashboard during a Delta-to-Iceberg migration: 30-day optimization activity, cost savings, and Critical / Warning / Healthy tiers across every catalog — surfacing which freshly converted tables need immediate attention.

Compaction

Streaming-ingested Delta tables commonly have millions of small files — 5 MB, 10 MB, sometimes under 1 MB each. After conversion, every one of those small files appears in the Iceberg manifests. Query planning time degrades linearly with manifest size, and scan performance suffers because each small file requires a separate read operation with its own connection overhead, metadata parsing, and predicate evaluation. Compaction merges small files into larger ones (typically targeting 256 MB to 512 MB) while preserving the data content.

Iceberg compaction runs via rewrite_data_files in Spark or through dedicated compaction engines. The operation reads the small files, merges their contents, writes new larger files, and updates the metadata to reference the new files instead of the originals. This is a data rewrite — but it is a selective rewrite of the files that need it, not a full table CTAS. A table where 80% of the data is in well-sized files and 20% is in small files only rewrites the 20%.

LakeOps Optimization
Per-table compaction strategy, file sizes, and expiration policies — the kind of table-level configuration that freshly converted Delta tables need immediately after migration.

For tables being queried from Trino, DuckDB, or Athena immediately after migration, compaction is not optional. These engines do not have Databricks' internal optimizations for handling massive file counts, and planning queries against a table with millions of manifest entries can take minutes. Compaction needs to run before the first production query hits the converted table.

Sort order optimization

Delta tables are written with whatever sort order the original Spark writer used — often no explicit sort at all, or a Z-order that was optimized for Databricks SQL query patterns. After migration, the query engines accessing the table may have completely different access patterns. Trino queries filter on different columns than Databricks notebooks. DuckDB ad hoc queries scan different predicates than Spark batch jobs.

Applying a sort order to an Iceberg table rewrites data files so that rows are physically ordered by the specified columns. This enables min/max column statistics to produce tight ranges, which in turn enables aggressive data skipping during query planning. A table sorted by event_date, user_id where queries filter on event_date will skip 95% or more of data files during planning — compared to an unsorted table where every file's min/max range overlaps and no files can be pruned.

Choosing the right sort order requires understanding the actual query patterns across all engines that will access the table — not just the patterns from the Databricks era. LakeOps Layout Simulations replay production SQL from multiple engines against candidate sort strategies before any data is rewritten, ensuring the first sort pass produces the optimal layout for the actual multi-engine workload. See Iceberg table health maintenance for a deep dive on how sort strategy interacts with compaction scheduling.

Statistics generation

Parquet files carry row-group-level statistics in their footers — min, max, and null count per column per row group. Iceberg manifests store file-level statistics (aggregated across all row groups). But the richest statistics layer — Puffin files with NDV (number of distinct values), histograms, and bloom filters — does not exist after zero-copy migration. These statistics enable query engines to make better join ordering decisions, estimate selectivity accurately, and choose between hash joins and merge joins based on actual cardinality rather than guesses.

Generating Puffin statistics files is a metadata-only operation that reads column data to compute NDV and histogram values, then writes the results as Puffin blobs associated with the table snapshot. Spark's rewrite_data_files with statistics enabled, or dedicated statistics generation procedures, produce these files without modifying the underlying data.

Snapshot and manifest management

After zero-copy conversion, the Iceberg table starts with a single snapshot. As new writes land from Iceberg-native engines — and as compaction, sort rewrite, and statistics operations produce new snapshots — the snapshot count grows. Without explicit retention policies, old snapshots accumulate indefinitely, consuming storage for metadata and slowing query planning as engines must navigate the snapshot history.

Snapshot expiration removes old snapshots and their associated metadata (manifest lists, manifest files that are no longer referenced). Orphan file cleanup removes data files that are no longer referenced by any snapshot — including the original Delta files that were replaced during compaction. Manifest rewriting consolidates fragmented manifests that accumulate from many small commits into fewer, larger manifests that are faster to parse during query planning.

These operations must be sequenced correctly: expire snapshots first, then clean up orphan files, then rewrite manifests. Running them out of order — or running orphan cleanup before snapshot expiration — can delete files that are still referenced by unexpired snapshots. For the migration specifically, orphan cleanup must exclude Delta log files and any shared infrastructure files that the Delta table (if it is still running in parallel) needs.

The operational gap

Every one of these post-migration operations — compaction, sort optimization, statistics generation, snapshot expiration, orphan cleanup, manifest rewriting — requires scheduling, monitoring, conflict avoidance (compaction must not collide with streaming writes), and policy management across tables, namespaces, and catalogs. Teams migrating from Databricks lose Predictive Optimization — the automated maintenance layer that handled compaction, VACUUM, and ANALYZE for Delta tables inside the Databricks ecosystem. Nothing replaces it automatically on the Iceberg side.

LakeOps Table Monitoring
Table health classification with file counts, manifests, snapshots, and partitions — surfacing exactly which converted tables need compaction, sort optimization, or manifest cleanup.

This is where LakeOps closes the gap. The moment a freshly converted Iceberg table is connected, LakeOps classifies its health — surfacing small-file proliferation, missing statistics, absent sort orders, and snapshot accumulation as actionable insights. The autonomous maintenance pipeline then runs the full optimization sequence: compaction with query-aware sort, statistics generation, snapshot expiration, orphan cleanup, and manifest rewriting — scheduled, conflict-aware, and sequenced correctly. The operational discipline that Predictive Optimization provided inside Databricks extends to every table across every engine and catalog in the estate. See Iceberg migration strategy for the complete post-migration operational playbook, and managed Iceberg for how LakeOps maintains table health autonomously at scale.

Migration decision framework

Not every Delta table should be migrated with zero-copy. The approach has clear strengths and clear boundaries.

Use zero-copy when the table is large enough that CTAS would take hours or cost thousands of dollars in compute. When the existing file layout is acceptable (or when you plan to optimize incrementally after conversion). When the table uses standard Delta features — Parquet data files with standard types, partition columns, and file-level statistics — without heavy reliance on Delta-specific features like identity columns, generated columns, or change data feed. When you need the migration to complete in minutes rather than hours. When you cannot afford a write-freeze window on production tables.

Use CTAS when the table is small enough that a full rewrite is fast and cheap. When you want to change the partition strategy during migration. When the existing file layout is problematic (millions of tiny files, wrong sort order, wrong compression codec). When the table uses Delta features that zero-copy tools cannot translate. When you want optimal Iceberg performance from day one without relying on post-migration compaction.

Use UniForm when you are not migrating away from Delta at all — you are staying on Delta as the write format and adding Iceberg read access for external engines. When Databricks remains the primary compute platform and the goal is read interoperability rather than format conversion. When you want zero operational effort on the conversion side and accept the trade-off of read-only Iceberg access.

For most production estates, the answer is a combination. Large tables that are straightforward candidates get zero-copy conversion followed by LakeOps-managed post-migration optimization. Small tables or tables needing layout changes get CTAS. Tables that must stay on Delta for now get UniForm. The migration is a table-by-table decision, not a one-size-fits-all switch.

Summary

Zero-copy migration from Delta Lake to Apache Iceberg works because the two formats share the same physical foundation — Parquet data files on object storage. The metadata translation is the migration: reading Delta's transaction log and producing Iceberg's manifest tree, without touching a single data file. The tooling in 2026 — the updated Iceberg Delta module with Delta Kernel support, Apache XTable for standalone conversion, and UniForm for coexistence — makes this translation reliable for the majority of production Delta tables, including those using deletion vectors and modern protocol versions.

But the migration is the easy half. Converting metadata takes minutes. Operating the resulting Iceberg tables — compaction, sort optimization, statistics generation, snapshot management, manifest hygiene — takes continuous discipline that most teams do not plan for. The operational gap left by Predictive Optimization when tables leave the Databricks ecosystem is real and measurable. LakeOps fills that gap with autonomous table health management from the moment conversion completes, ensuring that freshly migrated tables deliver the multi-engine performance that motivated the migration in the first place.

Related articles

Found this useful? Share it with your team.