Apache Iceberg Migration Strategy: From Hive, Parquet, or Delta to Production Iceberg

Every data platform team hits the same inflection point. The Hive tables that anchored the warehouse for five years are surrounded by a growing constellation of engines — Trino, Athena, Flink, DuckDB, Snowflake — that cannot read them efficiently or write back safely. The raw Parquet files dumped to S3 by a dozen microservices have no schema enforcement, no ACID guarantees, and no partition evolution. The Delta Lake tables that Databricks manages are invisible to non-Databricks engines without UniForm workarounds that only support read-only access. The convergence across the industry in 2026 is clear: Apache Iceberg is becoming the default open table format that decouples storage from compute and gives every engine a single, governed, transactional view of the same data.

Migrating to Iceberg is the easy half. Getting data into the format takes days — a bounded project with known tooling and documented procedures. Keeping those tables healthy, performant, and cost-efficient at scale takes continuous operational effort that most teams do not plan for. Compaction stops being automatic, snapshot retention requires explicit policy, manifest files drift as engines write in conflicting patterns, and nobody owns the operational loop that Hive or Databricks used to hide behind their native optimizers. Teams spend the first month migrating tables and the next two years fighting table maintenance. LakeOps is the autonomous control plane that makes life after migration manageable — from initial post-conversion optimization through ongoing autonomous maintenance. This guide covers the migration itself and the operational reality that follows.

Why teams migrate to Iceberg

The migration decision is rarely about a single capability. It is a compound of pressures that accumulate until the cost of staying on legacy formats exceeds the cost of moving.

Multi-engine access. Iceberg tables are readable and writable from Spark, Trino, Flink, Athena, DuckDB, Snowflake, Dremio, and StarRocks — all from the same physical files. Hive tables are locked to the Hive ecosystem. Delta Lake tables are readable from non-Databricks engines only through UniForm (read-only) or full rewrites. Raw Parquet has no metadata layer for concurrent writes or schema enforcement. Iceberg removes the engine lock-in that forces teams to copy data between systems.

ACID transactions. Every write to an Iceberg table is atomic and isolated. Concurrent writers coordinate through optimistic concurrency on metadata commits. Hive tables have no transactional guarantees — a failed write can leave partitions in an inconsistent state. Raw Parquet has no commit model at all. ACID semantics mean pipelines do not produce partial results and readers never see uncommitted data.

Schema evolution. Iceberg tracks schema changes as metadata operations — add columns, rename columns, widen types, reorder fields — without rewriting data files. Hive schema evolution requires ALTER TABLE statements that do not guarantee backward compatibility. Parquet schema changes mean new files with different schemas and readers that must reconcile differences at query time. Iceberg schema evolution is safe, backward-compatible, and tracked in the metadata history.

Time travel and audit. Every commit creates a snapshot. Query any previous version by snapshot ID or timestamp. Roll back a bad write without restoring from backup. Audit who changed what and when through the snapshot history. Hive has no native time travel. Parquet directories have no versioning. Delta Lake has time travel, but only within the Databricks ecosystem unless engines support the Delta protocol directly.

Cost reduction. Open Iceberg on object storage eliminates proprietary storage fees. Route workloads to the cheapest engine that meets latency requirements — DuckDB for ad hoc, Trino for interactive, Athena for serverless, Spark for batch — instead of paying warehouse credits for every query. Teams migrating from Snowflake or Databricks to multi-engine Iceberg report 40–70% compute cost reductions on workloads that were previously locked to a single expensive engine.

Partition evolution. Fix partitioning mistakes without rewriting the entire table. Iceberg partition specs evolve independently of the data — old files retain their original partitioning while new writes use the updated spec. Hive partition changes require full table rewrites. This alone justifies migration for teams stuck with year/month/day partitioning on tables that should use hourly or identity-based schemes.

Three source patterns

Before choosing a migration approach, identify which source format you are migrating from. Each has different constraints, available tooling, and optimal migration paths.

Source 1: Hive/HMS tables

Hive tables managed by a Hive Metastore (HMS) are the most common migration source in enterprise environments. These tables store data as Parquet or ORC files on HDFS or S3, with partition metadata tracked in a MySQL or PostgreSQL-backed metastore. The Hive ecosystem provides no ACID guarantees (unless ACID Hive tables with ORC are used), no snapshot history, and no schema evolution beyond ALTER TABLE statements that are not always backward-compatible.

Characteristics that affect migration strategy:

Partitions are physically manifested as directory paths (year=2025/month=06/day=15/).
File sizes vary wildly — streaming ingestion produces millions of small files; batch jobs produce oversized files.
Schema drift is common — columns added over time may not exist in older partitions.
Statistics are stored externally in the metastore and are often stale or missing.
Multiple tables may share the same underlying storage paths through views or external table definitions.

Source 2: Raw Parquet on S3

Raw Parquet files dumped to S3 by ETL pipelines, microservices, or data producers that do not use a table format at all. These files have no metastore registration, no transaction log, no partition metadata beyond directory naming conventions, and no schema enforcement across files written by different producers.

Characteristics that affect migration strategy:

No central schema — each file carries its own Parquet schema header and files written at different times may have different column sets.
Partition structure (if any) depends on the directory layout chosen by whoever wrote the files.
No file-level statistics beyond Parquet row group metadata.
No snapshot history — the current state is whatever files exist at a given S3 prefix.
File sizes depend entirely on the producing application — commonly either too small (streaming) or too large (daily batch dumps).

Source 3: Delta Lake tables

Delta Lake tables managed by Databricks or open-source Delta. These tables already have ACID semantics, schema enforcement, and time travel — the migration motivation is typically multi-engine access rather than capability gaps. Trino, Athena, Snowflake, and Flink do not natively read Delta unless UniForm or a full conversion provides Iceberg metadata.

Characteristics that affect migration strategy:

Transaction log (_delta_log/) tracks all commits, schema changes, and file additions/removals.
Data files are Parquet with Delta-specific metadata columns.
Liquid Clustering or traditional partitioning determines physical layout.
Deletion vectors (in newer Delta versions) create files that standard Parquet readers cannot interpret correctly.
Unity Catalog or Hive Metastore may govern access control that must be preserved post-migration.

Migration approach 1: In-place migration (metadata-only)

In-place migration creates Iceberg metadata over existing data files without copying or rewriting them. The Parquet (or ORC) files stay exactly where they are. Only new metadata files — metadata.json, manifest lists, and manifest files — are created to reference the existing data. This is the fastest migration path and the one with the lowest risk for large tables where a full rewrite would take hours or days.

When to use in-place migration

Use this approach when:

The existing file layout is acceptable (file sizes are reasonable, partition scheme works for your query patterns).
You cannot afford downtime or the compute cost of a full data rewrite.
The table is large enough that rewriting would take hours and create operational risk.
You plan to optimize file layout incrementally after migration using compaction.

The migrate procedure

The Spark Iceberg migrate procedure converts a Hive table to Iceberg by scanning its registered files and creating the full Iceberg metadata tree that references them. The original data files are untouched.

sql

1-- Migrate a Hive table to Iceberg in-place2CALL spark_catalog.system.migrate('analytics_db.page_views');

What migrate does internally:

1.Reads the existing Hive table metadata and partition information from HMS.
2.Creates Iceberg metadata files (metadata.json, manifest lists, manifest files) that reference the existing data files.
3.Updates the Hive Metastore entry to point to the Iceberg table format.
4.Preserves all existing data files in their current locations — zero data movement.
5.Inherits the existing partition spec from the Hive table definition.

For tables where you want to validate before committing, use snapshot instead — it creates a new Iceberg table that references the same files without modifying the original Hive table:

sql

1-- Create an Iceberg snapshot without modifying the source Hive table2CALL spark_catalog.system.snapshot('analytics_db.page_views', 'iceberg_db.page_views');

AWS teams running Glue should note that the Glue Data Catalog does not support the migrate procedure directly. The proven workaround is to use a Hive Metastore on an EMR cluster — run migrate against the HMS, then use register_table to surface the Iceberg table in the Glue catalog. This two-step approach is what Natural Intelligence used to migrate thousands of production tables on AWS without downtime.

The add_files procedure

When you need to register existing Parquet files into an Iceberg table without a Hive Metastore registration — or when you want to import files from a specific partition rather than the entire table — use add_files:

sql

1-- Create the target Iceberg table first2CREATE TABLE iceberg_catalog.analytics_db.web_logs (3  event_id STRING,4  user_id STRING,5  event_type STRING,6  event_timestamp TIMESTAMP,7  properties MAP<STRING, STRING>8) USING iceberg9PARTITIONED BY (days(event_timestamp));10 11-- Register existing Parquet files from S312CALL iceberg_catalog.system.add_files(13  table => 'analytics_db.web_logs',14  source_table => '`parquet`.`s3://data-lake-prod/web_logs/`'15);

For Hive-partitioned directory structures, add_files automatically discovers partition values from the path layout. You can also import specific partitions using a partition filter:

sql

1-- Import only 2026 data from a Hive-partitioned layout2CALL iceberg_catalog.system.add_files(3  table => 'analytics_db.web_logs',4  source_table => '`parquet`.`s3://data-lake-prod/web_logs/`',5  partition_filter => map('year', '2026')6);

Critical warning: add_files does not validate that the source file schemas match the target Iceberg table schema. If source files have different column types, missing columns, or additional columns, the metadata will register them anyway — and queries will fail or return incorrect results at read time. Always validate schema compatibility before running add_files on heterogeneous file sets.

In-place migration limitations

In-place migration preserves the existing file layout — which means it also preserves existing problems:

Small files remain small. If your Hive table has millions of 5 MB files from streaming ingestion, the Iceberg version inherits them. Plan for immediate compaction.
Sort order is whatever the original writer chose. Predicate pushdown and min/max pruning will not be effective until data is rewritten with an intentional sort.
Statistics are limited to what Parquet row groups provide. Column-level statistics for efficient planning require running rewrite_data_files or generating Puffin statistics files.
Partition specs are inherited as-is. Evolution can be applied after migration without rewriting data, but old files retain the original partitioning.

This is where the post-migration gap begins. In-place migration gets you to Iceberg format in minutes, but the resulting table typically needs immediate compaction, sort order application, and statistics generation before it performs well. LakeOps handles all of this autonomously — the moment tables are connected, it classifies their health and surfaces what needs fixing first, then runs the full optimization sequence without manual intervention.

Migration approach 2: CTAS (full rewrite with optimized layout)

A full rewrite reads all data from the source and writes it as a new Iceberg table with explicit partition specs, sort orders, and target file sizes. This produces the optimal physical layout from day one — at the cost of compute time, temporary storage, and a cutover window.

When to use CTAS

Use this approach when:

The existing file layout is problematic (millions of small files, wrong partition scheme, no sort order).
You want to change the partition strategy during migration (e.g., from year/month/day Hive partitions to Iceberg days(ts) hidden partitions).
The table is small enough that a full rewrite completes in an acceptable window.
You want optimal query performance from day one without relying on post-migration compaction.
You are migrating from a non-Parquet format (ORC, Avro) and want uniform Parquet with consistent compression.

CTAS from Hive

sql

1-- Full rewrite from Hive to Iceberg with new partition spec and sort order2CREATE TABLE iceberg_catalog.analytics_db.page_views3USING iceberg4PARTITIONED BY (days(event_timestamp))5TBLPROPERTIES (6  'write.target-file-size-bytes' = '536870912',7  'write.parquet.compression-codec' = 'zstd'8)9AS SELECT * FROM hive_catalog.analytics_db.page_views;10 11-- Apply sort order after creation12ALTER TABLE iceberg_catalog.analytics_db.page_views13WRITE ORDERED BY user_id, event_timestamp;

CTAS from Delta Lake

sql

1-- Read from Delta via Spark, write to Iceberg with optimized layout2CREATE TABLE iceberg_catalog.lakehouse.user_events3USING iceberg4PARTITIONED BY (days(created_at))5TBLPROPERTIES (6  'write.target-file-size-bytes' = '536870912',7  'write.metadata.previous-versions-max' = '10',8  'write.parquet.compression-codec' = 'zstd'9)10AS SELECT * FROM delta.`s3://databricks-prod/lakehouse/user_events`;

CTAS from raw Parquet

sql

1-- Register raw Parquet as a temporary view and rewrite to Iceberg2CREATE TEMPORARY VIEW raw_events AS3SELECT * FROM parquet.`s3://data-lake/raw/events/`;4 5CREATE TABLE iceberg_catalog.events_db.raw_events6USING iceberg7PARTITIONED BY (days(event_time))8TBLPROPERTIES (9  'write.target-file-size-bytes' = '536870912',10  'write.distribution-mode' = 'hash'11)12AS SELECT13  event_id,14  user_id,15  event_type,16  CAST(event_time AS TIMESTAMP) AS event_time,17  payload18FROM raw_events;

CTAS considerations

Full rewrites produce ideal layouts but come with operational costs:

Temporary double storage. Both source and target exist until you validate and drop the source.
Compute cost. Reading and writing terabytes requires substantial Spark cluster time.
Cutover window. Data written to the source after CTAS begins will not appear in the target. Plan for a freeze window or a delta sync after CTAS completes.
Partition transform changes. Changing from Hive-style year=X/month=Y/day=Z to Iceberg hidden partitions like days(ts) is only possible through CTAS — migrate preserves the original spec.

Migration approach 3: Shadow migration (dual-write, gradual cutover)

Shadow migration builds the Iceberg table alongside the existing source, runs both in parallel, validates parity, and then cuts consumers from old to new. This is the production-safe approach for critical tables where any data loss or inconsistency is unacceptable.

When to use shadow migration

Use this approach when:

The table serves production workloads that cannot tolerate downtime.
You need to validate Iceberg query results against the existing source before cutting over.
Multiple teams consume the table and cannot all switch simultaneously.
The table receives continuous writes and a freeze window is not possible.
You need a rollback path — the ability to revert to the original source if issues appear.

Shadow migration phases

Phase 1: Dual-write (30 days recommended). Configure ingestion pipelines to write to both the existing source (Hive, Delta, or Parquet) and the new Iceberg table simultaneously. Reads continue from the original source. Daily reconciliation jobs compare row counts, sums of numeric columns, and count(distinct primary_key). Set a divergence threshold — 0.01% is typical — that catches real bugs while ignoring floating-point rounding differences.

sql

1-- Reconciliation query: compare source and target2SELECT3  'hive' AS source,4  COUNT(*) AS row_count,5  SUM(amount) AS total_amount,6  COUNT(DISTINCT user_id) AS distinct_users7FROM hive_catalog.analytics_db.transactions8WHERE transaction_date >= DATE '2026-06-01'9 10UNION ALL11 12SELECT13  'iceberg' AS source,14  COUNT(*) AS row_count,15  SUM(amount) AS total_amount,16  COUNT(DISTINCT user_id) AS distinct_users17FROM iceberg_catalog.analytics_db.transactions18WHERE transaction_date >= DATE '2026-06-01';

Phase 2: Cut reads to Iceberg (2 weeks). Dashboards, reports, and downstream pipelines switch to consuming from the Iceberg table. Dual-write continues. The original source becomes the rollback target. Monitor query performance, result accuracy, and latency compared to baseline.

Phase 3: Stop dual-write. Once reads have been stable for two weeks with no divergence, stop writing to the original source. Keep the source table available for 30 days as a safety net. After 30 days with no issues, decommission the original.

Shadow migration with view swap

For environments where consumers reference table names rather than catalog paths, use a view swap pattern to make the cutover invisible to downstream queries:

sql

1-- Consumers query this view — not the underlying table directly2CREATE OR REPLACE VIEW analytics_db.transactions AS3SELECT * FROM hive_catalog.analytics_db.transactions_hive;4 5-- After validation, swap the view to point at Iceberg6CREATE OR REPLACE VIEW analytics_db.transactions AS7SELECT * FROM iceberg_catalog.analytics_db.transactions_iceberg;8 9-- Rollback if needed10CREATE OR REPLACE VIEW analytics_db.transactions AS11SELECT * FROM hive_catalog.analytics_db.transactions_hive;

Iceberg branches offer another safety mechanism during shadow migrations. Use the Write-Audit-Publish (WAP) pattern to write to an Iceberg branch, validate the data, and only merge to the main branch after confirmation — giving you transactional rollback at the table level rather than relying solely on view swaps.

From Hive: detailed migration playbook

Hive-to-Iceberg is the most common enterprise migration pattern. The tooling is mature, the procedures are well-documented, and the risks are understood. Here is the step-by-step playbook.

Step 1: Inventory and classify tables

Before migrating anything, classify every table by migration complexity:

Simple: Single partition column, consistent schema across all partitions, Parquet format, reasonable file sizes (100–500 MB). Use migrate directly.
Medium: Multiple partition columns, some schema drift across partitions, mixed file sizes. Use migrate followed by compaction and statistics generation.
Complex: Non-Parquet format (ORC, Avro), heavily skewed partitions, external tables with shared paths, views that reference multiple tables. Use CTAS with explicit schema definition.

Step 2: Run migrate with partition spec handling

sql

1-- Migrate table — partition spec is inherited from Hive2CALL spark_catalog.system.migrate('analytics_db.user_sessions');3 4-- Verify the inherited partition spec5SELECT * FROM spark_catalog.analytics_db.user_sessions.partitions;6 7-- Evolve partition spec if needed (does not rewrite existing data)8ALTER TABLE spark_catalog.analytics_db.user_sessions9ADD PARTITION FIELD hours(session_start);

Step 3: Handle partition evolution

Hive tables often have suboptimal partitioning — year/month/day when days(timestamp) would be more efficient, or identity partitions on high-cardinality columns that create thousands of directories. After migrate, evolve the partition spec:

sql

1-- Replace Hive year/month/day with Iceberg days() transform2ALTER TABLE spark_catalog.analytics_db.user_sessions3DROP PARTITION FIELD year;4ALTER TABLE spark_catalog.analytics_db.user_sessions5DROP PARTITION FIELD month;6ALTER TABLE spark_catalog.analytics_db.user_sessions7DROP PARTITION FIELD day;8ALTER TABLE spark_catalog.analytics_db.user_sessions9ADD PARTITION FIELD days(session_start);

Old files retain the original year/month/day partitioning. New writes use days(session_start). Iceberg handles both specs transparently during query planning — old data is still prunable by the engine using metadata from the original partition values.

Step 4: Validate

sql

1-- Row count validation2SELECT COUNT(*) FROM spark_catalog.analytics_db.user_sessions;3-- Compare against the count from the original Hive table (before migrate)4 5-- Schema validation6DESCRIBE TABLE EXTENDED spark_catalog.analytics_db.user_sessions;7 8-- Snapshot verification9SELECT * FROM spark_catalog.analytics_db.user_sessions.snapshots;10 11-- File inventory12SELECT13  COUNT(*) AS total_files,14  SUM(file_size_in_bytes) / (1024*1024*1024) AS total_gb,15  AVG(file_size_in_bytes) / (1024*1024) AS avg_file_mb,16  MIN(file_size_in_bytes) / (1024*1024) AS min_file_mb,17  MAX(file_size_in_bytes) / (1024*1024) AS max_file_mb18FROM spark_catalog.analytics_db.user_sessions.files;

From Parquet: registering existing files

Raw Parquet files on S3 lack a metastore entry, so migrate cannot be used directly. The workflow involves schema inference, target table creation, and add_files registration.

Step 1: Infer schema from existing files

python

1# Read a sample of Parquet files to infer the schema2df = spark.read.parquet('s3://data-lake-prod/events/year=2026/month=06/')3schema = df.schema4 5# Inspect the inferred schema6df.printSchema()7 8# Check for schema inconsistencies across partitions9df_old = spark.read.parquet('s3://data-lake-prod/events/year=2024/month=01/')10old_schema = df_old.schema11 12# Compare schemas13for field in schema:14    if field not in old_schema:15        print(f'Column {field.name} missing in older partitions')

Step 2: Create the target Iceberg table

sql

1-- Create the Iceberg table with the unified schema2CREATE TABLE iceberg_catalog.events_db.raw_events (3  event_id STRING,4  user_id STRING,5  event_type STRING,6  event_timestamp TIMESTAMP,7  session_id STRING,8  page_url STRING,9  referrer STRING,10  device_type STRING,11  properties MAP<STRING, STRING>12) USING iceberg13PARTITIONED BY (days(event_timestamp))14TBLPROPERTIES (15  'write.target-file-size-bytes' = '536870912',16  'write.parquet.compression-codec' = 'zstd'17);

Step 3: Register files with partition discovery

sql

1-- Register all Parquet files — partition values discovered from directory layout2CALL iceberg_catalog.system.add_files(3  table => 'events_db.raw_events',4  source_table => '`parquet`.`s3://data-lake-prod/events/`'5);6 7-- Verify registration8SELECT COUNT(*) AS file_count,9       SUM(file_size_in_bytes) / (1024*1024*1024) AS total_gb10FROM iceberg_catalog.events_db.raw_events.files;

Step 4: Handle schema inconsistencies

When files across different time periods have different schemas (columns added over time, type changes), you have two options:

Union schema: Define the Iceberg table with the superset of all columns. Files missing columns will return NULLs for those fields at query time — Iceberg handles this gracefully through schema evolution.
Rewrite problematic partitions: For partitions with incompatible type changes (e.g., INT to STRING), use CTAS to rewrite just those partitions with explicit casts.

From Delta: conversion approaches

Delta-to-Iceberg migration has the most options because Delta files are already Parquet with well-defined metadata. The choice depends on whether you want Databricks to remain involved or are fully decoupling.

Option 1: Delta UniForm (read-only Iceberg access)

UniForm is not a migration — it is a compatibility layer. Databricks generates Iceberg metadata over existing Delta Parquet files asynchronously after each Delta commit. External engines read through the Iceberg REST Catalog in read-only mode.

sql

1-- Enable UniForm on an existing Delta table (Databricks Runtime 15.4+)2ALTER TABLE delta_catalog.lakehouse.user_events3SET TBLPROPERTIES (4  'delta.enableIcebergCompatV2' = 'true',5  'delta.universalFormat.enabledFormats' = 'iceberg'6);7 8-- Or upgrade with file rewrite if protocol upgrade is needed9REORG TABLE delta_catalog.lakehouse.user_events10APPLY (UPGRADE UNIFORM(ICEBERG_COMPAT_VERSION = 2));

UniForm works when Databricks stays the write path and external engines only need reads. For full multi-engine writes to Iceberg, use CTAS or add_files.

Option 2: CTAS to native Iceberg

sql

1-- Full migration from Delta to native Iceberg2CREATE TABLE iceberg_catalog.lakehouse.user_events3USING iceberg4PARTITIONED BY (days(created_at))5TBLPROPERTIES (6  'write.target-file-size-bytes' = '536870912',7  'write.parquet.compression-codec' = 'zstd'8)9AS SELECT * FROM delta.`s3://databricks-bucket/lakehouse/user_events`;

Option 3: add_files from Delta's Parquet files

Since Delta tables store data as Parquet, you can register those files directly with Iceberg — but you must first ensure only current files are included (not files from previous snapshots that VACUUM would remove):

sql

1-- Run VACUUM on the Delta table first to remove old snapshots' files2VACUUM delta.`s3://databricks-bucket/lakehouse/user_events` RETAIN 0 HOURS;3 4-- Create target Iceberg table matching the Delta schema5CREATE TABLE iceberg_catalog.lakehouse.user_events (6  user_id STRING,7  event_type STRING,8  created_at TIMESTAMP,9  payload STRING10) USING iceberg11PARTITIONED BY (days(created_at));12 13-- Register the Delta table's Parquet files14CALL iceberg_catalog.system.add_files(15  table => 'lakehouse.user_events',16  source_table => '`parquet`.`s3://databricks-bucket/lakehouse/user_events/`'17);

Option 4: Spark SQL direct conversion

sql

1-- Read Delta through Spark's Delta connector, write to Iceberg catalog2Spark.read.format('delta')3  .load('s3://databricks-bucket/lakehouse/user_events')4  .writeTo('iceberg_catalog.lakehouse.user_events')5  .using('iceberg')6  .partitionedBy(col('created_at'))7  .tableProperty('write.target-file-size-bytes', '536870912')8  .createOrReplace();

Delta metadata mapping considerations

When migrating from Delta:

Deletion vectors — files with deletion vectors cannot be registered via add_files without first rewriting them. Run REORG TABLE ... APPLY (PURGE) in Databricks to materialize deletes before conversion.
Generated columns — Delta generated columns have no Iceberg equivalent. Compute them explicitly in the CTAS SELECT.
Column mapping — Delta columnMapping.mode = name maps to Iceberg's internal field IDs. Ensure your Iceberg client version (1.9.2+) handles this correctly.
Change Data Feed — Delta CDF metadata columns (_change_type, _commit_version, _commit_timestamp) are Delta-specific and should be excluded from Iceberg migration unless replicated as regular columns.

Post-migration operations: the part teams underestimate

The migration is complete. Tables are in Iceberg format. Queries work. The immediate reaction is to declare success and move to the next project. But migration is a point-in-time event — operations are forever. This section covers what must happen immediately after migration and what must continue indefinitely. It is the difference between tables that perform well on day one and tables that are still performant six months later.

After migration, tables typically need three things before they are production-ready: initial compaction (files are often suboptimal from the conversion process), sort order application (organize by real query patterns rather than whatever the source writer chose), and statistics generation (enable cost-based optimization across all engines). Teams that skip this phase see query performance 5–12x worse than what the same data would deliver with proper layout.

Initial compaction

In-place migrations inherit the source's file layout — often millions of small files from streaming ingestion or a handful of oversized files from batch dumps. Neither is optimal for Iceberg query performance. Run compaction immediately after migration:

sql

1-- Binpack compaction to target 512 MB files2CALL iceberg_catalog.system.rewrite_data_files(3  table => 'analytics_db.page_views',4  options => map(5    'target-file-size-bytes', '536870912',6    'min-file-size-bytes', '67108864',7    'max-file-size-bytes', '805306368'8  )9);

For large tables, compact partition-by-partition to avoid monopolizing cluster resources:

sql

1-- Compact only recent partitions first2CALL iceberg_catalog.system.rewrite_data_files(3  table => 'analytics_db.page_views',4  strategy => 'binpack',5  where => 'event_timestamp >= TIMESTAMP \'2026-01-01\'',6  options => map('target-file-size-bytes', '536870912')7);

Sort order application

After compaction, apply a sort order that matches your dominant query patterns. Sort order determines how effectively engines can prune files using min/max column statistics:

sql

1-- Set sort order based on common filter columns2ALTER TABLE iceberg_catalog.analytics_db.page_views3WRITE ORDERED BY user_id, event_timestamp;4 5-- Rewrite data files with the new sort order6CALL iceberg_catalog.system.rewrite_data_files(7  table => 'analytics_db.page_views',8  strategy => 'sort',9  sort_order => 'user_id ASC NULLS LAST, event_timestamp ASC NULLS LAST'10);

Choosing the wrong sort order is expensive — you rewrite terabytes only to discover that the query patterns favor different columns. LakeOps solves this with layout simulations: test candidate sort strategies on Iceberg branches against production query telemetry before committing to a full data rewrite. This avoids the blind rewrite-and-hope cycle that wastes compute on sort orders that do not match actual access patterns.

Statistics generation

Iceberg query planners use column-level statistics (min, max, null count, distinct count) to prune files during planning. After migration or compaction, ensure statistics are current:

sql

1-- Generate column statistics (Trino)2ANALYZE iceberg_catalog.analytics_db.page_views;3 4-- Or in Spark — statistics are generated as part of the rewrite5CALL iceberg_catalog.system.rewrite_data_files(6  table => 'analytics_db.page_views',7  strategy => 'binpack',8  options => map('target-file-size-bytes', '536870912')9);

Snapshot management

Every write creates a new snapshot. Without expiration policies, snapshot metadata grows unbounded — consuming storage and degrading planning performance as the metadata tree deepens:

sql

1-- Expire snapshots older than 7 days2CALL iceberg_catalog.system.expire_snapshots(3  table => 'analytics_db.page_views',4  older_than => TIMESTAMP '2026-06-19 00:00:00',5  retain_last => 106);7 8-- Remove orphan files (files not referenced by any snapshot)9CALL iceberg_catalog.system.remove_orphan_files(10  table => 'analytics_db.page_views',11  older_than => TIMESTAMP '2026-06-19 00:00:00'12);13 14-- Rewrite manifests to consolidate metadata15CALL iceberg_catalog.system.rewrite_manifests(16  table => 'analytics_db.page_views'17);

The sequence matters: expire snapshots first (releases metadata references), then remove orphan files (cleans unreferenced data), then compact data files, then rewrite manifests. Each step depends on the previous one completing cleanly.

How LakeOps handles post-migration optimization

LakeOps is the operations layer that fills the maintenance gap between migrating tables to Iceberg and running them in production. Whether you migrate to Glue, REST/Polaris, Nessie, Gravitino, Lakekeeper, or S3 Tables — LakeOps connects in approximately ten minutes and starts optimizing immediately. No scripts to write, no thresholds to guess, no Spark clusters to manage for maintenance workloads.

The moment tables are connected, LakeOps classifies their health — Critical, Warning, Healthy — based on file count, average file size, snapshot depth, manifest count, and orphan file accumulation. This health baseline tells you exactly which migrated tables need attention first, rather than discovering degradation weeks later when queries start timing out.

LakeOps walkthrough — connecting migrated catalogs and enabling autonomous table maintenance.

Autonomous maintenance from day one

The maintenance operations described above — compaction, snapshot expiration, orphan cleanup, manifest rewriting, statistics refresh — are exactly what LakeOps runs as a sequenced pipeline from the moment your catalogs connect. Triggers are event-driven: a file count crossing a compaction boundary, a snapshot count exceeding retention policy. Each operation is logged with duration, before/after metrics, and status for audit. This replaces the Airflow DAGs and cron-scheduled Spark jobs that work for the first ten tables but become their own infrastructure project by table fifty.

Query-aware optimization

Compaction alone is necessary but not sufficient. LakeOps analyzes cross-engine query telemetry — which columns are filtered, which tables are scanned together, what sort orders would maximize predicate pushdown — and rewrites data files around the access patterns your engines actually use. For tables migrated from Hive or Delta where the original sort order was chosen for a single engine, query-aware rewriting corrects the layout for multi-engine access within days of production traffic flowing through the new Iceberg tables. Teams typically see 12x query improvement after post-migration optimization with LakeOps.

Lake-wide policy

Instead of writing per-table maintenance scripts, define policies once and scope them through a hierarchy — table → namespace → catalog. A compaction policy set at the catalog level applies to every migrated table beneath it. Namespace overrides narrow the scope for high-write or latency-sensitive tables. Every operation is logged with duration, before/after metrics, and status for audit. This is the discipline behind managed Iceberg — maintenance sequenced the way Iceberg economics require, not scattered Airflow DAGs per team.

Validation checklist

Never declare a migration complete without running these validations. Silent data issues that surface weeks later are exponentially more expensive to fix.

Row count validation

sql

1-- Compare total row counts2SELECT 'source' AS origin, COUNT(*) AS cnt FROM hive_catalog.db.source_table3UNION ALL4SELECT 'iceberg' AS origin, COUNT(*) AS cnt FROM iceberg_catalog.db.target_table;

Schema match verification

sql

1-- Iceberg schema with field IDs2SELECT * FROM iceberg_catalog.db.target_table.metadata_log_entries;3 4-- Compare column names, types, and nullability against source5DESCRIBE TABLE EXTENDED iceberg_catalog.db.target_table;

Query result comparison

Run representative queries against both source and target and diff the results. Focus on:

Aggregations: SUM, AVG, COUNT DISTINCT on key columns.
Partition boundary queries: confirm partition pruning produces identical results.
NULL handling: verify NULL counts per column match.
Timestamp precision: compare MAX(ts) and MIN(ts) to confirm no precision loss.

sql

1-- Aggregate comparison2SELECT3  event_type,4  COUNT(*) AS cnt,5  SUM(amount) AS total,6  MIN(event_timestamp) AS earliest,7  MAX(event_timestamp) AS latest8FROM iceberg_catalog.db.target_table9GROUP BY event_type10ORDER BY event_type;11 12-- Compare against identical query on source

Metadata health check

sql

1-- Verify snapshot history exists2SELECT * FROM iceberg_catalog.db.target_table.snapshots;3 4-- Check file size distribution5SELECT6  COUNT(*) AS file_count,7  AVG(file_size_in_bytes) / (1024*1024) AS avg_mb,8  PERCENTILE_APPROX(file_size_in_bytes, 0.5) / (1024*1024) AS median_mb,9  PERCENTILE_APPROX(file_size_in_bytes, 0.95) / (1024*1024) AS p95_mb10FROM iceberg_catalog.db.target_table.files;11 12-- Verify partition spec13SELECT * FROM iceberg_catalog.db.target_table.partitions;

Common pitfalls

These issues appear in nearly every production migration. Plan for them explicitly rather than discovering them in post-migration debugging sessions.

Timestamp precision

Hive stores timestamps as TIMESTAMP with microsecond precision. Iceberg supports microsecond (timestamptz) and nanosecond precision. But Parquet files written by older Hive versions may store timestamps as INT96 (nanosecond) or as millisecond-precision INT64. When migrate or add_files registers these files, the Iceberg schema declares microsecond precision — but the underlying files may have different actual precision. Queries that compare timestamps at boundaries can return different results than they did on Hive.

Mitigation: After migration, run boundary-condition timestamp queries and compare results. If precision mismatches appear, rewrite affected partitions with explicit CAST to normalize precision.

Decimal handling

Hive and Spark handle decimal types with different precision/scale defaults. A Hive DECIMAL(38,18) column may be stored in Parquet as a fixed-length byte array. When registered in Iceberg, the declared precision must exactly match the Parquet file's metadata — if it does not, queries return incorrect values or fail with type mismatch errors.

Mitigation: Before migration, inspect Parquet file metadata (parquet-tools meta) to confirm decimal precision and scale match the intended Iceberg schema. Rewrite columns with inconsistent decimal metadata.

Nested types (structs, maps, arrays)

Iceberg has strict rules about nested type evolution — you can add fields to a struct but cannot remove or rename them without creating a new schema version. Hive tables with structs that evolved over time (fields added, renamed, or removed across different partition writes) create files where the internal Parquet schema differs per file. migrate and add_files register these files under a single Iceberg schema — reads may fail on files with incompatible struct layouts.

Mitigation: Identify tables with complex nested types that changed over time. For those tables, prefer CTAS with explicit struct definitions over in-place migration.

Partition mapping

Hive uses physical directory-based partitions (year=2026/month=06/). Iceberg uses logical partition transforms that can include hidden partitions. When migrating with migrate, the Hive partition columns become identity partitions in Iceberg. If the original Hive table used partition columns that were also data columns (common with year, month, day extracted from a timestamp), those columns exist both as data and as partition values — potentially causing confusion in engines that handle partition pruning differently.

Mitigation: After migrate, verify that partition columns are correctly reflected in both the schema and the partition spec. Consider evolving to Iceberg transform-based partitions (days(ts), hours(ts)) and rewriting data to eliminate redundant year/month/day columns.

File format version conflicts

Parquet files written by different engine versions may use different Parquet format versions (v1 vs v2), page encodings (plain, dictionary, delta), or compression codecs. Iceberg registers them all — but some query engines have incomplete support for older Parquet page encodings. Trino, for example, may fail on Parquet v1 files with certain dictionary encodings that Hive wrote.

Mitigation: After migration, run test queries from every engine that will access the table. If specific engines fail, rewrite the problematic partitions to normalize Parquet format version and encoding.

Orphaned files from failed migrations

A failed migrate or add_files can leave partial metadata in an inconsistent state. If you re-run the procedure, you may end up with duplicate file entries or orphaned metadata files.

Mitigation: If a migration fails mid-execution, drop the partially created Iceberg table (or roll back the snapshot) before retrying. Never re-run add_files on a table that already has some files registered from a previous attempt without first verifying what is already tracked.

Ongoing operational health

The migration is technically complete — but operationally, you have created a new category of infrastructure that requires continuous attention. This is the gap teams underestimate most severely.

Hive hid maintenance behind a combination of the metastore's partition management and whatever cleanup scripts your Hadoop admins maintained. Databricks hides it behind Predictive Optimization and automatic VACUUM. Snowflake hides it behind automatic storage optimization. When you move to open Iceberg, nobody hides it. You own:

Compaction scheduling — who runs rewrite_data_files and how often? Based on what threshold?
Snapshot expiration — what retention window? How many snapshots to keep?
Orphan file cleanup — how often? What safety window to avoid deleting files from in-flight queries?
Manifest rewriting — when do manifests grow large enough to degrade planning time?
Statistics freshness — how often do column statistics need regeneration after writes?
Cross-engine coordination — if Spark, Trino, and Flink all write to the same table, whose compaction job wins?

Most teams solve this initially with Airflow DAGs or cron-scheduled Spark jobs. That works for the first ten tables. At fifty tables with different access patterns, write frequencies, and query engines, the maintenance scripts become their own infrastructure project — one that drifts, breaks silently, and nobody owns after the engineer who wrote them leaves.

This is where the operational layer becomes non-negotiable. LakeOps replaces the per-table scripts with catalog-scoped policies, event-driven triggers, and sequenced maintenance pipelines that handle the full lifecycle — compaction, snapshot expiration, orphan cleanup, manifest optimization, and statistics refresh — across every catalog, engine, and table in your estate. Deploy it before table twenty, not after table two hundred.

For the full operational model, see automating Iceberg table maintenance. For lakehouse-wide performance tuning beyond individual tables, see Iceberg lakehouse performance.

Migration decision guide

Hive table, acceptable file layout, large data volume: In-place migrate → immediate compaction → connect LakeOps for ongoing maintenance.

Hive table, problematic file layout or partition scheme: CTAS with new partition spec and sort order → connect LakeOps.

Raw Parquet on S3, Hive-partitioned directories: add_files with partition discovery → schema validation → compaction → LakeOps.

Raw Parquet on S3, no directory structure: CTAS reading from parquet.\s3://path\`` with explicit partition spec.

Delta Lake, Databricks stays the write path: UniForm for immediate read access; CTAS or add_files for full multi-engine write independence.

Delta Lake, full decouple from Databricks: CTAS to open catalog (Glue, Polaris, Nessie) → connect LakeOps → route workloads to cheaper engines.

Any source, zero-downtime requirement: Shadow migration with dual-write → validation → view swap → decommission source.

Summary

Iceberg migration is two distinct problems: getting data into the format, and operating tables after conversion. The first is a bounded project — in-place migration for speed, CTAS for optimization, shadow migration for safety. The source format (Hive, Parquet, Delta) determines which procedures and tooling apply. The second is an ongoing discipline — compaction, snapshot management, manifest optimization, sort order tuning, statistics generation, and cross-engine coordination — that determines whether your Iceberg investment delivers sustained performance or slowly degrades into the same mess you migrated away from.

The teams that succeed are the ones who plan for operations before they finish migration — not after the first performance regression surfaces in production. Connect LakeOps as part of the migration plan, not as a remediation project six months later. Validate rigorously with the checklist above. Watch for the common pitfalls around timestamps, decimals, nested types, and partition mapping. And treat migration as the beginning of your Iceberg operational practice, not the end of a format conversion project.

Apache Iceberg Migration Strategy: From Hive, Parquet, or Delta to Production Iceberg

Why teams migrate to Iceberg

Three source patterns

Source 1: Hive/HMS tables

Source 2: Raw Parquet on S3

Source 3: Delta Lake tables

Migration approach 1: In-place migration (metadata-only)

When to use in-place migration

The migrate procedure

The add_files procedure

In-place migration limitations

Migration approach 2: CTAS (full rewrite with optimized layout)

When to use CTAS

CTAS from Hive

CTAS from Delta Lake

CTAS from raw Parquet

CTAS considerations

Migration approach 3: Shadow migration (dual-write, gradual cutover)

When to use shadow migration

Shadow migration phases

Shadow migration with view swap

From Hive: detailed migration playbook

Step 1: Inventory and classify tables

Step 2: Run migrate with partition spec handling

Step 3: Handle partition evolution

Step 4: Validate

From Parquet: registering existing files

Step 1: Infer schema from existing files

Step 2: Create the target Iceberg table

Step 3: Register files with partition discovery

Step 4: Handle schema inconsistencies

From Delta: conversion approaches

Option 1: Delta UniForm (read-only Iceberg access)

Option 2: CTAS to native Iceberg

Option 3: add_files from Delta's Parquet files

Option 4: Spark SQL direct conversion

Delta metadata mapping considerations

Post-migration operations: the part teams underestimate

Initial compaction

Sort order application

Statistics generation

Snapshot management

How LakeOps handles post-migration optimization

Autonomous maintenance from day one

Query-aware optimization

Lake-wide policy

Validation checklist

Row count validation

Schema match verification

Query result comparison

Metadata health check

Common pitfalls

Timestamp precision

Decimal handling

Nested types (structs, maps, arrays)

Partition mapping

File format version conflicts

Orphaned files from failed migrations

Ongoing operational health

Migration decision guide

Summary

Tags

Related articles

From Data Swamp to Modern Iceberg Lakehouse

Data Lake vs Lakehouse vs Warehouse: A Practical Guide

Apache Iceberg Data Quality and Table Health: Where Reliability Actually Breaks