Apache Iceberg Catalog Migration: Hive Metastore to REST, Polaris, Glue, or Nessie

Your data is not moving. Your catalog is.

Apache Iceberg catalog migration is the process of moving table registrations from one catalog implementation to another — Hive Metastore to Apache Polaris, HMS to AWS Glue, Glue to Nessie, JDBC to a REST catalog — without copying, moving, or rewriting a single data file. Because Iceberg catalogs store only metadata pointers (the path to the current metadata.json for each table), migration is a metadata-only operation. Register the existing metadata file location in the new catalog, and the new catalog immediately has access to the complete table — every snapshot, every schema evolution, every partition spec change.

This guide covers why catalog migration matters, which target catalogs to consider, two migration patterns (in-place registration vs. full recreation), step-by-step procedures for the two most common paths (HMS → REST and HMS → AWS Glue), dual-catalog access during transition, validation, rollback, and how LakeOps provides unified visibility across catalogs before, during, and after migration.

Catalog migration is a prerequisite for modern multi-engine Iceberg. But the migration itself is the easy part — what comes after (maintaining tables across the new catalog, ensuring health, coordinating engines) is where teams struggle. LakeOps connects to any catalog (Glue, REST/Polaris, Nessie, S3 Tables) in roughly ten minutes, providing immediate observability and autonomous maintenance from day one of your new catalog.

Why catalog migration matters

The table format question is settled — Iceberg won. The catalog question is where teams are now stuck. Most production Iceberg estates started on Hive Metastore because it was already running, every engine supported it, and adding Iceberg tables required zero new infrastructure. That was the right call in 2022. It is increasingly the wrong call in 2026.

Hive Metastore: why teams outgrow it

HMS served its purpose as the bootstrap catalog, but its limitations compound as lakehouse deployments grow:

Thrift protocol — every engine needs Hive client JARs on the classpath. No credential vending, no server-side commit deconflicting, no multi-table commits. Non-JVM engines (Python, Go, Rust) need a shim layer or proxy.
Scalability ceiling — listTables degrades as namespace size grows. At 8,000+ tables in a single namespace, operations that should take milliseconds take minutes because the getTableObjectsByName Thrift call fetches full table objects to filter Iceberg tables from Hive tables.
No native access control — fine-grained table permissions require bolting on Apache Ranger or a custom authorization layer.
No branching or versioning — no concept of catalog-level branches, tags, or time travel.
JVM dependency — the HMS server requires a JVM process. The transitive dependency tree includes Hadoop and Hive JARs that bloat container images and complicate upgrades.
Multi-catalog isolation issues — the HiveCatalog implementation ignores the catalog name parameter, so isolating metadata between logical catalogs on the same HMS instance is unreliable.

The REST catalog advantage

Modern REST-based catalogs solve every one of these limitations. The Iceberg REST Catalog specification defines a standard HTTP API that any engine can call and any server can implement — collapsing the O(engines × catalogs) integration matrix into O(engines + catalogs). REST enables credential vending (short-lived, table-scoped storage tokens), server-side commit deconflicting, multi-table commits, and lazy snapshot loading. Every catalog released after 2023 speaks REST.

The practical question is not whether to migrate from HMS, but when — and to which target catalog. For a detailed feature comparison of all seven major catalogs, see the Apache Iceberg catalog comparison.

Target catalog landscape

Before choosing a migration target, understand what each catalog brings. The right choice depends on your cloud provider, engine matrix, governance requirements, and operational appetite.

REST catalogs: Apache Polaris and Apache Gravitino

Apache Polaris is a full open-source implementation of the Iceberg REST Catalog specification — purpose-built for multi-engine access with fine-grained access control. It graduated from the Apache Incubator in February 2026. Polaris delivers credential vending with AWS STS session tags for CloudTrail correlation, built-in RBAC with OPA integration, multi-catalog management, and catalog federation to HMS, Glue, and other REST catalogs. A single Polaris instance can act as a routing layer for tables in other catalogs, enabling incremental adoption. Snowflake Open Catalog provides a managed hosting option built on the same codebase.

Apache Gravitino graduated as an Apache Top-Level Project in June 2025 and positions itself as a federated metadata lake — not just an Iceberg catalog, but a unified metadata layer for tables, files, models, Kafka topics, and UDFs across backend systems. Gravitino runs a native Iceberg REST endpoint, connects to Hive, MySQL, PostgreSQL, HDFS, S3, Iceberg, Hudi, Paimon, ClickHouse, and more through a unified API. Its breadth is both its strength and its complexity: for teams that only need an Iceberg catalog, Gravitino may be more infrastructure than necessary.

Both Polaris and Gravitino are strong migration targets when you need a vendor-neutral, open-source REST catalog with multi-engine interoperability. Polaris is the simpler, more focused choice; Gravitino is the right fit when you need to federate metadata across heterogeneous systems beyond Iceberg.

AWS Glue Data Catalog

AWS Glue is a fully managed, serverless metadata service with zero operational overhead. It integrates natively with IAM, Lake Formation, Athena, EMR, and Redshift Spectrum. In late 2024, AWS added a REST endpoint (https://glue.<region>.amazonaws.com/iceberg) that implements the Iceberg REST spec, letting external engines connect without Glue-specific SDKs. Glue is the path of least resistance for all-AWS teams — no servers to provision, no databases to manage, and deep Lake Formation integration for column-level security. The trade-offs are single-cloud lock-in, single-level namespaces, no branching, no multi-table commits, and REST API gaps (no UpdateTable for Iceberg, no credential vending through REST).

Project Nessie (git-like catalog)

Project Nessie brings Git-style semantics to catalog metadata — branches, tags, commits, cherry-picks, and merges over your entire catalog state. Create a dev branch, test a schema change or backfill job in isolation, then merge when ready. This is transformative for data CI/CD workflows, and Nessie is the only catalog that offers this capability. The trade-off: no built-in access control (pair with Polaris or OPA), no credential vending, and branch management overhead is only justified if your workflows actually benefit from data CI/CD. Nessie implements the Iceberg REST Catalog interface, so engine configuration is standard REST.

Databricks Unity Catalog

Unity Catalog is Databricks' governance layer — managing tables, volumes, ML models, and AI assets with row-level and column-level security, lineage tracking, and audit logging. Unity implements the Iceberg REST spec at /api/2.1/unity-catalog/iceberg-rest. For Databricks-centric teams, Unity is the natural target. For non-Databricks environments, the open-source version under Linux Foundation governance is an option but expect feature gaps versus the managed platform. For deeper context on moving from Databricks to Iceberg, see the Databricks to Iceberg migration guide.

Migration patterns: registration vs. recreation

There are two fundamental approaches to catalog migration. The right choice depends on whether your existing tables are already in Iceberg format.

Pattern 1: In-place metadata registration (zero data movement)

This is the most common pattern and the one you should default to when migrating between Iceberg catalogs. It works because Iceberg's table state is defined entirely by its metadata.json file in object storage. The catalog is just a pointer to that file. Migration means creating a new pointer in the new catalog that points to the same metadata file — no data copies, no file rewrites, no downtime for reads.

The operation is simple: read the current metadata_location from the source catalog for each table, then call register_table (or the equivalent API) on the target catalog with that same location. The new catalog immediately sees the full table — all snapshots, all schema history, all partition specs. Table history is preserved because the metadata files themselves are untouched.

Use in-place registration when:

Your tables are already Iceberg format in the source catalog.
You are moving between Iceberg catalog implementations (HMS → Polaris, Glue → Nessie, JDBC → REST).
You want zero downtime and zero data movement.
You need to preserve full table history and snapshot lineage.

Pattern 2: Full table recreation (data rewrite)

When your source tables are not Iceberg — raw Parquet registered in Hive, CSV files, or legacy Hive-format tables — you need to convert them to Iceberg before they can be registered in a new catalog. Two Spark procedures handle this:

`migrate` — converts an existing Hive/Parquet table to Iceberg in-place by reading Parquet footers and generating Iceberg metadata (manifests, metadata.json). The original data files stay. The old table is replaced with an Iceberg table in the catalog.
`snapshot` — creates a new Iceberg table from existing Hive/Parquet files without replacing the original. Both the old table and the new Iceberg table coexist, sharing the underlying data files.

For tables that need partition layout changes, sort order optimization, or schema restructuring, CREATE TABLE … AS SELECT (CTAS) rewrites data files entirely. This is more expensive but gives you full control over the target layout.

Use full recreation when:

Source tables are Hive-format, Parquet, or CSV — not Iceberg.
You want to change partition strategy or sort order during migration.
Source data quality requires a transformation pass.

Using add_files for external Parquet ingestion

A third option sits between registration and full recreation. The add_files procedure registers existing Parquet or Avro data files directly into an Iceberg table without rewriting them — similar to migrate, but designed for files that were never managed by any catalog. This is useful when you have Parquet files produced by external systems (data vendors, legacy pipelines, ad hoc exports) that you want to bring into Iceberg governance without a copy.

sql

1CALL catalog.system.add_files(2    table => 'analytics.vendor_data',3    source_table => '`vendor_data_parquet`',4    partition_filter => map('year', '2026')5);

The files are added to the Iceberg manifest structure with appropriate metadata, making them visible to all engines connected to the target catalog. Unlike migrate, the source table (if any) is not replaced. Unlike CTAS, the data files are not rewritten.

Step-by-step: HMS → REST catalog migration

This is the most common migration path. You have Iceberg tables registered in Hive Metastore and want to move them to a REST catalog — Apache Polaris, Lakekeeper, Gravitino, or any other REST implementation. Data stays in place; only the catalog pointer moves.

Step 1: Inventory your HMS tables

Before touching anything, build a complete inventory of every Iceberg table in your HMS instance. You need the table identifier and its current metadata_location.

python

1from pyiceberg.catalog import load_catalog2 3hms_catalog = load_catalog("hms", **{4    "type": "hive",5    "uri": "thrift://hms-host:9083",6    "s3.region": "us-east-1"7})8 9all_tables = []10for namespace in hms_catalog.list_namespaces():11    for table_id in hms_catalog.list_tables(namespace[0]):12        table = hms_catalog.load_table(table_id)13        all_tables.append({14            "identifier": table_id,15            "metadata_location": table.metadata_location,16            "schema": table.schema(),17            "snapshot_count": len(table.metadata.snapshots)18        })19        print(f"{table_id} → {table.metadata_location}")20 21print(f"\nTotal tables to migrate: {len(all_tables)}")

Save this inventory. It is your source of truth for validation and rollback.

Step 2: Deploy the REST catalog

Stand up your target REST catalog. For Polaris:

Deploy the Polaris server (Quarkus-based JVM) with PostgreSQL as the persistence backend.
Create a catalog entity with a base-location matching your existing warehouse path (critical — the REST catalog must have storage access to the same paths where your metadata files live).
Configure IAM credentials or service account permissions so the REST catalog can read and write to the S3/GCS/ADLS paths where table metadata and data files reside.
Create the target namespaces to match your HMS database structure.
Set ALLOW_UNSTRUCTURED_TABLE_LOCATION on the Polaris server — HMS creates namespace folders with a .db extension that Polaris would otherwise reject.
Set allowedLocations in the catalog's storage_configuration_info to include the source catalog directory.

sql

1-- Example: Polaris catalog creation via admin API2-- POST /api/management/v1/catalogs3-- {4--   "name": "production",5--   "type": "INTERNAL",6--   "properties": {7--     "default-base-location": "s3://my-warehouse/"8--   },9--   "storageConfigInfo": {10--     "storageType": "S3",11--     "allowedLocations": ["s3://my-warehouse/"]12--   }13-- }

Step 3: Register tables in the new catalog

Use the register_table procedure to create catalog entries that point to existing metadata files. No data moves.

Option A: Spark SQL (table-by-table)

sql

1-- Configure Spark to connect to both catalogs2-- spark.sql.catalog.hms_catalog = org.apache.iceberg.spark.SparkCatalog3-- spark.sql.catalog.hms_catalog.type = hive4-- spark.sql.catalog.hms_catalog.uri = thrift://hms-host:90835 6-- spark.sql.catalog.rest_catalog = org.apache.iceberg.spark.SparkCatalog7-- spark.sql.catalog.rest_catalog.type = rest8-- spark.sql.catalog.rest_catalog.uri = https://polaris-host:8181/api/catalog9-- spark.sql.catalog.rest_catalog.credential = <client-id>:<client-secret>10-- spark.sql.catalog.rest_catalog.warehouse = production11 12-- Register each table using its metadata file location13CALL rest_catalog.system.register_table(14    table => 'analytics.orders',15    metadata_file => 's3://my-warehouse/analytics/orders/metadata/v42.metadata.json'16);17 18CALL rest_catalog.system.register_table(19    table => 'analytics.customers',20    metadata_file => 's3://my-warehouse/analytics/customers/metadata/v18.metadata.json'21);

Option B: PyIceberg (bulk registration)

python

1from pyiceberg.catalog import load_catalog2 3rest_catalog = load_catalog("rest", **{4    "type": "rest",5    "uri": "https://polaris-host:8181/api/catalog",6    "credential": "<client-id>:<client-secret>",7    "warehouse": "production"8})9 10successes = []11failures = []12 13for table_info in all_tables:14    try:15        rest_catalog.register_table(16            identifier=table_info["identifier"],17            metadata_location=table_info["metadata_location"]18        )19        successes.append(table_info["identifier"])20        print(f"Registered {table_info['identifier']}")21    except Exception as e:22        failures.append({"id": table_info["identifier"], "error": str(e)})23        print(f"FAILED {table_info['identifier']}: {e}")24 25print(f"\nRegistered: {len(successes)}, Failed: {len(failures)}")

Option C: Iceberg Catalog Migrator CLI (bulk, automated)

The iceberg-catalog-migrator CLI (available under apache/polaris-tools on GitHub) automates bulk migration between any supported catalog pair. It supports both register (tables remain in both catalogs) and migrate (tables are removed from source after successful migration). Avoid running the migrator when there are in-progress commits for tables in the source catalog — concurrent writes during migration can cause missing updates and metadata corruption.

bash

1# Register all tables from HMS into Polaris REST catalog2# Tables remain in both catalogs — safe for validation3java -jar iceberg-catalog-migrator-cli.jar register \4    --source-catalog-type HIVE \5    --source-catalog-properties \6        uri=thrift://hms-host:9083,warehouse=s3a://my-warehouse/ \7    --target-catalog-type REST \8    --target-catalog-properties \9        uri=https://polaris-host:8181/api/catalog,warehouse=production,token=$TOKEN10 11# Or migrate specific namespaces using regex12java -jar iceberg-catalog-migrator-cli.jar register \13    --source-catalog-type HIVE \14    --source-catalog-properties \15        uri=thrift://hms-host:9083,warehouse=s3a://my-warehouse/ \16    --target-catalog-type REST \17    --target-catalog-properties \18        uri=https://polaris-host:8181/api/catalog,warehouse=production,token=$TOKEN \19    --identifiers-regex ^analytics\\..*

Step 4: Validate registration

Do not skip validation. For every registered table, confirm that the new catalog resolves to the same metadata, schema, and row count.

sql

1-- Schema comparison2DESCRIBE rest_catalog.analytics.orders;3DESCRIBE hms_catalog.analytics.orders;4 5-- Row count verification6SELECT count(*) FROM rest_catalog.analytics.orders;7SELECT count(*) FROM hms_catalog.analytics.orders;8 9-- Snapshot count comparison10SELECT count(*) FROM rest_catalog.analytics.orders.snapshots;11SELECT count(*) FROM hms_catalog.analytics.orders.snapshots;12 13-- Metadata location confirmation14SELECT * FROM rest_catalog.analytics.orders.metadata_log_entries15ORDER BY timestamp DESC LIMIT 1;

Run a representative set of production queries against the new catalog and compare results with the old catalog. Do not declare success based on row counts alone — verify actual query output on business-critical tables.

Step 5: Cut over engine configurations

Update engine configurations to point to the REST catalog. This is the moment engines start using the new catalog for metadata resolution.

properties

1# Before (HMS)2spark.sql.catalog.prod = org.apache.iceberg.spark.SparkCatalog3spark.sql.catalog.prod.type = hive4spark.sql.catalog.prod.uri = thrift://hms-host:90835 6# After (REST)7spark.sql.catalog.prod = org.apache.iceberg.spark.SparkCatalog8spark.sql.catalog.prod.type = rest9spark.sql.catalog.prod.uri = https://polaris-host:8181/api/catalog10spark.sql.catalog.prod.credential = <client-id>:<client-secret>11spark.sql.catalog.prod.warehouse = production

For Trino:

properties

1# trino/etc/catalog/iceberg.properties2# Before (HMS)3connector.name=iceberg4iceberg.catalog.type=hive_metastore5hive.metastore.uri=thrift://hms-host:90836 7# After (REST)8connector.name=iceberg9iceberg.catalog.type=rest10iceberg.rest-catalog.uri=https://polaris-host:8181/api/catalog11iceberg.rest-catalog.warehouse=production

Roll out engine configuration changes incrementally — start with read-only consumers (BI tools, ad hoc queries), then move to write-path engines (ETL pipelines, streaming jobs) only after read validation is complete.

Step-by-step: HMS → AWS Glue migration

For all-AWS teams, migrating from a self-managed HMS to AWS Glue eliminates operational overhead and integrates with IAM, Lake Formation, Athena, and EMR natively. The pattern is similar — data stays in place, only catalog pointers move — but the tooling differs because Glue uses its own API alongside the newer REST endpoint.

Step 1: Verify storage access

Ensure the AWS Glue service role has read and write access to the S3 paths where your Iceberg metadata and data files live. Glue needs to reach the same metadata.json files that HMS currently points to.

json

1{2    "Effect": "Allow",3    "Action": [4        "s3:GetObject",5        "s3:PutObject",6        "s3:DeleteObject",7        "s3:ListBucket"8    ],9    "Resource": [10        "arn:aws:s3:::my-warehouse",11        "arn:aws:s3:::my-warehouse/*"12    ]13}

Step 2: Create Glue databases

Map your HMS databases to Glue databases. Glue supports only single-level namespaces — if your HMS uses nested databases, flatten them with a naming convention.

python

1import boto32 3glue = boto3.client('glue', region_name='us-east-1')4 5# Create databases matching your HMS namespaces6for namespace in hms_namespaces:7    glue.create_database(8        DatabaseInput={9            'Name': namespace,10            'Description': f'Migrated from HMS - {namespace}',11            'LocationUri': f's3://my-warehouse/{namespace}/'12        }13    )14    print(f"Created Glue database: {namespace}")

Step 3: Register tables via Spark

Use Spark with both HMS and Glue catalogs configured to register existing Iceberg tables in Glue.

sql

1-- Configure Spark with both catalogs2-- spark.sql.catalog.hms_catalog.type = hive3-- spark.sql.catalog.hms_catalog.uri = thrift://hms-host:90834 5-- spark.sql.catalog.glue_catalog = org.apache.iceberg.spark.SparkCatalog6-- spark.sql.catalog.glue_catalog.catalog-impl = org.apache.iceberg.aws.glue.GlueCatalog7-- spark.sql.catalog.glue_catalog.warehouse = s3://my-warehouse/8-- spark.sql.catalog.glue_catalog.io-impl = org.apache.iceberg.aws.s3.S3FileIO9 10-- Register each table11CALL glue_catalog.system.register_table(12    table => 'analytics.orders',13    metadata_file => 's3://my-warehouse/analytics/orders/metadata/v42.metadata.json'14);

Alternatively, use the Iceberg Catalog Migrator CLI:

bash

1java -jar iceberg-catalog-migrator-cli.jar register \2    --source-catalog-type HIVE \3    --source-catalog-properties \4        uri=thrift://hms-host:9083,warehouse=s3a://my-warehouse/ \5    --target-catalog-type GLUE \6    --target-catalog-properties \7        warehouse=s3a://my-warehouse/,io-impl=org.apache.iceberg.aws.s3.S3FileIO

Step 4: Validate and configure Lake Formation

After registration, validate row counts and schemas as described in the REST migration section. Then layer on Lake Formation permissions to replace whatever access control layer you had on HMS (Ranger, custom authz, or none).

sql

1-- Validate from Athena using the Glue catalog2SELECT count(*) FROM analytics.orders;3 4-- Compare with HMS-sourced query5SELECT count(*) FROM hms_catalog.analytics.orders;

Step 5: Update engine configurations

Point engines to Glue. For Athena and EMR, Glue is the default catalog — no configuration needed. For Spark and Trino:

properties

1# Spark2spark.sql.catalog.prod = org.apache.iceberg.spark.SparkCatalog3spark.sql.catalog.prod.catalog-impl = org.apache.iceberg.aws.glue.GlueCatalog4spark.sql.catalog.prod.warehouse = s3://my-warehouse/5spark.sql.catalog.prod.io-impl = org.apache.iceberg.aws.s3.S3FileIO6 7# Trino8connector.name=iceberg9iceberg.catalog.type=glue10hive.metastore.glue.region=us-east-1

Dual-catalog access during migration

Production migrations are not instantaneous. You will have a transition period where some engines point to the old catalog and others to the new one. Managing this dual-catalog phase is where most migration failures happen.

The golden rule: read from both, write to one

During transition, configure your environment so that both catalogs are accessible for reads, but all writes go to one catalog only — the new one. This prevents split-brain scenarios where concurrent writes to both catalogs cause metadata divergence that is extremely difficult to reconcile.

python

1# Spark session with both catalogs for reads2# spark.sql.catalog.old_catalog.type = hive3# spark.sql.catalog.old_catalog.uri = thrift://hms-host:90834 5# spark.sql.catalog.new_catalog.type = rest6# spark.sql.catalog.new_catalog.uri = https://polaris-host:8181/api/catalog7 8# Read from either catalog during validation9# SELECT * FROM old_catalog.analytics.orders WHERE ...10# SELECT * FROM new_catalog.analytics.orders WHERE ...11 12# All writes go exclusively to the new catalog13# INSERT INTO new_catalog.analytics.orders VALUES (...)

LakeOps multi-catalog federation during transition

This dual-catalog phase is where LakeOps delivers the most immediate value. LakeOps connects to multiple catalogs simultaneously — you can run both old and new in parallel with unified observability. The moment you connect the new catalog, LakeOps classifies every table (Critical, Warning, Healthy) and surfaces problems. You get a single view of every table across both the source and target catalog with health status, size, record count, and last-modified timestamp. No guessing which tables have been migrated and which are still pending. No blind spots during transition.

Migration sequencing

Do not migrate everything at once. Sequence by domain and criticality:

1.Pilot — one non-critical namespace. Validate end-to-end: registration, engine configuration, query results, write path, compaction, snapshot expiration.
2.Read-heavy tables — tables consumed by BI and reporting but rarely written. These are low-risk because write-path conflicts are unlikely.
3.Write-heavy tables — ETL output tables, streaming sinks. These require a write freeze or coordinated cutover because the metadata pointer diverges on every commit.
4.High-governance tables — tables with strict access control, audit requirements, or regulatory sensitivity. Migrate last, after your access control model in the new catalog is fully validated.

Write-freeze cutover for write-heavy tables

For tables with continuous writes (streaming ingestion, frequent ETL), the safest cutover pattern is:

1.Pause writers — stop ingestion jobs, streaming sinks, and ETL pipelines that write to the table.
2.Snapshot the current state — verify the latest metadata location in the source catalog.
3.Register in the new catalog — point the new catalog at the latest metadata file.
4.Validate — row count, schema, snapshot count, sample queries.
5.Reconfigure writers — point all write-path jobs at the new catalog.
6.Resume writers — restart ingestion. From this point, all commits go to the new catalog only.
7.Keep old catalog entry read-only — leave it as a fallback for one to two weeks before decommissioning.

The write-freeze window is typically minutes, not hours. Registration and validation are fast because no data moves. LakeOps monitors both catalogs throughout this window, so you can verify parity before, during, and after cutover — ensuring zero-downtime migration support with confidence.

Data stays in place — only metadata moves

This point bears repeating because it is the most common misconception about catalog migration. No data files are copied, moved, or rewritten during an in-place catalog migration. The Parquet files, Avro manifest files, and metadata JSON files all stay exactly where they are in object storage. The only thing that changes is which catalog service holds the pointer to the current metadata.json.

This means:

Storage costs do not increase. You are not duplicating data. Both catalogs can point to the same metadata file simultaneously during transition.
Migration speed is independent of table size. A 10 TB table migrates just as fast as a 10 MB table — the operation is a single API call per table.
Table history is fully preserved. Every snapshot, every schema evolution step, every partition spec change is retained because the metadata files are untouched.
Rollback is trivial. If something goes wrong, the old catalog still has its pointer to the same metadata. You just switch engine configurations back.

The one prerequisite: the new catalog must have storage credentials that can access every storage path referenced by your tables. For Polaris, this means configuring the catalog's storage integration with IAM credentials or role assumption. For Glue, the Glue service role needs S3 access to those paths. For Nessie, the engine's storage credentials must reach the same paths.

Testing and validation

Validation is not optional. It is the difference between a migration and a data incident. Run these checks for every table after registration.

Row count verification

sql

1-- Compare row counts between old and new catalog2SELECT3    'old_catalog' AS source, count(*) AS row_count4FROM old_catalog.analytics.orders5UNION ALL6SELECT7    'new_catalog' AS source, count(*) AS row_count8FROM new_catalog.analytics.orders;

Row counts must match exactly. Any discrepancy means the metadata file was not correctly resolved or the catalogs are pointing to different snapshots.

Schema comparison

sql

1-- Verify schema matches2DESCRIBE EXTENDED old_catalog.analytics.orders;3DESCRIBE EXTENDED new_catalog.analytics.orders;

Check column names, types, nullability, and field IDs. Iceberg field IDs are embedded in the metadata file and must match — if they do not, schema evolution operations in the new catalog will produce corrupted data.

Snapshot and history validation

sql

1-- Compare snapshot lineage2SELECT snapshot_id, committed_at, operation3FROM new_catalog.analytics.orders.snapshots4ORDER BY committed_at DESC5LIMIT 10;6 7-- Verify partition spec8SELECT * FROM new_catalog.analytics.orders.partitions;9 10-- Verify manifest inventory11SELECT count(*) AS manifest_count12FROM new_catalog.analytics.orders.manifests;

Business-critical query replay

Run a representative set of production queries against both catalogs and compare outputs. Do not stop at aggregates — compare row-level results on business-critical reports. If your migration involves a query engine change (Hive to Trino, for example), account for semantic differences in SQL dialect and data type casting.

Automated validation script

python

1def validate_table(old_catalog, new_catalog, table_id):2    old_table = old_catalog.load_table(table_id)3    new_table = new_catalog.load_table(table_id)4 5    checks = {6        "metadata_location_match": (7            old_table.metadata_location == new_table.metadata_location8        ),9        "schema_match": old_table.schema() == new_table.schema(),10        "snapshot_count_match": (11            len(old_table.metadata.snapshots) == len(new_table.metadata.snapshots)12        ),13        "current_snapshot_match": (14            old_table.metadata.current_snapshot_id15            == new_table.metadata.current_snapshot_id16        ),17        "partition_spec_match": (18            old_table.spec() == new_table.spec()19        ),20    }21 22    all_passed = all(checks.values())23    status = "PASS" if all_passed else "FAIL"24    print(f"{status}: {table_id}")25    for check, result in checks.items():26        if not result:27            print(f"  FAILED: {check}")28 29    return all_passed30 31# Run validation across all migrated tables32results = [validate_table(hms_catalog, rest_catalog, t["identifier"]) for t in all_tables]33print(f"\nPassed: {sum(results)}/{len(results)}")

Post-migration operations

Migration day is not the finish line — it is the starting line. The catalog change is complete, but the real operational challenge begins now. Tables that lived under HMS rarely received proper maintenance, and the new catalog does not change that. Compaction, snapshot retention, manifest optimization, orphan cleanup, and statistics generation still need to happen. This is where most teams stumble after a successful migration.

Initial compaction and sort order optimization

Tables migrated from HMS often carry years of accumulated small files, suboptimal sort orders, and missing partition-level statistics. The new catalog inherits whatever state the table was in — it does not magically fix physical layout problems. After migration, tables frequently need an initial round of compaction to consolidate small files, sort order optimization to align with actual query patterns, and statistics generation (Puffin files) to enable effective predicate pushdown. For a deep dive on statistics, see the Iceberg Puffin statistics guide.

Snapshot expiration and orphan cleanup

HMS environments often let snapshot history accumulate indefinitely because there was no easy way to manage retention across hundreds of tables. After migration, establish retention policies immediately — a common default is 7 days of snapshot history for operational tables and 30 days for audit-sensitive ones. Run orphan file cleanup to reclaim storage from data files that are no longer referenced by any snapshot. These operations are catalog-agnostic; they work the same way regardless of whether you migrated to Polaris, Glue, or Nessie.

Autonomous maintenance with LakeOps

After migration, LakeOps handles post-migration optimization autonomously. Tables that need initial compaction, sort order optimization, and statistics generation are identified and maintained without manual intervention. LakeOps applies the same policies, same maintenance, same observability regardless of which catalog you chose — catalog-agnostic operations that work identically across Glue, REST/Polaris, Nessie, and S3 Tables.

Maintenance runs as sequenced pipelines: compaction first, then manifest optimization, then snapshot expiration, then orphan cleanup. A purpose-built Rust engine runs these operations 95 percent faster than equivalent Spark maintenance jobs. Query-aware sort analyzes cross-engine telemetry to determine which columns queries filter on, then reorders data files for optimal predicate pushdown. See autonomous Iceberg table maintenance for the full operational model.

LakeOps catalog connection walkthrough — connecting REST, Glue, and S3 Tables catalogs in minutes.

Rollback strategy

Every production migration needs a rollback plan. Iceberg catalog migration makes rollback straightforward because the operation is metadata-only.

Keep old catalog entries during transition

Use the register command (not migrate) when moving tables. The register command creates entries in the new catalog while leaving entries in the old catalog intact. This means both catalogs point to the same metadata files simultaneously — a cost-free safety net.

bash

1# register = tables exist in BOTH catalogs (safe)2java -jar iceberg-catalog-migrator-cli.jar register \3    --source-catalog-type HIVE \4    --target-catalog-type REST ...5 6# migrate = tables are REMOVED from source after migration (no rollback)7java -jar iceberg-catalog-migrator-cli.jar migrate \8    --source-catalog-type HIVE \9    --target-catalog-type REST ...

Only use migrate (which removes source entries) after you have validated every table in the new catalog and confirmed that all engines have been reconfigured. Recommended minimum transition period: two weeks with the old catalog entries preserved as read-only fallback.

Rollback procedure

If validation fails or you discover issues after cutover:

1.Revert engine configurations — point engines back to the old catalog. This is a configuration change, not a data operation.
2.Stop writes to the new catalog — if any writes went to the new catalog during the validation period, the old catalog will not have those commits. You may need to register the latest metadata file from the new catalog back into the old one.
3.Investigate — check whether the issue is catalog-level (permissions, connectivity, namespace mapping) or data-level (schema mismatch, corrupt metadata). Most rollbacks are caused by permission or namespace configuration errors, not data problems.
4.Re-attempt — fix the root cause and re-register. Since registration is idempotent (same metadata file, same result), re-running the migration is safe.

Post-cutover divergence

Once writes begin going to the new catalog, the metadata pointers in the two catalogs diverge — the new catalog's metadata_location advances with each commit while the old catalog's remains frozen at the pre-cutover state. After this point, rollback to the old catalog loses any writes committed to the new catalog. This is why the write-freeze cutover pattern exists: it minimizes the window where divergence can occur.

Migrating non-Iceberg tables alongside catalog migration

Many HMS instances contain a mix of Iceberg tables and legacy Hive/Parquet tables. A complete migration often includes converting these legacy tables to Iceberg format as part of the same program. For Parquet tables already on S3, the Spark migrate procedure converts them in place — no data rewrite — by reading Parquet footers and generating Iceberg metadata. For teams migrating from Snowflake-native or Databricks Delta tables, see the Snowflake to Iceberg and Databricks to Iceberg guides.

sql

1-- Convert a Hive/Parquet table to Iceberg in place2CALL rest_catalog.system.migrate(3    table => 'analytics.legacy_events'4);5 6-- Or create a snapshot (non-destructive — original table preserved)7CALL rest_catalog.system.snapshot(8    source_table => 'hms_catalog.analytics.legacy_events',9    table => 'rest_catalog.analytics.legacy_events'10);

Common pitfalls

Catalog migration is conceptually simple — register a pointer — but production environments have edges. Here are the pitfalls that catch teams in practice.

Namespace mapping

HMS databases do not always map cleanly to target catalog namespaces. HMS allows characters in database names that some REST catalogs reject. Glue supports only single-level namespaces — if your HMS has nested databases or uses dot-separated naming conventions, you need a flattening strategy. Define your namespace mapping before registering the first table, and document it. Inconsistent mapping across engines causes tables to become invisible in some query contexts.

Permission translation

Access control does not migrate automatically. If you had Ranger policies on HMS, those policies do not transfer to Polaris RBAC, Glue IAM, or Nessie (which has no built-in access control). Map your existing permission model to the target catalog's authorization framework before cutover. This is often the most time-consuming part of migration — not the table registration itself.

Statistics loss

Some catalog migrations lose table-level and column-level statistics. HMS stores statistics in its own metadata tables; Iceberg stores statistics in Puffin files referenced from table metadata. When you register a table in a new catalog, the Iceberg-level statistics (stored in Puffin files) are preserved because they are part of the metadata tree. But catalog-level statistics (stored in HMS) may not transfer. If your query engine relies on catalog-provided statistics for query planning, you may need to re-analyze tables after migration. LakeOps can regenerate statistics autonomously as part of its post-migration optimization pass.

Storage credential scope

The new catalog must have credentials that can access every storage path referenced by your tables. This seems obvious but causes failures when tables span multiple buckets, regions, or storage accounts. Audit your metadata file locations before migration — if tables reference paths in s3://bucket-a/, s3://bucket-b/, and s3://bucket-c/, your new catalog needs access to all three. For Polaris, configure multiple storage integrations with appropriate allowedLocations. For Glue, ensure the service role's IAM policy covers all relevant buckets.

Concurrent write conflicts during transition

If two catalogs both believe they own write authority for a table, concurrent commits create divergent metadata branches that cannot be automatically reconciled. This is the single most dangerous migration failure mode. The prevention is simple: during transition, exactly one catalog owns writes per table. Use the dual-catalog access pattern described above — read from both, write to one.

Hive table format confusion

Not every table in HMS is an Iceberg table. HMS also hosts Hive SerDe tables, external Parquet tables, and other formats. The register_table procedure only works for existing Iceberg tables — it expects a valid metadata.json file. For non-Iceberg tables, use the Spark migrate or snapshot procedures to convert them to Iceberg first, then register in the new catalog. Attempting to register a non-Iceberg table's metadata path will fail because no valid Iceberg metadata exists at that location.

Metadata file access after HMS decommission

After decommissioning HMS, ensure the metadata files that HMS referenced remain accessible. Iceberg tables store their complete history in metadata files on object storage — but if your HMS instance had specific IAM roles or network configurations that granted access to those paths, removing HMS might inadvertently remove those access grants. Verify that the new catalog's storage credentials are fully independent of HMS infrastructure.

Migration checklist

Before starting, work through this checklist:

Inventory complete — every Iceberg table identified with current metadata_location, schema, snapshot count, and storage paths.
Namespace mapping defined — HMS databases mapped to target catalog namespaces with a documented naming convention.
Storage credentials verified — the new catalog can access every S3/GCS/ADLS path referenced by table metadata.
Access control model designed — Ranger/HMS permissions translated to the target catalog's authorization framework (Polaris RBAC, Glue IAM + Lake Formation, Nessie + OPA).
Engine configuration templates ready — Spark, Trino, Flink, and other engine configurations tested against the new catalog in a staging environment.
Validation queries prepared — row count, schema comparison, snapshot lineage, and business-critical query outputs ready to run.
Rollback plan documented — using register (not migrate), old catalog entries preserved, engine configuration revert procedure documented.
Migration sequence planned — pilot namespace identified, tables sequenced by criticality, write-freeze windows scheduled for high-write tables.
Operations layer connected — LakeOps or equivalent monitoring wired to both old and new catalogs for health visibility during transition.
Communication plan — downstream consumers notified of cutover windows and any endpoint changes.

Multi-catalog management: the long-term approach

In practice, most enterprises do not end up with a single catalog. AWS workloads use Glue. Teams experimenting with branching adopt Nessie. Legacy Hive deployments still run HMS. Databricks estates use Unity Catalog. A full catalog migration rarely means "move everything to one catalog" — it means "adopt a better catalog for new workloads while old catalogs continue serving existing ones."

This is the reality that most migration guides ignore. You do not graduate from one catalog to another — you graduate from single-catalog thinking to multi-catalog operations. The migration you run today may consolidate HMS into Polaris, but next quarter a new team adopts Glue for their AWS-native workload, and the quarter after that a Databricks deployment brings Unity Catalog into the mix.

The teams that succeed treat catalog migration as the beginning of a multi-catalog management practice, not the end of a technical project. They invest in a catalog-agnostic operational layer that provides unified observability and maintenance regardless of how many catalogs are in play.

LakeOps is built for this reality. It connects to all of these catalogs simultaneously — Glue, HMS, REST catalogs (Polaris, Nessie, Gravitino, Lakekeeper), S3 Tables, and Unity Catalog — and provides a unified operational layer across the entire estate. Lake-wide policies for compaction thresholds, retention windows, and cleanup rules are defined once and scoped through a table → namespace → catalog hierarchy. Multi-engine routing registers Trino, Spark, DuckDB, Athena, and Snowflake in a single engine directory. Health tiers, Insights alerts for manifest bloat, snapshot accumulation, small-file proliferation, and partition skew apply identically across every catalog. For teams running a managed Iceberg practice, LakeOps is the control plane that makes multi-catalog estates manageable without requiring every team to become a catalog operations expert.

Summary

Apache Iceberg catalog migration is a metadata-only operation. Your Parquet files, manifest lists, and metadata JSON files stay exactly where they are. The migration registers existing metadata file locations in a new catalog — REST (Polaris, Gravitino, Lakekeeper), AWS Glue, Nessie, or Unity Catalog — giving you credential vending, fine-grained access control, REST protocol interoperability, and the operational capabilities HMS was never designed to provide.

The migration itself is straightforward — register_table calls, engine configuration updates, and validation queries. The real challenge is what comes after: compaction, snapshot retention, manifest optimization, statistics generation, and cross-engine health monitoring across whatever catalog mix you end up running. Those operational concerns do not come from the catalog. They require a dedicated operational layer that works across every catalog in your estate. Deploy that layer before table twenty, not after table two hundred. The teams that succeed are the ones who treat catalog migration as the beginning of a modern lakehouse practice — not the end of a technical project.

Apache Iceberg Catalog Migration: Hive Metastore to REST, Polaris, Glue, or Nessie

Why catalog migration matters

Hive Metastore: why teams outgrow it

The REST catalog advantage

Target catalog landscape

REST catalogs: Apache Polaris and Apache Gravitino

AWS Glue Data Catalog

Project Nessie (git-like catalog)

Databricks Unity Catalog

Migration patterns: registration vs. recreation

Pattern 1: In-place metadata registration (zero data movement)

Pattern 2: Full table recreation (data rewrite)

Using add_files for external Parquet ingestion

Step-by-step: HMS → REST catalog migration

Step 1: Inventory your HMS tables

Step 2: Deploy the REST catalog

Step 3: Register tables in the new catalog

Step 4: Validate registration

Step 5: Cut over engine configurations

Step-by-step: HMS → AWS Glue migration

Step 1: Verify storage access

Step 2: Create Glue databases

Step 3: Register tables via Spark

Step 4: Validate and configure Lake Formation

Step 5: Update engine configurations

Dual-catalog access during migration

The golden rule: read from both, write to one

LakeOps multi-catalog federation during transition

Migration sequencing

Write-freeze cutover for write-heavy tables

Data stays in place — only metadata moves

Testing and validation

Row count verification

Schema comparison

Snapshot and history validation

Business-critical query replay

Automated validation script

Post-migration operations

Initial compaction and sort order optimization

Snapshot expiration and orphan cleanup

Autonomous maintenance with LakeOps

Rollback strategy

Keep old catalog entries during transition

Rollback procedure

Post-cutover divergence

Migrating non-Iceberg tables alongside catalog migration

Common pitfalls

Namespace mapping

Permission translation

Statistics loss

Storage credential scope

Concurrent write conflicts during transition

Hive table format confusion

Metadata file access after HMS decommission

Migration checklist

Multi-catalog management: the long-term approach

Summary

Tags

Related articles

Apache Iceberg Data Quality and Table Health: Where Reliability Actually Breaks

Apache Iceberg Orphan Files: Safe Cleanup Without Breaking Tables

AWS Glue Iceberg Optimization: A Practical Guide