Iceberg Multi-Table Transactions: Cross-Table Atomicity for Production Lakehouses

Multi-table transactions in Apache Iceberg — cross-table atomicity for the open lakehouse

Apache Iceberg multi-table transactions are the last major gap between open lakehouses and traditional warehouses. Single-table ACID works flawlessly — every append, overwrite, compaction, and schema change commits atomically through optimistic concurrency control. But production pipelines rarely touch just one table.

The problem appears the moment a workload spans multiple tables. A star schema ETL that updates a fact table and three dimension tables. A CDC pipeline that applies changes across a customer table and an orders table in a single logical transaction. A data mesh team that publishes consistent views across multiple domain tables simultaneously. In each case, the question is the same: what happens if the commit succeeds on table A but fails on table B?

The answer, in most Iceberg deployments today, is partial writes — a state that traditional data warehouses eliminated decades ago with BEGIN/COMMIT semantics. The open lakehouse gained schema evolution, partition evolution, time travel, and engine independence. But it traded away the cross-table atomicity that warehouse users take for granted.

This gap is closing. The Apache Iceberg REST Catalog specification defines a multi-table commit endpoint, with active development tracked in GitHub issue #10617. The emerging four-pillar proposal — stateless coordination, TxnSnapshots, isolation levels, and retryable subtransactions — has been demonstrated on Apache Polaris. Catalogs like Project Nessie and Apache Gravitino (incubating) already implement variants of atomic multi-table commits.

This guide covers why multi-table transactions matter in production, how the current proposal works, which catalogs support it, and how teams operate cross-table workloads today.

Why single-table atomicity is not enough

Iceberg's single-table transaction model guarantees that each individual commit is atomic — the table moves from one consistent snapshot to another. But production data pipelines rarely operate on a single table in isolation.

Star schema ETL is the most common case. A nightly batch job updates the fact table and all related dimension tables. If the fact table commit succeeds but a dimension table commit fails — due to a concurrent compaction or a transient catalog error — downstream queries join the new fact records against stale dimensions. The result is silently incorrect data. No error is thrown, no alert fires. Analysts get wrong numbers.

Multi-table CDC presents the same problem with higher stakes. A CDC pipeline from a relational source applies changes across multiple Iceberg tables reflecting the source schema. If the source transaction updated both customers and orders (e.g., a customer merge that consolidates order history), applying the changes non-atomically to two Iceberg tables means a window exists where the customer is consolidated but the orders still reference the old customer ID.

Pipeline orchestration compounds the issue. Modern lakehouse architectures use multi-stage pipelines where stage N writes to multiple tables consumed by stage N+1. If stage N partially commits, stage N+1 sees inconsistent inputs. Airflow retry semantics help — but they introduce latency and cannot guarantee that a retry produces the same result if source data changed between attempts.

Snapshot isolation for reads is a related concern. When an analyst queries five tables that were all updated in the same logical batch, they expect to see the same consistent version across all five. Without multi-table snapshot coordination, they might read three tables at the new version and two at the old version — even if all five commits eventually succeed.

These are not edge cases. They are the default behavior in every multi-table Iceberg workload without explicit coordination.

LakeOps is a control plane for Apache Iceberg lakehouses that coordinates maintenance and observability across entire catalogs — sequencing compaction, snapshot expiration, and cleanup across related tables to prevent cross-table degradation, commit conflicts, and the silent inconsistency that multi-table pipelines are prone to.

How Iceberg transactions work today

To understand the multi-table proposal, you need to understand how single-table commits work in the REST Catalog protocol.

Every write operation produces a CommitTableRequest sent to the catalog's POST /v1/{prefix}/namespaces/{namespace}/tables/{table} endpoint. The request contains two components:

Requirements — assertions about the table's current state. For example, assert-ref-snapshot-id verifies that the table's current snapshot matches what the writer read at planning time. If the assertion fails, the commit is rejected — another writer committed first.

Updates — the list of changes to apply: AddSnapshotUpdate, SetSnapshotRefUpdate, AddSchemaUpdate, etc.

The catalog validates all requirements, then atomically applies all updates. This is the compare-and-swap: requirements are the "compare," updates are the "swap." If validation passes, the metadata pointer advances. If it fails, the writer gets CommitFailedException and must retry.

java

1// Single-table transaction in Iceberg Java API2Transaction txn = table.newTransaction();3txn.newAppend()4   .appendFile(dataFile1)5   .appendFile(dataFile2)6   .commit();  // stages within the transaction7txn.expireSnapshots()8   .expireOlderThan(cutoffMillis)9   .commit();  // stages within the transaction10txn.commitTransaction();  // atomic commit of ALL staged operations

This works perfectly for a single table. The problem is that the commitTransaction() call only spans operations within one table. There is no built-in mechanism to bundle commits across table A and table B into a single atomic unit.

The multi-table transaction proposal

The Iceberg community is actively developing multi-table transaction support (tracked in GitHub issue #10617). The design centers on the REST Catalog as the coordination layer — the only component in the architecture that already has a connection to every table.

The REST Catalog specification already defines the endpoint: POST /v1/{prefix}/transactions/commit. This endpoint accepts a CommitTransactionRequest — a list of UpdateTableRequest objects, one per table participating in the transaction. Today this provides atomic multi-table commits — all-or-nothing semantics across tables. The broader multi-table transaction semantics (isolation levels, consistent cross-table reads) are still being designed as described below.

The proposal breaks down into four architectural pillars:

Stateless coordination

Rather than introducing a heavyweight transaction manager with locks and session state, the proposal delegates atomicity to the catalog's backing database. The catalog receives a batch of table commits, validates all requirements for all tables, and writes all metadata updates in a single database transaction. If any requirement fails on any table, the entire batch is rejected — zero tables are modified.

This is fundamentally simpler than distributed two-phase commit (2PC). There is no coordinator state to manage, no timeout to tune, no prepared-but-not-committed limbo state. The catalog's backing store (PostgreSQL, CockroachDB, DynamoDB, etc.) provides the atomicity guarantee through its own transaction mechanism.

text

1POST /v1/lakehouse/transactions/commit2{3  "table-changes": [4    {5      "identifier": {"namespace": ["analytics"], "name": "fact_orders"},6      "requirements": [{"type": "assert-ref-snapshot-id", ...}],7      "updates": [{"action": "add-snapshot", ...}, {"action": "set-snapshot-ref", ...}]8    },9    {10      "identifier": {"namespace": ["analytics"], "name": "dim_customers"},11      "requirements": [{"type": "assert-ref-snapshot-id", ...}],12      "updates": [{"action": "add-snapshot", ...}, {"action": "set-snapshot-ref", ...}]13    }14  ]15}

TxnSnapshots: consistent cross-table reads

Multi-table writes solve only half the problem. The other half is consistent reads — ensuring a query across multiple tables sees a logically consistent version of each.

TxnSnapshots are lightweight handles that represent the set of snapshot IDs that were produced by a single multi-table commit. A downstream consumer can request the catalog to resolve a TxnSnapshot into the specific snapshot ID for each table, guaranteeing that all reads reflect the same logical commit even if subsequent writes have advanced individual tables.

This is the Iceberg equivalent of a database's read-consistent snapshot isolation — extended across table boundaries.

Selectable isolation levels

Not every workload needs the same consistency guarantee. The proposal supports multiple isolation levels:

READ COMMITTED — queries see the latest committed data for each table independently. This is the default behavior today and remains valid for workloads that tolerate momentary cross-table inconsistency (e.g., append-only fact tables queried independently).

REPEATABLE READ — queries that begin with a TxnSnapshot handle see consistent versions across all participating tables for the duration of the read. Ideal for star schema queries, cross-table joins, and analytical dashboards.

SERIALIZABLE — the strongest level, where multi-table transactions are ordered globally. This requires pessimistic mechanisms and is primarily relevant for financial or regulatory workloads.

Retryable subtransactions

Optimistic concurrency means conflicts happen. In single-table mode, Iceberg handles this with automatic retries — read the new metadata, rebase the operation, try again. In multi-table mode, a conflict on any participating table would normally require rolling back and retrying the entire batch.

The proposal introduces subtransactions with retry semantics. Each table's commit within the larger transaction is a subtransaction. If a subtransaction fails due to OCC conflict, the coordinator can retry that specific subtransaction (re-reading the table's current metadata) without discarding the work done for other tables. Only if a subtransaction exceeds its retry budget does the entire multi-table transaction abort.

For pathological cases — tables under heavy concurrent write pressure — the proposal also supports escalation to pessimistic locking on specific tables. This prevents the livelock scenario where retries continuously fail because the table's metadata advances faster than the coordinator can converge.

Which catalogs support multi-table transactions

Not all catalogs are architecturally capable of multi-table atomicity. The dividing line — explored in detail in our Iceberg catalog comparison — is whether the catalog's backing store supports database transactions across multiple metadata records.

Lakehouse architecture with catalogs and engines — Lakehouse control plane architecture: catalogs, engines, and the coordination layer that enables atomic operations across table boundaries.

Apache Polaris (incubating) — Built as a REST-first catalog with a relational backing store. Multi-table commits map naturally to a single database transaction across the metadata rows for each participating table. Polaris is the primary target for the reference implementation.

Project Nessie — Implements multi-table atomicity natively through its Git-like versioning model. Every commit in Nessie is already a branch-level atomic operation that can span multiple tables. Multi-table transactions are essentially how Nessie always works — the REST transaction endpoint maps directly to a Nessie commit.

Apache Gravitino (incubating) — Added support for the commitTransaction endpoint (PR #10675) with a two-phase validate-then-commit approach: Phase 1 validates all requirements across all tables without modifying anything; Phase 2 commits all tables sequentially only if Phase 1 fully passes. This provides best-effort atomicity — no table is modified if any precondition fails. True cross-table rollback after a Phase 2 commit failure requires backend-level transaction support, which varies by storage layer.

AWS Glue — Does not support multi-table transactions. Each table update is an independent DynamoDB operation. The REST transaction endpoint is not available.

Hive Metastore (HMS) — Limited by its MySQL/PostgreSQL-backed model where table metadata updates are individual operations. Cross-table atomicity would require custom extensions.

JDBC Catalog — Could theoretically support multi-table commits within a single JDBC transaction, but the current implementation commits tables independently.

The trend is clear: the ecosystem is moving toward REST-first catalogs that can support multi-table semantics. Teams choosing a catalog today should weigh multi-table transaction support as a key criterion. For a comprehensive comparison, see our guide on choosing the best Iceberg catalog.

Cross-table coordination in practice today

While the spec evolves, production teams still need cross-table consistency. Several patterns provide different levels of guarantee:

Orchestrator-level sequencing — Use Airflow, Dagster, or Prefect to sequence table writes and verify each commit before proceeding. If table B's commit fails, the orchestrator can roll back table A by reverting to a tagged branch. This provides operational atomicity but requires custom implementation for each pipeline.

Iceberg branching — Write all tables to named branches (staging), validate cross-table consistency, then fast-forward each branch to main in rapid succession. The window of inconsistency is reduced to the time between fast-forward operations (typically seconds), but it is not eliminated.

Single-writer coordination — Funnel all writes for related tables through a single Spark/Flink job that commits them sequentially within a tight loop. If any commit fails, the job rolls back preceding commits by rewriting metadata. This works but creates a single point of failure.

Catalog-native transactions — Use Nessie or Polaris where multi-table atomicity is already available. This is the correct long-term solution but requires catalog migration for teams currently on Glue or HMS.

Each of these patterns has gaps. The orchestrator pattern tolerates failure windows. Branching requires careful branch management. Single-writer creates bottlenecks. Catalog-native transactions require the right catalog.

How LakeOps coordinates cross-table operations

Multi-table transaction support addresses the write path — ensuring that user-facing data mutations commit atomically. But there is an equally important maintenance path — ensuring that compaction, snapshot expiration, orphan cleanup, and manifest rewriting execute correctly across an entire catalog of tables without leaving any table in a degraded state.

LakeOps Lake Events — LakeOps lake events: real-time feed of maintenance operations, commits, and health transitions across all tables — visibility into cross-table coordination and sequencing.

LakeOps Dashboard — lake-wide health and optimization activity — LakeOps Dashboard: 30-day optimization activity, cost savings, and health distribution — multi-table pipelines benefit from lake-wide visibility into which tables are degraded and need priority maintenance.

LakeOps operates as a cross-table coordinator for the maintenance path. When managing hundreds of tables, the control plane maintains a global view of:

Table dependencies — which tables are written by the same pipeline, which share partitioning schemes, and which are queried together in cross-table joins.

Writer activity — which tables have active streaming writers, which partitions are being appended to, and when it is safe to run maintenance without triggering commit conflicts.

Health priority — which tables have the highest maintenance urgency based on file count, delete-to-data ratio, manifest bloat, or snapshot depth. LakeOps processes tables by priority, ensuring that the most degraded tables are maintained first.

LakeOps Table Monitoring — health classification across the lake — Lake-wide table health: file counts, manifest state, snapshot depth, and partition details — priority-based maintenance ensures the most degraded tables across multi-table pipelines get attention first.

Sequencing — for each table, maintenance follows the correct dependency order: expire snapshots → remove orphan files → compact data files → rewrite manifests → refresh statistics. For cross-table operations, LakeOps ensures that shared-pipeline tables are maintained in sequence, so no inconsistency window exists between a fact table and its dimensions.

This coordination prevents commit conflicts — LakeOps never schedules an operation that would collide with an active writer or another maintenance job on the same table.

LakeOps platform walkthrough — catalog coordination and conflict-free operations.

Implications for multi-engine architectures

Multi-table transactions interact directly with the multi-engine nature of the open lakehouse. When Spark, Trino, Flink, and DuckDB all query the same tables, cross-table consistency becomes a cross-engine concern.

Consider a Flink job that continuously writes to raw_events and a Spark job that periodically transforms raw_events into fact_events and dim_users. Without multi-table transactions, there is no guarantee that a Trino analyst querying fact_events and dim_users sees a consistent transformation output. The Spark job might have committed fact_events but not yet committed dim_users.

With TxnSnapshots, the Spark job publishes a single transaction handle that represents the consistent output across both tables. Trino resolves this handle before querying, ensuring cross-table consistency regardless of when individual table commits completed.

For teams operating multi-engine lakehouses, LakeOps QueryFlux provides an additional coordination layer — routing queries to the appropriate engine based on workload type while ensuring that the underlying tables are maintained consistently across all reading and writing engines.

What this means for production teams today

Multi-table transactions in Iceberg are not a distant roadmap item. The REST endpoint exists in the spec. Polaris, Nessie, and Gravitino implement it. Spark and Flink are building client-side support. The question is when — not whether — cross-table atomicity becomes standard across the ecosystem.

For teams making decisions today:

Catalog choice matters more than ever. If cross-table consistency is a requirement, choose a REST-first catalog (Polaris, Nessie, Gravitino) over Glue or HMS. These catalogs also provide the governance capabilities — access control, audit logging, and policy enforcement — that production lakehouses need alongside transactional guarantees. The migration cost is real, but the alternative — building custom orchestration around per-table commits — is more expensive over the lifetime of the platform. For migration guidance, see our catalog migration guide.

Design pipelines for atomicity. Even before multi-table transactions are universally available, structure your pipelines so that related table writes originate from a single job. This makes the eventual adoption of multi-table commits a configuration change rather than an architectural rewrite.

Separate the write path from the maintenance path. User-facing writes need cross-table atomicity. Maintenance operations (compaction, expiration, cleanup) need cross-table coordination — sequencing, conflict avoidance, priority ordering. These are different problems with different solutions. Multi-table transactions solve the first. A control plane like LakeOps solves the second.

Monitor cross-table consistency. Even with atomic commits, operational issues (failed retries, timeout rollbacks, partial pipeline failures) can leave tables in inconsistent states. Build validation checks that compare snapshot timestamps across related tables. LakeOps's observability layer surfaces these inconsistencies as health alerts, giving teams visibility into cross-table drift before it affects downstream consumers.

The open lakehouse started as a bet on open formats over proprietary lock-in. Multi-table transactions close the last major capability gap against traditional warehouses. Combined with cross-table coordination from a control plane like LakeOps, production teams get warehouse-grade consistency without giving up the modularity, engine independence, and cost efficiency that made the open lakehouse worth building in the first place.