Apache Iceberg Commit Conflicts: Causes, Prevention, and Recovery

Commit conflicts are the single hardest operational challenge in multi-engine Apache Iceberg environments. A Flink pipeline streaming events, a Spark ETL backfilling historical data, and a compaction engine rewriting files — all targeting the same table, all competing for a single metadata pointer. When two of these writers race to commit, one wins and the other fails with CommitFailedException. At low concurrency, retries handle it transparently. At production scale — dozens of tables, multiple engines, continuous maintenance — commit conflicts become the dominant source of failed compaction jobs, stalled streaming pipelines, wasted compute, and on-call pages at 2 AM.

The root cause is structural: Iceberg's optimistic concurrency control (OCC) defers conflict detection to commit time, so every concurrent writer is a potential conflict. The solution is equally structural: a control plane like LakeOps that coordinates all operations — compaction, snapshot expiration, orphan cleanup, manifest rewriting — with full awareness of which engines are writing, which partitions are active, and when it is safe to run maintenance. LakeOps eliminates the most common source of commit conflicts — maintenance colliding with production writers — by never running an operation that would conflict with an active writer in the first place.

This guide explains why Iceberg's concurrency model produces conflicts, when they occur in practice, how to diagnose them from stack traces and logs, the strategies that prevent and recover from them, and how automated conflict-aware scheduling removes the problem entirely.

Why Iceberg conflicts happen: the optimistic concurrency model

Every Iceberg table has a single source of truth: a metadata file that points to the current snapshot. The snapshot references a manifest list, which references manifests, which reference data files. Every write operation — an append, an overwrite, a compaction, a delete — produces a new metadata file with a new snapshot. The commit is the act of atomically swapping the catalog's pointer from the old metadata file to the new one.

This is a compare-and-swap (CAS) operation. The writer reads the current metadata pointer, prepares its changes against that version, writes a new metadata file, and then asks the catalog to update the pointer — but only if the pointer still matches what the writer originally read. If another writer committed in between, the pointer has moved. The swap fails. This is optimistic concurrency control: rather than locking the table during the entire write phase, Iceberg allows every writer to proceed in isolation and detects collisions only at commit time.

The key properties of this model:

No locks during the write phase. Writers prepare data files, manifests, and metadata in isolation. Two Spark jobs, a Flink pipeline, and a compaction engine can all be writing simultaneously without blocking each other. Coordination only happens at commit time.
Conflicts are detected, not prevented. Unlike pessimistic locking (row-level locks, table-level locks), optimistic concurrency allows every writer to proceed and only rejects one if there is a genuine collision at commit time.
The commit is metadata-only. Data files are already written to object storage by the time the commit happens. The atomic swap updates only the catalog pointer. This means retrying a commit does not require rewriting data files — only refreshing metadata and re-validating.
The catalog is the arbiter. Whether it is a REST catalog (Polaris, Nessie, Lakekeeper), AWS Glue, or a Hive Metastore, the catalog provides the atomic CAS operation. The catalog's consistency guarantees and latency directly affect conflict behavior — slower catalogs widen the conflict window.

The commit sequence in detail

When a writer commits to an Iceberg table, the sequence is:

1.Read the current metadata. The writer loads the current metadata.json from the catalog, which identifies the current snapshot, manifest list, and all referenced data files.
2.Plan and execute the operation. The writer produces new data files (for appends/overwrites), delete files (for row-level deletes), or identifies files to remove (for compaction). All files are written to object storage.
3.Build new metadata. The writer creates a new snapshot that includes the changes — new manifest entries pointing to the new data files, updated statistics, and a new manifest list. It writes a new metadata.json that points to this snapshot.
4.Attempt the atomic swap. The writer asks the catalog to update the metadata pointer from the version it originally read to the new version. This is the CAS operation.
5.On success: The commit is complete. The new snapshot is now the current state of the table. Other readers and writers will see it on their next metadata refresh.
6.On failure: Another writer committed a different version between steps 1 and 4. The pointer has moved. The writer enters the retry loop.

The retry loop

When a CAS fails, Iceberg does not immediately give up. The writer enters a retry loop that is both efficient and safe:

1.Refresh metadata. The writer re-reads the current metadata from the catalog to get the latest committed state.
2.Validate assumptions. Iceberg structures commits as assumptions and actions. The retry checks whether the assumptions still hold against the new table state. For example, a compaction that rewrites file_a.parquet and file_b.parquet into merged.parquet assumes that both source files still exist in the table. If a concurrent commit deleted file_a.parquet, the assumption fails and the operation cannot be retried — it must be restarted entirely.
3.Re-apply actions. If the assumptions are still valid, the writer creates a new metadata file that incorporates both the concurrent writer's changes and its own. This is the rebase — analogous to a git rebase where your changes are replayed on top of the new base.
4.Retry the CAS. The writer attempts the atomic swap again. If another writer committed in the meantime, the cycle repeats.
5.Exhaust retries. If the maximum number of retries is reached without a successful commit, the writer throws CommitFailedException.

The critical insight is that retries only repeat the metadata commit, not the data write. Data files are already in object storage. The retry refreshes metadata, validates, rebuilds the snapshot, and re-attempts the swap. This makes retries cheap — milliseconds to seconds — compared to the minutes or hours spent writing data files. But cheapness does not mean unlimited: each retry adds latency to the commit, and under sustained contention the retry loop can exhaust without ever succeeding.

Two categories of conflict

Not all conflicts are equal. Understanding the distinction between the two categories determines whether a conflict is a minor hiccup or a catastrophic failure:

Catalog commit conflicts (metadata-only, retriable). Two writers append files to different partitions of the same table. Both attempt the CAS at roughly the same time. One succeeds, one fails. But the data does not overlap — the conflicting writer simply needs to refresh its metadata and replay its append on top of the new base snapshot. Iceberg handles this automatically through the retry loop. These conflicts are transient, cheap, and invisible to the end user when retry settings are adequate.

Data conflicts (non-retriable). Two writers modify the same physical files or make incompatible assumptions about the same data. A compaction job rewrites file_a.parquet and file_b.parquet into merged.parquet, but a concurrent snapshot expiration dereferences file_a.parquet. The compaction's assumption — that both source files still exist — is violated. No amount of metadata refresh can fix this. The retry loop aborts, throws CommitFailedException, and the entire operation must restart from scratch. These conflicts waste compute and are the primary operational concern.

When conflicts happen: common production scenarios

Commit conflicts in production almost always fall into one of four patterns. Each has different characteristics, different blast radii, and different solutions.

Streaming ingestion colliding with compaction

This is the most common and most damaging conflict pattern in production lakehouses. A Flink or Spark Structured Streaming job continuously appends data to the current partition. A compaction job runs periodically to consolidate small files from previous commits into larger, query-optimized files.

The conflict occurs when the compaction job targets a partition that is still receiving streaming writes. Compaction reads files from a snapshot, rewrites them into optimally sized targets, and then commits — removing the old files and adding the merged files. But between the read and the commit, the streaming writer has appended new files to the same partition and committed a new snapshot. The compaction's CAS fails because the metadata pointer has moved.

On a table with 60-second checkpoints, the streaming writer commits every minute. A compaction job that takes 10 minutes to process a partition will face at least 9 intervening commits. If the compaction targets the active partition, conflicts are nearly guaranteed.

The retry loop can resolve this if the conflict is metadata-only — the streaming writer added new files but did not modify the files that compaction is rewriting. Iceberg can rebase the compaction commit on top of the streaming writer's commit. But if the compaction job uses partial-progress.enabled = false (the default), a single conflict on one partition invalidates the entire compaction job — including work on partitions that had no conflicts at all. This is where the operational cost becomes severe: hours of compute wasted, tables remaining fragmented, query performance continuing to degrade. For a deep dive on partition-aware compaction that avoids this scenario, see Kafka to Iceberg Compaction — Done Right.

A control plane like LakeOps eliminates this conflict entirely. LakeOps continuously monitors per-partition commit activity and automatically excludes hot partitions — partitions that are actively receiving writes — from compaction scope. When today's partition becomes yesterday's and the streaming writer moves on, the next compaction cycle picks it up. No manual where clause management, no per-table cron configuration, no conflicts.

Multiple writers contending on the metadata pointer

Multiple independent writers — separate Spark jobs, Flink applications, or Kafka Connect tasks — appending to the same table but different partitions. In theory, these should not conflict because they touch different data. In practice, they conflict on the metadata pointer.

Every writer's commit updates the same catalog pointer. Even when Writer A writes to partition=2026-06-18 and Writer B writes to partition=2026-06-19, they both attempt to swap the same metadata pointer. If they commit within milliseconds of each other, one will fail.

These are metadata-only conflicts — the data files do not overlap. Iceberg's retry loop handles them cleanly by refreshing metadata and rebasing. But with the default retry configuration (4 retries, 100ms–60s exponential backoff), high-throughput multi-writer setups can exhaust retries before succeeding. This is especially common when 5+ writers commit to the same table within the same second.

LakeOps provides multi-engine awareness for this scenario: it understands which engines are writing to which tables and partitions, and schedules its own maintenance operations to avoid adding more concurrent commits to already-contended tables. When a table already has five writers competing for the metadata pointer, the last thing it needs is a compaction job adding a sixth.

Maintenance operations conflicting with each other

Snapshot expiration, orphan file cleanup, manifest rewriting, and compaction are all maintenance operations that commit metadata changes. When these run concurrently with active ingestion — or concurrently with each other — they compete for the same metadata pointer.

The most dangerous combination is snapshot expiration running concurrently with compaction. Snapshot expiration removes references to old snapshots and their associated data files. If compaction is simultaneously rewriting files referenced by those snapshots, the assumptions conflict. Compaction expects the source files to exist; expiration may have dereferenced them. The result is a non-retriable conflict — the compaction job must restart entirely, wasting all of its compute.

This is why maintenance operations must be sequenced, not parallelized. The correct order — expire snapshots → remove orphan files → compact data files → rewrite manifests — ensures each operation's output feeds the next without conflicts. Running them as independent cron jobs on separate schedules produces exactly these collisions. See Automating Apache Iceberg Table Maintenance for the full sequencing rationale.

LakeOps eliminates this class of conflict by design. Its sequenced maintenance pipeline runs operations in the correct order per table: expire → orphans → compact → manifests. Each step runs to completion before the next begins. There is no possibility of snapshot expiration conflicting with in-flight compaction, because they never run concurrently on the same table.

Schema and partition evolution during active writes

Schema changes (adding columns, renaming fields) and partition evolution (changing the partition spec) both produce metadata commits. If a streaming writer commits between the schema change and the next metadata refresh, one of them will fail.

These conflicts are typically transient — a single retry resolves them. But they can cascade if schema changes are applied during peak ingestion hours when the commit rate is high. Best practice is to apply schema and partition evolution during low-write windows, or pause ingestion briefly during structural changes.

How to diagnose commit conflicts

Recognizing CommitFailedException

When the retry loop exhausts without a successful commit, Iceberg throws org.apache.iceberg.exceptions.CommitFailedException. This is the primary signal that concurrent writers are colliding. The exception is a RuntimeException that implements CleanableFailure, meaning the calling engine can clean up intermediate state (delete orphaned data files) when it catches the exception.

In Flink task manager logs, the exception surfaces as:

text

1org.apache.iceberg.exceptions.CommitFailedException: 2  Cannot commit: Requirement failed: current table metadata is outdated3    at org.apache.iceberg.SnapshotProducer.commit(SnapshotProducer.java:421)4    at org.apache.iceberg.MergingSnapshotProducer.commit(MergingSnapshotProducer.java:130)5    at org.apache.iceberg.BaseOverwriteFiles.commit(BaseOverwriteFiles.java:61)6    at org.apache.iceberg.flink.sink.IcebergCommitter7      .commitOperation(IcebergCommitter.java:186)8    at org.apache.iceberg.flink.sink.IcebergCommitter9      .commit(IcebergCommitter.java:147)

In Spark driver logs during a compaction job:

text

1org.apache.iceberg.exceptions.CommitFailedException: 2  Cannot commit rewrite because files were removed by a concurrent operation: 3  [s3://lakehouse/warehouse/analytics/clickstream/data/event_date=2026-06-18/4   00001-4-8a7b3c9d-e1f2-4567-89ab-cdef01234567-00001.parquet]5    at org.apache.iceberg.RewriteFileGroup.validateRewrite(6      RewriteFileGroup.java:87)7    at org.apache.iceberg.actions.RewriteDataFilesAction8      .commitFileGroups(RewriteDataFilesAction.java:314)9    at org.apache.iceberg.spark.actions.SparkRewriteDataFilesAction10      .doExecute(SparkRewriteDataFilesAction.java:198)

The stack trace tells you the conflict type. When the message says "current table metadata is outdated," the conflict is metadata-only and should have been handled by the retry loop — the fact that it surfaced means retries exhausted, and increasing commit.retry.num-retries may help. When the message says "files were removed by a concurrent operation," the conflict is a data conflict that no amount of retries can resolve — the operation must be restarted, and the root cause (typically concurrent maintenance) must be addressed.

Reading conflict patterns from logs

Production monitoring for commit conflicts should track these patterns:

text

1# Count CommitFailedException per hour in Flink task manager logs2grep "CommitFailedException" /var/log/flink/taskmanager.log | \3  awk '{print $1" "$2}' | cut -d: -f1-2 | sort | uniq -c4 5# Identify conflict type from Spark driver logs6grep -A 3 "CommitFailedException" /var/log/spark/driver.log | \7  grep -E "(metadata is outdated|files were removed|Requirement failed)"8 9# Correlate conflicts with maintenance job timing10grep -E "(CommitFailedException|rewrite_data_files|expire_snapshots)" \11  /var/log/spark/driver.log | head -100

Instrument your monitoring stack (Datadog, Prometheus, CloudWatch) with custom metrics that increment on CommitFailedException. Alert when the conflict rate exceeds 5 per hour on any table. Flink's Iceberg sink exposes the elapsedSecondsSinceLastSuccessfulCommit metric — set up alerts when this exceeds 60 minutes to detect failed or missing commits.

Distinguishing retriable from non-retriable failures

CommitFailedException is thrown in two distinct situations:

Retriable conflicts exhausted. The writer detected a metadata conflict, entered the retry loop, and failed on every attempt. This happens when:

The table has a very high commit rate (multiple commits per second) and the retry window is too short
Multiple concurrent writers consistently race for the same metadata pointer
The catalog is slow to respond, making each retry take longer than the backoff window allows
The default 4 retries with 100ms–60s exponential backoff are insufficient for the concurrency level

Non-retriable conflicts. The retry validation found that the assumptions no longer hold. This happens when:

A compaction job's source files were deleted by a concurrent snapshot expiration
Two writers overwrote the same files (not just the same partition, but the same physical files)
A concurrent writer performed a row-level delete on files that another writer is rewriting
The base snapshot is no longer on the current branch (the commit has diverged beyond reconciliation)

A recent fix in Iceberg 1.11 (merged via PR #15126) addressed a critical bug in the REST catalog where sequence number validation failures during concurrent commits to different branches threw a non-retryable ValidationException instead of the expected retryable CommitFailedException. If you are running a REST catalog on Iceberg versions prior to 1.11, concurrent branch commits may fail with ValidationException when they should succeed on retry. Upgrading resolves this.

Isolation levels and false-positive conflicts

Each retry attempt checks whether the original operation's conflict detection filter overlaps with changes made by intervening commits. The isolation level — serializable or snapshot — determines how strictly conflicts are detected:

Serializable isolation (default for overwrites and deletes) rejects the commit if any new data was added that matches the operation's filter, even if the data files do not overlap. This is the safest but produces the most false-positive conflicts.
Snapshot isolation (default for appends) only rejects when the same files are modified. For most streaming append workloads, snapshot isolation is sufficient and produces fewer spurious failures.

For streaming tables where the dominant write pattern is append-only, ensure that the table's write operations use snapshot isolation to avoid rejecting commits unnecessarily.

How to prevent conflicts

Partition-level isolation for compaction

The single most effective strategy for preventing streaming + compaction conflicts is: never compact the partition currently being written to. This eliminates data conflicts entirely.

Iceberg's rewrite_data_files procedure accepts a where clause that restricts which partitions are rewritten. By filtering to only cold partitions — partitions that are no longer receiving new writes — compaction operates on stable data with no risk of conflict.

sql

1CALL catalog.system.rewrite_data_files(2  table => 'analytics.clickstream',3  strategy => 'sort',4  sort_order => 'event_type ASC NULLS LAST',5  where => 'event_date < current_date()',6  options => map(7    'target-file-size-bytes', '268435456',8    'min-input-files', '5',9    'partial-progress.enabled', 'true',10    'partial-progress.max-commits', '10'11  )12);

The where => 'event_date < current_date()' clause ensures compaction only touches yesterday's partitions and older. The streaming writer appends to today's partition without interference. When today becomes yesterday, the next compaction run picks it up.

For hourly partitioned tables, the equivalent is where => 'event_hour < date_trunc("hour", current_timestamp())' — always exclude the active hour.

This pattern eliminates data conflicts. Metadata-only conflicts (competing for the catalog pointer) can still occur, but these are always retriable. Combined with partial-progress.enabled = true, each file group commits independently — a conflict on one group does not invalidate work on others. For a full treatment of partition-aware compaction, see Kafka to Iceberg Compaction — Done Right.

LakeOps automates partition-level isolation at scale. Its conflict-aware execution engine continuously monitors per-partition commit activity and automatically excludes hot partitions from compaction scope. It knows not just which partitions received writes recently, but which partitions have active writers right now — and it will not touch them until the writer has moved on. This works across engines: if a Flink job is streaming into partition 2026-06-28 while a Spark ETL is backfilling 2026-06-15, LakeOps knows both partitions are hot and excludes both from compaction.

Write distribution mode

High file scatter across partitions increases both the number of files per commit and the probability of conflicts. When multiple writer subtasks each produce files across many partitions, the resulting commits touch a wide surface area — maximizing the chance that a concurrent writer's commit overlaps.

Setting write.distribution-mode to hash shuffles records by partition key before writing, so each partition is written by a single subtask. This reduces file count per commit (from subtasks × partitions to partitions) and narrows the commit's scope, reducing the conflict window.

sql

1ALTER TABLE analytics.clickstream SET TBLPROPERTIES (2  'write.distribution-mode' = 'hash'3);

For multi-writer setups, hash distribution also ensures that each writer's commit touches fewer partitions — making it easier for the retry loop to rebase non-overlapping commits. This is especially effective when combined with partition-level isolation for compaction. For a deeper comparison of distribution modes in streaming contexts, see Apache Iceberg with Flink: Streaming Optimization.

Partial progress for compaction

Without partial-progress.enabled, a compaction job commits all file group rewrites in a single atomic commit. If that commit conflicts, the entire job — potentially hours of work across hundreds of partitions — is lost.

With partial progress, each file group (or batch of file groups) commits independently. A conflict on one group retries only that group, while all other groups' commits have already succeeded. This is the difference between losing 10 minutes of work on one partition and losing 2 hours of work across the entire table.

sql

1CALL catalog.system.rewrite_data_files(2  table => 'analytics.clickstream',3  strategy => 'binpack',4  options => map(5    'partial-progress.enabled', 'true',6    'partial-progress.max-commits', '20',7    'partial-progress.max-failed-commits', '5'8  )9);

partial-progress.max-commits controls how many separate commits the compaction job produces. Higher values increase fault tolerance — each commit covers fewer file groups — at the cost of more snapshots (each commit creates a new snapshot). Set to 10–20 for tables with active concurrent writers. partial-progress.max-failed-commits controls how many commit failures are tolerated before the entire job fails. On streaming tables with active writers, set to 3–5. This allows a few hot-partition conflicts without aborting the entire maintenance run.

Sequencing maintenance operations

Most maintenance conflicts stem from a simple problem: operations that should run sequentially are scheduled independently. Snapshot expiration runs on one cron schedule, orphan cleanup on another, compaction on a third, manifest rewriting on a fourth. Each operation produces commits that can conflict with the others — and with active ingestion.

The conflict-free maintenance sequence is: expire snapshots → remove orphan files → compact data files → rewrite manifests. Each operation's output feeds the next:

1.Snapshot expiration dereferences old snapshots and their file pointers. Running it first ensures that compaction does not waste compute rewriting files that are about to be dereferenced.
2.Orphan file cleanup removes physical files that no snapshot references. Running it after expiration ensures it catches everything expiration just released.
3.Compaction rewrites the remaining small files into properly sized targets. Running it after cleanup ensures it operates on a clean, minimal file set.
4.Manifest rewriting consolidates the fragmented manifests from weeks of streaming commits. Running it last ensures it reflects the final compacted file layout.

Running these as a coordinated pipeline within a single Airflow DAG (or equivalent orchestrator) with task dependencies eliminates inter-operation conflicts entirely. See Automating Apache Iceberg Table Maintenance for the full implementation pattern.

LakeOps sequences maintenance operations automatically per table in exactly this order. The pipeline is event-driven rather than schedule-driven — it monitors structural signals (file count, average file size, snapshot velocity, delete file ratio, manifest depth) and triggers maintenance when a table crosses a threshold. A streaming table accumulating 1,000 files per hour compacts every 30 minutes. A batch table loaded once daily compacts once daily. Each table gets maintenance proportional to its actual need, not a fixed cron cadence shared with every other table in the lake.

Configuring retry behavior

Iceberg exposes four table properties that control the retry loop. These are critical for high-concurrency environments:

sql

1ALTER TABLE analytics.events SET TBLPROPERTIES (2  'commit.retry.num-retries' = '10',3  'commit.retry.min-wait-ms' = '200',4  'commit.retry.max-wait-ms' = '120000',5  'commit.retry.total-timeout-ms' = '1800000'6);

`commit.retry.num-retries` (default: 4) — number of retry attempts before throwing CommitFailedException. For tables with 3+ concurrent writers or streaming + compaction, increase to 8–10. Higher values tolerate more transient conflicts but extend the maximum time a commit can take. `commit.retry.min-wait-ms` (default: 100) — minimum backoff before the first retry. For streaming tables where commits happen every minute, 100ms is fine. For maintenance operations that run less frequently, increase to 500–1000ms to reduce catalog load. `commit.retry.max-wait-ms` (default: 60,000 / 1 minute) — maximum backoff between retries. Exponential backoff doubles the wait on each retry, capped at this value. For high-concurrency streaming tables, increase to 120,000ms (2 minutes) to give the retry loop more room. `commit.retry.total-timeout-ms` (default: 1,800,000 / 30 minutes) — total wall-clock time allowed for all retry attempts combined. If retries are still failing after 30 minutes, the operation is abandoned. This is a safety valve — if you hit this limit, the root cause is sustained high contention, not transient conflicts.

Recommended retry configurations by workload

For streaming ingestion (Flink, Spark Structured Streaming) where the writer commits every 1–5 minutes and compaction runs concurrently:

sql

1ALTER TABLE analytics.clickstream SET TBLPROPERTIES (2  'commit.retry.num-retries' = '10',3  'commit.retry.min-wait-ms' = '100',4  'commit.retry.max-wait-ms' = '10000',5  'commit.retry.total-timeout-ms' = '1800000'6);

For maintenance operations (compaction, snapshot expiration) running on batch schedules:

sql

1ALTER TABLE analytics.clickstream SET TBLPROPERTIES (2  'commit.retry.num-retries' = '4',3  'commit.retry.min-wait-ms' = '1000',4  'commit.retry.max-wait-ms' = '60000',5  'commit.retry.total-timeout-ms' = '1800000'6);

For multi-writer pipelines where 5+ jobs write concurrently to different partitions:

sql

1ALTER TABLE analytics.events SET TBLPROPERTIES (2  'commit.retry.num-retries' = '10',3  'commit.retry.min-wait-ms' = '200',4  'commit.retry.max-wait-ms' = '120000',5  'commit.retry.total-timeout-ms' = '1800000'6);

Tuning retry properties is a necessary first step, but it treats symptoms rather than causes. A table that requires 10 retries per commit has a concurrency design problem — the prevention strategies above address root causes. LakeOps provides retry-aware scheduling as a built-in capability: if a conflict is detected during a maintenance commit, LakeOps backs off intelligently, re-evaluates the partition state, and retries only when the writer has moved on. Unlike static retry configurations, this is dynamic — it adapts to the actual commit cadence of each table.

How streaming writers handle conflicts

Flink and Spark Structured Streaming interact with Iceberg's commit model in fundamentally different ways. Understanding these differences is critical for configuring conflict avoidance correctly.

Flink

Flink's Iceberg sink ties its commit cycle to Flink checkpoints. When a checkpoint triggers, each writer subtask flushes its current data file (regardless of size), and the Iceberg committer operator collects all file paths and issues a single Iceberg commit. The commit is part of Flink's two-phase commit protocol, which provides exactly-once semantics.

When a commit conflict occurs, Flink's behavior depends on the sink implementation:

`IcebergSink` (SinkV2 API, recommended): The committer retries according to the table's commit.retry.* properties. If retries exhaust, the sink raises an exception that triggers a Flink job restart from the last successful checkpoint. The checkpoint guarantees that no data is lost — the failed commit's data files are rewritten from the checkpoint state. As of Flink 1.20+, you can also configure sink.committer.failure-strategy to WARN to skip failed committables instead of triggering a full restart — useful for non-critical pipelines where losing one checkpoint interval of data is acceptable.
`FlinkSink` (legacy API, deprecated as of Iceberg 1.9+): Similar behavior, but with less granular control over retry semantics. Migration to IcebergSink is recommended for new projects.

The key Flink-specific conflict concern is the checkpoint interval. A 60-second checkpoint means the streaming writer commits every 60 seconds. If compaction targets the active partition, the compaction commit competes with the streaming commit roughly once per minute. Even with 10 retries, this cadence can exhaust the retry window. The solution is partition isolation — compact only cold partitions. For Flink-specific optimization strategies, see Apache Iceberg with Flink: Streaming Optimization.

java

1// Flink checkpoint configuration for conflict-aware streaming2env.enableCheckpointing(300_000); // 5-minute checkpoint interval3env.getCheckpointConfig().setMinPauseBetweenCheckpoints(60_000);4env.getCheckpointConfig().setCheckpointTimeout(600_000);

Increasing the checkpoint interval from 60 seconds to 5 minutes reduces the commit frequency by 5x, proportionally reducing the conflict window for concurrent operations.

Spark Structured Streaming

Spark Structured Streaming commits to Iceberg at the end of each micro-batch. The commit follows the same optimistic concurrency model, but Spark handles conflicts differently:

Micro-batch trigger interval controls commit frequency. A ProcessingTime("5 minutes") trigger produces one commit every 5 minutes — similar to Flink's checkpoint interval.
On conflict, Spark's commit retry loop executes within the micro-batch. If retries exhaust, the micro-batch fails and Spark retries the entire micro-batch from the beginning (re-reading from the source offset).
Exactly-once semantics are maintained through Spark's write-ahead log and source offset tracking. A failed commit does not produce duplicate data.

python

1# Spark Structured Streaming with conflict-aware configuration2df.writeStream \3  .format("iceberg") \4  .outputMode("append") \5  .trigger(processingTime="5 minutes") \6  .option("checkpointLocation", "s3://lakehouse/checkpoints/clickstream") \7  .toTable("catalog.analytics.clickstream")

Spark Structured Streaming is generally less sensitive to commit conflicts than Flink because micro-batch intervals are typically longer (5–15 minutes vs. 1–5 minutes for Flink checkpoints). But the same principles apply: isolate compaction to cold partitions, increase retry counts for high-concurrency tables, and use hash distribution to reduce per-commit file scatter.

Conflict recovery cost by engine

When a conflict causes a CommitFailedException that exhausts retries:

Flink restarts from the last checkpoint. Data files from the failed commit become orphans in object storage (cleaned up by orphan file removal). The recovery cost is replaying data from the checkpoint — typically 1–5 minutes of data.
Spark Structured Streaming retries the micro-batch. The recovery cost is re-reading source data for one micro-batch interval — typically 5–15 minutes of data.

In both cases, no data is lost. But frequent conflict-driven restarts waste compute and create orphan files that inflate storage costs until cleaned up. Prevention (partition isolation, retry tuning, conflict-aware scheduling) is always cheaper than recovery.

How to recover from conflicts

Branch-based writes for workload isolation

Iceberg branches (available since Iceberg 1.2) provide a powerful isolation mechanism: routing different workloads to separate branches so their commits never compete for the same metadata pointer.

A branch in Iceberg is a named reference to a snapshot, with its own independent lineage. Writes to a branch create new snapshots on that branch without affecting the main branch. The data files are shared — branching is a metadata-only operation with zero data duplication.

The Write-Audit-Publish (WAP) pattern uses branches to isolate maintenance writes from production reads:

sql

1-- Enable WAP on the table2ALTER TABLE analytics.clickstream SET TBLPROPERTIES (3  'write.wap.enabled' = 'true'4);5 6-- Create a maintenance branch7ALTER TABLE analytics.clickstream CREATE BRANCH maintenance_branch;

Configure the maintenance job (compaction, data quality checks, backfills) to write to the branch:

python

1# Spark session configured to write to the maintenance branch2spark.conf.set("spark.wap.branch", "maintenance_branch")3 4# Compaction runs against the branch — no conflict with main branch writers5spark.sql("""6  CALL catalog.system.rewrite_data_files(7    table => 'analytics.clickstream',8    strategy => 'sort',9    sort_order => 'event_type ASC',10    options => map(11      'target-file-size-bytes', '268435456',12      'min-input-files', '5'13    )14  )15""")

After validation, fast-forward the main branch to incorporate the maintenance work:

sql

1-- Publish the maintenance branch to main2CALL catalog.system.fast_forward(3  table => 'analytics.clickstream',4  branch => 'main',5  to => 'maintenance_branch'6);7 8-- Clean up the branch9ALTER TABLE analytics.clickstream DROP BRANCH maintenance_branch;

The fast-forward is a single atomic metadata operation. Downstream readers see the compacted data instantly, without any disruption to the streaming pipeline writing to the main branch.

Branch limitations

Branch-based isolation is powerful but has practical limitations:

Fast-forward only works when main has not diverged. If the main branch received new commits since the maintenance branch was created, fast-forward fails. You must cherry-pick or recreate the branch from the current main snapshot.
Streaming writers cannot easily switch branches. A Flink job writing to the main branch cannot be redirected to a branch without restarting the pipeline. Branches are better suited for batch maintenance than for streaming write isolation.
Branch proliferation adds complexity. Each branch maintains its own snapshot lineage. Expired branches leave orphan snapshots that must be cleaned up. Use branches for specific, bounded workflows (WAP, backfills, compaction) rather than as a permanent concurrency strategy.
Not all catalogs support branches equally. REST catalogs and Nessie provide full branch support. Hive Metastore and Hadoop catalogs have limited or no branch support.

For most production streaming workloads, partition-level isolation combined with retry tuning is simpler and more reliable than branch-based writes. Use branches for batch maintenance workflows where the write-audit-publish pattern provides clear value.

Monitoring for conflict rate and commit latency

You cannot fix what you cannot see. Monitoring commit conflicts is essential for detecting problems before they cascade into pipeline failures or maintenance backlogs.

Commit conflict rate. Track the number of CommitFailedException occurrences per table per hour. A healthy table should have near-zero conflicts. A table with 5+ conflicts per hour has a concurrency problem that retry tuning alone will not solve. Retry count per commit. Most commits should succeed on the first attempt (0 retries). If the median retry count exceeds 1, the table is under sustained contention. If the 99th percentile exceeds the configured num-retries, commits are failing. Commit latency. The time from starting the CAS operation to success or failure. On a healthy table with a REST catalog, commit latency should be 100–500ms. Latency spikes indicate catalog contention, slow metadata reads, or high retry counts. Orphan file creation rate. Every failed commit that wrote data files creates orphans. A spike in orphan files (visible through the table's all_data_files metadata or through storage auditing) indicates commit failures are happening silently.

Commit latency benchmarks by catalog type:

REST catalog (Polaris, Nessie, Lakekeeper): 50–200ms per commit. Best for high-concurrency streaming workloads because the CAS operation is fast and well-defined.
AWS Glue Data Catalog: 200–500ms per commit. Glue's DynamoDB-backed locking adds latency but provides strong consistency. Suitable for moderate concurrency.
Hive Metastore: 500ms–2s per commit. The slowest option due to RDBMS-backed metadata operations. Not recommended for tables with multiple concurrent writers.

Slower catalogs increase the conflict window — the longer a commit takes, the more likely another writer will commit in between. For high-concurrency streaming tables, a fast REST catalog is a prerequisite for low conflict rates.

Automated conflict avoidance with LakeOps

LakeOps is an autonomous control plane for Apache Iceberg that eliminates commit conflicts from the maintenance lifecycle. Instead of running maintenance as disconnected cron jobs that compete with each other and with active writers, LakeOps coordinates every operation with full awareness of table state, active writers, and partition activity.

Conflict-aware execution

LakeOps never runs compaction or maintenance that would conflict with active writers. It continuously monitors commit patterns across all connected engines and schedules operations in safe windows. Compaction targets only cold partitions — partitions that are no longer receiving active writes. When today's partition becomes yesterday's, the next compaction cycle picks it up. No manual where clause management, no per-table cron configuration.

The compaction engine itself — written in Rust on Apache DataFusion — executes with bounded memory, zero JVM overhead, and lock-free parallelism. Compaction completes 5–8x faster than Spark-based alternatives, which directly reduces the conflict window. A compaction job that finishes in 2 minutes has a far smaller chance of conflicting with a concurrent writer than one that runs for 20 minutes.

LakeOps Optimization — scheduled compaction — LakeOps Optimization tab — compaction strategy, target file sizes, and conflict-aware scheduling. Maintenance runs are coordinated to avoid overlap with active streaming writers.

Sequenced maintenance pipeline

LakeOps sequences maintenance operations per table in the correct order: expire snapshots → remove orphans → compact → rewrite manifests. Each step runs to completion before the next begins. There is no possibility of snapshot expiration conflicting with in-flight compaction, because they never run concurrently on the same table.

Partition-level coordination

LakeOps knows which partitions have active writers — across all connected engines — and avoids touching them. If a Flink pipeline is streaming into event_date=2026-06-28 and a Spark ETL is backfilling event_date=2026-06-01, both partitions are excluded from compaction until the writer moves on. This goes beyond simple time-based filtering (where event_date < current_date()): LakeOps monitors actual commit activity, so partitions that receive sporadic late-arriving data are also protected.

Retry-aware scheduling

If a conflict is detected despite preventive measures, LakeOps backs off intelligently. Rather than retrying on a static exponential backoff, it re-evaluates the partition state, checks whether the conflicting writer has completed, and retries only when the conditions are safe. This prevents the retry storm that occurs when a compaction job repeatedly conflicts with a high-frequency streaming writer.

Multi-engine awareness

LakeOps understands which engines are writing to which tables and partitions. In a multi-engine environment — Flink for streaming, Spark for batch ETL, Trino for interactive queries — it coordinates maintenance to avoid adding concurrent commits to already-contended tables. When a table already has five writers competing for the metadata pointer, LakeOps defers its maintenance operations rather than adding a sixth contender.

Event audit trail

Every conflict, retry, and maintenance operation is logged with full context: which table, which partition, which engine was writing, what the conflict type was, and how it was resolved. This audit trail eliminates the forensic investigation that typically follows a CommitFailedException cascade — instead of correlating Flink task manager logs, Spark driver logs, and cron job schedules, you can see the full timeline in one place.

LakeOps Lake Events — operations log — LakeOps Events log — every maintenance operation tracked with timing, status, and before/after metrics. Conflict-aware scheduling eliminates the CommitFailedException cascades that plague manual maintenance.

Cross-table coordination and policy-driven governance

When multiple tables need maintenance simultaneously, LakeOps prioritizes by degradation severity. A table with 200,000 small files and 5-second query planning time compacts before a table with 500 files and sub-second planning. This prevents the most degraded tables from waiting in a FIFO queue behind healthy ones.

At scale, per-table maintenance configuration does not survive. LakeOps policies apply maintenance rules at the catalog, namespace, or table level. Every table in scope inherits the policy — including new tables created after the policy is defined. A single Adaptive Maintenance policy bundles compaction, snapshot expiration, orphan cleanup, and manifest rewriting into a coordinated, conflict-aware workflow. Policies cascade with precedence: catalog-wide → namespace → table. A per-table override always wins.

The result is a maintenance system where commit conflicts are designed out, not debugged out. No CommitFailedException cascades, no orphan file spikes from failed compaction, no wasted compute from retried maintenance jobs. Explore the managed Iceberg solution for a deeper look at how LakeOps fits into the lakehouse architecture.

Prevention checklist

Writer-side configuration

Set `write.distribution-mode` = `hash` for partitioned tables to reduce file scatter and narrow commit scope
Increase checkpoint/commit intervals to 3–5 minutes for analytics workloads. Sub-minute intervals create 5x more commits and proportionally more conflict opportunities
Decouple source and sink parallelism in Flink. Read from Kafka at full partition count but write to Iceberg with 8–16 subtasks
Prefer `day(timestamp)` partitioning over hour(timestamp) unless volume exceeds 5 GB/hour. Fewer active partitions means fewer conflict-prone hot partitions

Compaction-side configuration

Always use partition-aware compaction with a where clause excluding the active partition
Enable `partial-progress.enabled = true` to commit each file group independently
Set `partial-progress.max-failed-commits` to 3–5 to tolerate occasional hot-partition conflicts without aborting the entire job
Set `min-input-files` to 5 to skip partitions that are already well-sized
Set `rewrite-job-order` to `files-desc` to compact the most fragmented partitions first — if the job is interrupted by a conflict, the worst partitions are already handled

Retry configuration

Set `commit.retry.num-retries` to 8–10 for tables with concurrent streaming + maintenance workloads
Set `commit.retry.min-wait-ms` to 100–200 for streaming, 500–1000 for batch maintenance
Set `commit.retry.max-wait-ms` to 60,000–120,000 depending on catalog latency
Monitor actual retry counts and adjust. If most commits succeed on attempt 1–2, your retry config is sufficient. If commits routinely hit attempt 8+, the problem is structural, not configuration-level

Maintenance sequencing

Run maintenance in order: expire → orphans → compact → manifests
Never schedule maintenance operations as independent cron jobs that can overlap
Use a single orchestrated pipeline (Airflow DAG, LakeOps policy) per table
Schedule maintenance during low-write windows when possible
For 50+ tables, use a dedicated control plane like LakeOps that coordinates per-table maintenance automatically

Monitoring

Track `CommitFailedException` count per table per hour — alert above 5
Monitor commit latency percentiles (p50, p95, p99) — alert when p99 exceeds 5 seconds
Track orphan file creation rate — spikes indicate silent commit failures
Monitor compaction job success rate — a drop below 95% indicates sustained conflicts
Correlate conflict spikes with maintenance schedules to identify overlapping operations

Summary

Commit conflicts are the price of optimistic concurrency — and optimistic concurrency is what makes Iceberg's multi-engine, multi-writer architecture possible. The goal is not to eliminate the mechanism, but to ensure conflicts are rare, retriable, and never disruptive.

For most production lakehouses, three changes eliminate 95% of commit conflicts: partition-level isolation for compaction (never compact the active partition), sequenced maintenance operations (expire → orphans → compact → manifests in a single pipeline), and tuned retry properties for high-concurrency tables.

The remaining 5% — multi-writer pointer contention, catalog-level bottlenecks, edge cases during schema evolution — are handled by increased retry counts, hash distribution mode, and monitoring that surfaces conflicts before they cascade.

LakeOps provides the full stack: conflict-aware compaction that never touches active partitions, sequenced maintenance pipelines, partition-level coordination across all connected engines, retry-aware scheduling that adapts to actual commit cadence, multi-engine awareness, and a complete event audit trail. Its Rust-based execution engine minimizes the conflict window by finishing compaction 5–8x faster than Spark. The result is a lakehouse where maintenance runs continuously alongside production writers — without CommitFailedException cascades, without orphan file spikes, and without engineers debugging cron job collisions at 2 AM.

Apache Iceberg Commit Conflicts: Causes, Prevention, and Recovery

Why Iceberg conflicts happen: the optimistic concurrency model

The commit sequence in detail

The retry loop

Two categories of conflict

When conflicts happen: common production scenarios

Streaming ingestion colliding with compaction

Multiple writers contending on the metadata pointer

Maintenance operations conflicting with each other

Schema and partition evolution during active writes

How to diagnose commit conflicts

Recognizing CommitFailedException

Reading conflict patterns from logs

Distinguishing retriable from non-retriable failures

Isolation levels and false-positive conflicts

How to prevent conflicts

Partition-level isolation for compaction

Write distribution mode

Partial progress for compaction

Sequencing maintenance operations

Configuring retry behavior

Recommended retry configurations by workload

How streaming writers handle conflicts

Flink

Spark Structured Streaming

Conflict recovery cost by engine

How to recover from conflicts

Branch-based writes for workload isolation

Branch limitations

Monitoring for conflict rate and commit latency

Automated conflict avoidance with LakeOps

Conflict-aware execution

Sequenced maintenance pipeline

Partition-level coordination

Retry-aware scheduling

Multi-engine awareness

Event audit trail

Cross-table coordination and policy-driven governance

Prevention checklist

Writer-side configuration

Compaction-side configuration

Retry configuration

Maintenance sequencing

Monitoring

Summary

Further reading

Tags

Related articles

Apache Iceberg with Flink: Streaming Optimization Guide

Apache Iceberg Delete Files: Reducing Merge-on-Read Overhead

Apache Iceberg Operational Runbook: Incidents, Symptoms, and Fixes