Back to blog

Apache Iceberg Delete Files: Reducing Merge-on-Read Overhead

Delete files let Iceberg avoid rewriting data on every UPDATE or DELETE — but every unresolved delete file forces readers to reconcile at query time. A deep guide to position deletes, equality deletes, measuring overhead, and resolving accumulation before it tanks performance.

Apache Iceberg Delete Files — stacked data blocks with pink delete file markers funneled through compaction into clean, optimized data with a performance gauge showing improved read speed

Apache Iceberg supports row-level mutations — UPDATEs, DELETEs, and MERGEs — without rewriting entire data files on every operation. The mechanism behind this is delete files: lightweight metadata files that mark rows as logically removed. Readers reconcile these markers at query time, filtering out deleted rows from the result set.

This merge-on-read approach is a deliberate tradeoff. Writes are fast because only a small delete file is appended rather than an entire Parquet file rewritten. But reads pay the cost: every query that touches affected data files must open the corresponding delete files, match rows, and exclude them. As delete files accumulate, that per-query overhead compounds — and in production lakehouses with frequent mutations, the tax becomes significant.

This guide covers how Iceberg delete files work internally, when they cause performance problems, how to measure their impact, and practical approaches to resolve them — including how LakeOps handles delete file cleanup autonomously as part of its compaction pipeline.

How deletes work in Iceberg

Iceberg offers two write modes for handling row-level changes:

Copy-on-write (COW) rewrites the entire data file containing the affected rows, producing a new file with the updated or remaining rows. The old file is dereferenced in the next snapshot. Reads are always clean — no reconciliation needed — but writes are expensive because even a single-row delete triggers a full file rewrite.

Merge-on-read (MOR) writes a small delete file that references the rows to be removed. The original data file stays intact. Reads must merge the data file with its associated delete files to produce a correct result set. Writes are cheap; reads carry reconciliation overhead proportional to the number of outstanding delete files.

Most production engines default to merge-on-read for DELETE and UPDATE operations because the write-side savings dominate when mutations are frequent. The cost is deferred to read time — and delete files accumulate until something physically applies them.

Position delete files

A position delete file identifies rows by their physical location: a combination of file_path (the data file containing the row) and pos (the zero-indexed row position within that file). The schema is minimal:

text
1| file_path                          | pos  |2|------------------------------------|------|3| s3://lake/table/data/00001.parquet | 142  |4| s3://lake/table/data/00001.parquet | 8903 |5| s3://lake/table/data/00003.parquet | 47   |

At read time, the engine opens the data file, loads the associated position delete file(s), and skips the referenced row offsets. Because position deletes use physical addresses, matching is a direct index lookup — fast per file, but the cost scales linearly with the number of delete files per data file.

Position deletes are the format Spark, Trino, and Flink produce when executing row-level DELETE or MERGE statements under merge-on-read mode. They are precise, low-overhead on the write side, and relatively efficient to apply — as long as they do not accumulate unbounded.

Equality delete files

An equality delete file identifies rows by column values rather than physical position. The file contains one or more columns that define the delete predicate — any row in any data file where those columns match is logically removed:

text
1| user_id    | event_date |2|------------|------------|3| usr_991204 | 2026-03-15 |4| usr_447821 | 2026-03-15 |5| usr_003847 | 2026-03-16 |

The engine must scan every data file in the partition (or table, if unpartitioned) and check each row against the equality predicates. Unlike position deletes, there is no shortcut — the reader cannot skip files without first evaluating whether they contain matching rows. This makes equality deletes significantly more expensive at read time.

Critically, Iceberg uses sequence numbers to scope equality deletes correctly. Each delete file carries a sequence number, and it only applies to data files with a lower sequence number. This prevents a delete from affecting rows that were inserted after the delete was written — but it also means the engine must track sequence ordering during reconciliation, adding further read-time complexity.

Equality deletes are common in CDC pipelines and GDPR/CCPA compliance workflows where the system knows what to delete (a user ID, a record key) but not where it lives physically. They are also the only option when the delete originates from a system that does not track file-level positions. Proper partitioning can limit the blast radius of equality deletes by scoping them to specific partition directories rather than the entire table.

Deletion vectors in Iceberg v3

Iceberg v3 introduces deletion vectors — a compact binary bitmap format that replaces position delete files. Instead of writing a separate Parquet file listing (file_path, pos) pairs, v3 engines write a Puffin file containing a bitmap where each bit represents a row in the referenced data file. A set bit means the row is deleted.

Deletion vectors consolidate all position deletes for a given data file into a single vector rather than accumulating multiple delete files over time. This eliminates the N-delete-files-per-data-file scaling problem that plagues v2 position deletes. At read time, the engine loads one bitmap per data file and filters rows via bitwise operations — significantly faster than joining against multiple delete files.

To enable deletion vectors, set format-version = 3 on the table and configure merge-on-read mode:

sql
1ALTER TABLE my_catalog.my_table SET TBLPROPERTIES (2  'format-version' = '3',3  'write.delete.mode' = 'merge-on-read',4  'write.update.mode' = 'merge-on-read',5  'write.merge.mode' = 'merge-on-read'6);

Deletion vectors address the per-file delete accumulation problem but do not eliminate the need for compaction. Rows marked as deleted still occupy storage, and heavily-deleted data files waste I/O reading dead rows. Periodic compaction remains necessary to physically remove deleted rows and reclaim space. Equality deletes also remain unchanged in v3 — they still use separate delete files and carry the same read-time cost.

The merge-on-read overhead problem

Every unresolved delete file forces work at query time. The cost model is straightforward:

  • Position deletes: For each data file, the reader opens N associated position delete files, builds a skip set, and filters rows during scan. Overhead: O(N) per data file, where N is the delete file count.
  • Equality deletes: For each data file in scope, the reader evaluates every row against every equality delete predicate. Overhead: O(D × R) where D is the number of equality delete entries and R is the row count per file.

In practice, this means a table with 50 position delete files per data file may add 200–500ms per file scan. A table with equality deletes across a partition of 1,000 files must evaluate the predicate against every row in every file — adding seconds or minutes to queries that previously ran in sub-second time.

The problem compounds because delete files are cumulative. Each mutation adds a new delete file. Without periodic cleanup, a table receiving 100 deletes per hour accumulates 2,400 delete files per day. After a week, readers reconcile 16,800 delete files on every query touching that partition.

This is the core tension: merge-on-read makes writes fast but shifts the cost to every subsequent reader. The more readers you have, the higher the aggregate cost of deferred cleanup. For high-performance lakehouse workloads, unresolved delete files are one of the most common sources of query regression.

Healthy vs Unhealthy Iceberg tables — the impact of unresolved delete files
Unresolved delete files create the unhealthy state (left): every reader reconciles deletes at query time, fragmented metadata, slow performance. Autonomous compaction restores the healthy state (right): clean data files, no pending deletes, fast queries.

LakeOps monitors delete file ratios across all tables and triggers compaction automatically when accumulation reaches configurable thresholds — eliminating the manual overhead of tracking and resolving delete files. We'll cover this approach in detail below.

When delete files accumulate

Delete file buildup is not uniform across tables. Specific patterns produce rapid accumulation:

Frequent updates and deletes. Tables receiving row-level mutations from streaming pipelines, event-driven systems, or operational databases accumulate delete files continuously. A Flink job issuing UPSERTs every 60 seconds can generate thousands of delete files per partition per day.

GDPR/CCPA compliance. Right-to-erasure requests target individual user records scattered across many partitions. Each erasure produces equality delete files because the system knows the user ID but not the physical file and row position. Compliance workflows that run hourly or daily produce steady delete file growth.

CDC pipelines. Change data capture from operational databases generates a stream of inserts, updates, and deletes. The update and delete events translate directly to Iceberg delete files. High-velocity CDC sources — order systems, user activity tables, inventory — produce the highest accumulation rates.

Slowly changing dimensions. SCD Type 2 patterns that mark previous records as inactive (setting an end_date or is_current flag) generate a delete file for the old version on every dimension change. Dimension tables with millions of entities and daily updates accumulate rapidly.

Late-arriving corrections. Data quality pipelines that retroactively fix records in historical partitions generate delete files in cold partitions that may not be compacted on the regular schedule.

Measuring delete file impact

Before resolving delete files, you need visibility into which tables are affected and how severely. Iceberg exposes metadata that lets you quantify the problem.

Inspect delete files via the metadata table:

sql
1-- Count delete files per partition2SELECT3  partition,4  content,5  COUNT(*) AS file_count,6  SUM(record_count) AS delete_records7FROM my_catalog.my_table.all_delete_files8GROUP BY partition, content9ORDER BY file_count DESC;

The content column distinguishes position deletes (value 1) from equality deletes (value 2). Partitions with hundreds of delete files are candidates for immediate compaction.

Measure the ratio of delete files to data files:

sql
1-- Delete-to-data file ratio by partition2WITH data_files AS (3  SELECT partition, COUNT(*) AS data_count4  FROM my_catalog.my_table.files5  GROUP BY partition6),7delete_files AS (8  SELECT partition, COUNT(*) AS delete_count9  FROM my_catalog.my_table.all_delete_files10  GROUP BY partition11)12SELECT13  d.partition,14  d.data_count,15  COALESCE(del.delete_count, 0) AS delete_count,16  ROUND(COALESCE(del.delete_count, 0) * 100.0 / d.data_count, 1) AS delete_ratio_pct17FROM data_files d18LEFT JOIN delete_files del ON d.partition = del.partition19ORDER BY delete_ratio_pct DESC;

Key thresholds to watch:

  • Delete-to-data ratio > 10%: compaction should be prioritized
  • Delete-to-data ratio > 50%: queries are paying severe reconciliation overhead
  • Equality delete files present in partitions scanned by latency-sensitive queries: immediate action
  • Delete file count growing faster than compaction resolves them: the table is in deficit

Profile query latency before and after cleanup. The clearest signal is a query that runs in 2 seconds on a clean table but takes 15 seconds on the same table with accumulated delete files. Track P50 and P99 latencies for known queries and correlate degradation with delete file growth.

Resolving delete files

The mechanism for resolving delete files is straightforward: rewrite the affected data files with the deletes physically applied. The output is a set of new data files with deleted rows omitted, and the old data files plus delete files are dereferenced.

Using `rewrite_data_files` with delete-file-threshold:

sql
1-- Rewrite data files that have associated delete files2CALL my_catalog.system.rewrite_data_files(3  table => 'my_db.my_table',4  options => map(5    'delete-file-threshold', '1',6    'partial-progress.enabled', 'true',7    'partial-progress.max-commits', '10'8  )9);

Setting delete-file-threshold to 1 means any data file with at least one associated delete file will be rewritten. This is aggressive — in practice, a threshold of 3–5 balances cleanup with rewrite cost. The partial-progress options allow the operation to commit incrementally rather than holding a single long transaction.

Removing dangling delete files: After compaction rewrites data files, some delete files may reference only the now-dereferenced original files and no longer apply to any live data. The remove-dangling-deletes option (available in recent Iceberg releases) generates an additional commit to clean up these orphaned delete files:

sql
1CALL my_catalog.system.rewrite_data_files(2  table => 'my_db.my_table',3  options => map(4    'delete-file-threshold', '3',5    'remove-dangling-deletes', 'true'6  )7);

Targeting specific partitions:

sql
1-- Rewrite only the most affected partition2CALL my_catalog.system.rewrite_data_files(3  table => 'my_db.my_table',4  strategy => 'sort',5  sort_order => 'event_date ASC, user_id ASC',6  where => 'partition_date = "2026-05-25"',7  options => map(8    'delete-file-threshold', '1',9    'target-file-size-bytes', '268435456'10  )11);

Combining delete file resolution with a sort strategy is efficient: you pay the rewrite cost once and get both clean files and optimal sort order for downstream queries.

Compaction strategies that handle delete files

The most effective approach is not to run delete file cleanup as a standalone operation but to fold it into your regular compaction pipeline. When compaction rewrites small files into larger ones, it can simultaneously apply pending delete files — resolving both the small-file and delete-file problems in a single pass.

Binpack with delete file application: Standard binpack compaction merges small files into target-sized files. When delete files are present, the compaction pass reads the data files, applies the delete filters, and writes only the surviving rows to the output files. Delete files are dereferenced along with the original data files.

Sort compaction with delete application: Sort-based compaction reorganizes data by specified columns. Because sort rewrites all files in scope, it inherently applies any pending delete files — the output is always clean. For tables with both layout and delete-file problems, sort compaction addresses both simultaneously.

The scheduling problem: If compaction runs nightly but delete files accumulate hourly, readers pay merge-on-read overhead for ~23 hours out of every 24. Event-driven compaction — triggered by delete file count thresholds rather than fixed schedules — keeps the window of exposure small. This is one of the core arguments for autonomous table maintenance: let the system react to the actual state of the data rather than a static schedule. For a broader look at cost-efficient optimization strategies, aligning compaction frequency to mutation rate is consistently the highest-leverage improvement.

Copy-on-write vs merge-on-read: choosing the right mode

The choice between COW and MOR is a write-amplification vs read-overhead tradeoff:

FactorCopy-on-WriteMerge-on-Read
Write costHigh (full file rewrite per mutation)Low (small delete file appended)
Read costZero overheadProportional to delete file count
Best forLow mutation frequency, read-heavyHigh mutation frequency, write-heavy
Delete files producedNoneYes — accumulates over time
Maintenance burdenLower (no delete cleanup needed)Higher (requires periodic compaction)

Choose copy-on-write when:

  • The table receives fewer than ~100 mutations per hour
  • Queries are latency-sensitive and cannot tolerate merge overhead
  • The table is read 100x more than it is written
  • Write amplification cost is acceptable given the mutation rate

Choose merge-on-read when:

  • The table receives frequent updates or deletes (CDC, streaming, UPSERT)
  • Write latency and throughput matter more than read latency
  • You have a compaction pipeline that periodically resolves delete files
  • The table is write-heavy relative to its read frequency

In practice, most production lakehouses use merge-on-read for mutable tables and accept the maintenance burden of periodic delete file cleanup. The key is ensuring that cleanup actually happens — which is where autonomous compaction becomes critical. For tables on v3, deletion vectors reduce per-query overhead from position deletes but do not eliminate the compaction requirement for reclaiming storage.

How LakeOps handles delete files

LakeOps treats delete file accumulation as a first-class optimization signal. The control plane continuously monitors delete file counts, types, and distribution across every table and partition in the lake — and resolves them autonomously as part of its compaction pipeline, not as a separate maintenance step.

LakeOps Compaction — autonomous delete file resolution as part of the optimization pipeline
LakeOps Optimization tab — compaction configuration that handles delete file cleanup as part of the regular maintenance pipeline, with per-table strategy and scheduling controls.
LakeOps demo — autonomous delete file compaction and table health monitoring.

Key capabilities for delete file management:

  • Detection and thresholds — tracks delete-file-to-data-file ratio per partition with configurable triggers; equality delete files use a lower threshold than position deletes due to their higher read-time cost
  • Applied during compaction — delete files are resolved during every compaction pass (whether triggered by small-file accumulation, ratio breach, or sort-order drift), not as a separate step; one pass resolves both problems with no separate rewrite_data_files call to schedule
  • Per-partition visibility — the Partitions tab shows exactly which partitions carry delete files, how many, and what type, answering the question operators always ask: which partitions are degraded right now?
  • Event-driven triggers — cleanup fires proportionally to mutation rate rather than on a cron schedule; high-velocity tables compact more frequently while quiet tables are left alone
  • Lake-wide policy control — delete file thresholds configured per catalog, namespace, or individual table via lake-wide policies; the policy hierarchy lets platform teams set defaults while allowing team-specific overrides
  • Proactive Insights — severity-ranked alerts surface delete file accumulation trends before query performance degrades, triggering compaction automatically when thresholds are breached
LakeOps Partitions tab — delete file visibility per partition
LakeOps Partitions tab: per-partition visibility into file counts, delete file distribution, and compaction status — identifying exactly where merge-on-read overhead concentrates.
LakeOps Table Insights — proactive alerts for delete file accumulation
Table Insights: severity-ranked alerts for file accumulation, partition skew, and maintenance drift — including delete file warnings that trigger compaction.
LakeOps Table Events — tracking delete file creation and compaction resolution over time
LakeOps Table Events: a time-series view of table operations including delete file creation, compaction runs, and snapshot commits — showing exactly when delete accumulation spikes and when cleanup resolves it.
LakeOps Layout Simulations — optimizing sort order while resolving delete files
Layout Simulations: LakeOps evaluates candidate sort configurations against actual query patterns — combining delete file resolution with optimal data layout in a single compaction pass.

Best practices for managing delete files

1. Automate delete file resolution with a control plane. Delete files accumulate silently and compound over time. A dedicated system like LakeOps monitors delete-to-data ratios across all tables, triggers compaction when thresholds are breached, and sequences cleanup (expire snapshots → compact → rewrite manifests) automatically. In autopilot mode, delete files never accumulate to the point of query degradation. In manual mode, the observability surfaces exactly which tables and partitions need attention. 2. Monitor delete file ratios continuously. Do not wait for query regressions to discover delete file accumulation. Track the delete-to-data ratio per table and partition, and alert when it exceeds 10%.

3. Match compaction frequency to mutation rate. Tables with streaming mutations need sub-hourly compaction. Tables with daily batch updates need daily compaction. Cron-based schedules that treat all tables equally will either waste compute or leave gaps.

4. Use lower thresholds for equality deletes. Equality deletes are more expensive to reconcile than position deletes. A table with 5 equality delete files per data file may degrade more than one with 20 position delete files. Set trigger thresholds accordingly.

5. Combine delete resolution with sort compaction. If a table needs both delete cleanup and layout optimization, run a single sort compaction pass rather than two separate operations. You pay the rewrite cost once and get both benefits.

6. Prioritize partitions serving latency-sensitive queries. Not all partitions matter equally. A hot partition serving real-time dashboards should have delete files resolved immediately. A cold partition accessed monthly can tolerate higher accumulation.

7. Consider COW for low-mutation tables. If a table receives fewer than 100 mutations per hour and is read frequently, copy-on-write mode eliminates the delete file problem entirely at the cost of slightly higher write latency.

8. Track the [metadata lifecycle](/blog/iceberg-metadata-lifecycle-maintenance-optimization) holistically. Delete files are one piece of the maintenance puzzle. Snapshot expiration, orphan cleanup, and manifest optimization all interact with delete file resolution. Run them in sequence — expire snapshots first, then compact — so compaction operates on the current dataset rather than rewriting files about to be garbage-collected.

LakeOps Data Lake Insights — lake-wide health and optimization status
LakeOps Data Lake Insights: a single view across all catalogs and namespaces showing table health scores, optimization opportunities, and delete file accumulation trends — enabling platform teams to manage hundreds of tables without manual inspection.

Conclusion

Delete files are the hidden cost of merge-on-read. They make writes fast but silently degrade every query that touches affected partitions. Position deletes scale linearly with file count; equality deletes scale with row count. Both accumulate without bound unless something resolves them.

The operational answer is compaction — but manual, schedule-driven compaction leaves gaps. Tables with high mutation rates need event-driven cleanup triggered by actual delete file ratios, not wall-clock time. LakeOps provides this: continuous monitoring, configurable thresholds, and autonomous compaction that resolves delete files as part of a unified table health maintenance pipeline.

For teams managing production Iceberg lakehouses, getting delete file management right is the difference between consistent sub-second queries and unpredictable multi-second regressions. Start by measuring your delete-to-data ratios, then move toward automated resolution — whether through custom pipelines or a managed optimization platform that handles it end-to-end.

Tags

Apache IcebergApache Icebergdelete filesmerge-on-readposition deletesequality deletescompaction

Related articles

Found this useful? Share it with your team.