
At Iceberg Summit 2026, LinkedIn's data ingestion team described how they operate Iceberg CDC at scale — covering equality vs position deletes, budgeted compaction, and WAP branching. These insights outline what production CDC on Apache Iceberg actually looks like when merge-on-read meets billions of daily upserts.
The published numbers are staggering. In their Summit talk, LinkedIn reports more than one exabyte in the data lake, ingestion from 10,000+ Kafka topics, and 10,000+ CDC tables handling upserts and deletes — with daily throughput above two petabytes and ten billion records. The lake is the analytical backbone of a company serving nearly a billion members.
What makes their story instructive is not the raw scale. It is the fact that at their level of throughput, every known limitation of Iceberg's merge-on-read architecture becomes a production incident waiting to happen. Delete file accumulation degrades read performance. Concurrent compaction and merge operations produce commit conflicts. HDFS read I/O amplification from scattered delete files overwhelms storage clusters. The tradeoff between data freshness and compute cost becomes a daily negotiation.
LinkedIn's journey illustrates what every team operating Iceberg at scale eventually discovers: the ingestion is the easy part. It is the delete files, the compaction conflicts, the I/O amplification, and the freshness-vs-cost tradeoff that consume engineering time.
This post unpacks LinkedIn's CDC architecture and optimization strategies — contextualizing each decision for teams operating CDC on Iceberg at any scale.
LinkedIn's data lake: scale as a forcing function
To appreciate why LinkedIn's CDC challenges matter, you need to understand the environment those challenges emerge from.
LinkedIn's migration from Hive to Iceberg, which is now well underway, used a view-driven strategy where queries were routed through abstraction views that could point to either Hive or Iceberg storage, enabling zero-downtime cutover table by table without forcing consumers to change anything. Views abstracted production dataset versions, applied read-time filters for hiding incomplete partitions and internal columns, and managed multiple public aliases per dataset — all without duplicating pipelines or data.
But the CDC tables are where the real complexity lives. Unlike append-only event logs or periodic dimension snapshots, CDC tables reflect the live state of upstream databases. Every INSERT, UPDATE, and DELETE in an OLTP source propagates as a change event through Kafka and lands in Iceberg. The result is not a write-once dataset — it is a continuously mutating table where rows are created, modified, and removed throughout the day.
At LinkedIn's scale, this means processing more than 10,000 CDC tables simultaneously, each receiving a continuous stream of upserts and deletes from upstream sources. The ingestion layer — built on their FastIngest streaming pipeline using Gobblin with Iceberg and ORC — pulls from more than 10,000 Kafka topics. The combined daily throughput — over two petabytes, over ten billion records — flows continuously, not in nightly batches.
Scale alone does not create interesting engineering problems. But scale combined with mutation does. Iceberg was designed around immutable data files. When you need to express that a row has been updated or deleted, the table format gives you two mechanisms — and choosing between them has profound implications for read performance, write latency, storage costs, and operational complexity.
CDC ingestion patterns: equality deletes vs position deletes
The core architectural decision for any CDC-on-Iceberg pipeline is how to represent row-level changes. Iceberg provides two mechanisms, and LinkedIn's experience illuminates the tradeoffs between them in ways that benchmark papers cannot.
Streaming with equality deletes
The first approach is the streaming pattern. As CDC events arrive from Kafka, they are written to Iceberg in near-real-time. When a row is updated, the ingestion engine writes an equality delete file — a file that says "delete any row where the primary key equals X." The new version of the row is written as a standard data file. This happens continuously, event by event or in micro-batches.
The advantage is freshness. Data is available in Iceberg within seconds or minutes of the upstream change. Downstream consumers — dashboards, ML feature stores, operational analytics — see current state without waiting for batch windows.
The cost is delete file accumulation. Every update produces at least one equality delete file. On a high-churn table — imagine a table tracking member session state, ad impression counts, or recommendation scores — thousands of equality delete files accumulate per hour. Each equality delete file must be evaluated against every data file during reads, because the reader cannot know in advance which data files contain the affected rows without scanning. This is the fundamental read amplification problem with merge-on-read (MoR) and equality deletes.
At LinkedIn's scale, a single high-churn CDC table could accumulate tens of thousands of equality delete files per day. A query that previously scanned 200 data files now also needs to evaluate 15,000 delete files against each of those 200 data files. Query latency does not degrade linearly — it degrades multiplicatively, because every data file must be checked against every delete file.
Batch merge with position deletes
The second approach is the batch merge pattern. Instead of streaming individual CDC events directly into Iceberg, changes are accumulated in a staging area (often a Kafka compacted topic or a temporary staging table) and periodically merged into the Iceberg table. The merge operation reads the current data files, applies all accumulated changes, and writes new data files with the changes incorporated. Delete files produced are position deletes — files that identify specific rows by their file path and row position rather than by primary key.
Position deletes are dramatically cheaper to evaluate at read time. The reader knows exactly which file and which row offset to skip. There is no need to scan primary keys across all data files. A position delete is a surgical pointer: "skip row 47,291 in file abc123.parquet." An equality delete is a broadcast: "skip any row where user_id = 47291, wherever it may be."
The tradeoff is freshness. Batch merges typically run on a schedule — every 15 minutes, every hour, every few hours. Data from the upstream database is not visible in Iceberg until the next merge window completes. For many use cases this is acceptable. For real-time operational dashboards or time-sensitive ML features, it is not.
LinkedIn operates both patterns depending on the table's requirements. Some tables — those feeding real-time applications — use the streaming equality-delete approach and accept the operational burden of managing delete file accumulation. Others use periodic batch merges to keep delete files under control at the cost of some latency.
This dual-mode operation is itself a source of complexity. The platform team must maintain two ingestion paths, two sets of monitoring, two failure modes. But the alternative — forcing a single pattern on all 10,000+ CDC tables — would either sacrifice freshness everywhere or drown in delete files everywhere.

The performance challenge: delete file accumulation on high-churn tables
The streaming equality-delete pattern, for all its freshness advantages, creates a problem that grows worse with every hour of operation. Delete files accumulate. Read performance degrades. And the degradation is not graceful.
Consider a concrete scenario from LinkedIn's architecture. A CDC table receives 500,000 upserts per hour from an upstream MySQL database tracking member activity. Each upsert produces a data file (or is batched into a micro-batch data file) and a corresponding equality delete file. After 24 hours, the table has accumulated roughly 12,000 new delete files on top of whatever existing data files were already present.
When a downstream analyst runs a query against this table, the Iceberg reader must perform the following: open each data file, open every equality delete file, and for each row in each data file, check whether that row's primary key appears in any delete file. The computational complexity is proportional to the product of data files and delete files, not their sum.
LinkedIn observed that on their highest-churn tables, query latency could increase by 10x or more within a single day if delete files were not aggressively managed. A query that completed in 30 seconds in the morning could take five minutes by evening — not because the data volume changed significantly, but because the delete file count exploded.
The HDFS read I/O amplification compound the problem further. Each equality delete file is a separate small file on HDFS. Reading 15,000 delete files means 15,000 separate file opens, 15,000 metadata reads, and 15,000 data reads — all hitting the HDFS NameNode and DataNodes. At LinkedIn's scale, the aggregate I/O from delete file reads across thousands of tables was measurable at the storage cluster level. Delete file I/O was competing with productive query I/O for storage bandwidth.
This is not a theoretical concern. LinkedIn's storage team saw HDFS read throughput degradation correlated directly with delete file accumulation across their CDC tables. The delete files were small — often just a few kilobytes each — but their sheer number created a metadata storm on the NameNode and scattered random reads across DataNodes.
LakeOps, a control plane for Apache Iceberg lakehouses, addresses this exact pattern by continuously monitoring the delete-to-data file ratio on every connected table. When the ratio crosses a configurable threshold, LakeOps triggers merge-on-read compaction — rewriting affected data files with deletes applied, producing clean data files and eliminating the accumulated delete files in a single atomic operation. This is event-driven, not scheduled. A table that suddenly receives a burst of CDC activity gets compacted when it needs it, not when a cron job happens to run.
Optimization strategies: frequent compaction, periodic rewrites, budgeted compaction
LinkedIn did not solve the delete file accumulation problem with a single strategy. Instead, they developed a layered approach that balances freshness, read performance, and compute cost.
Frequent delete-file compaction
The first layer is frequent, lightweight compaction focused specifically on delete files. Rather than rewriting entire data files, this operation targets partitions where delete file counts have grown past a threshold. The compaction job reads the data files and their associated delete files, applies the deletes, and writes new data files without the delete overhead. The old data files and delete files are then marked for removal in the next snapshot expiration cycle.
LinkedIn runs this type of compaction on aggressive schedules for their highest-churn tables — sometimes as frequently as every 30 minutes. The goal is not to produce perfectly optimized files. It is to prevent delete file counts from reaching the threshold where read performance degrades noticeably. Think of it as garbage collection: you run it often enough that the garbage never piles up.
The challenge is cost. Each compaction run consumes compute resources — reading data files, applying deletes, writing new files, committing metadata. On tables with hundreds of partitions and terabytes of data, even a lightweight compaction run can consume significant Spark cluster time. Multiply this by hundreds of high-churn tables, and the aggregate compaction compute cost becomes a line item that platform teams must manage carefully.
LakeOps takes this further with its purpose-built Rust-based compaction engine, which replaces Spark for delete-file compaction at roughly one-tenth the cost. Where LinkedIn provisions Spark clusters to compact delete files, LakeOps runs the same operation with Apache DataFusion at $5/TB versus $50/TB — making aggressive compaction schedules economically viable even for teams without LinkedIn's compute budget.
Periodic full rewrites
The second layer is periodic full-table or full-partition rewrites. Unlike the lightweight delete-file compaction that targets specific partitions with high delete counts, a full rewrite reads every file in the table (or partition), sorts the data optimally, and writes new, well-sized files from scratch.
Full rewrites address problems that incremental compaction cannot: suboptimal sort order from historical ingestion patterns, partition skew from changing data distributions, and file size drift from accumulated small compaction outputs. They are expensive — rewriting a multi-terabyte table takes hours and significant compute — but they reset the table to an optimal physical layout.
LinkedIn schedules full rewrites during off-peak hours, typically on weekends or during low-query-load periods. The frequency depends on the table: some high-value tables get weekly rewrites, while others are rewritten monthly or only when performance metrics trigger an alert.
Budgeted data compaction
The third and most interesting layer is what LinkedIn calls "budgeted data compaction" — a cost-constrained approach to compaction that acknowledges compute resources are finite, with prioritization built directly into their ingestion platform.
The insight behind budgeted compaction is simple but powerful: not all compaction work is equally valuable. Compacting a table where delete files have pushed read latency to 10x its baseline is far more valuable than compacting a table where reads are only 1.2x degraded. If your compaction budget is limited — and it always is — you should spend it where the impact is greatest.
LinkedIn's budgeted compaction system works by ranking tables and partitions by their estimated read performance degradation (derived from delete file counts, file size distributions, and query patterns) and then allocating compaction resources in priority order until the compute budget is exhausted. Tables with the worst degradation get compacted first. Tables with minor degradation wait until budget is available.
This is a fundamentally different approach from the typical strategy of running compaction on a fixed schedule for every table. Schedule-based compaction wastes compute on tables that do not need it and under-serves tables that do. Budgeted compaction directs resources where they produce the most improvement per compute dollar spent.
LakeOps implements a parallel concept through its adaptive maintenance policies. Instead of a fixed cron schedule, LakeOps continuously evaluates every table's health score — factoring in delete file ratio, file count, average file size, snapshot accumulation, and manifest fragmentation — and prioritizes maintenance operations on the tables with the worst scores. This is the same economic logic as LinkedIn's budgeted compaction, applied across all maintenance operations, not just compaction. The result is the same: limited compute resources are directed where they produce the most value.
Operational challenges at scale: I/O amplification and merge-compaction conflicts
Even with a layered compaction strategy, LinkedIn encountered operational challenges that required architectural solutions beyond simply running compaction more often.
HDFS read I/O amplification
The I/O amplification problem deserves deeper examination because it affects every team running merge-on-read Iceberg tables, not just those at LinkedIn's scale.
When an Iceberg reader processes a query against a table with equality delete files, the read path looks like this: first, read the table metadata and manifest list to identify which data files are relevant. Then, for each relevant data file, read all equality delete files that could contain matching primary keys. Finally, for each row in each data file, check against the delete file entries.
On HDFS, each of these reads is a separate RPC to the NameNode (for file metadata) followed by block reads from DataNodes. The NameNode is a single point of coordination in HDFS — it handles every file open, every block location lookup, every metadata query. When thousands of tables each have thousands of small delete files, the aggregate NameNode load from delete-file reads becomes a bottleneck for the entire storage cluster.
LinkedIn mitigated this through a combination of approaches. Aggressive compaction reduces the number of delete files. Larger micro-batch sizes during ingestion produce fewer, larger files. Caching delete file contents in the query engine reduces repeated reads. And careful scheduling ensures that compaction I/O does not coincide with peak query I/O.
For teams running on cloud object stores like S3 instead of HDFS, the I/O amplification takes a different form. There is no NameNode bottleneck, but each file read is an HTTP request with non-trivial latency. Thousands of small-file reads translate to thousands of GET requests, each with 50-100ms of round-trip time. The per-request cost model of S3 also means that reading 10,000 delete files costs 10,000 GET request charges — which adds up across thousands of tables.
LakeOps provides per-table observability into delete file counts, delete-to-data ratios, and estimated I/O amplification factors. This visibility is critical because I/O amplification from delete files is often invisible until queries slow down. By the time an analyst complains about a slow dashboard, the table may have accumulated thousands of delete files over days of unmonitored ingestion. LakeOps surfaces these metrics continuously, enabling teams to catch degradation before it impacts consumers.
Merge-compaction conflicts
The second major operational challenge at LinkedIn's scale is conflicts between concurrent merge (ingestion) and compaction operations.
Iceberg uses optimistic concurrency control for commits. When two operations modify a table simultaneously — say, a streaming ingestion job writing new data and equality deletes while a compaction job rewrites existing data files — both operations prepare their changes independently and then attempt to commit. If the second commit detects that the table state has changed since it started (because the first commit succeeded), it must either retry or fail.
At LinkedIn's throughput, these commit conflicts are not rare edge cases — they are a constant operational reality. A streaming ingestion job that writes every minute will inevitably overlap with a compaction job that takes 20 minutes to complete. The compaction job reads a snapshot of the table, spends 20 minutes rewriting files, and then attempts to commit. If the streaming job has committed new data during those 20 minutes, the compaction commit may conflict.
The conflict handling strategy matters enormously. If compaction simply retries from scratch on every conflict, it may never complete on a table with continuous ingestion — each retry takes 20 minutes, during which new data arrives, causing another conflict. LinkedIn addressed this by implementing partial compaction that operates on disjoint sets of files from the streaming ingestion, reducing the conflict surface.
They also implemented backoff strategies and conflict-aware scheduling — ensuring that compaction jobs target partitions that are not currently receiving active writes. This requires coordination between the ingestion layer and the compaction layer, adding operational complexity but reducing wasted compute from failed compaction attempts.
LakeOps solves this with conflict-aware compaction scheduling built into its compaction engine. The engine reads the table's current commit log before starting compaction, identifies which files are being actively written by streaming producers, and excludes those files from the compaction plan. This eliminates the most common class of commit conflicts — compaction trying to rewrite files that are simultaneously being modified by ingestion. The result is that compaction runs to completion on the first attempt in the vast majority of cases, even on tables with continuous streaming ingestion.

WAP branching for zero-downtime rebootstraps
One of the most operationally painful scenarios in CDC pipeline management is the full table rebootstrap — reloading an entire table from the source database because the incremental CDC stream has drifted, corrupted, or fallen too far behind to catch up.
At LinkedIn's scale, a rebootstrap of a large CDC table (tens of terabytes) takes hours. During a naive rebootstrap — truncate the Iceberg table, bulk-load from source — the table is unavailable or inconsistent for the entire duration. Downstream consumers see missing data, partial data, or errors. For tables that feed real-time applications, this is unacceptable.
LinkedIn solved this using Iceberg's Write-Audit-Publish (WAP) branching capability. The approach works like this: create a new Iceberg branch (effectively a named snapshot reference). Bulk-load the complete source data into the branch — this can take hours, but it does not affect the main branch that consumers are reading. Once the bulk load completes, validate the branch data against the source database for consistency. If validation passes, atomically swap the branch to become the new main — a metadata-only operation that takes milliseconds.
The key insight is that WAP branching separates the duration of the rebootstrap from its impact on consumers. The bulk load takes as long as it takes. The consumer-facing cutover is instantaneous. Zero downtime, zero inconsistency windows.
LinkedIn built custom tooling around WAP branching that automates the branch creation, bulk load coordination, validation, and atomic swap. The validation step is critical — without it, you might swap in a branch that has data inconsistencies introduced during the bulk load. Their validation compares row counts, primary key sets, and sampled column values between the source database and the Iceberg branch.
This pattern applies to any team running CDC at scale. Schema migrations in the source database, CDC connector failures, and accumulated drift all create scenarios where a full rebootstrap is the safest recovery path. Without WAP branching, that recovery is disruptive. With it, the recovery is invisible to downstream consumers.
Validation and consistency tooling
The validation tooling LinkedIn built for rebootstraps reflects a broader operational necessity: continuous consistency checking between source databases and Iceberg tables.
CDC pipelines can lose events. Kafka consumers can skip offsets. Serialization bugs can corrupt individual records. Schema evolution mismatches can silently drop columns. Each of these failure modes produces an Iceberg table that looks healthy — the files are there, the metadata is clean, queries return results — but the data does not match the source of truth.
LinkedIn's consistency validation operates at multiple levels. At the aggregate level, it compares row counts between the source database and the Iceberg table — a fast sanity check that catches bulk data loss. At the sampling level, it randomly selects primary keys from the source database and verifies that the corresponding rows in Iceberg match exactly. At the schema level, it verifies that the Iceberg table schema reflects the current source database schema, catching evolution mismatches.
This validation runs continuously, not just during rebootstraps. A CDC table that passed validation yesterday might drift today due to a transient Kafka consumer issue. Continuous validation catches drift early, when the gap is small enough to repair incrementally rather than requiring a full rebootstrap.
The validation results feed into alerting and automated remediation. Small drift (a few hundred missing rows) triggers an incremental repair — re-reading the affected primary keys from the source and applying them to Iceberg. Large drift (thousands of missing rows or schema mismatches) triggers an alert for human review and potentially a WAP-branched rebootstrap.
For teams building CDC pipelines on Iceberg, the lesson is clear: you need to validate, continuously, that your Iceberg tables actually match your source databases. The CDC pipeline is not a guarantee — it is a best-effort replication channel that requires active monitoring.

Balancing freshness and cost: the perpetual tradeoff
Every decision in LinkedIn's CDC architecture ultimately traces back to a single tension: data freshness versus operational cost.
Streaming equality deletes provide maximum freshness at maximum operational cost — frequent compaction, I/O amplification management, conflict resolution, and continuous monitoring. Batch position deletes provide lower operational cost at the expense of freshness — periodic merge windows, simpler compaction requirements, fewer conflicts, but data that is always at least one merge-window old.
LinkedIn's approach is pragmatic rather than dogmatic. They classify their CDC tables by consumer requirements and assign ingestion patterns accordingly. Tables feeding real-time applications get streaming ingestion with aggressive compaction. Tables feeding daily reports get batch merges on hourly schedules. Tables feeding weekly analytics get less frequent merges with minimal compaction.
The classification is not static. As consumer needs change — a weekly report becomes a real-time dashboard, an experimental ML feature becomes a production model — tables migrate between ingestion patterns. This requires the platform team to maintain operational flexibility and tooling that supports both patterns simultaneously.
The budgeted compaction approach ties this tradeoff together at the economic level. Given a fixed compute budget for compaction, how do you allocate it across tables with different freshness requirements and different levels of degradation? LinkedIn's answer — prioritize by impact rather than schedule — is elegant but requires the observability infrastructure to measure impact accurately and the scheduling infrastructure to allocate compute dynamically.
LakeOps encodes this same tradeoff in its adaptive maintenance engine. Tables are scored by health — a composite metric that includes delete file ratio, file count distribution, snapshot accumulation, and query performance trends — and maintenance operations are prioritized by score. Tables in critical health state get immediate attention. Tables in healthy state get maintenance only when compute headroom is available. The result is the same prioritization LinkedIn built manually, automated and continuously adjusted based on real-time table telemetry. Teams get to define the cost envelope; LakeOps decides how to spend it most effectively.
Key takeaways for teams operating CDC at scale
LinkedIn's experience with Iceberg CDC ingestion yields several principles that apply regardless of your scale.
Choose your delete strategy deliberately. Equality deletes versus position deletes is not a technical detail — it is an architectural decision with cascading implications for read performance, compaction cost, and operational complexity. Understand the read amplification characteristics of equality deletes before committing to a streaming CDC pattern. Position deletes via batch merge are operationally simpler for the majority of tables.
Delete file accumulation is the primary operational risk of merge-on-read CDC. If you are writing equality deletes, you must have a compaction strategy that prevents delete file counts from growing unbounded. Monitor delete-to-data file ratios continuously. Set alerting thresholds. Automate compaction triggers based on these metrics, not just on time-based schedules.
Compaction budgets are finite — spend them wisely. Not all compaction work produces equal value. A table with 10x read degradation from delete files benefits more from compaction than a table with 1.2x degradation. Prioritize compaction by impact, not by schedule. LinkedIn's budgeted compaction approach is a model worth studying.
Concurrent ingestion and compaction will conflict. At any meaningful throughput, optimistic concurrency conflicts between streaming writers and compaction jobs are inevitable. Build conflict-aware compaction that avoids rewriting files under active ingestion. Without this, compaction jobs on continuously-written tables will retry indefinitely and waste compute.
WAP branching makes rebootstraps non-disruptive. Full table reloads from source databases are a reality of operating CDC pipelines. Iceberg's WAP branching separates the duration of the reload from its impact on consumers. Build tooling around branch creation, validation, and atomic swap — you will need it.
Validate continuously, not just during migrations. Your Iceberg table is only as good as its consistency with the source database. CDC pipelines lose events, skip offsets, and introduce drift. Continuous validation at the aggregate, sampling, and schema levels catches problems before they compound into full rebootstrap scenarios.
Invest in observability before you need it. LinkedIn's ability to manage 10,000+ CDC tables depends on per-table visibility into delete file counts, I/O amplification, compaction history, and drift metrics. Without this observability, problems are invisible until they manifest as user-facing query failures — by which point recovery is expensive.
From LinkedIn-scale challenges to automated operations
LinkedIn's CDC architecture is the product of years of engineering iteration at a scale few organizations will match. But the problems they solve — delete file accumulation, compaction conflicts, I/O amplification, freshness-cost tradeoffs, rebootstrap disruption, and consistency drift — appear at every scale. A team operating 50 CDC tables on Iceberg encounters the same category of problems as a team operating 10,000. The difference is magnitude, not kind.

The real question for most teams is not whether they will encounter these problems — they will — but whether they build custom infrastructure to address each one, or adopt a control plane that handles them systematically.
LakeOps was designed for exactly this operational surface. Event-driven compaction that responds to delete file accumulation thresholds rather than fixed schedules. Conflict-aware scheduling that coordinates with streaming writers rather than colliding with them. Per-table health scoring that prioritizes maintenance by impact rather than running every table on the same cron. Delete file observability that surfaces I/O amplification before it degrades queries. And a Rust-based compaction engine that makes aggressive compaction economically feasible at a fraction of Spark's cost.
LinkedIn built these capabilities over years with dedicated teams. LakeOps delivers them as a connected service. The engineering principles are the same — the deployment model is different.
For teams currently ingesting CDC data into Iceberg, start with observability. Understand your delete file ratios. Know which tables are accumulating delete files fastest. Measure the read performance impact. Then build — or adopt — the compaction, conflict resolution, and validation tooling that keeps those tables healthy as throughput grows.
The ingestion pipeline will work. Keeping the tables healthy at scale is the engineering challenge that separates a data lake from a production-grade lakehouse.
Further reading
- Kafka to Iceberg: Compaction Best Practices — compaction strategies for streaming CDC pipelines
- Iceberg Delete Files and Merge-on-Read Explained — deep dive into equality vs position deletes and their read performance implications
- Iceberg Commit Conflicts: Prevention and Resolution — handling optimistic concurrency at scale
- Managed Iceberg Solutions — how LakeOps automates table maintenance for production lakehouse operations



