
Why Iceberg needs a runbook
Apache Iceberg gives you ACID semantics, schema evolution, partition evolution, and time travel — all in an open format that works across every major query engine. What it does not give you is operations. Iceberg ships primitives: rewrite_data_files, expire_snapshots, remove_orphan_files, rewrite_manifests. It does not ship scheduling, health detection, conflict handling, multi-table coordination, or observability. That gap between having the procedures and having healthy tables is where production incidents originate.
Every incident in this runbook traces back to the same root: maintenance that did not run, ran at the wrong time, ran in the wrong order, or ran without awareness of the table's current state. Small files accumulate because compaction was not triggered. Manifests fragment because no one rewrote them after compaction. Storage grows because orphan cleanup never ran after snapshot expiration. Write conflicts spike because compaction targeted hot partitions. The format is sound — the operational layer is missing.
A control plane like LakeOps fills that gap. It connects to your existing catalogs and engines, continuously monitors every table's health, and executes the correct maintenance operations autonomously — in the right order, at the right time, without conflicting with active writers. Teams running LakeOps report a 90%+ reduction in Iceberg-related incidents because the system prevents the conditions that cause them. Health classification catches degradation at the Warning stage — before it becomes a 2 AM page. Event-driven triggers fire based on actual table telemetry, not arbitrary cron schedules. Conflict-aware execution avoids hot partitions entirely.
But whether you run autonomous maintenance or manage tables manually, you need to understand how Iceberg fails. When the page fires, you need a path from symptom to fix that does not require guessing. This runbook gives you that path for the eight most common production incidents — ordered from most frequent to least frequent, each following the same structure: Symptom → Root Cause → Diagnosis → Immediate Fix → Long-term Prevention → How LakeOps prevents this.
Incident 1: Queries suddenly slow (10x+ degradation)
Symptom
Queries that ran in 2–5 seconds now take 30–60 seconds. No schema changes, no data volume spike. Dashboards timeout. The degradation affects all engines — Trino, Spark, Athena — pointing to a data-layer problem rather than an engine-specific issue.
Root cause
Small file explosion. A streaming pipeline — Flink, Spark Structured Streaming, or a CDC connector — has been writing files without compaction keeping pace. Each query must open every file individually: HTTP round trip per S3 GET, Parquet footer parse, reader initialization. A table with 200,000 files averaging 3 MB performs dramatically worse than the same data in 2,000 files averaging 300 MB. The I/O overhead scales linearly with file count — 100x more files means roughly 100x more planning and scheduling overhead.
Diagnosis
Confirm the file count and average size per partition:
1SELECT2 partition,3 COUNT(*) AS file_count,4 ROUND(AVG(file_size_in_bytes) / 1048576, 1) AS avg_size_mb,5 SUM(CASE WHEN file_size_in_bytes < 33554432 THEN 1 ELSE 0 END) AS small_files_under_32mb6FROM catalog.db.affected_table.files7GROUP BY partition8ORDER BY file_count DESC9LIMIT 20;If multiple partitions show 1,000+ files with average sizes below 32 MB, you have a small file explosion. Cross-reference with the write history to confirm compaction stopped keeping up:
1SELECT2 committed_at,3 operation,4 summary['added-data-files'] AS files_added,5 summary['deleted-data-files'] AS files_deleted6FROM catalog.db.affected_table.snapshots7ORDER BY committed_at DESC8LIMIT 50;A sustained pattern of append operations without any replace operations (which indicate compaction ran) confirms the diagnosis.
Immediate fix
Run emergency binpack compaction on the worst partitions. Binpack merges files without re-sorting — fastest path to relief:
1CALL catalog.system.rewrite_data_files(2 table => 'db.affected_table',3 strategy => 'binpack',4 where => 'event_date >= current_date() - INTERVAL 7 DAYS',5 options => map(6 'target-file-size-bytes', '268435456',7 'min-input-files', '3',8 'partial-progress.enabled', 'true',9 'partial-progress.max-commits', '20',10 'max-concurrent-file-group-rewrites', '15'11 )12);Use partial-progress.enabled so the job commits incrementally — if it fails partway through, progress is retained. After compaction completes, rewrite manifests to reflect the new layout:
1CALL catalog.system.rewrite_manifests(2 table => 'db.affected_table'3);Long-term prevention
Schedule compaction proportionally to write frequency. Streaming tables need compaction every 1–4 hours; batch tables need it daily. Fixed nightly schedules leave streaming tables exposed for 23 hours. Event-driven triggers that fire when file count crosses a threshold (e.g., 500 files per partition) are more reliable than time-based schedules. For a deep dive on small file root causes and resolution strategies, see Fixing Small Files in Apache Iceberg.
How LakeOps prevents this
LakeOps monitors file count and average file size per partition continuously. When thresholds are breached, compaction triggers automatically — no cron schedule to miss the window. The compaction engine is written in Rust on Apache DataFusion, processing Parquet through Arrow columnar buffers with bounded memory and no JVM. Event-driven triggers based on file count thresholds mean a streaming table that accumulates 500 small files at 3 PM gets compacted at 3 PM — not at midnight when the nightly job runs. The health classification system flags the table as Warning at 500 files and Critical at 1,000, alerting the team before users notice degradation.
Incident 2: Query planning takes minutes
Symptom
Queries hang for 30 seconds to several minutes before returning any rows. Engine logs show the delay is in the planning phase — before any data files are scanned. EXPLAIN returns slowly. This affects all queries against the table, not just complex ones.
Root cause
Manifest bloat combined with excessive snapshot retention. Every commit creates at least one new manifest file. A streaming table with 5-minute commits accumulates 8,640 manifests per month. Each manifest must be read and evaluated during query planning. Simultaneously, thousands of retained snapshots force the engine to traverse a deep snapshot chain to resolve the current file set.
Diagnosis
Check manifest count and fragmentation:
1SELECT2 COUNT(*) AS manifest_count,3 ROUND(AVG(length) / 1024, 1) AS avg_manifest_size_kb,4 SUM(added_data_files_count + existing_data_files_count) AS total_file_entries5FROM catalog.db.affected_table.manifests;Check snapshot accumulation:
1SELECT2 COUNT(*) AS snapshot_count,3 MIN(committed_at) AS oldest_snapshot,4 MAX(committed_at) AS latest_snapshot,5 DATEDIFF(DAY, MIN(committed_at), MAX(committed_at)) AS retention_days6FROM catalog.db.affected_table.snapshots;If manifest count exceeds 500 or snapshot count exceeds 2,000, you have a planning bottleneck. A healthy streaming table should have fewer than 100 manifests after regular rewriting.
Immediate fix
First, expire old snapshots to reduce the metadata tree depth. Then rewrite manifests to consolidate the fragmented remainder:
1CALL catalog.system.expire_snapshots(2 table => 'db.affected_table',3 older_than => TIMESTAMP '2026-06-15 00:00:00',4 retain_last => 505);6 7CALL catalog.system.rewrite_manifests(8 table => 'db.affected_table'9);On tables with extreme manifest fragmentation (2,000+), manifest rewriting alone can reduce planning time from 30+ seconds to under 1 second.
Long-term prevention
Run manifest rewriting after every compaction cycle. Configure snapshot expiration to run at least daily with a retention window of 3–7 days for streaming tables. The correct maintenance sequence is: expire snapshots → remove orphans → compact → rewrite manifests. See Automating Iceberg Table Maintenance for the full sequencing logic.
How LakeOps prevents this
Manifest rewriting runs automatically after every compaction cycle as part of the sequenced maintenance pipeline. The correct order — expire → orphans → compact → rewrite manifests — is enforced by the system, not by DAG configuration or human memory. Snapshot expiration is the first operation in every maintenance cycle, with configurable retention policies at the catalog, namespace, or table level. Tables never accumulate thousands of stale snapshots because expiration runs continuously as part of the coordinated pipeline.
Incident 3: Write conflicts (CommitFailedException)
Symptom
Spark or Flink jobs fail intermittently with org.apache.iceberg.exceptions.CommitFailedException: Cannot commit changes based on stale table metadata. Compaction jobs fail alongside streaming writers. The failures are intermittent — sometimes the job succeeds on retry, sometimes it fails repeatedly.
Root cause
Iceberg uses optimistic concurrency control. Every commit validates against the table state at the time the operation started. If another writer committed in the interval between your operation's start and its commit attempt, the commit is rejected. This happens most often when compaction targets the same partitions that active writers are appending to, or when multiple writers target the same partition simultaneously.
Diagnosis
Identify the conflicting operations by checking recent snapshots:
1SELECT2 committed_at,3 operation,4 summary['changed-partition-count'] AS partitions_affected,5 summary['added-data-files'] AS files_added,6 summary['deleted-data-files'] AS files_deleted7FROM catalog.db.affected_table.snapshots8ORDER BY committed_at DESC9LIMIT 30;Look for overlapping timestamps between append operations (writers) and replace operations (compaction). If both target the same partitions within seconds of each other, that is your conflict window.
Immediate fix
Increase retry configuration on the failing writer. Iceberg retries only replay the metadata commit, not the entire write — making retries cheap:
1ALTER TABLE db.affected_table SET TBLPROPERTIES (2 'commit.retry.num-retries' = '10',3 'commit.retry.min-wait-ms' = '200',4 'commit.retry.max-wait-ms' = '30000',5 'commit.retry.total-timeout-ms' = '600000'6);Scope compaction to cold partitions only — exclude the actively-written partition:
1CALL catalog.system.rewrite_data_files(2 table => 'db.affected_table',3 strategy => 'binpack',4 where => 'event_date < current_date()',5 options => map(6 'partial-progress.enabled', 'true',7 'partial-progress.max-commits', '10'8 )9);Long-term prevention
Never compact the active write partition. Use partial-progress.enabled = true so a single conflict does not invalidate an entire compaction run. Change the write distribution mode to hash so each writer produces files for distinct partitions, reducing overlap. Schedule compaction to target partition < current_date() by default. For streaming tables with sub-minute commits, use serializable isolation for appends (which allows concurrent appends to different files) and avoid overlapping OVERWRITE or DELETE with appends on the same partition.
How LakeOps prevents this
LakeOps compaction is conflict-aware by design. It inspects active writer state and targets only cold partitions — partitions with no active streaming appends. If a conflict occurs despite this (e.g., a late-arriving batch write), the affected partition is retried on the next cycle automatically. The conflict window is minimized by design rather than by retry configuration. Because compaction uses partial-progress.enabled by default, a single conflict never invalidates an entire run — only the affected file group is retried.
Incident 4: S3 storage growing faster than data
Symptom
The S3 bill increases 30–50% month-over-month but the logical data volume (as reported by metadata) is flat or growing slowly. Storage audits show files in the table prefix that are not referenced by any snapshot. The gap between billed storage and logical data widens continuously.
Root cause
Two compounding factors: orphan files from failed writes, crashed compaction jobs, and aborted transactions; and snapshot retention keeping old data files referenced longer than needed. Orphan files are invisible to Iceberg — they exist on S3 but no manifest points to them. On mature lakes, orphans routinely account for 25–40% of billable storage on affected prefixes.
Diagnosis
Compare logical data size (from metadata) against physical storage:
1SELECT2 ROUND(SUM(file_size_in_bytes) / 1073741824, 2) AS logical_data_gb3FROM catalog.db.affected_table.files;Check how many snapshots are pinning old data:
1SELECT2 COUNT(*) AS total_snapshots,3 MIN(committed_at) AS oldest_retained,4 DATEDIFF(DAY, MIN(committed_at), CURRENT_TIMESTAMP()) AS retention_days5FROM catalog.db.affected_table.snapshots;If retention exceeds 14 days on a high-write table, expired data files are being held longer than necessary. Run a dry-run orphan cleanup to quantify the waste:
1CALL catalog.system.remove_orphan_files(2 table => 'db.affected_table',3 older_than => TIMESTAMP '2026-06-11 00:00:00',4 dry_run => true5);Immediate fix
Execute the maintenance sequence in order — expire snapshots first (to dereference old files), then remove orphans (to delete the physical files):
1CALL catalog.system.expire_snapshots(2 table => 'db.affected_table',3 older_than => TIMESTAMP '2026-06-13 00:00:00',4 retain_last => 1005);6 7CALL catalog.system.remove_orphan_files(8 table => 'db.affected_table',9 older_than => TIMESTAMP '2026-06-11 00:00:00'10);The older_than threshold for orphan cleanup must be at least 7 days in the past. Files from in-progress writes are temporarily orphaned until the writer commits — deleting them prematurely corrupts the table. This is a hard safety rule.
Long-term prevention
Run orphan cleanup weekly (or daily for high-write tables), always after snapshot expiration. Set snapshot retention to the minimum window your team needs for rollback — typically 3–7 days for streaming, 14 days for batch. Enable metadata file auto-cleanup:
1ALTER TABLE db.affected_table SET TBLPROPERTIES (2 'write.metadata.delete-after-commit.enabled' = 'true',3 'write.metadata.previous-versions-max' = '100'4);How LakeOps prevents this
Orphan cleanup runs as part of the coordinated maintenance pipeline, after snapshot expiration releases file references. The 7+ day safety window is enforced by default — there is no risk of premature deletion regardless of who configures the policy. Continuous safe cleanup means orphan files never accumulate to 25–40% of storage because they are removed within days of becoming unreferenced, not weeks or months later when someone notices the bill. The full audit trail logs every orphan removal with file count and bytes reclaimed.
Incident 5: Engine OOM during compaction
Symptom
Spark compaction jobs crash with java.lang.OutOfMemoryError: Java heap space or Container killed by YARN for exceeding memory limits. The job processes for 20–40 minutes, then dies. Increasing executor memory delays but does not prevent the OOM. The failure is specific to certain partitions — smaller partitions compact successfully.
Root cause
Sort compaction on massive partitions. Sort-based compaction must read all files in a partition, sort them by the specified columns, and write the output. A partition with 500 GB of data and 50,000 files requires holding the sort state in memory. Spark's shuffle-based sort generates enormous intermediate data that exceeds executor memory limits. Z-order compaction is even more memory-intensive due to the interleaving computation.
Diagnosis
Identify which partitions are too large for in-memory sort:
1SELECT2 partition,3 COUNT(*) AS file_count,4 ROUND(SUM(file_size_in_bytes) / 1073741824, 2) AS partition_size_gb5FROM catalog.db.affected_table.files6GROUP BY partition7ORDER BY partition_size_gb DESC8LIMIT 10;If any partition exceeds 100 GB with 10,000+ files and you are running sort compaction, that is the OOM source. Check the Spark job configuration for spark.executor.memory and the compaction strategy being used.
Immediate fix
Switch to binpack for the oversized partitions. Binpack does not require global sort — it merges files by size with bounded memory:
1CALL catalog.system.rewrite_data_files(2 table => 'db.affected_table',3 strategy => 'binpack',4 where => 'partition_date = "2026-06-01"',5 options => map(6 'target-file-size-bytes', '268435456',7 'min-input-files', '5',8 'partial-progress.enabled', 'true',9 'partial-progress.max-commits', '50',10 'max-file-group-size-bytes', '10737418240'11 )12);The max-file-group-size-bytes option (10 GB above) limits how much data is processed per rewrite group, bounding memory. Reduce this value if OOMs persist. After stabilizing with binpack, if sort optimization is required, narrow the scope to sub-partition ranges or break the partition into smaller chunks that fit in memory.
Long-term prevention
Never run unbounded sort compaction on partitions exceeding 100 GB without file group size limits. Use max-file-group-size-bytes to cap per-group memory consumption. For tables that require sorted data in large partitions, run binpack first to reduce file count, then sort in a second pass with bounded groups. Increase spark.sql.shuffle.partitions for sort jobs to distribute work across more executors.
How LakeOps prevents this
LakeOps runs compaction on a purpose-built Rust engine on Apache DataFusion. The engine uses streaming sort with bounded memory, lock-free parallelism, and no JVM — eliminating garbage collection, heap limits, and OOM entirely. Partitions that crash Spark with OOM complete in minutes on the Rust engine. A 1.2 TB partition that caused Spark to OOM completed in 11 minutes. No cluster resizing, no memory tuning, no shuffle partition configuration.
Incident 6: Time travel queries fail
Symptom
Queries specifying FOR SYSTEM_TIME AS OF or FOR SYSTEM_VERSION AS OF fail with errors like Cannot find snapshot older than <timestamp> or Snapshot ID <id> does not exist. Users relying on time travel for debugging, auditing, or rollback lose access to historical states.
Root cause
Snapshot expiration was configured too aggressively. Once a snapshot is expired, it is permanently gone — along with exclusive references to data files from that point in time. If expiration runs with older_than set to 3 days and a user needs data from 5 days ago, the snapshot no longer exists. This also happens when teams expire snapshots that long-running queries are actively using — the query fails mid-execution because its underlying data files get deleted.
Diagnosis
Check the current snapshot retention:
1SELECT2 snapshot_id,3 committed_at,4 operation5FROM catalog.db.affected_table.snapshots6ORDER BY committed_at ASC7LIMIT 10;If the oldest available snapshot is more recent than the time travel target, the data has been expired. Verify the table properties to see what retention is configured:
1SHOW TBLPROPERTIES db.affected_table;2-- Look for:3-- history.expire.max-snapshot-age-ms4-- history.expire.min-snapshots-to-keepImmediate fix
If the target snapshot is already expired, recovery is not possible through standard Iceberg APIs — the metadata has been permanently removed. For future protection, adjust retention immediately:
1ALTER TABLE db.affected_table SET TBLPROPERTIES (2 'history.expire.max-snapshot-age-ms' = '604800000',3 'history.expire.min-snapshots-to-keep' = '100'4);The above retains snapshots for 7 days and keeps at least 100 snapshots regardless of age. For compliance-critical tables, use tags to pin specific snapshots that must never expire:
1ALTER TABLE db.affected_table CREATE TAG `end_of_quarter_2026Q2`2 AS OF VERSION 8472910353 RETAIN 365 DAYS;Long-term prevention
Set retain_last high enough that the oldest retained snapshot covers your incident response SLA. If your team takes 72 hours to detect bad data, retain at least 5 days of snapshots. Use tags for audit checkpoints. Document retention windows per table — streaming tables may need 3–7 days; compliance tables may need 90 days. Never expire snapshots with an older_than threshold shorter than your longest-running query's expected duration — otherwise active queries can fail mid-execution.
How LakeOps prevents this
Retention policies are configurable per table or via catalog-wide policies with enforced retain_last minimums. LakeOps allows tagged snapshots for audit points that are excluded from expiration regardless of the age threshold. The observability dashboard surfaces the current retention window for every table, making it visible when a table's configuration is too aggressive relative to its usage patterns. Insights flag tables where retention is shorter than the observed time-travel query history.
Incident 7: Delete file ratio degrading reads
Symptom
Queries on tables with frequent UPDATEs or DELETEs slow progressively over days or weeks. No corresponding increase in data volume — the table has the same logical row count. Engine profiles show excessive time in "delete file reconciliation" or "merge-on-read" phases. Individual queries that touched 10 files now reconcile against 200+ delete files.
Root cause
Merge-on-read mode accumulates delete files with every mutation. Position delete files mark specific rows by (file_path, position). Equality delete files mark rows by column values. Both require per-query reconciliation — every read must join data files against their associated delete files to filter out logically removed rows. Without compaction that physically applies the deletes, this overhead compounds linearly with mutation count.
Diagnosis
Measure the delete file ratio per partition:
1WITH data AS (2 SELECT partition, COUNT(*) AS data_files3 FROM catalog.db.affected_table.files4 GROUP BY partition5),6deletes AS (7 SELECT partition, COUNT(*) AS delete_files, SUM(record_count) AS delete_records8 FROM catalog.db.affected_table.all_delete_files9 GROUP BY partition10)11SELECT12 d.partition,13 d.data_files,14 COALESCE(del.delete_files, 0) AS delete_files,15 ROUND(COALESCE(del.delete_files, 0) * 100.0 / d.data_files, 1) AS delete_ratio_pct16FROM data d17LEFT JOIN deletes del ON d.partition = del.partition18WHERE COALESCE(del.delete_files, 0) > 019ORDER BY delete_ratio_pct DESC;If any partition shows a delete-to-data ratio above 10%, compaction targeting delete files is overdue. Ratios above 50% indicate severe read degradation. For a comprehensive breakdown of delete file mechanics and thresholds, see Iceberg Delete Files Guide.
Immediate fix
Run targeted compaction with delete-file-threshold set to rewrite any data file with associated deletes:
1CALL catalog.system.rewrite_data_files(2 table => 'db.affected_table',3 strategy => 'binpack',4 where => 'partition_date >= current_date() - INTERVAL 14 DAYS',5 options => map(6 'delete-file-threshold', '1',7 'target-file-size-bytes', '268435456',8 'partial-progress.enabled', 'true',9 'remove-dangling-deletes', 'true'10 )11);The remove-dangling-deletes option generates a follow-up commit to clean up delete files that no longer reference any live data files. Focus on partitions with the highest delete ratios first — those deliver the biggest read performance improvement per compaction dollar.
Long-term prevention
Match compaction frequency to mutation rate. A table receiving 100 deletes per hour needs sub-hourly compaction on affected partitions — not a nightly batch job. Use delete-file-threshold as a trigger: compact any partition where the ratio exceeds your threshold (10% is a reasonable default). For tables with both small files and delete file accumulation, a single compaction pass resolves both simultaneously.
How LakeOps prevents this
Delete file ratios are tracked per partition with configurable triggers. When the ratio exceeds the threshold (default 10%), compaction fires automatically and physically applies pending deletes during the pass — one operation resolves both small files and delete files simultaneously. The Insights system surfaces partitions with rising delete ratios at WARNING severity before they reach CRITICAL, giving teams visibility into the trend even if automatic resolution handles it. The full audit trail shows delete files removed per compaction run.
Incident 8: Schema changes break downstream
Symptom
Downstream consumers — Trino queries, dbt models, BI dashboards — fail after a schema change is applied to the source Iceberg table. Errors include Column 'X' not found, type mismatch exceptions, or unexpected NULLs in previously non-null columns. The failure may not surface immediately if consumers cache schema metadata.
Root cause
Schema evolution applied directly to the production branch without compatibility testing. Iceberg supports schema evolution (add columns, rename columns, widen types, reorder columns) without rewriting data — but downstream consumers that reference columns by name or position break if they are not prepared for the change. Dropping or renaming a column that a consumer depends on causes immediate failures.
Diagnosis
Check the schema history to identify what changed:
1SELECT * FROM catalog.db.affected_table.metadata_log_entries2ORDER BY timestamp DESC3LIMIT 20;Compare the current schema against what downstream consumers expect. Identify which columns were added, dropped, renamed, or had their types changed. Cross-reference with the consumer's query definitions to find the incompatibility.
Immediate fix
If a column was dropped or renamed and consumers depend on it, roll back the schema change:
1CALL catalog.system.rollback_to_snapshot(2 table => 'db.affected_table',3 snapshot_id => 8472910344);Then reapply using the branch-based approach for safe schema evolution:
1ALTER TABLE db.affected_table CREATE BRANCH schema_test_v22 RETAIN 7 DAYS;3 4SET spark.wap.branch = schema_test_v2;5INSERT INTO db.affected_table SELECT * FROM test_data_with_new_schema;6 7-- Validate downstream consumers against the branch8-- SELECT * FROM db.affected_table VERSION AS OF 'schema_test_v2'9-- Run integration tests here10 11CALL catalog.system.fast_forward(12 table => 'db.affected_table',13 branch => 'main',14 to => 'schema_test_v2'15);Long-term prevention
Never apply schema changes directly to production without consumer validation. Use the Write-Audit-Publish pattern: make changes on a branch, validate with downstream consumers (run their queries against the branch), then fast-forward to main. Maintain a schema compatibility contract — document which columns are public APIs and which are internal. Only additive changes (new nullable columns) are safe to apply without coordination.
How LakeOps prevents this
While schema evolution is a pipeline concern rather than a maintenance operation, LakeOps's observability layer surfaces schema change events in the table event log. Every schema modification is logged with a timestamp and before/after state, making it immediately visible when changes happened and what was modified. Teams can set up alerts on schema change events to trigger validation pipelines before consumers encounter the incompatibility.

Incident response procedures
Severity classification
P1 — Production queries failing. Time travel errors, schema breaks, or CommitFailedExceptions blocking pipelines. Response: immediate. Fix within 30 minutes.
P2 — Performance degradation above 5x. Query latency 10x normal, planning takes minutes. Pipelines are slow but not failing. Response: within 1 hour. Fix within 4 hours.
P3 — Cost anomaly. Storage growing faster than expected, orphan accumulation, snapshot retention too long. Response: within 24 hours. Fix within 1 week.
P4 — Drift from targets. File sizes trending down, manifest count rising, delete ratio climbing. No user-visible impact yet. Response: next maintenance window.
LakeOps maps these severity levels to its Insights system: CRITICAL and HIGH correspond to P1/P2 incidents, WARNING to P3, and LOW to P4. The difference is detection — Insights surface conditions at P4 before they escalate to P1.
Escalation path
- 1.On-call engineer — diagnose using metadata queries above, apply immediate fix
- 2.Data platform team — review maintenance configuration, adjust policies, tune compaction schedules
- 3.Infrastructure team — cluster resizing for OOM, S3 prefix optimization for throttling, catalog capacity
- 4.Vendor support — engine-specific bugs (Spark, Trino, Flink), catalog issues (Glue, REST, Polaris)
Post-incident checklist
- Confirm fix is applied and metrics are trending toward healthy
- Identify the root cause prevention (configuration change, new policy, schedule adjustment)
- Update monitoring thresholds if the incident was not detected by automated alerts
- Document the incident in the operations log for pattern analysis
- If this incident would have been prevented by automated maintenance, evaluate whether a control plane should handle it going forward
Monitoring: detecting problems before they page
Every incident above is detectable from Iceberg metadata before it impacts users. The following queries form a minimum monitoring baseline. Run them on a schedule (hourly for streaming tables, daily for batch) and alert when thresholds are breached.
File health check
1SELECT2 partition,3 COUNT(*) AS file_count,4 ROUND(AVG(file_size_in_bytes) / 1048576, 1) AS avg_size_mb5FROM catalog.db.target_table.files6GROUP BY partition7HAVING COUNT(*) > 500 OR AVG(file_size_in_bytes) < 671088648ORDER BY file_count DESC;Snapshot accumulation check
1SELECT2 COUNT(*) AS snapshot_count,3 MIN(committed_at) AS oldest_snapshot4FROM catalog.db.target_table.snapshots5HAVING COUNT(*) > 1000;Manifest fragmentation check
1SELECT2 COUNT(*) AS manifest_count,3 ROUND(AVG(added_data_files_count + existing_data_files_count), 1) AS avg_entries_per_manifest4FROM catalog.db.target_table.manifests5HAVING COUNT(*) > 200;Delete file accumulation check
1WITH data AS (2 SELECT partition, COUNT(*) AS data_files3 FROM catalog.db.target_table.files4 GROUP BY partition5),6deletes AS (7 SELECT partition, COUNT(*) AS delete_files8 FROM catalog.db.target_table.all_delete_files9 GROUP BY partition10)11SELECT12 d.partition,13 d.data_files,14 del.delete_files,15 ROUND(del.delete_files * 100.0 / d.data_files, 1) AS ratio_pct16FROM data d17JOIN deletes del ON d.partition = del.partition18WHERE del.delete_files * 100.0 / d.data_files > 2019ORDER BY ratio_pct DESC;Write conflict frequency check
Monitor your streaming job logs for CommitFailedException frequency. If retry success rate drops below 95%, your conflict window is too large — scope compaction more narrowly or increase retry limits.

From manual runbook to autonomous prevention
This runbook gives you the diagnostic path and fix for each incident. But the pattern is clear: every incident here is caused by maintenance that did not run or ran incorrectly. The reactive path — detect symptom, diagnose root cause, apply fix, configure prevention — works. The proactive path — prevent the conditions from occurring in the first place — is better.
LakeOps replaces manual runbook execution with a closed-loop system. Health classification catches problems at four severity levels (CRITICAL, HIGH, WARNING, LOW) and surfaces them in the Insights tab before users report symptoms. Event-driven maintenance triggers fire based on actual table telemetry — file count thresholds, delete ratios, snapshot depth — not arbitrary cron schedules. Conflict-aware execution never compacts hot partitions. Sequenced operations run in the correct order every time. The full audit trail logs every operation with duration, impact, and status.
The result: teams using LakeOps report 90%+ reduction in Iceberg-related incidents. Not because the incidents are fixed faster — because the conditions that cause them never develop.

Quick reference: incident → fix
Queries 10x slower → Check file count per partition → Binpack compaction on worst partitions → Schedule compaction proportional to write rate
Planning takes minutes → Check manifest count and snapshot count → Expire snapshots + rewrite manifests → Run manifest rewrite after every compaction
CommitFailedException → Check concurrent operations on same partitions → Increase retry config + scope compaction to cold partitions → Use hash write distribution, exclude active partitions from compaction
Storage growing faster than data → Compare logical vs physical storage → Expire snapshots + remove orphans with 7-day safety → Run orphan cleanup weekly after expiration
Compaction OOMs → Identify oversized partitions → Switch to binpack with max-file-group-size-bytes → Use Rust engine (LakeOps) or bound group size
Time travel fails → Check oldest available snapshot → Adjust retain_last and retention window → Tag critical snapshots, document retention per table
Delete ratio degrading reads → Measure delete-to-data ratio per partition → Compact with delete-file-threshold=1 → Match compaction frequency to mutation rate
Schema changes break consumers → Check schema history → Rollback to pre-change snapshot → Use branch-based evolution with WAP pattern
Further reading
- Automating Iceberg Table Maintenance — correct operation order, scheduling, and policy-driven automation
- Fixing Small Files in Apache Iceberg — root causes, measurement, and compaction strategies
- Iceberg Delete Files Guide — merge-on-read overhead, measurement, and resolution
- Iceberg Lakehouse Observability — continuous table health monitoring and alerting
- Managed Iceberg — autonomous maintenance control plane



