Back to blog

Apache Iceberg Operational Runbook: Incidents, Symptoms, and Fixes

A production-ready runbook for Iceberg incidents: queries suddenly slow, planning takes minutes, write conflicts spike, storage grows uncontrolled, compaction OOMs, time travel breaks, and delete files degrade reads. Each incident follows Symptom → Root Cause → Diagnosis → Fix → Prevention.

Apache Iceberg Operational Runbook — incidents, symptoms, and fixes with detect, diagnose, resolve, and verify workflow

Why Iceberg needs a runbook

Apache Iceberg gives you ACID semantics, schema evolution, partition evolution, and time travel — all in an open format that works across every major query engine. What it does not give you is operations. Iceberg ships primitives: rewrite_data_files, expire_snapshots, remove_orphan_files, rewrite_manifests. It does not ship scheduling, health detection, conflict handling, multi-table coordination, or observability. That gap between having the procedures and having healthy tables is where production incidents originate.

Every incident in this runbook traces back to the same root: maintenance that did not run, ran at the wrong time, ran in the wrong order, or ran without awareness of the table's current state. Small files accumulate because compaction was not triggered. Manifests fragment because no one rewrote them after compaction. Storage grows because orphan cleanup never ran after snapshot expiration. Write conflicts spike because compaction targeted hot partitions. The format is sound — the operational layer is missing.

A control plane like LakeOps fills that gap. It connects to your existing catalogs and engines, continuously monitors every table's health, and executes the correct maintenance operations autonomously — in the right order, at the right time, without conflicting with active writers. Teams running LakeOps report a 90%+ reduction in Iceberg-related incidents because the system prevents the conditions that cause them. Health classification catches degradation at the Warning stage — before it becomes a 2 AM page. Event-driven triggers fire based on actual table telemetry, not arbitrary cron schedules. Conflict-aware execution avoids hot partitions entirely.

But whether you run autonomous maintenance or manage tables manually, you need to understand how Iceberg fails. When the page fires, you need a path from symptom to fix that does not require guessing. This runbook gives you that path for the eight most common production incidents — ordered from most frequent to least frequent, each following the same structure: Symptom → Root Cause → Diagnosis → Immediate Fix → Long-term Prevention → How LakeOps prevents this.

Incident 1: Queries suddenly slow (10x+ degradation)

Symptom

Queries that ran in 2–5 seconds now take 30–60 seconds. No schema changes, no data volume spike. Dashboards timeout. The degradation affects all engines — Trino, Spark, Athena — pointing to a data-layer problem rather than an engine-specific issue.

Root cause

Small file explosion. A streaming pipeline — Flink, Spark Structured Streaming, or a CDC connector — has been writing files without compaction keeping pace. Each query must open every file individually: HTTP round trip per S3 GET, Parquet footer parse, reader initialization. A table with 200,000 files averaging 3 MB performs dramatically worse than the same data in 2,000 files averaging 300 MB. The I/O overhead scales linearly with file count — 100x more files means roughly 100x more planning and scheduling overhead.

Diagnosis

Confirm the file count and average size per partition:

sql
1SELECT2  partition,3  COUNT(*) AS file_count,4  ROUND(AVG(file_size_in_bytes) / 1048576, 1) AS avg_size_mb,5  SUM(CASE WHEN file_size_in_bytes < 33554432 THEN 1 ELSE 0 END) AS small_files_under_32mb6FROM catalog.db.affected_table.files7GROUP BY partition8ORDER BY file_count DESC9LIMIT 20;

If multiple partitions show 1,000+ files with average sizes below 32 MB, you have a small file explosion. Cross-reference with the write history to confirm compaction stopped keeping up:

sql
1SELECT2  committed_at,3  operation,4  summary['added-data-files'] AS files_added,5  summary['deleted-data-files'] AS files_deleted6FROM catalog.db.affected_table.snapshots7ORDER BY committed_at DESC8LIMIT 50;

A sustained pattern of append operations without any replace operations (which indicate compaction ran) confirms the diagnosis.

Immediate fix

Run emergency binpack compaction on the worst partitions. Binpack merges files without re-sorting — fastest path to relief:

sql
1CALL catalog.system.rewrite_data_files(2  table => 'db.affected_table',3  strategy => 'binpack',4  where => 'event_date >= current_date() - INTERVAL 7 DAYS',5  options => map(6    'target-file-size-bytes', '268435456',7    'min-input-files', '3',8    'partial-progress.enabled', 'true',9    'partial-progress.max-commits', '20',10    'max-concurrent-file-group-rewrites', '15'11  )12);

Use partial-progress.enabled so the job commits incrementally — if it fails partway through, progress is retained. After compaction completes, rewrite manifests to reflect the new layout:

sql
1CALL catalog.system.rewrite_manifests(2  table => 'db.affected_table'3);

Long-term prevention

Schedule compaction proportionally to write frequency. Streaming tables need compaction every 1–4 hours; batch tables need it daily. Fixed nightly schedules leave streaming tables exposed for 23 hours. Event-driven triggers that fire when file count crosses a threshold (e.g., 500 files per partition) are more reliable than time-based schedules. For a deep dive on small file root causes and resolution strategies, see Fixing Small Files in Apache Iceberg.

How LakeOps prevents this

LakeOps monitors file count and average file size per partition continuously. When thresholds are breached, compaction triggers automatically — no cron schedule to miss the window. The compaction engine is written in Rust on Apache DataFusion, processing Parquet through Arrow columnar buffers with bounded memory and no JVM. Event-driven triggers based on file count thresholds mean a streaming table that accumulates 500 small files at 3 PM gets compacted at 3 PM — not at midnight when the nightly job runs. The health classification system flags the table as Warning at 500 files and Critical at 1,000, alerting the team before users notice degradation.

Incident 2: Query planning takes minutes

Symptom

Queries hang for 30 seconds to several minutes before returning any rows. Engine logs show the delay is in the planning phase — before any data files are scanned. EXPLAIN returns slowly. This affects all queries against the table, not just complex ones.

Root cause

Manifest bloat combined with excessive snapshot retention. Every commit creates at least one new manifest file. A streaming table with 5-minute commits accumulates 8,640 manifests per month. Each manifest must be read and evaluated during query planning. Simultaneously, thousands of retained snapshots force the engine to traverse a deep snapshot chain to resolve the current file set.

Diagnosis

Check manifest count and fragmentation:

sql
1SELECT2  COUNT(*) AS manifest_count,3  ROUND(AVG(length) / 1024, 1) AS avg_manifest_size_kb,4  SUM(added_data_files_count + existing_data_files_count) AS total_file_entries5FROM catalog.db.affected_table.manifests;

Check snapshot accumulation:

sql
1SELECT2  COUNT(*) AS snapshot_count,3  MIN(committed_at) AS oldest_snapshot,4  MAX(committed_at) AS latest_snapshot,5  DATEDIFF(DAY, MIN(committed_at), MAX(committed_at)) AS retention_days6FROM catalog.db.affected_table.snapshots;

If manifest count exceeds 500 or snapshot count exceeds 2,000, you have a planning bottleneck. A healthy streaming table should have fewer than 100 manifests after regular rewriting.

Immediate fix

First, expire old snapshots to reduce the metadata tree depth. Then rewrite manifests to consolidate the fragmented remainder:

sql
1CALL catalog.system.expire_snapshots(2  table => 'db.affected_table',3  older_than => TIMESTAMP '2026-06-15 00:00:00',4  retain_last => 505);6 7CALL catalog.system.rewrite_manifests(8  table => 'db.affected_table'9);

On tables with extreme manifest fragmentation (2,000+), manifest rewriting alone can reduce planning time from 30+ seconds to under 1 second.

Long-term prevention

Run manifest rewriting after every compaction cycle. Configure snapshot expiration to run at least daily with a retention window of 3–7 days for streaming tables. The correct maintenance sequence is: expire snapshots → remove orphans → compact → rewrite manifests. See Automating Iceberg Table Maintenance for the full sequencing logic.

How LakeOps prevents this

Manifest rewriting runs automatically after every compaction cycle as part of the sequenced maintenance pipeline. The correct order — expire → orphans → compact → rewrite manifests — is enforced by the system, not by DAG configuration or human memory. Snapshot expiration is the first operation in every maintenance cycle, with configurable retention policies at the catalog, namespace, or table level. Tables never accumulate thousands of stale snapshots because expiration runs continuously as part of the coordinated pipeline.

Incident 3: Write conflicts (CommitFailedException)

Symptom

Spark or Flink jobs fail intermittently with org.apache.iceberg.exceptions.CommitFailedException: Cannot commit changes based on stale table metadata. Compaction jobs fail alongside streaming writers. The failures are intermittent — sometimes the job succeeds on retry, sometimes it fails repeatedly.

Root cause

Iceberg uses optimistic concurrency control. Every commit validates against the table state at the time the operation started. If another writer committed in the interval between your operation's start and its commit attempt, the commit is rejected. This happens most often when compaction targets the same partitions that active writers are appending to, or when multiple writers target the same partition simultaneously.

Diagnosis

Identify the conflicting operations by checking recent snapshots:

sql
1SELECT2  committed_at,3  operation,4  summary['changed-partition-count'] AS partitions_affected,5  summary['added-data-files'] AS files_added,6  summary['deleted-data-files'] AS files_deleted7FROM catalog.db.affected_table.snapshots8ORDER BY committed_at DESC9LIMIT 30;

Look for overlapping timestamps between append operations (writers) and replace operations (compaction). If both target the same partitions within seconds of each other, that is your conflict window.

Immediate fix

Increase retry configuration on the failing writer. Iceberg retries only replay the metadata commit, not the entire write — making retries cheap:

sql
1ALTER TABLE db.affected_table SET TBLPROPERTIES (2  'commit.retry.num-retries' = '10',3  'commit.retry.min-wait-ms' = '200',4  'commit.retry.max-wait-ms' = '30000',5  'commit.retry.total-timeout-ms' = '600000'6);

Scope compaction to cold partitions only — exclude the actively-written partition:

sql
1CALL catalog.system.rewrite_data_files(2  table => 'db.affected_table',3  strategy => 'binpack',4  where => 'event_date < current_date()',5  options => map(6    'partial-progress.enabled', 'true',7    'partial-progress.max-commits', '10'8  )9);

Long-term prevention

Never compact the active write partition. Use partial-progress.enabled = true so a single conflict does not invalidate an entire compaction run. Change the write distribution mode to hash so each writer produces files for distinct partitions, reducing overlap. Schedule compaction to target partition < current_date() by default. For streaming tables with sub-minute commits, use serializable isolation for appends (which allows concurrent appends to different files) and avoid overlapping OVERWRITE or DELETE with appends on the same partition.

How LakeOps prevents this

LakeOps compaction is conflict-aware by design. It inspects active writer state and targets only cold partitions — partitions with no active streaming appends. If a conflict occurs despite this (e.g., a late-arriving batch write), the affected partition is retried on the next cycle automatically. The conflict window is minimized by design rather than by retry configuration. Because compaction uses partial-progress.enabled by default, a single conflict never invalidates an entire run — only the affected file group is retried.

Incident 4: S3 storage growing faster than data

Symptom

The S3 bill increases 30–50% month-over-month but the logical data volume (as reported by metadata) is flat or growing slowly. Storage audits show files in the table prefix that are not referenced by any snapshot. The gap between billed storage and logical data widens continuously.

Root cause

Two compounding factors: orphan files from failed writes, crashed compaction jobs, and aborted transactions; and snapshot retention keeping old data files referenced longer than needed. Orphan files are invisible to Iceberg — they exist on S3 but no manifest points to them. On mature lakes, orphans routinely account for 25–40% of billable storage on affected prefixes.

Diagnosis

Compare logical data size (from metadata) against physical storage:

sql
1SELECT2  ROUND(SUM(file_size_in_bytes) / 1073741824, 2) AS logical_data_gb3FROM catalog.db.affected_table.files;

Check how many snapshots are pinning old data:

sql
1SELECT2  COUNT(*) AS total_snapshots,3  MIN(committed_at) AS oldest_retained,4  DATEDIFF(DAY, MIN(committed_at), CURRENT_TIMESTAMP()) AS retention_days5FROM catalog.db.affected_table.snapshots;

If retention exceeds 14 days on a high-write table, expired data files are being held longer than necessary. Run a dry-run orphan cleanup to quantify the waste:

sql
1CALL catalog.system.remove_orphan_files(2  table => 'db.affected_table',3  older_than => TIMESTAMP '2026-06-11 00:00:00',4  dry_run => true5);

Immediate fix

Execute the maintenance sequence in order — expire snapshots first (to dereference old files), then remove orphans (to delete the physical files):

sql
1CALL catalog.system.expire_snapshots(2  table => 'db.affected_table',3  older_than => TIMESTAMP '2026-06-13 00:00:00',4  retain_last => 1005);6 7CALL catalog.system.remove_orphan_files(8  table => 'db.affected_table',9  older_than => TIMESTAMP '2026-06-11 00:00:00'10);

The older_than threshold for orphan cleanup must be at least 7 days in the past. Files from in-progress writes are temporarily orphaned until the writer commits — deleting them prematurely corrupts the table. This is a hard safety rule.

Long-term prevention

Run orphan cleanup weekly (or daily for high-write tables), always after snapshot expiration. Set snapshot retention to the minimum window your team needs for rollback — typically 3–7 days for streaming, 14 days for batch. Enable metadata file auto-cleanup:

sql
1ALTER TABLE db.affected_table SET TBLPROPERTIES (2  'write.metadata.delete-after-commit.enabled' = 'true',3  'write.metadata.previous-versions-max' = '100'4);

How LakeOps prevents this

Orphan cleanup runs as part of the coordinated maintenance pipeline, after snapshot expiration releases file references. The 7+ day safety window is enforced by default — there is no risk of premature deletion regardless of who configures the policy. Continuous safe cleanup means orphan files never accumulate to 25–40% of storage because they are removed within days of becoming unreferenced, not weeks or months later when someone notices the bill. The full audit trail logs every orphan removal with file count and bytes reclaimed.

Incident 5: Engine OOM during compaction

Symptom

Spark compaction jobs crash with java.lang.OutOfMemoryError: Java heap space or Container killed by YARN for exceeding memory limits. The job processes for 20–40 minutes, then dies. Increasing executor memory delays but does not prevent the OOM. The failure is specific to certain partitions — smaller partitions compact successfully.

Root cause

Sort compaction on massive partitions. Sort-based compaction must read all files in a partition, sort them by the specified columns, and write the output. A partition with 500 GB of data and 50,000 files requires holding the sort state in memory. Spark's shuffle-based sort generates enormous intermediate data that exceeds executor memory limits. Z-order compaction is even more memory-intensive due to the interleaving computation.

Diagnosis

Identify which partitions are too large for in-memory sort:

sql
1SELECT2  partition,3  COUNT(*) AS file_count,4  ROUND(SUM(file_size_in_bytes) / 1073741824, 2) AS partition_size_gb5FROM catalog.db.affected_table.files6GROUP BY partition7ORDER BY partition_size_gb DESC8LIMIT 10;

If any partition exceeds 100 GB with 10,000+ files and you are running sort compaction, that is the OOM source. Check the Spark job configuration for spark.executor.memory and the compaction strategy being used.

Immediate fix

Switch to binpack for the oversized partitions. Binpack does not require global sort — it merges files by size with bounded memory:

sql
1CALL catalog.system.rewrite_data_files(2  table => 'db.affected_table',3  strategy => 'binpack',4  where => 'partition_date = "2026-06-01"',5  options => map(6    'target-file-size-bytes', '268435456',7    'min-input-files', '5',8    'partial-progress.enabled', 'true',9    'partial-progress.max-commits', '50',10    'max-file-group-size-bytes', '10737418240'11  )12);

The max-file-group-size-bytes option (10 GB above) limits how much data is processed per rewrite group, bounding memory. Reduce this value if OOMs persist. After stabilizing with binpack, if sort optimization is required, narrow the scope to sub-partition ranges or break the partition into smaller chunks that fit in memory.

Long-term prevention

Never run unbounded sort compaction on partitions exceeding 100 GB without file group size limits. Use max-file-group-size-bytes to cap per-group memory consumption. For tables that require sorted data in large partitions, run binpack first to reduce file count, then sort in a second pass with bounded groups. Increase spark.sql.shuffle.partitions for sort jobs to distribute work across more executors.

How LakeOps prevents this

LakeOps runs compaction on a purpose-built Rust engine on Apache DataFusion. The engine uses streaming sort with bounded memory, lock-free parallelism, and no JVM — eliminating garbage collection, heap limits, and OOM entirely. Partitions that crash Spark with OOM complete in minutes on the Rust engine. A 1.2 TB partition that caused Spark to OOM completed in 11 minutes. No cluster resizing, no memory tuning, no shuffle partition configuration.

Incident 6: Time travel queries fail

Symptom

Queries specifying FOR SYSTEM_TIME AS OF or FOR SYSTEM_VERSION AS OF fail with errors like Cannot find snapshot older than <timestamp> or Snapshot ID <id> does not exist. Users relying on time travel for debugging, auditing, or rollback lose access to historical states.

Root cause

Snapshot expiration was configured too aggressively. Once a snapshot is expired, it is permanently gone — along with exclusive references to data files from that point in time. If expiration runs with older_than set to 3 days and a user needs data from 5 days ago, the snapshot no longer exists. This also happens when teams expire snapshots that long-running queries are actively using — the query fails mid-execution because its underlying data files get deleted.

Diagnosis

Check the current snapshot retention:

sql
1SELECT2  snapshot_id,3  committed_at,4  operation5FROM catalog.db.affected_table.snapshots6ORDER BY committed_at ASC7LIMIT 10;

If the oldest available snapshot is more recent than the time travel target, the data has been expired. Verify the table properties to see what retention is configured:

sql
1SHOW TBLPROPERTIES db.affected_table;2-- Look for:3-- history.expire.max-snapshot-age-ms4-- history.expire.min-snapshots-to-keep

Immediate fix

If the target snapshot is already expired, recovery is not possible through standard Iceberg APIs — the metadata has been permanently removed. For future protection, adjust retention immediately:

sql
1ALTER TABLE db.affected_table SET TBLPROPERTIES (2  'history.expire.max-snapshot-age-ms' = '604800000',3  'history.expire.min-snapshots-to-keep' = '100'4);

The above retains snapshots for 7 days and keeps at least 100 snapshots regardless of age. For compliance-critical tables, use tags to pin specific snapshots that must never expire:

sql
1ALTER TABLE db.affected_table CREATE TAG `end_of_quarter_2026Q2`2  AS OF VERSION 8472910353  RETAIN 365 DAYS;

Long-term prevention

Set retain_last high enough that the oldest retained snapshot covers your incident response SLA. If your team takes 72 hours to detect bad data, retain at least 5 days of snapshots. Use tags for audit checkpoints. Document retention windows per table — streaming tables may need 3–7 days; compliance tables may need 90 days. Never expire snapshots with an older_than threshold shorter than your longest-running query's expected duration — otherwise active queries can fail mid-execution.

How LakeOps prevents this

Retention policies are configurable per table or via catalog-wide policies with enforced retain_last minimums. LakeOps allows tagged snapshots for audit points that are excluded from expiration regardless of the age threshold. The observability dashboard surfaces the current retention window for every table, making it visible when a table's configuration is too aggressive relative to its usage patterns. Insights flag tables where retention is shorter than the observed time-travel query history.

Incident 7: Delete file ratio degrading reads

Symptom

Queries on tables with frequent UPDATEs or DELETEs slow progressively over days or weeks. No corresponding increase in data volume — the table has the same logical row count. Engine profiles show excessive time in "delete file reconciliation" or "merge-on-read" phases. Individual queries that touched 10 files now reconcile against 200+ delete files.

Root cause

Merge-on-read mode accumulates delete files with every mutation. Position delete files mark specific rows by (file_path, position). Equality delete files mark rows by column values. Both require per-query reconciliation — every read must join data files against their associated delete files to filter out logically removed rows. Without compaction that physically applies the deletes, this overhead compounds linearly with mutation count.

Diagnosis

Measure the delete file ratio per partition:

sql
1WITH data AS (2  SELECT partition, COUNT(*) AS data_files3  FROM catalog.db.affected_table.files4  GROUP BY partition5),6deletes AS (7  SELECT partition, COUNT(*) AS delete_files, SUM(record_count) AS delete_records8  FROM catalog.db.affected_table.all_delete_files9  GROUP BY partition10)11SELECT12  d.partition,13  d.data_files,14  COALESCE(del.delete_files, 0) AS delete_files,15  ROUND(COALESCE(del.delete_files, 0) * 100.0 / d.data_files, 1) AS delete_ratio_pct16FROM data d17LEFT JOIN deletes del ON d.partition = del.partition18WHERE COALESCE(del.delete_files, 0) > 019ORDER BY delete_ratio_pct DESC;

If any partition shows a delete-to-data ratio above 10%, compaction targeting delete files is overdue. Ratios above 50% indicate severe read degradation. For a comprehensive breakdown of delete file mechanics and thresholds, see Iceberg Delete Files Guide.

Immediate fix

Run targeted compaction with delete-file-threshold set to rewrite any data file with associated deletes:

sql
1CALL catalog.system.rewrite_data_files(2  table => 'db.affected_table',3  strategy => 'binpack',4  where => 'partition_date >= current_date() - INTERVAL 14 DAYS',5  options => map(6    'delete-file-threshold', '1',7    'target-file-size-bytes', '268435456',8    'partial-progress.enabled', 'true',9    'remove-dangling-deletes', 'true'10  )11);

The remove-dangling-deletes option generates a follow-up commit to clean up delete files that no longer reference any live data files. Focus on partitions with the highest delete ratios first — those deliver the biggest read performance improvement per compaction dollar.

Long-term prevention

Match compaction frequency to mutation rate. A table receiving 100 deletes per hour needs sub-hourly compaction on affected partitions — not a nightly batch job. Use delete-file-threshold as a trigger: compact any partition where the ratio exceeds your threshold (10% is a reasonable default). For tables with both small files and delete file accumulation, a single compaction pass resolves both simultaneously.

How LakeOps prevents this

Delete file ratios are tracked per partition with configurable triggers. When the ratio exceeds the threshold (default 10%), compaction fires automatically and physically applies pending deletes during the pass — one operation resolves both small files and delete files simultaneously. The Insights system surfaces partitions with rising delete ratios at WARNING severity before they reach CRITICAL, giving teams visibility into the trend even if automatic resolution handles it. The full audit trail shows delete files removed per compaction run.

Incident 8: Schema changes break downstream

Symptom

Downstream consumers — Trino queries, dbt models, BI dashboards — fail after a schema change is applied to the source Iceberg table. Errors include Column 'X' not found, type mismatch exceptions, or unexpected NULLs in previously non-null columns. The failure may not surface immediately if consumers cache schema metadata.

Root cause

Schema evolution applied directly to the production branch without compatibility testing. Iceberg supports schema evolution (add columns, rename columns, widen types, reorder columns) without rewriting data — but downstream consumers that reference columns by name or position break if they are not prepared for the change. Dropping or renaming a column that a consumer depends on causes immediate failures.

Diagnosis

Check the schema history to identify what changed:

sql
1SELECT * FROM catalog.db.affected_table.metadata_log_entries2ORDER BY timestamp DESC3LIMIT 20;

Compare the current schema against what downstream consumers expect. Identify which columns were added, dropped, renamed, or had their types changed. Cross-reference with the consumer's query definitions to find the incompatibility.

Immediate fix

If a column was dropped or renamed and consumers depend on it, roll back the schema change:

sql
1CALL catalog.system.rollback_to_snapshot(2  table => 'db.affected_table',3  snapshot_id => 8472910344);

Then reapply using the branch-based approach for safe schema evolution:

sql
1ALTER TABLE db.affected_table CREATE BRANCH schema_test_v22  RETAIN 7 DAYS;3 4SET spark.wap.branch = schema_test_v2;5INSERT INTO db.affected_table SELECT * FROM test_data_with_new_schema;6 7-- Validate downstream consumers against the branch8-- SELECT * FROM db.affected_table VERSION AS OF 'schema_test_v2'9-- Run integration tests here10 11CALL catalog.system.fast_forward(12  table => 'db.affected_table',13  branch => 'main',14  to => 'schema_test_v2'15);

Long-term prevention

Never apply schema changes directly to production without consumer validation. Use the Write-Audit-Publish pattern: make changes on a branch, validate with downstream consumers (run their queries against the branch), then fast-forward to main. Maintain a schema compatibility contract — document which columns are public APIs and which are internal. Only additive changes (new nullable columns) are safe to apply without coordination.

How LakeOps prevents this

While schema evolution is a pipeline concern rather than a maintenance operation, LakeOps's observability layer surfaces schema change events in the table event log. Every schema modification is logged with a timestamp and before/after state, making it immediately visible when changes happened and what was modified. Teams can set up alerts on schema change events to trigger validation pipelines before consumers encounter the incompatibility.

LakeOps Tables — health classification
LakeOps health classification — every table continuously evaluated as Healthy, Warning, or Critical based on file count, file size, manifest fragmentation, snapshot depth, and delete file ratios. Problems surface at the Warning stage before they become production incidents.

Incident response procedures

Severity classification

P1 — Production queries failing. Time travel errors, schema breaks, or CommitFailedExceptions blocking pipelines. Response: immediate. Fix within 30 minutes.

P2 — Performance degradation above 5x. Query latency 10x normal, planning takes minutes. Pipelines are slow but not failing. Response: within 1 hour. Fix within 4 hours.

P3 — Cost anomaly. Storage growing faster than expected, orphan accumulation, snapshot retention too long. Response: within 24 hours. Fix within 1 week.

P4 — Drift from targets. File sizes trending down, manifest count rising, delete ratio climbing. No user-visible impact yet. Response: next maintenance window.

LakeOps maps these severity levels to its Insights system: CRITICAL and HIGH correspond to P1/P2 incidents, WARNING to P3, and LOW to P4. The difference is detection — Insights surface conditions at P4 before they escalate to P1.

Escalation path

  1. 1.On-call engineer — diagnose using metadata queries above, apply immediate fix
  2. 2.Data platform team — review maintenance configuration, adjust policies, tune compaction schedules
  3. 3.Infrastructure team — cluster resizing for OOM, S3 prefix optimization for throttling, catalog capacity
  4. 4.Vendor support — engine-specific bugs (Spark, Trino, Flink), catalog issues (Glue, REST, Polaris)

Post-incident checklist

  • Confirm fix is applied and metrics are trending toward healthy
  • Identify the root cause prevention (configuration change, new policy, schedule adjustment)
  • Update monitoring thresholds if the incident was not detected by automated alerts
  • Document the incident in the operations log for pattern analysis
  • If this incident would have been prevented by automated maintenance, evaluate whether a control plane should handle it going forward

Monitoring: detecting problems before they page

Every incident above is detectable from Iceberg metadata before it impacts users. The following queries form a minimum monitoring baseline. Run them on a schedule (hourly for streaming tables, daily for batch) and alert when thresholds are breached.

File health check

sql
1SELECT2  partition,3  COUNT(*) AS file_count,4  ROUND(AVG(file_size_in_bytes) / 1048576, 1) AS avg_size_mb5FROM catalog.db.target_table.files6GROUP BY partition7HAVING COUNT(*) > 500 OR AVG(file_size_in_bytes) < 671088648ORDER BY file_count DESC;

Snapshot accumulation check

sql
1SELECT2  COUNT(*) AS snapshot_count,3  MIN(committed_at) AS oldest_snapshot4FROM catalog.db.target_table.snapshots5HAVING COUNT(*) > 1000;

Manifest fragmentation check

sql
1SELECT2  COUNT(*) AS manifest_count,3  ROUND(AVG(added_data_files_count + existing_data_files_count), 1) AS avg_entries_per_manifest4FROM catalog.db.target_table.manifests5HAVING COUNT(*) > 200;

Delete file accumulation check

sql
1WITH data AS (2  SELECT partition, COUNT(*) AS data_files3  FROM catalog.db.target_table.files4  GROUP BY partition5),6deletes AS (7  SELECT partition, COUNT(*) AS delete_files8  FROM catalog.db.target_table.all_delete_files9  GROUP BY partition10)11SELECT12  d.partition,13  d.data_files,14  del.delete_files,15  ROUND(del.delete_files * 100.0 / d.data_files, 1) AS ratio_pct16FROM data d17JOIN deletes del ON d.partition = del.partition18WHERE del.delete_files * 100.0 / d.data_files > 2019ORDER BY ratio_pct DESC;

Write conflict frequency check

Monitor your streaming job logs for CommitFailedException frequency. If retry success rate drops below 95%, your conflict window is too large — scope compaction more narrowly or increase retry limits.

LakeOps Table Events — operations log
LakeOps Table Events — a complete audit trail showing every maintenance operation with duration, before/after metrics, and status. This is the operational log you reference during incident response instead of parsing Spark logs.

From manual runbook to autonomous prevention

This runbook gives you the diagnostic path and fix for each incident. But the pattern is clear: every incident here is caused by maintenance that did not run or ran incorrectly. The reactive path — detect symptom, diagnose root cause, apply fix, configure prevention — works. The proactive path — prevent the conditions from occurring in the first place — is better.

LakeOps replaces manual runbook execution with a closed-loop system. Health classification catches problems at four severity levels (CRITICAL, HIGH, WARNING, LOW) and surfaces them in the Insights tab before users report symptoms. Event-driven maintenance triggers fire based on actual table telemetry — file count thresholds, delete ratios, snapshot depth — not arbitrary cron schedules. Conflict-aware execution never compacts hot partitions. Sequenced operations run in the correct order every time. The full audit trail logs every operation with duration, impact, and status.

The result: teams using LakeOps report 90%+ reduction in Iceberg-related incidents. Not because the incidents are fixed faster — because the conditions that cause them never develop.

LakeOps Dashboard
LakeOps Dashboard — aggregate lake health: total operations, query acceleration, estimated cost savings, and data optimized across all connected catalogs. The executive view that answers whether the lake is getting healthier or worse.
LakeOps autonomous maintenance — detecting and resolving table health issues before they become incidents.

Quick reference: incident → fix

Queries 10x slower → Check file count per partition → Binpack compaction on worst partitions → Schedule compaction proportional to write rate

Planning takes minutes → Check manifest count and snapshot count → Expire snapshots + rewrite manifests → Run manifest rewrite after every compaction

CommitFailedException → Check concurrent operations on same partitions → Increase retry config + scope compaction to cold partitions → Use hash write distribution, exclude active partitions from compaction

Storage growing faster than data → Compare logical vs physical storage → Expire snapshots + remove orphans with 7-day safety → Run orphan cleanup weekly after expiration

Compaction OOMs → Identify oversized partitions → Switch to binpack with max-file-group-size-bytes → Use Rust engine (LakeOps) or bound group size

Time travel fails → Check oldest available snapshot → Adjust retain_last and retention window → Tag critical snapshots, document retention per table

Delete ratio degrading reads → Measure delete-to-data ratio per partition → Compact with delete-file-threshold=1 → Match compaction frequency to mutation rate

Schema changes break consumers → Check schema history → Rollback to pre-change snapshot → Use branch-based evolution with WAP pattern

Further reading

Related articles

Found this useful? Share it with your team.