
Most teams discover table health problems the same way: a dashboard goes stale, a query that used to take 4 seconds now takes 90, or the storage bill doubles in a quarter with no corresponding increase in data volume. By the time someone notices, the damage has been compounding for weeks — sometimes months.
The root cause is almost never a single broken write or a schema change gone wrong. It is the gap between two kinds of reliability that production lakehouses need — and that most teams conflate until something fails spectacularly.
Data quality asks: Is the data correct? Table health asks: Is the table structurally sound? These are independent failure modes. You can have perfect data quality on a structurally degraded table — every row is correct, but queries scan 10x more data than necessary because 50,000 small files have accumulated across fragmented partitions. You can also have perfect table health on a table with severe data quality issues — files are compacted, manifests are clean, but half the records have null values in a required column because an upstream schema change dropped a field mapping.
Both failure modes are silent until they are not. Both compound daily. And monitoring hundreds of tables across multiple catalogs manually is impossible — by the time you notice a problem, queries are already slow and costs are already inflated. LakeOps provides continuous, automated table health classification and remediation across your entire lake, but understanding what it monitors — and why — requires knowing the dimensions of each failure mode in detail.
This guide covers the structural dimensions of table health, the data quality signals measurable from Iceberg metadata, the failure patterns that recur across production lakehouses, actionable monitoring queries, a practical classification framework, and how automated remediation closes the loop from detection to fix.
Table health dimensions: what structural signals to measure
Table health has six structural dimensions. Each one degrades independently, and each has distinct thresholds where degradation transitions from tolerable to performance-impacting to critical.
File count and size distribution
The most direct indicator of structural health. Iceberg stores data as immutable Parquet files, and every write creates new ones. A streaming pipeline checkpointing every 60 seconds across 100 partitions creates 144,000 new files per day — most of them a few megabytes each. For a full treatment of why this happens and how to fix it, see the small files guide.
What matters is not just the total file count but the distribution. A table with 5,000 files averaging 256 MB each is healthy. A table with 5,000 files where 4,800 are under 5 MB and 200 are over 500 MB has a bimodal distribution that compaction must address differently than uniform small files.
1-- File count and size distribution2SELECT3 COUNT(*) AS total_files,4 ROUND(AVG(file_size_in_bytes) / 1048576, 1) AS avg_file_mb,5 ROUND(PERCENTILE_APPROX(file_size_in_bytes, 0.5) / 1048576, 1) AS median_file_mb,6 ROUND(PERCENTILE_APPROX(file_size_in_bytes, 0.05) / 1048576, 1) AS p5_file_mb,7 ROUND(PERCENTILE_APPROX(file_size_in_bytes, 0.95) / 1048576, 1) AS p95_file_mb,8 ROUND(MIN(file_size_in_bytes) / 1048576, 1) AS min_file_mb,9 ROUND(MAX(file_size_in_bytes) / 1048576, 1) AS max_file_mb,10 SUM(CASE WHEN file_size_in_bytes < 33554432 THEN 1 ELSE 0 END) AS files_under_32mb11FROM catalog.db.my_table.files;The files_under_32mb count is the actionable metric. If it exceeds 20% of total files, compaction is overdue. If it exceeds 50%, the table is in a critical state where every query pays a substantial planning and I/O tax. LakeOps tracks this ratio continuously per table, flagging transitions from healthy to warning to critical without polling — health state updates with every new commit.
Manifest count and fragmentation
Manifests are the index layer between snapshots and data files. Each manifest tracks a set of files with their partition values, column statistics, and paths. When manifests fragment — too many small manifests, each tracking a handful of files — scan planning slows because the engine must open and parse every manifest before it can identify which data files to read.
1-- Manifest health check2SELECT3 COUNT(*) AS manifest_count,4 ROUND(AVG(length) / 1048576, 2) AS avg_manifest_mb,5 SUM(added_data_files_count + existing_data_files_count) AS total_tracked_files,6 ROUND(7 SUM(added_data_files_count + existing_data_files_count) * 1.0 / COUNT(*),8 19 ) AS avg_files_per_manifest10FROM catalog.db.my_table.manifests;Healthy tables have fewer than one manifest per 50–100 data files. If avg_files_per_manifest drops below 10, manifests need rewriting. A table with 800 manifests each tracking 3 files has the same data as a table with 8 manifests each tracking 300 files — but the first one requires 100x more metadata I/O during scan planning.
Snapshot depth and retention drift
Every write, append, overwrite, or delete creates a new snapshot. Snapshots enable time travel and rollback, but they also anchor data files in storage — as long as a snapshot exists, every file it references persists. A table with 15,000 retained snapshots has an enormous metadata tree that slows every planning operation.
1-- Snapshot depth and age2SELECT3 COUNT(*) AS total_snapshots,4 MIN(committed_at) AS oldest_snapshot,5 MAX(committed_at) AS newest_snapshot,6 DATEDIFF(day, MIN(committed_at), MAX(committed_at)) AS retention_span_days7FROM catalog.db.my_table.snapshots;If your retention policy is 7 days but the query shows a 45-day span, snapshot expiration is either not running or not keeping up with write volume. For the mechanics of snapshot expiration and why it must run before compaction, see the table health maintenance guide. LakeOps surfaces snapshot depth as a first-class metric in table-level Insights, flagging tables where retention drift exceeds policy bounds at CRITICAL or HIGH severity.
Delete file accumulation
Iceberg v2 supports row-level deletes through position delete files and equality delete files. These are efficient for writes — no data rewriting at write time — but expensive for reads. Every query must reconcile data files against pending delete files, filtering out deleted rows on the fly. As delete files accumulate, merge-on-read overhead grows linearly.
1-- Delete file ratio2SELECT3 SUM(CASE WHEN content = 0 THEN 1 ELSE 0 END) AS data_files,4 SUM(CASE WHEN content = 1 THEN 1 ELSE 0 END) AS position_delete_files,5 SUM(CASE WHEN content = 2 THEN 1 ELSE 0 END) AS equality_delete_files,6 ROUND(7 SUM(CASE WHEN content IN (1, 2) THEN 1 ELSE 0 END) * 1.0 /8 NULLIF(SUM(CASE WHEN content = 0 THEN 1 ELSE 0 END), 0),9 310 ) AS delete_to_data_ratio11FROM catalog.db.my_table.all_data_files;A delete-to-data ratio above 0.1 means accumulating deletes are starting to impact read performance. Above 0.3 is a warning. Above 0.5 is a read-performance emergency — every scan is doing significant reconciliation work, even for queries that touch a small fraction of the data. CDC tables ingesting change streams are the most common source. LakeOps tracks position deletes and equality deletes separately per table, displaying both in the table metrics view alongside records distribution over the last 60 snapshots.
Partition skew
Partition skew means some partitions have thousands of files while others have one. Lake-wide averages hide this completely. A table with an average of 50 files per partition sounds healthy — until you discover that one partition has 12,000 files and most partitions have 5.
1-- Partition skew analysis2SELECT3 partition,4 COUNT(*) AS file_count,5 ROUND(AVG(file_size_in_bytes) / 1048576, 1) AS avg_file_mb,6 ROUND(SUM(file_size_in_bytes) / 1073741824, 2) AS total_gb7FROM catalog.db.my_table.files8GROUP BY partition9ORDER BY file_count DESC10LIMIT 20;
The skew ratio — MAX(file_count) / MEDIAN(file_count) — quantifies the imbalance. A ratio above 10 indicates significant skew. Above 50 means some partitions are effectively degraded while the rest of the table appears fine. Targeted compaction on the worst partitions delivers the most performance improvement per compute dollar spent. LakeOps performs health classification at the partition level, not just the table level — hot partitions are flagged independently and compaction targets the worst ones first.
Sort order drift
Iceberg tables can define a sort order that physically organizes rows within data files. Sorted data maximizes data skipping — Parquet min/max statistics become tight per-file ranges instead of overlapping intervals. But sort order effectiveness degrades over time. New unsorted files from streaming appends are interleaved with sorted files from compaction. If compaction runs infrequently or only uses binpack (which preserves existing row order but does not re-sort), the table gradually loses its sort benefit.
Sort order drift is harder to measure directly from metadata tables alone. The proxy metric is data skipping effectiveness: if filtered queries scan a high percentage of files despite selective predicates, sort order has degraded. Puffin statistics — specifically the column min/max bounds tracked in manifests — can help quantify overlap between files on sort key columns. Maintaining sort alignment is one of the highest-leverage optimizations for lakehouse query performance.
LakeOps captures cross-engine telemetry from every connected query engine, identifying which columns production queries actually filter, join, and group on. When sort staleness is detected — the current sort order no longer aligns with dominant access patterns — it surfaces as a WARNING-level Insight with a recommended sort configuration.
Data quality on Iceberg: what to measure without scanning data
Table health monitoring tells you whether the physical structure is sound. Data quality monitoring tells you whether the data itself is trustworthy. On Iceberg, three dimensions are measurable directly from metadata — no full-table scans required.
Freshness: last commit time
The simplest and highest-signal data quality metric. If a table's SLA requires data no older than 15 minutes and the last commit was 3 hours ago, something is broken upstream — regardless of how healthy the table structure looks.
1-- Freshness check: time since last commit2SELECT3 MAX(committed_at) AS last_commit,4 CURRENT_TIMESTAMP - MAX(committed_at) AS staleness,5 COUNT(*) AS total_snapshots6FROM catalog.db.my_table.snapshots;Freshness monitoring is the single most effective data quality check for production tables. It catches pipeline failures, upstream outages, and ingestion backpressure — all without reading a single data file. Set per-table SLAs and alert when staleness exceeds the threshold. LakeOps monitors last commit time across every connected table, surfacing freshness SLA breaches as CRITICAL Insights the moment a table goes stale.
Completeness: null ratios from manifest statistics
Iceberg manifests store per-file, per-column null counts alongside value counts. This metadata is written at commit time and can be aggregated to compute null ratios across the entire table without scanning data — a technique validated at scale in LinkedIn's zero-scan data quality architecture across 200,000+ Iceberg tables.
1-- Null ratio for key columns (via file-level metadata)2SELECT3 SUM(null_value_counts['user_id']) AS user_id_nulls,4 SUM(value_counts['user_id']) AS user_id_total,5 ROUND(6 SUM(null_value_counts['user_id']) * 100.0 /7 NULLIF(SUM(value_counts['user_id']), 0),8 29 ) AS user_id_null_pct10FROM catalog.db.my_table.files;A column that was 0.1% null last week and is now 15% null indicates a schema mapping failure, a pipeline bug, or an upstream source change. The power of metadata-driven completeness checking is that it runs at zero marginal compute cost — the statistics already exist in the manifests.
Schema conformance: evolution tracking
Iceberg tracks schema evolution through field IDs, not column names. This means renames are safe, but adds, drops, and type changes are visible in the schema history. A column that changed from LONG to STRING or a required field that became optional can silently break downstream consumers.
1-- Schema evolution history2SELECT * FROM catalog.db.my_table.metadata_log_entries3ORDER BY timestamp DESC4LIMIT 10;Monitor schema changes as events. Any schema evolution on a production table should trigger a notification — not because schema changes are bad, but because downstream consumers need to know. A new column requires no action; a dropped column or type change requires verification that dependent queries and pipelines still produce correct results.
Where reliability breaks: real failure patterns
The metrics above tell you what to measure. But understanding where reliability actually breaks — the failure patterns that recur across production lakehouses — is what separates reactive firefighting from proactive prevention.
Silent degradation: the table that works but wastes
The most insidious failure mode. The table returns correct results. Queries complete successfully. No alerts fire. But behind the scenes, scan planning takes 8 seconds instead of 200 milliseconds because 1,200 manifests must be parsed. Queries read 40 GB instead of 4 GB because 80,000 small files each have overlapping min/max ranges that defeat data skipping. The S3 bill includes 2 million unnecessary GET requests per day.
Silent degradation is undetectable without structural monitoring. Users perceive slowness as normal. Storage cost growth is attributed to data growth. The problem compounds daily — every new write adds more small files, more manifests, more snapshots — until someone finally profiles a query and discovers the table is scanning its entire history on every read.
The fix is not a one-time compaction run. It is continuous monitoring of file count, average file size, and manifest ratio with alerting thresholds that catch degradation while it is cheap to fix — before the table has accumulated six months of structural debt. LakeOps classifies every table as Critical, Warning, or Healthy based on exactly these structural signals, updating classification with every commit rather than on a polling schedule.
Partition skew: hot and cold in the same table
A daily-partitioned events table with uniform volume across days looks straightforward. But in practice, Black Friday has 20x normal volume. A marketing campaign drives 50x traffic to one region. A retry storm after an outage floods a single hour's partition with duplicate events.
The result: one partition has 15,000 files averaging 2 MB each. Adjacent partitions have 20 files averaging 300 MB each. Lake-wide metrics show the table is healthy — average file count per partition is reasonable. But any query that touches the skewed partition pays the full small-file penalty.
Partition-level monitoring is non-negotiable. Track the P95 file count across partitions, not just the average. Alert when any single partition exceeds 5x the median file count. Target compaction at the worst partitions first — this delivers the highest performance improvement per compute dollar.
Statistics staleness: Puffin files not refreshed after compaction
Compaction rewrites data files and creates a new snapshot. But the Puffin statistics — NDV sketches used for join ordering and cost-based optimization — still reference the pre-compaction snapshot. Some engines will ignore stale statistics entirely, falling back to heuristic defaults. Others will use them but with degraded accuracy.
The impact is subtle and hard to attribute. Join orders become suboptimal. Split sizes are unbalanced. Queries that used to complete in seconds now take minutes because the optimizer chose a hash join instead of a broadcast join based on stale cardinality estimates.
The fix is to sequence statistics re-collection as part of the maintenance pipeline: expire snapshots → clean orphans → compact → rewrite manifests → refresh statistics. Statistics should be recomputed against the clean, post-compaction layout — never against the pre-compaction state.
Delete file accumulation on CDC tables
Change Data Capture pipelines generate a continuous stream of inserts, updates, and deletes. On Iceberg v2, updates and deletes produce position delete files or equality delete files rather than rewriting data files — this is efficient for writes but creates a growing read-time tax.
A CDC table processing 100,000 updates per hour accumulates delete files rapidly. Within a week, each data file may have 10–20 associated delete files. Every read must reconcile all of them — opening each delete file, building a filter bitmap, and applying it during scan. On a table with 5,000 data files and 50,000 delete files, this reconciliation can double or triple query latency.
1-- CDC table delete file health2SELECT3 partition,4 SUM(CASE WHEN content = 0 THEN 1 ELSE 0 END) AS data_files,5 SUM(CASE WHEN content IN (1, 2) THEN 1 ELSE 0 END) AS delete_files,6 ROUND(7 SUM(CASE WHEN content IN (1, 2) THEN 1 ELSE 0 END) * 1.0 /8 NULLIF(SUM(CASE WHEN content = 0 THEN 1 ELSE 0 END), 0),9 210 ) AS delete_ratio11FROM catalog.db.my_table.all_data_files12GROUP BY partition13HAVING SUM(CASE WHEN content IN (1, 2) THEN 1 ELSE 0 END) > 014ORDER BY delete_ratio DESC15LIMIT 20;CDC tables need aggressive delete-file compaction — rewriting data files to physically apply pending deletes and eliminate the reconciliation cost. A delete-file threshold of 3–5 per data file is a reasonable trigger. Tables with delete ratios above 0.5 need immediate attention.
Orphan file accumulation: the invisible storage leak
Orphan files — data files, metadata files, or manifests that exist on storage but are not referenced by any current table metadata — are the silent cost driver of production lakehouses. They accumulate from incomplete snapshot expiration, failed writes that never committed, interrupted compaction jobs, and schema evolution artifacts.
Orphans are invisible to query engines. They do not appear in any snapshot, manifest, or metadata file. But they consume storage and inflate object storage bills. In production deployments, orphan accumulation regularly reaches hundreds of terabytes across a catalog before anyone notices — the files look identical to active data in the storage layer.
Without a system that reconciles referenced files against physical storage, orphan accumulation is effectively undetectable. LakeOps tracks orphan volume as a per-table metric alongside stale file count and active data files, flagging tables where total storage significantly exceeds referenced data.
Monitoring queries: actionable SQL against Iceberg metadata
The queries above are per-dimension checks. For production monitoring at scale, you need composite queries that surface the overall health state of a table in a single pass.
Composite health scoring query
1-- Composite health check (run per table)2WITH file_stats AS (3 SELECT4 COUNT(*) AS total_files,5 AVG(file_size_in_bytes) / 1048576 AS avg_file_mb,6 SUM(CASE WHEN file_size_in_bytes < 33554432 THEN 1 ELSE 0 END) AS small_files,7 SUM(file_size_in_bytes) / 1073741824 AS total_size_gb8 FROM catalog.db.my_table.files9),10manifest_stats AS (11 SELECT12 COUNT(*) AS manifest_count13 FROM catalog.db.my_table.manifests14),15snapshot_stats AS (16 SELECT17 COUNT(*) AS snapshot_count,18 DATEDIFF(day, MIN(committed_at), MAX(committed_at)) AS retention_days,19 CURRENT_TIMESTAMP - MAX(committed_at) AS staleness20 FROM catalog.db.my_table.snapshots21),22delete_stats AS (23 SELECT24 SUM(CASE WHEN content = 0 THEN 1 ELSE 0 END) AS data_files,25 SUM(CASE WHEN content IN (1, 2) THEN 1 ELSE 0 END) AS delete_files26 FROM catalog.db.my_table.all_data_files27)28SELECT29 f.total_files,30 ROUND(f.avg_file_mb, 1) AS avg_file_mb,31 f.small_files,32 ROUND(f.total_size_gb, 2) AS total_size_gb,33 m.manifest_count,34 s.snapshot_count,35 s.retention_days,36 s.staleness,37 d.delete_files,38 ROUND(d.delete_files * 1.0 / NULLIF(d.data_files, 0), 3) AS delete_ratio,39 CASE40 WHEN f.avg_file_mb < 3241 OR d.delete_files * 1.0 / NULLIF(d.data_files, 0) > 0.542 OR m.manifest_count > f.total_files * 0.143 OR s.snapshot_count > 500044 THEN 'CRITICAL'45 WHEN f.avg_file_mb < 12846 OR d.delete_files * 1.0 / NULLIF(d.data_files, 0) > 0.147 OR m.manifest_count > f.total_files * 0.0248 OR s.snapshot_count > 200049 THEN 'WARNING'50 ELSE 'HEALTHY'51 END AS health_status52FROM file_stats f53CROSS JOIN manifest_stats m54CROSS JOIN snapshot_stats s55CROSS JOIN delete_stats d;This is a starting point, not a production system. Real health classification needs partition-level granularity, historical trending, per-table threshold overrides, and cross-engine telemetry. A streaming events table and a monthly aggregate table have fundamentally different expectations for file count and snapshot depth.
Per-partition health query
1-- Per-partition health breakdown2WITH partition_files AS (3 SELECT4 partition,5 COUNT(*) AS file_count,6 AVG(file_size_in_bytes) / 1048576 AS avg_mb,7 SUM(CASE WHEN file_size_in_bytes < 33554432 THEN 1 ELSE 0 END) AS small_files8 FROM catalog.db.my_table.files9 GROUP BY partition10),11partition_deletes AS (12 SELECT13 partition,14 SUM(CASE WHEN content IN (1, 2) THEN 1 ELSE 0 END) AS delete_files15 FROM catalog.db.my_table.all_data_files16 GROUP BY partition17)18SELECT19 pf.partition,20 pf.file_count,21 ROUND(pf.avg_mb, 1) AS avg_file_mb,22 pf.small_files,23 COALESCE(pd.delete_files, 0) AS delete_files,24 CASE25 WHEN pf.file_count > 5000 OR pf.avg_mb < 32 THEN 'CRITICAL'26 WHEN pf.file_count > 1000 OR pf.avg_mb < 128 THEN 'WARNING'27 ELSE 'HEALTHY'28 END AS partition_health29FROM partition_files pf30LEFT JOIN partition_deletes pd ON pf.partition = pd.partition31ORDER BY pf.file_count DESC32LIMIT 30;Partition-level queries reveal what table-level averages hide. A table classified as HEALTHY at the aggregate level may have individual partitions in CRITICAL state — and those are the partitions that degrade specific queries.
Health classification framework: from metrics to action
Monitoring individual metrics is necessary but not sufficient. With hundreds of tables, you need a classification framework that triages every table into a health state and directs attention to the ones that need it most.
Three-tier health classification
Healthy. All structural metrics within normal ranges. File count per partition is reasonable (under 500 for most workloads). Average file size is within 50–100% of the target (typically 128–512 MB). Manifest ratio is below 1:50 (manifests to data files). Delete-to-data ratio is below 0.1. Snapshot count is within retention policy bounds. No partition has a file count exceeding 5x the median.
Warning. One or more metrics in a degraded range but not yet critical. Average file size has dropped below 50% of target. Some partitions exceed 1,000 files. Delete-to-data ratio is between 0.1 and 0.5. Snapshot count exceeds retention policy by 2x. Manifest ratio is between 1:10 and 1:50. The table still performs acceptably but is trending toward critical if maintenance does not intervene.
Critical. At least one metric in a range that actively degrades query performance or inflates costs. Average file size is below 32 MB. Any partition exceeds 5,000 files. Delete-to-data ratio exceeds 0.5. Snapshot count exceeds retention policy by 5x or more. Manifest ratio exceeds 1:10. Queries on this table are measurably slower than they should be, and storage costs are inflated by orphan files or retained snapshots.

Four-severity Insights model
Beyond the three-tier classification, individual problems within a table need severity ranking to prioritize remediation. LakeOps surfaces table-level Insights at four severity levels:
CRITICAL — the table is actively broken for consumers. Examples: query latency exceeds 10x baseline, delete ratio above 0.8, partition with 20,000+ files, freshness SLA breached by more than 5x the allowed window. Requires immediate automated remediation.
HIGH — significant degradation that will become critical without intervention. Examples: average file size below 32 MB across most partitions, manifest count exceeding 10% of file count, snapshot retention exceeded by 3x. Automated remediation should trigger within the current maintenance window.
WARNING — measurable drift from optimal but not yet impacting consumers. Examples: some partitions exceeding 1,000 files, delete ratio between 0.1–0.3, sort order misaligned with 30%+ of recent query patterns. Schedule remediation in the next maintenance cycle.
LOW — minor sub-optimalities detected during routine health evaluation. Examples: a few partitions slightly above file count target, statistics age exceeding 48 hours post-compaction, manifest count trending upward but still within acceptable bounds. Informational — will be addressed in the next scheduled maintenance pass.
Automated remediation: closing the loop from detection to fix
Monitoring tells you something is wrong. Classification tells you how urgent it is. But without automated remediation, every health alert becomes a ticket for a human to investigate, diagnose, and fix — a cycle that does not scale past a few dozen tables.
The maintenance sequence
Automated remediation is not just about triggering individual operations. The order matters — and getting it wrong wastes compute or creates new problems. Running compaction before snapshot expiration rewrites files that would have been garbage-collected moments later. Rewriting manifests before compaction produces clean manifests that immediately become stale when compaction changes the file layout.
The correct sequence is always: (1) snapshot expiration → (2) orphan file cleanup → (3) data file compaction (including delete file application) → (4) manifest rewriting → (5) statistics refresh. Each step's output feeds the next. For the full rationale behind this sequencing, see the table health maintenance guide.
Event-driven triggers vs. fixed schedules
The simplest form of automated remediation: when a metric crosses a threshold, trigger the appropriate maintenance operation. No cron schedule, no manual intervention — the table's own state determines when maintenance runs.
1def check_and_remediate(spark, table_name, config):2 files = spark.sql(f"""3 SELECT4 COUNT(*) AS total_files,5 AVG(file_size_in_bytes) / 1048576 AS avg_mb6 FROM {table_name}.files7 """).collect()[0]8 9 delete_ratio = spark.sql(f"""10 SELECT11 COALESCE(12 SUM(CASE WHEN content IN (1,2) THEN 1 ELSE 0 END) * 1.0 /13 NULLIF(SUM(CASE WHEN content = 0 THEN 1 ELSE 0 END), 0),14 015 ) AS ratio16 FROM {table_name}.all_data_files17 """).collect()[0]['ratio']18 19 if files['avg_mb'] < config['min_avg_file_mb']:20 spark.sql(f"""21 CALL catalog.system.rewrite_data_files(22 table => '{table_name}',23 strategy => 'binpack',24 options => map(25 'target-file-size-bytes', '{config["target_file_bytes"]}',26 'min-file-size-bytes', '{config["min_file_bytes"]}'27 )28 )29 """)30 31 if delete_ratio > config['max_delete_ratio']:32 spark.sql(f"""33 CALL catalog.system.rewrite_data_files(34 table => '{table_name}',35 options => map('delete-file-threshold', '3')36 )37 """)38 39 snapshot_count = spark.sql(f"""40 SELECT COUNT(*) AS cnt FROM {table_name}.snapshots41 """).collect()[0]['cnt']42 43 if snapshot_count > config['max_snapshots']:44 spark.sql(f"""45 CALL catalog.system.expire_snapshots(46 table => '{table_name}',47 older_than => CURRENT_TIMESTAMP - INTERVAL {config['retention_days']} DAYS,48 retain_last => {config['retain_last']}49 )50 """)This approach works for a handful of tables. The limitations emerge at scale: you need a polling loop to check every table, a scheduler to orchestrate operations in the correct sequence, conflict handling for tables with concurrent writers, per-table configuration management, and observability into what ran, what failed, and what impact it had. At 50+ tables, the maintenance infrastructure itself becomes a significant engineering project.
Per-table policies
Not every table has the same health profile. A high-velocity streaming events table needs aggressive compaction triggers (compact when any partition exceeds 200 files), short snapshot retention (24–48 hours), and frequent delete-file application. A slowly-changing dimension table needs conservative thresholds — compact only when file sizes drift dramatically, retain snapshots for 30 days for audit purposes, and run maintenance weekly rather than hourly.
Effective automated remediation requires per-table (or per-namespace, or per-catalog) policy definitions: target file size, compaction strategy (binpack vs. sort), snapshot retention window, delete-file threshold, and maintenance frequency. A system that applies a single global policy to all tables will over-maintain some and under-maintain others.
How LakeOps delivers continuous health classification and remediation
LakeOps is a dedicated lakehouse control plane for Apache Iceberg that automates the full health monitoring and remediation lifecycle at scale. It connects to your existing catalogs (Glue, REST, Polaris, Nessie, S3 Tables) and query engines (Trino, Spark, Flink, Snowflake, Athena, DuckDB) without moving data or changing pipelines. Currently managing 786+ tables across 112+ PB autonomously, every operation is logged and reversible.
Health classification and table-level Insights
Every table is continuously classified as Critical, Warning, or Healthy based on structural signals: file count vs. target, manifest fragmentation ratio, snapshot depth relative to policy, orphan volume, sort staleness measured against query telemetry, and query latency trends. Health state updates with every new commit — not on a polling schedule.
Table-level Insights surface problems at four severity levels (CRITICAL, HIGH, WARNING, LOW) with one-click remediation. Each Insight includes the specific metric that triggered it, the current value, the threshold it violated, and the recommended action. Engineers can remediate with a single click or let LakeOps handle it autonomously.
Per-table metrics and observability
For every table, LakeOps displays: total records, total size, stale file count, active data files, average file size, position deletes, equality deletes, and records distribution over the last 60 snapshots. This gives a complete structural picture without running any queries — the metrics are computed from catalog metadata in real time.
Executive dashboard

The executive dashboard provides lake-wide visibility: total operations performed, average query acceleration achieved by maintenance, cumulative cost savings from eliminated orphans and reduced scan volume, CPU and storage reduction percentages. This is the view for platform engineering leadership — one glance answers whether the lake is healthy, how much maintenance is saving, and whether any segment is degraded.
Cross-engine telemetry
LakeOps provides a unified view of how each query engine performs against each table. Trino, Spark, Flink, Athena, Snowflake — each engine's access patterns, query latencies, and scan volumes are visible in a single pane. This telemetry drives intelligent sort order recommendations: if 80% of queries from Trino filter on event_date and user_id, but the table is sorted by created_at, LakeOps surfaces that misalignment as a sort staleness Insight.
Coordinated maintenance pipeline
When health degrades past configured thresholds, LakeOps triggers the appropriate maintenance automatically — running the full sequenced pipeline (expire → clean → compact → rewrite manifests → refresh statistics) as a single coordinated operation. Each step's output feeds the next. A streaming table may remediate hourly while a batch table remediates weekly — triggered by structural signals, not fixed cron schedules.
The execution engine is built on Apache DataFusion in Rust — no JVM startup, no GC pauses, no executor provisioning. Non-blocking commits ensure concurrent readers and writers are never interrupted during maintenance. In production benchmarks across 5.5 TB, it completes compaction 86% faster than Spark.
Event audit trail
Every maintenance action is logged with complete context: what operation ran, when it started and completed, duration, files processed, bytes written, before-and-after metrics, and outcome (success/failure/partial). The event history per table provides a complete audit trail — answering what was done, when, how long it took, and what impact it had on structural health. Every operation is reversible if needed.
Building a sustainable table health practice
Data quality and table health are two sides of lakehouse reliability. Ignoring either one creates blind spots that compound silently until they surface as performance regressions, cost overruns, or — worst case — incorrect business decisions based on stale or incomplete data.
The practical path forward has three stages:
Stage 1: Visibility. Run the monitoring queries in this guide against your production tables. Identify which tables are in Warning or Critical health states. Check freshness SLAs. Compute delete file ratios. Look at partition-level file counts, not just table-level averages. This takes hours, not weeks, and immediately reveals the tables that are silently degrading.
Stage 2: Classification. Implement a health scoring framework — Healthy/Warning/Critical — based on the thresholds in this guide, tuned to your workload profiles. A streaming events table and a monthly reporting table have different expectations. Classify every table and sort by urgency. Use per-table policies to encode what healthy looks like for each workload type.
Stage 3: Automation. Move from manual remediation to automated, event-driven maintenance that runs the full pipeline in the correct sequence, triggers based on actual table state, and logs every operation for observability. This is where a dedicated control plane like LakeOps eliminates the operational overhead — handling health classification, sequenced maintenance, per-table policies, cross-engine telemetry, and lake-wide observability across your entire catalog without custom scripting or Spark cluster management.
The teams that treat reliability as a compound problem — data quality and table health, monitored continuously, remediated automatically — are the ones running lakehouses that scale smoothly, cost predictably, and deliver trustworthy data to every consumer. The metrics are already in your table metadata. The tooling exists to act on them autonomously. Start with visibility, graduate to classification, and let automation handle the rest.



