Iceberg Lakehouse Observability: Monitor Table Health, Costs, and Query Performance

Apache Iceberg lakehouse observability — monitoring what matters in production

Apache Iceberg observability is not built into the table format. Production experience from managed multicloud, multi-engine Iceberg lakehouses shows what teams actually need to monitor: lineage, data quality, cost, performance, query patterns, access audits, and orphan file management. Data platform teams need that same cross-layer visibility before degradation shows up as failed dashboards or doubled cloud bills.

This decomposition is the entire point of the lakehouse architecture — vendor independence, engine flexibility, cost control. But it also means no single component has a complete picture. Each engine knows about its own queries. Each catalog knows about its own metadata. The storage layer knows about bytes and objects. Nobody correlates them.

The result is a monitoring blind spot that grows with every table you add, every engine you connect, and every pipeline you deploy. Tables degrade silently — small files accumulate, snapshots pile up, delete files multiply, orphans leak storage — and the first signal is often a Slack message asking why the dashboard is slow or why the cloud bill doubled.

Iceberg gives you a table format. It does not give you observability. You cannot monitor what you cannot measure — and without observability, tables degrade silently until queries fail or bills spike.

LakeOps fills this gap as a dedicated control plane for Apache Iceberg lakehouses — providing unified observability across catalogs and engines, health classification for every table, and automated remediation that closes the loop between monitoring and maintenance. No custom scripting, no cluster management, no blind spots.

This guide covers why lakehouse observability is fundamentally harder than warehouse observability, the seven pillars you need to monitor, the specific metrics at table and engine level that matter most, and how to build observability into the operational loop that keeps your lake healthy.

Why lakehouse observability is harder than warehouse observability

In a traditional data warehouse — Snowflake, BigQuery, Redshift — observability is a solved problem because the vendor solves it for you. They control the storage format, the query engine, the optimizer, the metadata store, and the billing model. They can instrument every layer because they own every layer. When a query scans more data than expected, they know. When a table has not been refreshed, they know. When costs spike, they attribute it to specific queries, users, or warehouses.

The lakehouse breaks this model in three fundamental ways.

Decomposed architecture means decomposed visibility. Each component — object storage, table format, catalog, query engines — generates its own telemetry. S3 CloudWatch metrics tell you about request counts and bytes transferred. Spark's UI tells you about stages and tasks. Trino's JMX metrics tell you about query planning time. Iceberg's metadata tells you about snapshots and manifests. But no component correlates these signals into a coherent picture. A query that scans 500 GB across 80,000 small files registers as high S3 GET request count in CloudWatch, long planning time in Trino, and high file count in Iceberg metadata — but correlating these three signals to identify that the root cause is a compaction backlog requires cross-system reasoning that no individual component provides.

Multiple engines mean multiple perspectives. When Spark, Trino, and Flink all query the same Iceberg table, each engine has a partial view of performance. Spark knows its own query latencies but not Trino's. Trino knows its own scan volumes but not Flink's. No engine can answer the question: across all consumers, what is the total cost of querying this table, and which structural improvements would reduce it the most? Without cross-engine telemetry, optimization is engine-local — you tune Spark queries without knowing that Trino queries on the same table are scanning 10x more data because they filter on different columns.

Open format means no built-in monitoring. Iceberg is a specification, not a service. It defines how metadata, manifests, and data files are organized — but it does not define how to monitor them. There is no built-in health dashboard, no alert framework, no cost attribution, no lineage tracking. These are left as exercises for the operator. For a handful of tables, you can run monitoring queries manually. For hundreds of tables across multiple catalogs, multiple clouds, and multiple engines, manual monitoring is impossible.

The consequence is that lakehouse observability is not a feature you get — it is an infrastructure you build. Or, increasingly, an infrastructure you adopt as a dedicated control plane.

The seven pillars of lakehouse observability

Production lakehouse observability spans seven domains. Each one answers a different operational question, and neglecting any one of them creates a blind spot that eventually surfaces as an incident.

1. Lineage: where does the data come from and where does it go?

Lineage tracks the flow of data from source to destination — which pipelines write to which tables, which tables feed which downstream consumers, and which transformations happen along the way. In a decomposed lakehouse, lineage is fragmented across pipeline orchestrators (Airflow, Dagster), streaming platforms (Kafka, Flink), and batch engines (Spark, dbt).

Without lineage, impact analysis is guesswork. When a source table changes schema, you cannot systematically identify which downstream tables, dashboards, and ML features are affected. When a pipeline fails, you cannot determine which consumers are now seeing stale data. When a table is decommissioned, you cannot verify that nothing still depends on it.

Lineage breaks are the most dangerous failure mode in a lakehouse — not because they cause immediate errors, but because they cause silent correctness issues that propagate through the dependency graph. A single schema change in a source table can produce subtly wrong aggregations in a downstream dashboard that nobody catches for weeks.

2. Data quality: is the data correct and complete?

Data quality monitoring validates that the data itself meets expectations: freshness (is the table being updated on schedule?), completeness (are required columns populated?), uniqueness (are there duplicate records?), and conformance (does the data match expected formats and ranges?). For a deep dive into quality monitoring at the Iceberg metadata level, see the data quality and table health guide.

On Iceberg, several quality signals are measurable directly from metadata without scanning data files. Freshness comes from the last snapshot commit time. Completeness comes from per-column null counts stored in manifests. Schema conformance comes from the evolution history. These zero-scan checks are the highest-leverage quality monitors because they run at negligible compute cost and catch the most common failure modes — pipeline failures (freshness), upstream schema changes (conformance), and ingestion bugs (completeness).

3. Cost: how much does each table, pipeline, and query cost?

Cloud lakehouse costs are notoriously hard to attribute. Object storage costs are per-bucket, not per-table. Compute costs are per-cluster or per-query, not per-table. Data transfer costs are invisible until the bill arrives. The result is that most teams know their total lake cost but cannot answer the question: which tables are the most expensive to operate, and why?

Cost observability requires correlating storage volume per table (including orphan files), compute consumption per query (attributed to the table scanned), and maintenance compute (compaction, manifest rewriting, snapshot expiration). Without this correlation, cost optimization is impossible — you cannot reduce what you cannot measure.

A single table with 500 TB of orphan files costs the same as 500 TB of active data in object storage. A compaction job that runs hourly on a table that changes weekly wastes compute. A query that scans 200 GB because small files defeat data skipping costs 10x more than it should. Each of these is invisible without table-level cost attribution.

4. Performance: how fast are queries and how efficient is the storage layout?

Performance observability has two layers. The engine layer tracks query latency, scan volume, planning time, and error rates per engine per table. The storage layer tracks file count, file size distribution, manifest fragmentation, delete file ratio, and sort order alignment — the structural properties that determine how efficiently any engine can read the data.

The critical insight is that engine-level performance is downstream of storage-level health. When queries get slower, the root cause is almost never the engine configuration — it is the table structure. Small files multiply planning time and I/O operations. Fragmented manifests increase metadata parsing. Accumulated delete files add merge-on-read overhead. Drifted sort order defeats data skipping. Fixing the table structure fixes performance across all engines simultaneously.

5. Query patterns: who queries what, how, and how often?

Query pattern analysis reveals how tables are actually used — which columns are filtered, which are joined, which are projected, and by which engines. This telemetry drives optimization decisions that would otherwise require manual query log analysis.

If 90% of Trino queries on a table filter on event_date and region, but the table is sorted by created_at, every query does a full scan instead of skipping 95% of the data. If a column is never filtered on by any engine, including it in the sort key wastes compaction compute for zero read benefit. Query pattern telemetry turns sort order and partitioning from guesswork into data-driven decisions.

6. Access audits: who accessed what and when?

For regulated industries — healthcare, finance, government — access auditing is not optional. Every query against a table containing sensitive data must be logged with the identity of the querier, the columns accessed, the timestamp, and the query intent. In a multi-engine lakehouse, access audit data is scattered across engine-specific logs that must be aggregated and normalized.

Beyond compliance, access audits reveal operational patterns: which teams depend on which tables, how query volume changes over time, and whether deprecated tables still have active consumers. This information is essential for safe table decommissioning, migration planning, and capacity forecasting.

7. Storage health: are files clean, compact, and properly managed?

Storage health is the foundation layer — the structural condition of the data files, manifests, snapshots, and metadata that comprise each table. It encompasses the metrics covered in detail in the table health maintenance guide: file count and size distribution, manifest fragmentation, snapshot depth, delete file accumulation, partition skew, orphan file volume, and sort order drift.

Storage health is the pillar most often neglected because it is the least visible. Queries still succeed on degraded tables. Dashboards still render. The degradation manifests as gradually increasing latency, gradually increasing scan volume, and gradually increasing storage cost — all of which are attributed to data growth rather than structural decay.

Table-level metrics that matter

Observability at the table level means tracking the structural signals that determine read performance, write efficiency, and storage cost. These are the metrics that should appear in every table's observability profile.

File count and average file size

The most fundamental health signal. Every write to an Iceberg table creates new immutable Parquet files. Streaming pipelines checkpointing every minute across hundreds of partitions generate thousands of small files per day. Without regular compaction, file counts grow linearly with write frequency.

The thresholds are workload-dependent, but general guidelines hold: average file size below 32 MB is critical, below 128 MB is a warning, and 128–512 MB is healthy. Total file count matters less than file count per partition — a table with 50,000 well-distributed files across 1,000 partitions is healthier than a table with 5,000 files concentrated in 3 partitions.

Snapshot depth and retention drift

Every commit creates a new snapshot. Snapshots enable time travel but anchor data files in storage — as long as a snapshot references a file, that file cannot be garbage-collected even if newer snapshots no longer need it. A table with 15,000 retained snapshots has an enormous metadata tree that slows scan planning and prevents orphan cleanup.

The observability signal is the gap between intended retention and actual retention. If your policy says 7 days but the oldest snapshot is 45 days old, snapshot expiration is either not running or not keeping up. This drift is invisible unless you actively monitor it.

Manifest count and files-per-manifest ratio

Manifests are Iceberg's index layer — each one tracks a set of data files with partition values and column statistics. When manifests fragment (too many small manifests each tracking a handful of files), scan planning degrades because the engine must parse every manifest before it knows which data files to read.

Healthy tables maintain fewer than one manifest per 50–100 data files. When the ratio drops below 1:10, manifest rewriting is overdue. The cost is quadratic: more manifests mean more metadata I/O, which means longer planning time, which means every query — regardless of selectivity — pays a fixed overhead that grows with fragmentation.

Delete file ratio

Iceberg v2 row-level deletes produce position delete files and equality delete files. These are efficient for writes but create a cumulative read tax — every query must reconcile data files against pending deletes. A delete-to-data ratio above 0.1 indicates growing overhead, above 0.3 is a warning, and above 0.5 is a read-performance emergency. CDC tables are the primary source.

Partition skew

Lake-wide averages hide partition-level problems. A table with an average of 50 files per partition sounds healthy — until you discover that one partition has 12,000 files because of a traffic spike, a retry storm, or a backfill. Any query touching that partition pays the full small-file penalty regardless of how healthy the rest of the table looks.

The skew ratio — max file count divided by median file count — quantifies the imbalance. A ratio above 10 indicates significant skew. Above 50 means some partitions are effectively degraded while the table-level metrics look fine. Partition-level monitoring is non-negotiable for production observability.

LakeOps Tables health — Lake-wide table health classification in LakeOps: every table classified as Critical, Warning, or Healthy based on structural signals — file counts, average file sizes, snapshot depths, delete ratios, and partition details. Per-table Insights surface individual problems at four severity levels (CRITICAL, HIGH, WARNING, LOW) with specific metrics, thresholds, and remediation actions.

Engine-level metrics: the other half of the picture

Table-level metrics tell you about structural health. Engine-level metrics tell you about the operational impact of that structure on real workloads. Both are necessary — table metrics without engine metrics miss the user-facing impact, and engine metrics without table metrics miss the root cause.

Query latency distribution

Not just average latency — the distribution matters. A table where P50 latency is 2 seconds and P99 is 180 seconds has a tail latency problem that averages hide. The tail is often caused by queries that hit degraded partitions, trigger merge-on-read against accumulated delete files, or scan through fragmented manifests.

Track latency per engine per table over time. Sudden shifts indicate structural degradation (compaction backlog, delete file accumulation) or workload changes (new queries scanning more data). Gradual upward drift indicates structural decay that is compounding daily.

Scan volume per query

How much data does each query actually read versus how much it should read? If a query with a selective filter on a partitioned column scans 500 GB instead of 5 GB, something is wrong — small files with overlapping min/max ranges are defeating data skipping, the sort order is misaligned with the filter columns, or the partition spec does not match the query pattern.

Scan volume is the most direct measure of storage layout efficiency. Divided by the result set size, it gives you a read amplification factor — the ratio of bytes read to bytes returned. A read amplification factor of 100x on a query that should be selective is a clear signal that the table's physical layout needs attention.

Cost per query

In cloud lakehouses, every byte scanned has a cost — compute time for the engine, GET requests for object storage, data transfer for cross-region reads. Cost per query is the product of scan volume, compute pricing, and I/O pricing. Attributing this cost to specific tables reveals which tables are the most expensive to query and where structural improvements would yield the highest ROI.

A table where the average query costs $0.50 because small files force full scans might cost $0.05 per query after compaction. Across 10,000 queries per day, that is $4,500 per day in unnecessary spend. Without cost-per-query attribution, this waste is invisible — it is buried in aggregate compute bills that are attributed to clusters, not tables.

Error rates and failure patterns

Query failures are the loudest observability signal — but by the time queries fail, the underlying problem has usually been compounding for a long time. Common failure patterns: planning timeouts from manifest overload, OOM errors from small-file fan-out, concurrent modification exceptions from compaction conflicting with writes, and metadata corruption from incomplete operations.

Error rates should be tracked per engine per table with automatic classification: transient (retry-safe), structural (table maintenance needed), or systemic (infrastructure issue). A table that fails 5% of Trino queries due to planning timeouts is a structural problem — the fix is manifest rewriting and compaction, not Trino configuration.

Cross-system signals: where observability domains intersect

The highest-value observability signals come from correlating across domains — combining table health, engine telemetry, cost data, and lineage into signals that no single domain can produce alone.

Lineage breaks and freshness SLAs

A freshness SLA violation is an observability event. But the operational question is not just is the table stale? — it is why is the table stale, what is affected downstream, and how do we fix it? Answering this requires correlating freshness (table-level) with lineage (cross-system) and pipeline status (orchestrator-level).

If table A feeds table B feeds dashboard C, and table A goes stale, you need to know immediately that dashboard C is at risk — not after dashboard C's SLA breaches and a user reports it. This is lineage-aware freshness monitoring: propagating staleness signals through the dependency graph rather than monitoring each table in isolation.

Quality checks driven by structural signals

A sudden spike in null counts for a column that was previously 99.9% populated is a quality signal. But correlating it with a schema evolution event on the upstream table transforms it from an anomaly into a diagnosed root cause. The upstream table added a new column with the same name but different semantics, and the downstream pipeline's column mapping broke silently.

Similarly, a sudden change in record count per snapshot — 10x more records than usual — could be a legitimate traffic spike or a pipeline bug that duplicated a batch. Correlating it with partition-level record counts, upstream lineage events, and historical patterns distinguishes the two without manual investigation.

Cost attribution across the stack

True cost attribution requires combining storage cost (object storage bytes per table, including orphans), compute cost (query engine time per table), maintenance cost (compaction and cleanup compute per table), and data transfer cost (cross-region reads per table). Each of these comes from a different system — S3 billing, EMR/Dataproc usage, maintenance job logs, and network transfer logs.

The outcome is a per-table total cost of ownership that answers: what does it cost to store, query, and maintain this table? Tables with high TCO relative to their business value are candidates for archival, schema optimization, or partition redesign. Tables with rapidly growing TCO are candidates for immediate structural remediation.

Real-time monitors vs periodic audits: which should page you at 2 AM?

Not every observability signal needs real-time alerting. Paging an on-call engineer for a sub-optimal manifest ratio is as harmful as not alerting on a freshness SLA breach. The distinction between real-time monitors and periodic audits determines operational sanity.

Real-time monitors: page-worthy signals

These signals indicate an active incident or imminent failure and should trigger immediate notification:

Freshness SLA breach. If a table's data is stale beyond its defined threshold, downstream consumers are seeing outdated data. This is a live correctness issue that affects business decisions. Page immediately.

Query failure rate spike. If a table's query error rate jumps from baseline 0.1% to 15%, something is structurally broken — planning timeouts, metadata corruption, or storage access failures. This is a live availability issue. Page immediately.

Lineage break with downstream impact. If a critical pipeline fails and the lineage graph shows that three tier-1 dashboards depend on its output, the blast radius is known and time-sensitive. Page immediately with the dependency context.

Storage anomaly. If a table's storage volume doubles in an hour with no corresponding increase in record count, something is wrong — runaway orphan generation, failed compaction creating duplicate files, or a write amplification bug. Page immediately because the cost is accumulating by the minute.

Periodic audits: important but not urgent

These signals indicate degradation that should be addressed in the next business day or maintenance window:

Structural health drift. A table transitioning from Healthy to Warning — file sizes declining, manifest count increasing, delete ratio creeping up. Important to address before it becomes Critical, but not an incident. Surface in a daily digest or weekly health report.

Cost trend anomalies. A table whose query cost has increased 30% over the past month. Important for FinOps review and capacity planning, but not actionable at 2 AM. Include in a weekly cost report with attribution details.

Sort order misalignment. Query telemetry reveals that the dominant filter columns have shifted but the sort order has not been updated. Performance is sub-optimal but not broken. Flag as a WARNING-level Insight for the next maintenance cycle.

Orphan file accumulation. Orphans accumulate gradually — they are invisible to queries and do not affect performance. They inflate storage cost, but the cost growth is slow enough that daily or weekly cleanup suffices. Include in periodic storage health audits.

The principle is straightforward: page on correctness and availability, report on efficiency and cost. LakeOps implements this distinction natively through its four-severity Insights model — CRITICAL and HIGH Insights trigger immediate automated remediation, while WARNING and LOW Insights are surfaced for review without paging anyone.

Building observability into the maintenance loop

Observability without action is just monitoring — you know something is wrong but you still have to fix it manually. The operational leverage comes from closing the loop: observe → classify → act → measure.

Observe: continuous metric collection

The observe phase collects structural metrics from every table in every connected catalog. File counts, size distributions, manifest ratios, snapshot depths, delete file ratios, partition-level statistics, and freshness timestamps — all computed from catalog metadata without scanning data files. For a lake with hundreds of tables, this must be automated and continuous. Polling-based approaches miss fast-moving degradation; event-driven approaches that update metrics with every commit provide real-time health state.

Classify: health scoring and severity ranking

Raw metrics are necessary but insufficient. With hundreds of tables, you need classification that triages every table into a health state — Critical, Warning, or Healthy — and ranks individual problems by severity. Classification transforms a wall of metrics into an actionable priority list: fix these tables first, watch these tables next, and leave these tables alone.

LakeOps implements this classification with three health states and a four-severity Insights model that provides the granularity production operations demand:

Health states classify the overall table condition. Critical means the table has structural problems severe enough to impact query performance or correctness — average file size below 32 MB, delete-to-data ratio above 0.5, manifest ratio below 1:10, or snapshot retention exceeding policy by more than 5x. Warning means the table is degrading and will reach Critical without intervention — file sizes declining below 128 MB, delete ratios creeping above 0.1, or snapshot depth drifting beyond policy. Healthy means all structural signals are within acceptable bounds — file sizes between 128–512 MB, manifests well-consolidated, snapshots within retention policy, and sort order aligned with query patterns.

Four-severity Insights provide per-issue granularity within each table. A single Critical table might have multiple Insights: a CRITICAL-severity Insight for 85,000 small files in a single partition, a HIGH-severity Insight for 4,200 pending position delete files, a WARNING-severity Insight for sort order misalignment with dominant query predicates, and a LOW-severity Insight for 12 orphan files consuming 340 MB. Each Insight includes the specific metric, its current value, the threshold violated, and a remediation action. CRITICAL and HIGH Insights trigger automated remediation immediately. WARNING and LOW Insights surface for review without paging anyone.

The classification must be context-aware. A streaming events table with 500 small files per partition is in worse shape than a monthly reporting table with the same count — because the streaming table is queried thousands of times per day and the reporting table is queried once per month. Per-table policies that encode workload expectations — compaction frequency, target file size, acceptable delete ratio, snapshot retention — are essential for accurate classification. LakeOps applies these policies per table, so a CDC table with aggressive write throughput is classified against different thresholds than a daily-batch dimension table.

LakeOps Insights — proactive alerts for table health — Severity-ranked Insights: CRITICAL for partition file issues, HIGH for excessive manifests, WARNING for partition skew and small files.

Act: automated, sequenced remediation

When classification identifies a table in Critical or Warning state, remediation should trigger automatically — running the maintenance operations in the correct sequence (expire → clean → compact → rewrite manifests → refresh statistics) without human intervention. The operations must be conflict-safe (concurrent readers and writers are not interrupted), incremental (only the degraded partitions are targeted), and reversible (every operation can be rolled back if it produces unexpected results).

The alternative — opening a ticket, assigning it to an engineer, waiting for investigation, scheduling maintenance, and verifying the fix — takes days to weeks. During that time, the degradation compounds. Automated remediation closes the loop in minutes to hours.

Measure: before-and-after impact tracking

Every remediation action should be measured against the health state that triggered it. If compaction ran because average file size was 8 MB, did it achieve the target 256 MB? If manifest rewriting ran because the ratio exceeded 1:5, is it now below 1:50? If snapshot expiration ran because retention was 45 days against a 7-day policy, is the oldest snapshot now within bounds?

Impact measurement serves two purposes: it validates that remediation worked, and it tunes the classification thresholds over time. If a WARNING-level threshold consistently triggers remediation that has no measurable impact, the threshold is too sensitive. If a table regularly reaches CRITICAL before remediation triggers, the WARNING threshold is not sensitive enough.

LakeOps Table Events — LakeOps event audit trail: every maintenance operation — compaction, snapshot expiration, orphan cleanup, manifest rewriting — logged with start time, duration, files processed, bytes before and after, health score change, and outcome. This is the operational history that closes the observe-classify-act-measure loop and answers 'what happened to this table' at any point in time.

Cross-engine telemetry: seeing the whole picture

In a multi-engine lakehouse, the most valuable observability signals come from aggregating telemetry across all engines that touch the same tables. No single engine has this view. StarRocks knows about its federation queries but not about the Spark jobs that write to the same tables. Trino knows about its scan volumes but not about the Flink checkpoints that create the small files it struggles to read. Athena knows about its costs but not about the compaction jobs that would reduce them.

Cross-engine telemetry solves three problems that engine-local monitoring cannot. First, conflicting access patterns: if Trino queries filter on event_date and region while Spark queries filter on customer_id and event_type, the optimal sort order is a compromise that no single engine can determine alone. Second, total cost attribution: the true cost of a table is the sum of all engine queries, all maintenance compute, and all storage — scattered across different billing systems that no individual engine aggregates. Third, write-read coordination: knowing that Flink commits to a table every 30 seconds while StarRocks federation queries hit it every 5 minutes tells you that compaction should run between federation query cycles, not between Flink checkpoints.

LakeOps collects this cross-engine telemetry by connecting to each engine's query logs and correlating them per table. The result is a unified access profile per table that shows: which engines read it (with frequency and volume), which engines write to it (with commit cadence and file sizes), which columns are most frequently filtered (across all engines), and what the read amplification factor is per engine. This profile drives every optimization decision — sort order selection, compaction scheduling, partition strategy, and query routing — because it reflects the actual combined workload rather than any single engine's partial view.

The control plane approach to observability

The seven pillars described above are individually tractable. Any team can write monitoring queries for file count and snapshot depth. Any team can set up freshness alerts. The challenge is doing all of it, across all tables, across all engines, continuously, at scale — while also handling remediation, cost attribution, lineage correlation, and access auditing.

This is the control plane thesis: observability, health classification, maintenance orchestration, and operational intelligence should be a single dedicated system — not a patchwork of scripts, cron jobs, and custom dashboards stitched together by the platform team.

What a control plane provides

Unified visibility across catalogs and engines. Connect Glue, REST, Polaris, Nessie, S3 Tables — every table in every catalog appears in a single view with health classification, structural metrics, and operational history. Connect Trino, Spark, Flink, Athena, Snowflake — every engine's telemetry is correlated per table.

Health classification with four-severity Insights. Every table is classified as Critical, Warning, or Healthy based on structural signals that update with every commit. Individual problems surface as Insights at four severity levels (CRITICAL, HIGH, WARNING, LOW), each with the specific metric that triggered it, the current value, the threshold violated, and one-click remediation.

Per-table detail view without running queries. Every table gets a dedicated observability profile computed entirely from catalog metadata — no data scanning, no compute cost. The profile surfaces: total records and physical size, active data files and average file size, stale files pending cleanup, position delete files and equality delete files (with the delete-to-data ratio), partition count and per-partition file distribution, snapshot count and retention drift, manifest count and files-per-manifest ratio, and a records-over-time chart showing distribution across the last 60 snapshots. This per-table view is the starting point for every investigation — when a query slows down, the table's observability profile shows whether the cause is structural (small files, manifests, deletes) or data-related (volume spike, partition skew) before anyone opens a query plan.

Cross-engine telemetry driving optimization. LakeOps collects query telemetry from every connected engine — Trino, Spark, Flink, Athena, StarRocks, Snowflake — and correlates it per table. The telemetry reveals which columns are filtered by which engines, which sort orders align with actual access patterns, which tables are queried by multiple engines with conflicting scan patterns, and which tables have the highest read amplification (bytes scanned vs. bytes returned). This cross-engine view is something no single engine can provide — Trino knows about Trino queries, but not that Spark queries on the same table filter on completely different columns. LakeOps uses this telemetry to recommend (and automatically apply) sort orders and compaction strategies that optimize for the combined workload across all engines.

Coordinated maintenance pipeline. Observe → classify → act → measure as a single automated loop. Expire snapshots, clean orphans, compact data files, rewrite manifests, refresh statistics — in the correct sequence, triggered by structural signals, with every operation logged and reversible. Each maintenance operation is tracked in a comprehensive event audit trail: start time, duration, files processed, bytes before and after, impact on health score, and outcome (success, partial, conflict). The audit trail provides the operational history that answers 'what changed, when, and what was the impact' for any table at any point in time — essential for debugging regressions and demonstrating compliance.

Executive dashboard for platform leadership. The executive view surfaces lake-wide operational health in a single screen: total tables under management, tables by health state (Critical, Warning, Healthy), total maintenance operations run (with trend), average query acceleration achieved from maintenance, cumulative cost savings from orphan cleanup and scan reduction, total CPU hours and storage reduced, and month-over-month optimization trends. These are the metrics that platform engineering leadership needs to answer two questions in a monthly review: is the lake healthy, and is the investment in observability paying off? The dashboard provides both the current state and the historical trajectory.

LakeOps Dashboard — LakeOps executive dashboard: tables by health state, total maintenance operations (with trend), average query acceleration from compaction and sort optimization, cumulative cost savings from orphan cleanup and scan reduction, CPU hours and storage reduced — the single view that platform engineering leadership uses to verify lakehouse health and justify the observability investment.

LakeOps as the observability control plane

LakeOps implements the control plane approach for Apache Iceberg lakehouses. It connects to your existing catalogs and query engines without moving data or changing pipelines, providing the full observability stack described in this guide — from table-level structural metrics to cross-engine telemetry to executive dashboards.

The observability layer processes thousands of structural health checks per hour — every commit to every table triggers a health reclassification, every engine query contributes to the cross-engine telemetry corpus, and every maintenance operation feeds the closed-loop measurement cycle. Every operation is logged with complete context — what ran, when, duration, files processed, bytes before and after, health score impact, and outcome — providing the event audit trail that production lakehouses require for both operational debugging and compliance.

LakeOps observability walkthrough — health classification, insights, and cross-engine telemetry.

The operational model is adaptive: observability signals drive the maintenance loop rather than fixed schedules. A table that receives 10,000 writes per hour gets compacted hourly. A table that changes weekly gets compacted weekly. A table in Critical health state gets immediate attention; a Healthy table is left alone. The system observes, classifies, acts, and measures — continuously, across every table, without human intervention for routine operations.

The execution engine is built on Apache DataFusion in Rust — no JVM startup, no GC pauses, no executor provisioning. Non-blocking commits ensure concurrent readers and writers are never interrupted during maintenance. In production benchmarks across 5.5 TB, it completes compaction 86% faster than Spark.

From monitoring gaps to operational intelligence

Lakehouse observability is not a single metric or a single tool. It is the systematic practice of measuring, classifying, acting on, and verifying the health of every table, engine, pipeline, and cost center in your lake. The seven pillars — lineage, data quality, cost, performance, query patterns, access audits, and storage health — each cover an independent failure domain. Neglecting any one of them creates a blind spot that eventually surfaces as an incident.

The practical path forward has three stages:

Stage 1: Instrument the basics. Start with table-level structural metrics — file count, average file size, snapshot depth, manifest ratio, delete file ratio — and freshness monitoring for your most critical tables. These signals catch the most common failure modes (silent structural degradation, pipeline failures) with minimal instrumentation effort. Run the monitoring queries in this guide against your production tables today.

Stage 2: Correlate across systems. Add engine-level telemetry (query latency, scan volume, error rates per table), cost attribution (storage + compute + maintenance per table), and lineage (upstream dependencies and downstream consumers). This is where the cross-system signals emerge — lineage-aware freshness, cost-attributed health classification, and query-pattern-driven optimization.

Stage 3: Close the loop. Move from monitoring to the full observe → classify → act → measure cycle with automated remediation, per-table policies, and impact tracking. This is where a dedicated control plane like LakeOps eliminates the operational overhead — handling health classification, sequenced maintenance, cross-engine telemetry, and lake-wide observability across your entire catalog without custom scripting or cluster management.

The teams running lakehouses at scale — hundreds of tables, multiple engines, petabytes of data, multi-cloud deployments — are the ones who have learned that Iceberg's operational reality in 2026 demands observability as a first-class concern. The table format gives you open, vendor-neutral storage. Observability gives you the ability to keep it healthy. And a control plane gives you the ability to do both at scale, continuously, without the operational overhead consuming your entire platform team.

For related patterns: multi-engine architecture covers how multiple engines interact with shared Iceberg tables, query routing with Iceberg covers directing queries to the optimal engine, hot/cold data tiering with StarRocks covers the observability challenges specific to tiered architectures, and the multi-engine routing solution shows how QueryFlux uses observability signals for routing decisions.