Apache Iceberg logo

Iceberg Lake Observability

Know everything.
Control your Lakehouse.

LakeOps unifies Iceberg metadata, table health, proactive insights, and cross-engine query telemetry in one control plane — so platform teams know which tables are degrading, why, and what to do next.

LakeOps LogoLakeOps

Last 30 days Optimization Activity

Total Operations
12,211
Last 90 days
Query Speed
12.4×
Avg. acceleration across engines
Cost Savings
$1,374,672
Saved in last 3 months
CPU & Storage
-76%
Last 90 days
Data Optimized
46.8 PB
Last 30 days

Key Metrics

Total Tables
786
Tables in all catalogs
Critical Tables
70
Require immediate attention
Warning Tables
105
Should be addressed or auto-piloted
Healthy Tables
566
Tables in optimal state
Total Data
112.4 PB
Total lake data size

Recent Operations

Last 10 operations
OperationTableDurationImpactTimeStatus
Compact Data Files
customer_orders
orders
4s1.24 TB, 16 → 1 files57 minutes agoSUCCESS
Expire Snapshots
payment_transactions
payments
27s8.2 TB4 hours agoSUCCESS
Rewrite Manifests
raw_clickstream
analytics
1.9s3 → 1 manifests5 hours agoSUCCESS
Compact Data Files
product_catalog
products
6m 11.3s3,008 → 1,256 files6 hours agoSUCCESS
Remove Orphan Files
user_sessions
analytics
13m 6.9s59,831 files, 74.81 GB freed7 hours agoSUCCESS

Table Status Distribution

Critical70 (9%)
Warning105 (13%)
Healthy566 (72%)

Top 5 Tables Needing Optimization

By Size
Table NameTable SizeStatusLast Scan
analytics.raw_clickstream4.6 TBCRITICAL2 hours ago
analytics.search_query_logs3.2 TBCRITICAL3 hours ago
analytics.user_sessions1.9 TBCRITICAL4 hours ago
orders.customer_orders1.24 TBCRITICAL1 hour ago
payments.payment_transactions860 GBCRITICAL2 hours ago

Telemetry from across your stack

AWS
Azure
Google Cloud
Snowflake
Databricks
Apache Flink
Apache Hadoop
Apache Iceberg
Delta Lake
Spark
Lakekeeper
StarRocks
AWS
Azure
Google Cloud
Snowflake
Databricks
Apache Flink
Apache Hadoop
Apache Iceberg
Delta Lake
Spark
Lakekeeper
StarRocks

In the product

Full lake visibilitywith one-click actions

From executive health summary to per-table metrics and engine telemetry — each layer answers a different question platform teams ask every day.

Lake-wide dashboard

One screen for the health of your entire lake

After catalogs connect, LakeOps discovers every table and reads Iceberg metadata continuously. The dashboard summarizes health tiers, active insights, and lake scope — so platform teams start triage without writing SQL.

  • Critical, Warning, and Healthy counts across all connected catalogs
  • Table inventory with records, size, status, and last-modified at a glance
  • Refreshes as schemas and namespaces evolve — no custom instrumentation
LakeOps Dashboard — lake-wide health tiers, optimization activity, and table inventory

Table health classification

Structural scoring for every registered table

Health is computed from Iceberg signals teams already care about: file count and size distribution, manifest depth, snapshot accumulation, delete-file ratio, partition skew, and sort-order alignment with real query patterns.

  • Critical — severe fragmentation or metadata bloat; planning or scans at risk
  • Warning — degradation underway; likely to reach Critical without action
  • Healthy — structural indicators within bounds you define per environment
LakeOps table health grid — Critical, Warning, and Healthy Iceberg tables with status, size, and records

Events & audit trail

Complete history of what changed and why

Every maintenance operation — compaction, expiration, orphan removal, manifest rewrite — is logged lake-wide and per table with duration, impact, and status. Observability closes the loop: see the problem, act, verify the outcome.

  • Filter by catalog, operation type, and success or failure
  • Before/after file and manifest counts on every event
  • Compliance-ready trail for platform and data-governance teams
LakeOps events — lake-wide audit trail for compaction, snapshot expiration, and maintenance operations

Actionable insights

Actionable alerts before users file tickets

Insights evaluate tables on a schedule and raise prioritized findings — each tied to a table, severity, and recommended next step. Remediate manually or let policies act on the same signal.

  • CRITICAL — partition file explosions and runaway write patterns
  • HIGH — manifest counts above threshold, snapshot backlog
  • WARNING — partition skew, emerging small-file clusters
  • LOW — early drift you can fix before the next compaction window
LakeOps Insights — proactive Iceberg table health alerts ranked by severity

Table-level insights

Drill into the signals driving each table's health

Beyond lake-wide alerts, every table surfaces its own insights — manifest count vs. threshold, partition skew, small-file accumulation, and more. Each finding links directly to the affected table with severity and recommended action.

  • Per-table severity breakdown — from CRITICAL to LOW
  • Manifest fragmentation alerts with undersized-file counts
  • Partition skew and small-file warnings before they compound
LakeOps table insights — per-table alerts for manifests, partition skew, and small files

Explore & metrics

Per-table investigation without leaving the control plane

The Explore view and Metrics tab expose the full structural picture: records over time, active files, stale files, delete files, file-size histograms, and snapshot-level growth — the same data you would pull from Iceberg metadata tables, pre-joined and charted.

  • Records distribution across recent snapshots — spot write-pattern changes
  • File size histogram — % of files in optimal range vs. undersized
  • Position and equality delete tracking for merge-on-read tables
LakeOps table metrics — records distribution, file counts, and structural indicators per table

Partition drill-down

See skew and hotspots before they become outages

The Partitions view breaks down file counts, byte distribution, and delete-file concentration per partition key. Spot the partitions driving planning timeouts or runaway compaction jobs — and act before they escalate.

  • Per-partition file count and byte distribution at a glance
  • Delete-file hotspots highlighted across partition keys
  • Identify streaming-write explosions in individual partitions
LakeOps partition drill-down — per-partition file counts, byte distribution, and skew analysis

Cross-engine telemetry

See how every engine uses your tables

LakeOps ingests query telemetry from the engines in your stack. Field-access analysis shows which columns appear in filters and joins; engine-level views show latency and load — so observability informs both triage and downstream optimization.

  • SELECT, FILTER, and JOIN frequency per column
  • Per-engine query volume and latency trends
  • Hot tables and cold tables — prioritize maintenance where it matters
LakeOps cross-engine telemetry — field access frequency from queries and layout simulation results

What LakeOps gives you

Iceberg observability, end to end

Not another metrics bolt-on — a purpose-built view of table structure, metadata health, engine usage, and maintenance history for every table in your lake.

Lake-wide health at a glance

One dashboard for every catalog and namespace: how many tables are healthy, warning, or critical, total lake size, and where attention is needed first.

Continuous table health scoring

Every Iceberg table is classified from metadata — file fragmentation, manifest depth, snapshot age, delete-file ratio, and sort-order drift — not from guesswork or ad-hoc SQL.

Proactive Insights alerts

Structural problems surface at four severity levels — CRITICAL through LOW — with direct links to the affected table and optional one-click remediation.

Per-table metrics & history

Records, file counts, average file size, delete files, snapshot growth, and file-size distributions — the same signals Iceberg metadata exposes, unified in one UI.

Partition-level drill-down

See skew, per-partition file counts, and delete-file hotspots before they turn into planning timeouts or runaway compaction jobs.

Cross-engine query telemetry

Understand which engines hit which tables, which columns drive filters and joins, and where latency or scan volume is trending — without stitching together six separate UIs.

Full operations audit trail

Every compaction, snapshot expiration, manifest rewrite, and orphan cleanup — lake-wide or per-table — with duration, before/after file counts, and success status.

Foundation for autonomous maintenance

Observability signals feed compaction, expiration, and policy decisions — so maintenance runs on measured degradation, not fixed schedules.

The gap

Iceberg gives you metadata,
not a monitoring layer

Without continuous observability, tables degrade silently. Platform teams discover manifest bloat and small-file sprawl only after query latency spikes or storage bills jump.

Iceberg has metadata, not monitoring

Metadata tables expose snapshots, files, and manifests — but no native health scores, alerting, or lake-wide dashboards. Platform teams run one-off Spark SQL until something breaks.

Visibility is split across silos

Object-storage metrics, engine query UIs, and Iceberg catalog APIs each tell part of the story. Correlating a slow dashboard with manifest bloat takes days of manual investigation.

Degradation is silent until queries hurt

Small files, manifest sprawl, and snapshot buildup compound over weeks. Planning time grows, scans widen, and nobody notices until analysts or agents report failures.

Multi-engine lakes multiply blind spots

Trino, Spark, Snowflake, Athena, DuckDB, and Flink each see the same tables differently. Without unified telemetry, you cannot tell which engine or table is actually driving cost and latency.

Connect your catalogs.
See everything in minutes.

LakeOps reads Iceberg metadata from every connected catalog, classifies table health, and surfaces insights — no agents, no pipeline changes, no manual SQL to get started.

1

Connect & collect telemetry

Apache Iceberg
AWS
Snowflake
Trino
2

Manual or autonomous management

Manual
Autonomous
3

Operations run & optimize

Compaction
Snapshots
Orphan cleanup
Manifests & metadata
4

Observability & governance

Metrics
Health
Agents
Routing
Logs
Policies
No vendor lock-in
No code / infra changes
No data changes