Full-Stack Observability

LakeOps provides continuous analysis of table structure, file health, and optimization opportunities. Monitor active engines, query latency, throughput, and error rates. Cross-system telemetry from S3, GCS, ADLS, and every engine — view, alert, and act from one place.

Dashboard

The main Dashboard is the first screen you see when logging in. It provides a real-time overview of your entire lake's optimization activity and health status.

Optimization activity

The top row of stat cards shows aggregate metrics for recent optimization activity:

MetricDescription
Total OperationsCount of all completed optimization runs (compaction, snapshot expiry, manifest rewrites, orphan cleanup) over the selected time window.
Query SpeedAverage query acceleration factor across all connected engines, comparing pre-optimization and post-optimization query latencies.
Cost SavingsEstimated dollar savings from reduced storage footprint and compute hours.
CPU & StoragePercentage reduction in resource usage compared to the unoptimized baseline.
Data OptimizedTotal volume of data processed by LakeOps optimizations in the time window.

Table health overview

Below the activity metrics, a set of health cards provides an instant picture of your lake's state:

  • Total Tables — number of tables discovered across all connected catalogs.
  • Critical — tables requiring immediate attention (severe file fragmentation, excessive manifests, high orphan count).
  • Warning — tables that should be addressed or put on autopilot to prevent degradation.
  • Healthy — tables in optimal state with no action required.
  • Total Data — aggregate data size across all catalogs.

Recent operations

A live-updating table at the bottom of the Dashboard shows the most recent optimization operations with their type, target table, duration, impact (files merged / data reclaimed), time, and status (Success, Running, or Failed).

Tables (global view)

The Tables screen lists every table across all catalogs in one searchable, filterable view. Each row shows:

  • Table name and namespace
  • Records and total size
  • Health status (Critical / Warning / Healthy)
  • Last updated timestamp

Filters: Use the dropdown selectors at the top to filter by catalog, namespace, or status. The search bar supports fuzzy matching on table names.

Click any table name to jump into the Explore deep-dive for that table.

Table Explorer (per-table)

The Explore screen is the deep-dive interface for any individual table. The left pane shows a tree view organized by catalog > namespace > table. Select a table to load its detail view with the following tabs:

Info

Full schema (column name, type, required flag), partition specs with current/historical spec IDs, sort orders, table UUID, storage location, file format, and custom properties.

Snapshots

Complete snapshot history with actions: Tag, Branch, Rollback, Set Current Snapshot, Compare, and Time Travel. Shows snapshot ID, timestamp, operation type, manifest count, and branch/tag references.

Metrics

Total records, total data size, stale file count, active data file count, average file size, average records per file, and a records distribution chart over recent snapshots.

Optimization

Per-table controls for File Compaction, Snapshot Retention, Orphan Files Cleanup, and Rewrite Manifests. Each with Auto/Manual toggles, target values, cron schedule, and Simulate/Save buttons.

Insights

AI-generated recommendations categorized by severity (CRITICAL, HIGH, WARNING, LOW) with issue type and actionable description. Types include Partition Data Files, Excessive Manifests, Excessive Snapshots.

Simulations

Run and compare layout simulations. Includes field access frequency chart (SELECT, FILTER, JOIN breakdown) and Layout Customization Diff table for side-by-side strategy comparison.

Events

Full operation audit log for this table: operation type, status, start time, duration, and impact. Filterable by operation type.

Policies

All policies assigned to this table with enable/disable toggle, type badge, next scheduled run, and last execution time. Assign additional policies with + Assign Policy.

Query

Built-in SQL editor to run queries directly against this table. Write SQL, click Run Query, and view results inline. Useful for verifying optimization results.

Settings

Table-level configuration: table format (Iceberg v2), file format (Parquet), compression (ZSTD), target file size, snapshot retention, orphan file age, and write mode.

Table Metrics

The Explore > Metrics tab provides detailed file-level metrics for any table:

MetricWhat it tells you
totalRecordsCurrent size of the dataset
positionDeletesPosition-based deletes pending merge-on-read. Non-zero values indicate read amplification.
equalityDeletesEquality-based deletions that haven't been compacted away yet.
totalFilesTotal number of data files in the current snapshot.
totalFileSizeAggregate size of all data files.
avgFileSizeAverage size per file. Files much smaller than target (e.g. 512 MB) indicate compaction is needed.
avgRecordsPerFileAverage records per file. Helps gauge file density and layout efficiency.
deletedFilesZero deleted files in the latest snapshot (tracked for compaction and cleanup visibility).

A Records Distribution Over Time chart visualizes how record counts change across recent snapshots, helping you spot ingestion anomalies or compaction gaps.

Insights

LakeOps continuously analyzes every table and generates actionable insights. Insights surface issues that affect performance, cost, or reliability and suggest specific remediation steps.

Severity levels

CRITICAL
Immediate action required
HIGH
Should be resolved soon
WARNING
Address proactively
LOW
Informational / minor

Insight types

  • Partition Data Files — detects partition skew, oversized partitions, and small-file accumulation within partitions.
  • Excessive Manifests — flags tables where manifest count exceeds the threshold (default: 50) or where manifests are undersized relative to target.
  • Excessive Snapshots — identifies tables with high obsolete snapshot ratios and estimates potential storage savings from expiration.

Global vs. per-table insights

The Insights screen in the sidebar shows insights across all tables with filters for catalog, namespace, severity, and type. The per-table Explore > Insights tab shows insights for a specific table only.

Events & Operations log

Every optimization operation is logged with full audit detail. The Events system provides both a global view and per-table history.

Event record fields

FieldDescription
Operation typeCompact Data Files, Expire Snapshots, Rewrite Manifests, or Remove Orphan Files.
StatusSuccess, Running, or Failed.
Start timeWhen the operation began.
DurationHow long the operation took.
ImpactQuantifiable result (e.g. “24 → 16 files”, “3 → 1 manifests”, “1 snapshot expired”).

Global Events screen: Use the catalog, type, and status dropdowns plus the search bar to find operations across all tables.

Per-table Events tab: Shows only operations for the selected table in the Explore view.

Running queries

The Explore > Query tab provides an inline SQL editor connected to your table. Write SQL, click Run Query, and view results immediately. Use this to:

  • Verify data correctness after compaction or rollback operations.
  • Test query performance against specific tables before and after optimization.
  • Debug data issues by inspecting specific partitions or time ranges.

System Monitoring

The Monitoring screen provides system-level observability for the platform and all connected engines.

System overview cards

  • Active Engines — how many registered engines are currently active and responding.
  • Queries Total — total queries processed across all engines with trend comparison.
  • Avg Query Time — mean query duration across engines with week-over-week delta.
  • System Alerts — count of active high-priority alerts requiring attention.

Engine performance table

A real-time table shows each engine with its health status, total queries, average duration, CPU usage (percentage bar), and memory usage (percentage bar). Engines with utilization above threshold are highlighted in amber.

Optimization suggestions

The monitoring system surfaces table-level optimization suggestions derived from engine telemetry. Each suggestion includes the target table, engine, issue description, impact severity, and a specific recommendation (e.g. “Add date_created as partition key”, “Enable Snappy compression”).

System alerts

Real-time alerts are categorized as:

  • Error — connectivity failures, data source unreachable, critical system issues.
  • Warning — high resource utilization, approaching thresholds.
  • Info — scheduled maintenance, configuration changes.
  • Success — completed operations, resolved issues.

Recent queries

A live table of the most recent queries across all engines, showing the SQL snippet, user/service identity, engine used, duration, row count, and timestamp. Useful for debugging slow queries or identifying unexpected access patterns.

Storage metrics

Cards showing storage breakdown by data category (e.g. Raw Data Lake, Processed Analytics, BI Aggregates, ML Features) with current size and monthly growth rate. Helps identify which layers are growing fastest and may need attention.

Catalog observability

The Catalogs screen shows all connected catalogs with their type (Glue, REST, DynamoDB, S3 Tables, etc.), table count, total data size, region, and health status. This provides a top-down view of your data estate across cloud environments.

Console navigation

The left sidebar organizes all observability surfaces:

SectionScreens
DashboardOptimization activity, health overview, recent operations
DataCatalogs, Explore (per-table), Tables, Insights, Events
ManagePolicies (organization-wide rules)
RoutingOverview, Endpoints, Metrics, Settings
EnginesOverview, Compare, Health, Add Engine
MonitoringSystem status, engine performance, alerts, queries, storage
SettingsOrganization config: default retention, compaction threshold, max concurrent ops, notification channel, API key

Global screens each include relevant filter dropdowns: Tables (catalog, namespace, status), Insights (catalog, namespace, severity, type), Events (catalog, type, status), and Policies (type, status) — plus a search bar for quick lookup.

Cross-system telemetry

LakeOps aggregates telemetry from multiple sources into a single pane:

  • Object storage — S3, GCS, and ADLS file counts, sizes, and access patterns.
  • Query engines — Trino, Spark, Snowflake, Athena, DuckDB, Flink query metrics.
  • Iceberg metadata — snapshot counts, manifest health, partition statistics.
  • Optimization pipeline — compaction throughput, queue depth, scheduling state.

This unified view eliminates the need to switch between cloud consoles, engine UIs, and custom monitoring tools to understand your lake's state.