Full-Stack Observability

LakeOps provides continuous analysis of table structure, file health, and optimization opportunities. Monitor active engines, query latency, throughput, and error rates. Cross-system telemetry from S3, GCS, ADLS, and every engine — view, alert, and act from one place.

Dashboard

The main Dashboard is the first screen you see when logging in. It provides a real-time overview of your entire lake's optimization activity and health status.

Optimization activity

The top row of stat cards shows aggregate metrics for recent optimization activity:

Metric	Description
Total Operations	Count of all completed optimization runs (compaction, snapshot expiry, manifest rewrites, orphan cleanup) over the selected time window.
Query Speed	Average query acceleration factor across all connected engines, comparing pre-optimization and post-optimization query latencies.
Cost Savings	Estimated dollar savings from reduced storage footprint and compute hours.
CPU & Storage	Percentage reduction in resource usage compared to the unoptimized baseline.
Data Optimized	Total volume of data processed by LakeOps optimizations in the time window.

Table health overview

Below the activity metrics, a set of health cards provides an instant picture of your lake's state:

•Total Tables — number of tables discovered across all connected catalogs.
•Critical — tables requiring immediate attention (severe file fragmentation, excessive manifests, high orphan count).
•Warning — tables that should be addressed or put on autopilot to prevent degradation.
•Healthy — tables in optimal state with no action required.
•Total Data — aggregate data size across all catalogs.

Recent operations

A live-updating table at the bottom of the Dashboard shows the most recent optimization operations with their type, target table, duration, impact (files merged / data reclaimed), time, and status (Success, Running, or Failed).

Tables (global view)

The Tables screen lists every table across all catalogs in one searchable, filterable view. Each row shows:

•Table name and namespace
•Records and total size
•Health status (Critical / Warning / Healthy)
•Last updated timestamp

Filters: Use the dropdown selectors at the top to filter by catalog, namespace, or status. The search bar supports fuzzy matching on table names.

Click any table name to jump into the Explore deep-dive for that table.

Table Explorer (per-table)

The Explore screen is the deep-dive interface for any individual table. The left pane shows a tree view organized by catalog > namespace > table. Select a table to load its detail view with the following tabs:

Info

Full schema (column name, type, required flag), partition specs with current/historical spec IDs, sort orders, table UUID, storage location, file format, and custom properties.

Snapshots

Complete snapshot history with actions: Tag, Branch, Rollback, Set Current Snapshot, Compare, and Time Travel. Shows snapshot ID, timestamp, operation type, manifest count, and branch/tag references.

Metrics

Total records, total data size, stale file count, active data file count, average file size, average records per file, and a records distribution chart over recent snapshots.

Optimization

Per-table controls for Adaptive Maintenance, File Compaction, Snapshot Retention, Orphan Files Cleanup, and Rewrite Manifests. Each with an Enabled toggle, cron schedule, and Execute/Save buttons. When Adaptive Maintenance is active, individual sections are locked.

Insights

AI-generated recommendations categorized by severity (CRITICAL, HIGH, WARNING, LOW) with issue type and actionable description. Types include Partition Data Files, Excessive Manifests, Excessive Snapshots.

Simulations

Run and compare layout simulations. Includes field access frequency chart (SELECT, FILTER, JOIN breakdown) and Layout Customization Diff table for side-by-side strategy comparison.

Events

Full operation audit log for this table: operation type, status, start time, duration, and impact. Filterable by operation type.

Policies

All policies assigned to this table with enable/disable toggle, type badge, next scheduled run, and last execution time. Assign additional policies with + Assign Policy.

Query

Built-in SQL editor to run queries directly against this table. Choose between Spark and DataFusion engines, write SQL, click Run Query, and view results inline. Also supports a Compaction query mode for direct DataFusion-powered compaction.

Settings

Table-level configuration: table format (Iceberg v2), file format (Parquet), compression (ZSTD), target file size, snapshot retention, orphan file age, and write mode.

Table Metrics

The Explore > Metrics tab provides detailed file-level metrics for any table:

Metric	What it tells you
totalRecords	Current size of the dataset
positionDeletes	Position-based deletes pending merge-on-read. Non-zero values indicate read amplification.
equalityDeletes	Equality-based deletions that haven't been compacted away yet.
totalFiles	Total number of data files in the current snapshot.
totalFileSize	Aggregate size of all data files.
avgFileSize	Average size per file. Files much smaller than target (e.g. 512 MB) indicate compaction is needed.
avgRecordsPerFile	Average records per file. Helps gauge file density and layout efficiency.
deletedFiles	Zero deleted files in the latest snapshot (tracked for compaction and cleanup visibility).

A Records Distribution Over Time chart visualizes how record counts change across recent snapshots, helping you spot ingestion anomalies or compaction gaps.

Insights

LakeOps continuously analyzes every table and generates actionable insights. Insights surface issues that affect performance, cost, or reliability and suggest specific remediation steps.

Severity levels

CRITICAL

Immediate action required

HIGH

Should be resolved soon

WARNING

Address proactively

LOW

Informational / minor

Insight types

•Partition Data Files — detects partition skew, oversized partitions, and small-file accumulation within partitions.
•Excessive Manifests — flags tables where manifest count exceeds the threshold (default: 50) or where manifests are undersized relative to target.
•Excessive Snapshots — identifies tables with high obsolete snapshot ratios and estimates potential storage savings from expiration.

Global vs. per-table insights

The Insights screen in the sidebar shows insights across all tables with filters for catalog, namespace, severity, and type. The per-table Explore > Insights tab shows insights for a specific table only.

Events & Operations log

Every optimization operation is logged with full audit detail. The Events system provides both a global view and per-table history.

Event record fields

Field	Description
Operation type	Compact Data Files, Expire Snapshots, Rewrite Manifests, or Remove Orphan Files.
Status	Success, Running, or Failed.
Start time	When the operation began.
Duration	How long the operation took.
Impact	Quantifiable result (e.g. “24 → 16 files”, “3 → 1 manifests”, “1 snapshot expired”).

Global Events screen: Use the catalog, type, and status dropdowns plus the search bar to find operations across all tables.

Per-table Events tab: Shows only operations for the selected table in the Explore view.

Running queries

The Explore > Query tab provides an inline SQL editor connected to your table. Choose between Spark and DataFusion as the query engine, write SQL, click Run Query, and view results immediately. A separate Compaction query mode lets you run DataFusion-powered compaction directly. Use this to:

•Verify data correctness after compaction or rollback operations.
•Test query performance against specific tables before and after optimization.
•Debug data issues by inspecting specific partitions or time ranges.

System Monitoring

The Monitoring screen provides system-level observability for the platform and all connected engines.

System overview cards

•Active Engines — how many registered engines are currently active and responding.
•Queries Total — total queries processed across all engines with trend comparison.
•Avg Query Time — mean query duration across engines with week-over-week delta.
•System Alerts — count of active high-priority alerts requiring attention.

Engine performance table

A real-time table shows each engine with its health status, total queries, average duration, CPU usage (percentage bar), and memory usage (percentage bar). Engines with utilization above threshold are highlighted in amber.

Optimization suggestions

The monitoring system surfaces table-level optimization suggestions derived from engine telemetry. Each suggestion includes the target table, engine, issue description, impact severity, and a specific recommendation (e.g. “Add date_created as partition key”, “Enable Snappy compression”).

System alerts

Real-time alerts are categorized as:

•Error — connectivity failures, data source unreachable, critical system issues.
•Warning — high resource utilization, approaching thresholds.
•Info — scheduled maintenance, configuration changes.
•Success — completed operations, resolved issues.

Recent queries

A live table of the most recent queries across all engines, showing the SQL snippet, user/service identity, engine used, duration, row count, and timestamp. Useful for debugging slow queries or identifying unexpected access patterns.

Storage metrics

Cards showing storage breakdown by data category (e.g. Raw Data Lake, Processed Analytics, BI Aggregates, ML Features) with current size and monthly growth rate. Helps identify which layers are growing fastest and may need attention.

Catalog observability

The Catalogs screen shows all connected catalogs with their type (Glue + S3, REST + S3, DynamoDB + S3, S3 Tables, Custom), table count, total data size, region, and health status. This provides a top-down view of your data estate across cloud environments.

Console navigation

The left sidebar organizes all observability surfaces:

Section	Screens
Dashboard	Optimization activity, health overview, recent operations
Data	Catalogs, Explore (per-table deep-dive), Tables, Insights, Events
Manage	Policies (maintenance and governance rules)
Routing	Overview, Endpoints, Metrics, Settings
Engines	Overview, Compare, Health, Add Engine
Monitoring	System status, engine performance, alerts, queries, storage

Routing and Engines sections are visible to admin users. Organization settings (user management, roles) are accessible from the user menu.

Global screens each include relevant filter dropdowns: Tables (catalog, namespace, status), Insights (catalog, namespace, severity, type), Events (catalog, type, status), and Policies (type, status) — plus a search bar for quick lookup.

Cross-system telemetry

LakeOps aggregates telemetry from multiple sources into a single pane:

•Object storage — S3, GCS, and ADLS file counts, sizes, and access patterns.
•Query engines — Trino, Snowflake, Athena, DuckDB, StarRocks, ClickHouse query metrics.
•Iceberg metadata — snapshot counts, manifest health, partition statistics.
•Optimization pipeline — compaction throughput, queue depth, scheduling state.

This unified view eliminates the need to switch between cloud consoles, engine UIs, and custom monitoring tools to understand your lake's state.

← Orphan Cleanup Next: Policies →