Full-Stack Observability
LakeOps provides continuous analysis of table structure, file health, and optimization opportunities. Monitor active engines, query latency, throughput, and error rates. Cross-system telemetry from S3, GCS, ADLS, and every engine — view, alert, and act from one place.
Dashboard
The main Dashboard is the first screen you see when logging in. It provides a real-time overview of your entire lake's optimization activity and health status.
Optimization activity
The top row of stat cards shows aggregate metrics for recent optimization activity:
| Metric | Description |
|---|---|
| Total Operations | Count of all completed optimization runs (compaction, snapshot expiry, manifest rewrites, orphan cleanup) over the selected time window. |
| Query Speed | Average query acceleration factor across all connected engines, comparing pre-optimization and post-optimization query latencies. |
| Cost Savings | Estimated dollar savings from reduced storage footprint and compute hours. |
| CPU & Storage | Percentage reduction in resource usage compared to the unoptimized baseline. |
| Data Optimized | Total volume of data processed by LakeOps optimizations in the time window. |
Table health overview
Below the activity metrics, a set of health cards provides an instant picture of your lake's state:
- •Total Tables — number of tables discovered across all connected catalogs.
- •Critical — tables requiring immediate attention (severe file fragmentation, excessive manifests, high orphan count).
- •Warning — tables that should be addressed or put on autopilot to prevent degradation.
- •Healthy — tables in optimal state with no action required.
- •Total Data — aggregate data size across all catalogs.
Recent operations
A live-updating table at the bottom of the Dashboard shows the most recent optimization operations with their type, target table, duration, impact (files merged / data reclaimed), time, and status (Success, Running, or Failed).
Tables (global view)
The Tables screen lists every table across all catalogs in one searchable, filterable view. Each row shows:
- •Table name and namespace
- •Records and total size
- •Health status (Critical / Warning / Healthy)
- •Last updated timestamp
Filters: Use the dropdown selectors at the top to filter by catalog, namespace, or status. The search bar supports fuzzy matching on table names.
Click any table name to jump into the Explore deep-dive for that table.
Table Explorer (per-table)
The Explore screen is the deep-dive interface for any individual table. The left pane shows a tree view organized by catalog > namespace > table. Select a table to load its detail view with the following tabs:
Info
Full schema (column name, type, required flag), partition specs with current/historical spec IDs, sort orders, table UUID, storage location, file format, and custom properties.
Snapshots
Complete snapshot history with actions: Tag, Branch, Rollback, Set Current Snapshot, Compare, and Time Travel. Shows snapshot ID, timestamp, operation type, manifest count, and branch/tag references.
Metrics
Total records, total data size, stale file count, active data file count, average file size, average records per file, and a records distribution chart over recent snapshots.
Optimization
Per-table controls for File Compaction, Snapshot Retention, Orphan Files Cleanup, and Rewrite Manifests. Each with Auto/Manual toggles, target values, cron schedule, and Simulate/Save buttons.
Insights
AI-generated recommendations categorized by severity (CRITICAL, HIGH, WARNING, LOW) with issue type and actionable description. Types include Partition Data Files, Excessive Manifests, Excessive Snapshots.
Simulations
Run and compare layout simulations. Includes field access frequency chart (SELECT, FILTER, JOIN breakdown) and Layout Customization Diff table for side-by-side strategy comparison.
Events
Full operation audit log for this table: operation type, status, start time, duration, and impact. Filterable by operation type.
Policies
All policies assigned to this table with enable/disable toggle, type badge, next scheduled run, and last execution time. Assign additional policies with + Assign Policy.
Query
Built-in SQL editor to run queries directly against this table. Write SQL, click Run Query, and view results inline. Useful for verifying optimization results.
Settings
Table-level configuration: table format (Iceberg v2), file format (Parquet), compression (ZSTD), target file size, snapshot retention, orphan file age, and write mode.
Table Metrics
The Explore > Metrics tab provides detailed file-level metrics for any table:
| Metric | What it tells you |
|---|---|
| totalRecords | Current size of the dataset |
| positionDeletes | Position-based deletes pending merge-on-read. Non-zero values indicate read amplification. |
| equalityDeletes | Equality-based deletions that haven't been compacted away yet. |
| totalFiles | Total number of data files in the current snapshot. |
| totalFileSize | Aggregate size of all data files. |
| avgFileSize | Average size per file. Files much smaller than target (e.g. 512 MB) indicate compaction is needed. |
| avgRecordsPerFile | Average records per file. Helps gauge file density and layout efficiency. |
| deletedFiles | Zero deleted files in the latest snapshot (tracked for compaction and cleanup visibility). |
A Records Distribution Over Time chart visualizes how record counts change across recent snapshots, helping you spot ingestion anomalies or compaction gaps.
Insights
LakeOps continuously analyzes every table and generates actionable insights. Insights surface issues that affect performance, cost, or reliability and suggest specific remediation steps.
Severity levels
Insight types
- •Partition Data Files — detects partition skew, oversized partitions, and small-file accumulation within partitions.
- •Excessive Manifests — flags tables where manifest count exceeds the threshold (default: 50) or where manifests are undersized relative to target.
- •Excessive Snapshots — identifies tables with high obsolete snapshot ratios and estimates potential storage savings from expiration.
Global vs. per-table insights
The Insights screen in the sidebar shows insights across all tables with filters for catalog, namespace, severity, and type. The per-table Explore > Insights tab shows insights for a specific table only.
Events & Operations log
Every optimization operation is logged with full audit detail. The Events system provides both a global view and per-table history.
Event record fields
| Field | Description |
|---|---|
| Operation type | Compact Data Files, Expire Snapshots, Rewrite Manifests, or Remove Orphan Files. |
| Status | Success, Running, or Failed. |
| Start time | When the operation began. |
| Duration | How long the operation took. |
| Impact | Quantifiable result (e.g. “24 → 16 files”, “3 → 1 manifests”, “1 snapshot expired”). |
Global Events screen: Use the catalog, type, and status dropdowns plus the search bar to find operations across all tables.
Per-table Events tab: Shows only operations for the selected table in the Explore view.
Running queries
The Explore > Query tab provides an inline SQL editor connected to your table. Write SQL, click Run Query, and view results immediately. Use this to:
- •Verify data correctness after compaction or rollback operations.
- •Test query performance against specific tables before and after optimization.
- •Debug data issues by inspecting specific partitions or time ranges.
System Monitoring
The Monitoring screen provides system-level observability for the platform and all connected engines.
System overview cards
- •Active Engines — how many registered engines are currently active and responding.
- •Queries Total — total queries processed across all engines with trend comparison.
- •Avg Query Time — mean query duration across engines with week-over-week delta.
- •System Alerts — count of active high-priority alerts requiring attention.
Engine performance table
A real-time table shows each engine with its health status, total queries, average duration, CPU usage (percentage bar), and memory usage (percentage bar). Engines with utilization above threshold are highlighted in amber.
Optimization suggestions
The monitoring system surfaces table-level optimization suggestions derived from engine telemetry. Each suggestion includes the target table, engine, issue description, impact severity, and a specific recommendation (e.g. “Add date_created as partition key”, “Enable Snappy compression”).
System alerts
Real-time alerts are categorized as:
- •Error — connectivity failures, data source unreachable, critical system issues.
- •Warning — high resource utilization, approaching thresholds.
- •Info — scheduled maintenance, configuration changes.
- •Success — completed operations, resolved issues.
Recent queries
A live table of the most recent queries across all engines, showing the SQL snippet, user/service identity, engine used, duration, row count, and timestamp. Useful for debugging slow queries or identifying unexpected access patterns.
Storage metrics
Cards showing storage breakdown by data category (e.g. Raw Data Lake, Processed Analytics, BI Aggregates, ML Features) with current size and monthly growth rate. Helps identify which layers are growing fastest and may need attention.
Catalog observability
The Catalogs screen shows all connected catalogs with their type (Glue, REST, DynamoDB, S3 Tables, etc.), table count, total data size, region, and health status. This provides a top-down view of your data estate across cloud environments.
Console navigation
The left sidebar organizes all observability surfaces:
| Section | Screens |
|---|---|
| Dashboard | Optimization activity, health overview, recent operations |
| Data | Catalogs, Explore (per-table), Tables, Insights, Events |
| Manage | Policies (organization-wide rules) |
| Routing | Overview, Endpoints, Metrics, Settings |
| Engines | Overview, Compare, Health, Add Engine |
| Monitoring | System status, engine performance, alerts, queries, storage |
| Settings | Organization config: default retention, compaction threshold, max concurrent ops, notification channel, API key |
Global screens each include relevant filter dropdowns: Tables (catalog, namespace, status), Insights (catalog, namespace, severity, type), Events (catalog, type, status), and Policies (type, status) — plus a search bar for quick lookup.
Cross-system telemetry
LakeOps aggregates telemetry from multiple sources into a single pane:
- •Object storage — S3, GCS, and ADLS file counts, sizes, and access patterns.
- •Query engines — Trino, Spark, Snowflake, Athena, DuckDB, Flink query metrics.
- •Iceberg metadata — snapshot counts, manifest health, partition statistics.
- •Optimization pipeline — compaction throughput, queue depth, scheduling state.
This unified view eliminates the need to switch between cloud consoles, engine UIs, and custom monitoring tools to understand your lake's state.
