
Managed Iceberg in 2026: Autonomous Data Lake
Iceberg tables degrade silently — small files pile up, snapshots bloat metadata, and query latency creeps higher. A breakdown of the nine components every production data lake needs to stay healthy.
Managed Apache Iceberg
LakeOps continuously optimizes compaction, data layout, and autonomous table maintenance — snapshots, manifests, metadata, and orphan files — across every engine and cloud, so your Iceberg tables stay fast, lean, and production-ready.
The challenge
Small files, stale snapshots, and orphaned files compound quietly, driving up compute and storage cost while query latency keeps drifting.
Spark, Trino, Athena, Snowflake, and Databricks optimize differently. Teams juggle per-engine scripts, configs, and schedules that do not scale.
Manifests bloat, partitions skew, and layouts drift from real workloads, degrading scan efficiency, cache locality, and query planning.
Ad-hoc scripts and reactive firefighting create operational debt. Retention, DR, and GDPR policies stay manual instead of control-plane enforced.
Results
Benchmarks from production-grade tables across multiple engines and cloud providers.
Compaction speed
vs. Apache Spark on identical datasets
Query performance
After compaction + layout optimization
Cost savings
In compute & storage spend
Table health
Autonomous maintenance keeps every table optimized
Capabilities
Every layer of your lakehouse — from compaction and metadata to engines, observability, and policy enforcement — managed from one control plane.
Seconds
Cost ($)
Compaction
Not just file merging — LakeOps analyzes which columns your queries actually filter, join, and group on, then organizes data files accordingly. The result: predicate pushdown and column pruning skip entire file groups, reducing I/O, query time, and compute cost across every engine reading the table. Powered by a Rust-based engine with Apache DataFusion — 95% faster and ~10x cheaper than Spark.
Compaction
38% small files — merging 970 → 87 at 512 MB target
Expire Snapshots
154 snapshots, 62 past 30-day retention
Rewrite Manifests
12 manifests — below threshold, waiting for compaction
Orphan Cleanup
847 MB unreferenced — scheduled after expiration
Query patterns
event_date, region
Top sort columns (Trino + Spark)
Improvement
12.4× faster
Avg query speed after optimization
Cycle
Self-tuning
Sort orders adapt as patterns change
Maintenance
LakeOps continuously collects telemetry — file counts, partition health, snapshot velocity, delete ratios, manifest growth, and query patterns — and uses that signal to decide what to run, when, and in what order. Each operation's outcome feeds back into the next decision. The result is a coordinated maintenance loop that eliminates redundant work, adapts to changing workloads, and keeps every table in optimal shape without human intervention.
Total Snapshots
154
Retention
30 days
Expired Today
12
Storage Freed
18.4 GB
Automated retention, expiration, and version history for every table. Set policies once — LakeOps expires old snapshots safely with full awareness of concurrent readers. Time-travel to any point, compare snapshots, and roll back without manual intervention.
Manifests
487 → 12
97.5% reduced
Planner Latency
−2.1s
3.4s → 1.3s
Puffin Stats
100%
All columns indexed
Rewrite Manifests
Consolidate manifest files for faster query planning
Rewrite Position Deletes
Optimize position delete files to improve read performance
Compute Statistics (Puffin)
Calculate column stats to optimize query planning and pruning
Consolidate and rewrite manifest files so query planning stays fast at any scale. Smaller manifests mean faster planning and fewer metadata scans for Trino, Spark, Flink, and every engine that touches your lake. Includes position delete file optimization and Puffin statistics computation.
Unreferenced
847 MB
59,831 files
Age Threshold
7 days
Safety window
Last Cleanup
74.8 GB
Reclaimed 3 hrs ago
Detect and safely remove files no longer referenced by any table. Eliminate storage drift from failed jobs, aborted commits, and legacy tables. Configurable retention thresholds, catalog-wide or per-table scope, and scheduled execution — reclaim capacity without risking data integrity.
Queries Today
12,485
+12% from yesterday
Avg Latency
1.2s
−0.3s from last week
Active Engines
4 / 6
All critical online
Active Alerts
3
1 critical
312 partitions exceed file threshold
Query scan amplified 8×
Excessive manifests (487) — planning overhead
Planner latency +2.1s
Small file ratio 38% — compaction recommended
S3 GET costs elevated
Observability
Continuous analysis of table structure, file health, and optimization opportunities. Monitor active engines, query latency, throughput, and error rates. Cross-system telemetry from S3, GCS, ADLS, and every engine — view, alert, and act from one place.
Active Groups
2 / 3
Routing traffic
Engines in Use
7
8 registered
Routed Volume
7,285
This period
Query Routing
Connect Trino, Spark, Snowflake, Athena, DuckDB, and Flink to one routing layer. Intelligent query routing optimizes for cost, latency, or throughput automatically. Compare engine performance, monitor health, and add new engines — all without engine-specific scripts or duplicate tooling.
Catalogs
4
Tables
127
Columns
1,842
ReadOnly
Blocks DDL and DML from agent sessions
CostEstimate
Rejects queries exceeding scan thresholds
PIIMask
Hashes sensitive columns before results reach the model
HumanApproval
Pauses high-stakes operations for review
Agentic AI
Built for AI and ML pipelines — optimized metadata, layout, and table structure for agents, feature stores, and autonomous data workflows. Run simulations on file layout changes before applying them. Fast, consistent access to table state and history so AI pipelines get the data they need without extra glue.
Total Policies
5
Maintenance
4
Configuration
1
Governance
Define and enforce compaction, retention, orphan cleanup, and maintenance policies across catalogs and tables. Set schedules, priorities, and target scopes — then let LakeOps execute continuously. Every policy is auditable, versioned, and controllable with one toggle.

Iceberg tables degrade silently — small files pile up, snapshots bloat metadata, and query latency creeps higher. A breakdown of the nine components every production data lake needs to stay healthy.

Netflix spent years building an intelligent lakehouse — Polaris, Autotune, janitors, and Metacat. LakeOps lets every team build the same — and go beyond — in minutes.

How to route queries across Trino, Spark, DuckDB, Snowflake, Athena, and Flink on shared Iceberg tables — SQL routing proxy, dialect translation, and table-aware optimization.
Works with your stack
LakeOps connects to your existing infrastructure. No vendor lock-in — your data, metadata, and execution stay under your control.
LakeOps Control Plane
Connects, analyzes, optimizes
Engines
Catalogs
Clouds & on-prem
Agentic AI readiness
AI agents are becoming primary consumers of SQL infrastructure. LakeOps is the control plane that makes your lake intelligent — agent-native interface, built-in guardrails, self-optimizing storage, and a closed-loop feedback system that learns from every query.
AI Agents
Claude, LangChain,
custom MCP agents
LakeOps
Iceberg Lake
Tables, metadata,
engines, catalogs
Native MCP server connects any compatible agent — Claude, LangChain, or custom — with zero integration code. Schema-aware tools, async queries with SSE streaming, and Postgres/MySQL/Arrow Flight wire compatibility.
Layered guardrails for unsupervised execution — ReadOnlyGuard blocks DDL, CostEstimateGuard rejects expensive scans, PIIMaskGuard scrubs sensitive columns, HumanApprovalGuard pauses high-stakes queries.
Three-router stack — Adaptive routes on history, LLM reasons over new templates with live table stats, Semantic matches intent. 0ms cached decisions, data-quality-aware routing enriched by IceProbe.
Agents querying uncompacted tables pay 5–10× latency penalty. The workload analyst feeds agent query signals to the Rust compaction engine, and the feedback loop auto-updates routing as tables improve.
Production benchmarks
Real workloads. Real data. Batch, streaming, delete-heavy, multi-writer, and terabyte-scale tables — all on the same engine, same hardware.
| Table | Size | Workload | Files (B → A) | Throughput | Time | Notes |
|---|---|---|---|---|---|---|
| balance_snapshots | 1,192 GB | TB-Scale batch | 11,957 → 3,270 | 1,572 MB/s | 11 min | Spark OOM on same hardware |
| user_accounts | 174 GB | Batch | 878 → 400 | 2,269 MB/s | 74s | Single Node |
| events_analytics | 484 GB | Delete-Heavy | 16,128 → 7,198 | 729 MB/s | 11m 21s | 23,433 delete files; 551M rows removed |
| raw_sdk_events | 8 GB | Streaming | 42,633 → 69 | 167 MB/s | 138s | 99.8% file reduction |
| site_traffic | 292 GB | Multi-Writer | 2,740 → 754 | 1,465 MB/s | 3m 25s | Single partition |
| cluster_registry | 322 GB | Batch | 998 → 440 | 2,522 MB/s | 2m | Peak throughput |
Normalized to Spark = 100%
Source: 200 GB (~1 TB uncompressed) benchmark. Spark cost index 100 vs LakeOps 10.
balance_snapshots — 1.192 TB across consecutive runs
Same data and hardware; planner learns workload telemetry and improves runtime from 22 to 11 minutes.
Get a personalized walkthrough on your own Iceberg tables — see the impact in minutes.