Iceberg Lakehouse with AI Agents: A Guide

AI agents are becoming primary consumers of lakehouse data. An analyst assistant answering natural-language questions. A customer-facing chatbot pulling real-time order status. An ML pipeline agent discovering features by exploring schemas. A data quality agent validating freshness and distribution metrics across hundreds of tables every hour.

These agents query Apache Iceberg tables iteratively, at high frequency, and without human review — and they interact with your data in ways that batch-oriented infrastructure was not designed for.

This guide walks through the five infrastructure components required to connect AI agents to your Iceberg lakehouse safely and performantly — and shows how LakeOps provides each one as a managed control plane.

Why agents need different infrastructure

Human analysts run a handful of queries per session. AI agents run dozens — sometimes hundreds — per reasoning step. A single user question like "Which product categories had declining margins last quarter?" can trigger a chain of 15–30 SQL statements: schema discovery to find the right tables, sampling to understand column distributions, aggregation to compute margins by category, drill-down to identify the declining ones, validation queries to cross-check against a different data source, and formatting queries to prepare the final answer.

A single user interaction can produce 30–50 SQL statements. Multiply that across hundreds of concurrent agent sessions and the workload profile looks nothing like the BI traffic your lake was sized for.

This creates four problems that traditional Iceberg infrastructure is not built for.

Query volume spikes. An agent in a tool-use loop can issue 50+ queries per minute. Infrastructure sized for 100 concurrent dashboard users suddenly sees 5,000 agent-generated queries per minute. Without routing, every agent query competes with ETL and dashboards for the same compute cluster. The I/O subsystem, metadata catalog, and query engine concurrency limits were never designed for this fanout.

Unpredictable query shapes. BI dashboards have stable query shapes — the same SELECT with different date filters. Agents improvise. They generate novel SQL based on conversation context, explore unfamiliar tables, and access columns no one anticipated. Partition strategies and sort orders tuned for known dashboard filters may be irrelevant to agent access patterns — or worse, actively harmful if they cause full scans on agent-typical predicates. Agents also generate structurally dangerous queries: cartesian joins, SELECT * on billion-row tables, nested CTEs wrapping mutations.

No built-in safety. A human analyst who accidentally writes a dangerous query usually catches it before executing. Agents have no such filter. Without guardrails, a single misconfigured agent can scan petabytes of data, expose PII in results passed back to an LLM context window, run DDL against production tables, or generate unbounded compute spend — all within minutes, with no human in the loop to notice. And unlike a human who learns from a mistake, an agent in a loop repeats it indefinitely, compounding costs every iteration.

Latency compounds through reasoning chains. When an agent runs 12 sequential queries in a reasoning loop, each 3-second query delay adds up to 36 seconds of wall-clock time for a single interaction. Users perceive this as the agent being broken. On uncompacted Iceberg tables with thousands of small files and fragmented manifests, even a simple partition-pruned scan can take 5–10 seconds due to S3 GET amplification and manifest listing overhead. Sub-second p95 is the target for agent SQL, not a nice-to-have.

Solving these problems requires a control plane between agents and your Iceberg tables — handling connectivity, safety, performance, and optimization. The architecture has five components.

1. Agent-native MCP connectivity

What's needed

Agents cannot just get a JDBC connection string and figure it out. They need structured tools for discovering what data exists, understanding table schemas, and executing queries safely — without custom integration code per agent framework.

The Model Context Protocol (MCP) is the emerging standard for this: an open interface that exposes data operations as tool calls any LLM can invoke natively. It fills the same role for agent-to-infrastructure communication that REST catalogs fill for engine-to-catalog communication — a standardized, schema-aware interface that any compliant client can connect to without custom integration code.

A well-designed MCP interface for Iceberg access exposes discrete, purpose-built tools rather than a single "run anything" endpoint. This design constraint is intentional. Agents that get a generic execute_sql tool tend to skip schema discovery and hallucinate table names. Forcing discovery through separate tools improves query accuracy.

How LakeOps provides it

LakeOps exposes an agent-native MCP server that any MCP-compatible agent — Claude, LangChain, LlamaIndex, or custom — can connect to with zero integration code. The MCP server provides four purpose-built tools:

`list_schemas` enumerates all catalogs and namespaces across connected engines. Agents call this first to understand what data exists before constructing queries — eliminating the hallucinated table names that plague agents with generic SQL access.

`describe_table` returns column names, types, partition specs, sort orders, and optionally sample rows in a single call. This eliminates the multiple round-trips agents typically make to understand a table's shape — a single tool call replaces three to five sequential queries (SHOW COLUMNS, SELECT * LIMIT 5, DESCRIBE PARTITIONS), reducing both latency and token cost.

`execute_query` runs SQL through the full routing and guardrail pipeline. It accepts an optional engine_hint for cases where the agent has a preference, and enforces max_rows by default to prevent unbounded result sets from flooding the LLM context window. The query passes through the same three-router stack and guard chain that protects all traffic.

`explain_query` returns estimated cost, row count, and execution plan without running the query. Agents use this for pre-flight cost checks — a research agent exploring an unfamiliar table can estimate whether a query will scan 100 MB or 100 TB before committing to execution.

Each routing endpoint gets a stable URL — for example, storefront-analytics.lakeops.dev. The endpoint defines which engines are available, what query types are allowed, and what guardrails apply. Agents inherit the full policy stack of their endpoint without per-agent configuration. This means you do not configure guardrails on individual agents — you configure them on the endpoint, and every agent connecting to that endpoint inherits the same trust boundary.

An often-overlooked requirement: async query support. Agents run in tool-use loops. A synchronous execute call that blocks for 10 minutes on a large scan ties up the entire reasoning chain and risks gateway timeouts. The MCP layer supports submit/poll/cancel semantics with SSE streaming so agents can monitor progress and handle long-running queries gracefully.

Wire compatibility extends to PostgreSQL, MySQL, and Arrow Flight SQL protocols, so agents that do not use MCP can still connect through standard database drivers and benefit from the full routing and guardrail stack.

json

1{2  "mcpServers": {3    "iceberg-lakehouse": {4      "command": "uvx",5      "args": ["iceberg-mcp"],6      "env": {7        "ICEBERG_REST_URI": "https://catalog.example.com/v1",8        "ICEBERG_WAREHOUSE": "s3://my-lakehouse/warehouse"9      }10    }11  }12}

Compatible agent frameworks include Claude, LangChain, LlamaIndex, and any custom MCP client. The key property is zero integration code — agents connect via standard MCP protocol, call list_schemas to discover what is available, and start querying.

2. Query guardrails

What's needed

Giving agents SQL access to production data requires enforcement at the query level, not at the application level. Agents have no instinct to catch dangerous queries. The guard chain must sit between routing and dispatch — every query passes through it regardless of which frontend or agent issued it.

The risk is not theoretical. A research agent exploring an unfamiliar dataset will happily run SELECT * FROM events on a 4 TB table, producing a result set that overflows the LLM context window and generates thousands of dollars in egress costs. A coding agent asked to "clean up old data" will issue DELETE statements against production tables. A customer support agent with access to user tables will return email addresses, SSNs, and credit card numbers in plain text — all passed directly into the LLM context where they become part of the training signal.

How LakeOps provides it

LakeOps provides five composable guardrails that sit between routing and engine dispatch. Every agent query passes through the guard chain — guards are evaluated sequentially, and the first non-Allow result wins. The ordering matters: cheaper checks gate expensive ones.

ReadOnlyGuard blocks DDL and DML statements — INSERT, UPDATE, DELETE, DROP, CREATE, ALTER, TRUNCATE — for agent-originated sessions. Implementation uses SQL parsing (sqlparser-rs) rather than string matching, catching CTEs that wrap mutations, function calls with side effects, and other circumvention patterns that naive blocklists miss. This is the first guard in every agent-facing chain.

RowLimitGuard injects LIMIT N when a query lacks one. Agents in reasoning loops often forget to limit results, producing multi-gigabyte result sets that overflow LLM context windows and generate massive egress costs. The guard wraps queries automatically: SELECT * FROM ({original_sql}) __limited LIMIT {max_rows}. A customer-facing chatbot might set max_rows: 100. A research agent might allow max_rows: 5000. Configurable per routing group.

CostEstimateGuard issues EXPLAIN before execution and rejects queries where estimated scanned bytes exceed a configurable threshold. This is the primary defense against the cartesian join and missing-WHERE patterns that agents in tool-use loops frequently generate — queries that pass syntax checks but would scan terabytes at runtime. For example, an agent that runs SELECT * FROM orders JOIN products without a join condition would be blocked before scanning a single byte. The guard activates for engines that return structured EXPLAIN output (Trino, DuckDB); for others it falls back to warn rather than hard-reject.

PIIMaskGuard rewrites queries to protect sensitive columns tagged in catalog configuration. Three masking strategies are available. Exclude removes the column from SELECT * entirely — the agent never sees it. Hash wraps the column in SHA256(CAST(col AS VARCHAR)) — the agent gets a consistent pseudonym without the real value. NullOut replaces the column with NULL AS column_name — the agent sees the column exists but gets no data. This prevents PII from ever entering the LLM context window — a compliance requirement (GDPR, CCPA) for any agent with access to tables containing user data.

HumanApprovalGuard pauses high-stakes queries, sends a webhook notification (Slack, email, or custom), and waits for human approval or timeout. The trigger can be pattern-based (any DDL), cost-threshold-based (queries estimating > 500 GB scanned), or table-sensitivity-based (any query touching users.financial_records). Timeout behavior is configurable: auto-reject after 5 minutes, auto-approve after review window, or queue indefinitely.

Guards are stacked per routing group — the endpoint defines the trust boundary, not the individual agent. A typical agent-facing configuration:

yaml

1groups:2  - name: agent-analytics3    guards:4      - type: read_only5      - type: row_limit6        max_rows: 50007        inject_if_missing: true8      - type: cost_estimate9        max_scanned_bytes: 10_000_000_00010      - type: pii_mask11        sensitive_columns:12          "users.email": hash13          "users.ssn": exclude14          "payments.card_number": null_out

Every guard action is logged — what fired, what was rewritten, what was rejected — providing full auditability over what agents attempted and what the system allowed. A data engineering agent might get a different stack: cost_estimate → human_approval for DDL operations, with write access to designated staging tables.

3. Multi-engine routing

What's needed

Production Iceberg deployments rarely use a single engine. A typical setup has Trino for interactive analytics, DuckDB for lightweight lookups, Snowflake for BI, Athena for ad-hoc exploration, and Spark for batch ETL. The wrong routing decision at agent scale is expensive: a simple metadata lookup that costs $0.001 on DuckDB costs $0.05 on Snowflake. Across thousands of agent interactions per day, that 50x difference in per-query cost determines whether the agent workload is economically viable.

The routing challenge for agents is that their query shapes are both repetitive (roughly 80% are repeated templates with different literal values) and unpredictable (20% novel SQL generated from conversation context). A single router type cannot handle both efficiently.

How LakeOps provides it

LakeOps provides a three-router stack that dispatches each agent query to the cheapest viable engine based on query shape, table health, and historical performance. The routing layer is powered by QueryFlux — an open-source, Rust-based SQL proxy with ~0.35ms P50 overhead.

1. Adaptive router — pure statistics over query history. For any parameterized_hash with 20+ observations across multiple engines, it routes to the engine with the best P50 latency if the gap exceeds 25%. No LLM, no embeddings — just percentile math over execution history. Decision cost: 0ms (in-memory cache). Handles the ~80% of agent traffic that consists of repeated templates. The same SELECT * FROM orders WHERE customer_id = ? AND status = ? that runs thousands of times per day is routed instantly based on accumulated performance data.

2. LLM router — for novel query shapes the adaptive router has never seen. Uses a language model (Claude Haiku or equivalent) with live table statistics and engine capabilities to reason about optimal placement. The LLM call is cached by parameterized_hash, so subsequent executions of the same template shape skip the LLM entirely. A complex multi-table join that the adaptive router has no history for gets reasoned about once — then cached for every future execution of the same shape.

3. Semantic router — local embedding similarity against a library of known query shapes. When the LLM router's confidence is below threshold (< 0.60), the semantic router finds the most similar previously-routed query and uses its routing decision. Decision cost: ~1ms. Handles the long tail of rare query shapes.

4. Default group — fallback for queries that no router claims with sufficient confidence.

Routing groups organize engines by workload type — each with its own stable endpoint URL, engine pool, query-type scope, priority level, and guardrail stack. Agents connect to their endpoint and inherit everything automatically:

Routing group	Engines	Query types	Priority
Storefront Analytics	Trino, DuckDB	SELECT, AGGREGATE	High
Checkout Transactions	Snowflake, StarRocks	INSERT, UPDATE, MERGE	Medium
Executive Reporting	Snowflake, Trino	SELECT, JOIN	Medium
Catalog ETL	AWS Athena, Spark	INSERT, DELETE	Low

LakeOps Routing Endpoints — stable URLs per workload type with engine assignment and query scope — Routing groups in the LakeOps console — each group has its own stable endpoint URL, assigned engines, allowed query types, and priority level. Agents connect to their endpoint and inherit the full policy stack automatically.

The routing layer supports SQL dialect translation via sqlglot across Trino HTTP, PostgreSQL wire, MySQL wire, Snowflake HTTP, and Arrow Flight SQL — so agents connect through whichever protocol their framework supports, and QueryFlux translates the SQL to the target engine's dialect automatically.

The system is self-improving: as agents run queries and the feedback collector records execution metrics, the adaptive router builds confidence in routing decisions for repeated templates. New query shapes start at the LLM router, get cached, and eventually accumulate enough history to move to the adaptive router — converging toward optimal routing with zero manual configuration.

Per-group concurrency limits and proxy-side queuing protect engines from agent-driven burst pressure. When a cluster is saturated, queries queue at the proxy rather than hammering the backend. Overflow spills to a secondary group via fallback routing. Automatic failover reroutes queries when an engine goes down — no client changes, no manual intervention.

LakeOps Engine Comparison — cost and latency side by side across engines — Engine comparison in LakeOps — Trino at $0.03/query and 1.8s average, Snowflake at $0.08/query and 2.1s, DuckDB at $0.01/query and 0.5s. Success rates, query counts, and data scanned per engine — all in one view.

The measurable impact on production deployments: per-query compute cost drops 60–70% on mixed agent workloads. An agent issuing 50 queries per interaction at $0.05 per query on a mis-routed engine costs $2.50 per interaction. Routed optimally, the same interaction costs $0.15. At 1,000 interactions per day, that is $2,350 saved daily.

4. Self-optimizing storage

What's needed

Routing gets queries to the right engine — but the engine still reads from object storage. If the underlying Iceberg table has thousands of small files (common after streaming ingestion), fragmented manifests (common after months of appends without maintenance), or sort orders misaligned with agent query patterns, even the best engine choice produces slow queries.

The numbers are stark. Agent query P95 latency on uncompacted tables with 47,000 small files runs 52 seconds due to S3 GET amplification and manifest listing overhead. The same query on a compacted, sorted table with 280 files returns in 5.8 seconds — a 9x improvement. With further optimization, sub-500ms is achievable.

This is not just a performance problem. Agents that get slow responses issue more queries — retries, timeouts, and recovery queries that compound the load. A slow lake makes the agent problem worse, not just slower.

How LakeOps provides it

LakeOps runs a Rust-based compaction engine with Apache DataFusion that continuously optimizes tables for the queries actually hitting them — including agent queries. The engine operates at $5/TB versus Spark's $50/TB and gets smarter over time as it observes access patterns.

Four operations run in a coordinated sequence — each operation's output becomes the next one's clean input:

Query-aware compaction analyzes field access frequency from actual queries (including agent queries) and adjusts sort order to match. If agents predominantly filter on event_timestamp and user_id, data is re-sorted on those columns so Parquet row group min/max statistics enable predicate pushdown — eliminating irrelevant row groups before any S3 read. Two strategies are available: Binpack merges small files targeting optimal sizes (fast, effective for most tables), and Sort reorders data by query-relevant columns for maximum data skipping. Benchmarks show 221 seconds for LakeOps versus 1,612 seconds (Spark) versus 6,300 seconds (S3 Tables) on identical 200 GB datasets. Peak throughput reaches 2,522 MB/s.

Manifest consolidation merges fragmented manifest files and computes Puffin column statistics (NDV, min/max, null counts). A table with 500 manifests forces 500 S3 GET requests just for query planning — adding seconds of latency before any data is scanned. Consolidation keeps planning fast regardless of table age. Fresh Puffin statistics enable engines to make smarter skip decisions, reducing data scanned by up to 51%.

Snapshot expiration removes old snapshots while respecting retention windows and concurrent readers. Without expiration, streaming tables accumulate thousands of snapshots per day — each one adding metadata overhead. Configurable retention period and minimum snapshot count balance time-travel depth against metadata performance.

Orphan file cleanup removes unreferenced files from object storage — data from failed writes, aborted jobs, and expired snapshots. Configurable age thresholds prevent touching in-progress writes. Production runs have recovered 200 TB of orphan data across 324 tables at near-zero compute cost.

The operations are sequenced: expire snapshots first → clean orphans → compact → rewrite manifests. Running them independently as disconnected cron jobs produces conflicts and redundant work — compacting files that are about to be expired, rewriting manifests before compaction changes the file layout. LakeOps coordinates them as a single pipeline where each step feeds the next.

LakeOps Optimization tab — compaction strategy, snapshot retention, and scheduling per table — Per-table optimization controls — configure compaction strategy (Binpack or Sort), target file size, snapshot retention, and cron schedule. Start manual, then toggle Enabled for continuous automation.

Tables receiving heavy agent traffic get compacted more frequently — the system monitors query latency and data freshness per table and automatically elevates compaction priority where agents are experiencing degraded performance. The engine is also self-improving: the same 1.192 TB table compacts from 22 minutes on the first run to 11 minutes on the third as the planner learns workload telemetry.

The compound effect is what makes this powerful for agent workloads. As storage optimizes for agent patterns, more engines become viable for each query shape (a query that required Trino before compaction might run on DuckDB afterward). This expands the routing options, which generates more performance data, which further improves both routing and compaction decisions. The lake gets faster the more agents use it.

LakeOps Lake-Wide Events — compaction, snapshot expiration, and manifest rewrites running across tables — LakeOps Events showing autonomous operations across the lake — Compact Data Files (customer_orders 1.24 TB, 16→1 files in 4s), Expire Snapshots, Rewrite Manifests — running continuously with full audit trail.

5. Per-agent observability

What's needed

Standard query monitoring is not sufficient for agent workloads. When a dashboard user runs a slow query, you see one entry in your logs. When an agent runs a slow query, it is one of 40 queries in a reasoning chain — and the chain might retry, branch, or escalate based on each result. You need to answer: which agent session produced this query? What step in its reasoning chain? Which tool call? What was the conversation context? Without this, debugging agent behavior and tuning guardrails is guesswork.

How LakeOps provides it

LakeOps tracks agent activity through four observability dimensions, with agent context propagated automatically via MCP.

Agent context propagation carries structured metadata through the entire query pipeline:

yaml

1agent_context:2  agent_id: "customer-support-agent-v3"3  conversation_id: "conv_8f3a2b1c"4  step_index: 75  tool_call_id: "tc_9d4e5f6a"6  agent_framework: "langchain"

This context attaches to every QueryRecord in the persistence layer, enabling four observability dimensions:

Routing metrics — query volume, latency percentiles, and engine utilization broken down by agent_id. See which agents are heavy consumers, which engines they favor, and where routing decisions are suboptimal. A customer support agent that consistently routes to Snowflake when DuckDB would be 4x faster and 8x cheaper is immediately visible.

Guardrail audit logs — every guard evaluation recorded: what fired, what was rejected, what was rewritten (PII masking, row limit injection). When a ReadOnlyGuard blocks a DELETE attempt or a CostEstimateGuard rejects a 500 GB scan, the full context — agent ID, conversation, step index, original query — is logged. Full audit trail for GDPR, CCPA, and SOC 2 compliance.

Query shape analysis — parameterized template patterns per agent type. Understand whether agents run the same 5 queries with different parameters (optimize for caching and adaptive routing) or generate novel SQL every time (optimize for LLM and semantic routing). This directly informs which router handles most of the traffic and where to invest in optimization.

Cost attribution — per-agent and per-conversation compute spend. Aggregate by agent_id and conversation_id to identify expensive reasoning chains and where optimization has the highest ROI. An agent that costs $2.50 per conversation on average versus one that costs $0.15 represents an immediate optimization target.

Session replay is the capability that ties everything together. Given a conversation_id, you can reconstruct the entire sequence of SQL queries an agent issued during a single interaction, ordered by step_index. This turns query history into an agent reasoning debugger — you can see exactly why an agent reached a particular conclusion by inspecting the data it queried at each step.

The feedback loop closes here: observability metrics feed back into routing decisions (the adaptive router learns from per-agent performance data), compaction priorities (hot tables get elevated), and guardrail tuning (adjust thresholds based on what actually fires).

LakeOps Insights — proactive table health alerts with severity classification — LakeOps Insights page — CRITICAL alerts for partition data file issues, HIGH alerts for excessive manifests and snapshots, WARNING for small file accumulation and partition skew. Each insight links directly to the affected table for one-click remediation.

Agentic query patterns worth optimizing for

Beyond the five infrastructure layers, certain query patterns unique to agent workloads recur across every deployment and deserve explicit attention.

Temporal queries dominate. Agents frequently ask "what did I know about X as of time T?" rather than "what is the current state of X?" Iceberg's time-travel semantics support this natively, but the storage layer must be optimized for it — snapshot retention policies that balance time-travel depth against metadata overhead, and compaction strategies that preserve temporal access efficiency.

Discovery-heavy workloads. Agents spend significant query budget on schema exploration — listing tables, describing columns, sampling data. These queries are lightweight individually but occur at high frequency. The MCP layer handles them from metadata cache without hitting the query engine at all, reducing both latency and cost to near-zero for this traffic.

Cross-agent joins. Multi-agent systems produce queries like "What did Agent A know when it delegated to Agent B?" — temporal joins across two agents' activity records. These require sub-100ms latency because the calling agent waits on the result before taking its next action. Sort order on (agent_id, timestamp) and aggressive compaction of activity log tables are prerequisites for acceptable performance.

Budget attribution queries. "How much did each agent spend during this task?" requires fast aggregation by (agent_id, task_id, time_window) over cost records. This is a well-defined, high-frequency template that the adaptive router recognizes quickly and routes to the cheapest analytical engine — typically DuckDB for this shape.

Workload profiles and routing configuration

Different agent types produce fundamentally different workload shapes. Configuring routing groups, guardrails, and compaction priorities per workload type — rather than treating all agent traffic as homogeneous — is where the largest performance and cost gains materialize.

Interactive analytics agents (customer support, sales intelligence, conversational BI). These agents operate in human-facing loops where response time directly affects user experience. They issue 10–30 queries per interaction, mostly point lookups and narrow aggregations with tight filters. Latency target: sub-500ms p95. Routing: DuckDB for single-table lookups, Trino for multi-table joins. Guards: read_only + row_limit(1000) + pii_mask. Compaction priority: high, with sort orders aligned to common filter columns (customer_id, timestamp, status).

Data pipeline agents (ETL orchestration, data quality validation, schema evolution). These agents run in background loops without latency pressure but generate high query volume — schema introspection, row counts, distribution checks, and validation queries across many tables. Latency target: sub-5s p95. Routing: Athena or Spark for wide scans, DuckDB for metadata queries. Guards: cost_estimate(50GB) + human_approval for DDL. Compaction priority: medium.

Feature store agents (ML feature retrieval, embedding generation, model training data). These agents run highly repetitive parameterized queries — the same template shape executed millions of times with different entity IDs. The adaptive router converges quickly on these patterns. Latency target: sub-200ms p95. Routing: DuckDB or StarRocks for point lookups at scale. Guards: read_only + row_limit(500) + cost_estimate(1GB). Compaction priority: critical — these tables must be perfectly sorted on the entity key with small target file sizes for fast point reads.

Research and exploration agents (data scientists' assistants, report generation, ad-hoc analysis). These agents generate novel, diverse SQL shapes — joining unfamiliar tables, running aggregations at unusual granularity. The LLM and semantic routers handle most of this traffic since the adaptive router lacks history for novel shapes. Latency target: sub-10s p95. Routing: Trino for complex joins, Snowflake for wide aggregations, DuckDB for exploratory sampling. Guards: read_only + row_limit(5000) + cost_estimate(100GB) + pii_mask. Compaction priority: standard.

Agent type	Guard stack	Routing strategy	Compaction priority
BI assistant	read_only + row_limit(1000) + cost_estimate	Latency (DuckDB, Trino)	High
Customer-facing chatbot	read_only + row_limit(100) + pii_mask + cost_estimate	Latency (DuckDB, cached)	Critical
Data exploration agent	read_only + row_limit(5000) + pii_mask	Cost (DuckDB metadata, Trino scans)	Medium
ML feature discovery	read_only + cost_estimate	Cost (Spark for wide scans)	Low
ETL orchestrator	human_approval + cost_estimate	Throughput (Spark, Athena)	Medium

The closed loop

The five components are not independent tools. They form a closed-loop system that improves autonomously.

1. Agents connect via MCP — schema-aware tools auto-discovered, stable endpoints per workload, structured access with zero integration code.

2. Guardrails enforce safety — every query passes through composable guards before execution. ReadOnly, RowLimit, CostEstimate, PIIMask, and HumanApproval stack per routing group.

3. Routing optimizes cost and latency — the three-router stack dispatches each query to the cheapest viable engine. Adaptive handles 80% of traffic at 0ms decision cost. LLM and Semantic handle novel patterns.

4. Storage self-optimizes — compaction, manifests, snapshots, and cleanup run continuously as a coordinated pipeline, informed by agent query telemetry. Hot tables get priority. Sort orders align to agent predicates.

5. Observability closes the loop — per-agent metrics feed back into routing decisions, compaction priorities, and guardrail tuning. Every query is attributed, every guard action is logged, every cost is tracked.

The measurable outcomes on production deployments: agent query P95 latency drops from 5–10 seconds to under 500ms. Per-query compute cost drops 65% through intelligent routing. Guardrail interventions prevent an average of 3.2 petabyte-scale scans per day that agents would otherwise have executed. And the system improves autonomously — no manual query log analysis, no sort-order tuning by hand, no routing rule maintenance.

Getting started

LakeOps connects to your existing catalogs and object storage in approximately 10 minutes. No agents to install, no data to move, no pipelines to modify. Your data stays in your account.

Step 1: Get your tables healthy. An agent hitting a table with 200,000+ small files will timeout regardless of MCP sophistication. Enable continuous compaction, snapshot expiration, and manifest consolidation. This is a prerequisite, not an optimization — uncompacted tables pay a 5–10x latency penalty that no amount of routing can compensate for.

Step 2: Deploy guardrails before the first agent query. Start with ReadOnlyGuard and RowLimitGuard on every agent session. Add CostEstimateGuard with a conservative threshold (10 GB scanned). The cost of deploying guards too early is zero. The cost of deploying them too late is one bad query that scans your entire lake.

Step 3: Enable routing and observability. Connect your engines, configure routing groups per agent workload type, and enable per-agent cost attribution. Within a week you will have the data to understand which agents are expensive, which tables are slow, and where to optimize next.

Supported catalog types include AWS Glue, DynamoDB-backed catalogs, REST catalogs (Polaris, Gravitino, Nessie, Lakekeeper), and S3 Tables. Three guarantees apply: no vendor lock-in, no infrastructure changes, no data leaves your environment.

Watch demo·LakeOps product demoYouTube ↗

Watch LakeOps in action — connecting catalogs, analyzing table health, and running autonomous optimization across your Iceberg lake.

Iceberg Lakehouse with AI Agents: A Guide

Why agents need different infrastructure

1. Agent-native MCP connectivity

What's needed

How LakeOps provides it

2. Query guardrails

What's needed

How LakeOps provides it

3. Multi-engine routing

What's needed

How LakeOps provides it

4. Self-optimizing storage

What's needed

How LakeOps provides it

5. Per-agent observability

What's needed

How LakeOps provides it

Agentic query patterns worth optimizing for

Workload profiles and routing configuration

The closed loop

Getting started

Further reading

Tags

Related articles

Iceberg for AI Agents: Turning Lakehouse Data Into AI-Ready Context

Apache Iceberg Lakehouse Governance: Separation of Concerns with Polaris and Policy Engines

Apache Iceberg Schema Evolution in Production: Best Practices and Pitfalls