Routing Multiple Query Engines with Iceberg

Multiple Query Engines with Iceberg — Ferris the Rust crab routing queries to Trino, Snowflake, DataFusion, Databricks, Presto, ClickHouse, DuckDB, and Apache Spark over an Iceberg Lakehouse

One of the key promises of Apache Iceberg is multi-engine access: Spark writes, Trino queries, Flink streams, DuckDB explores, Athena scans — all on the same tables in the same object storage. The table format handles it. The catalog resolves it. But nothing in the stack decides which engine should run which query.

That decision — query routing — is the gap between having multiple engines and actually benefiting from them. Without it, every team picks the engine they know, regardless of whether it is the cheapest or fastest for the query shape at hand. A point lookup that costs $0.01 on DuckDB costs $0.08 on Snowflake. Across thousands of queries per day, the wrong routing decisions add up to significant overspend and unnecessary latency.

This article covers why multi-engine Iceberg deployments need a routing layer, how routing works architecturally, what QueryFlux provides as an open-source SQL proxy, how LakeOps extends it into an intelligent control plane, and the practical patterns that tie it all together.

Why multiple engines?

Same Data, Different Engines — open lakehouse layered architecture with compute engines above shared Iceberg tables on object storage — Same physical data. No duplication across engines. Compute engines connect to the same Iceberg tables through a shared catalog — each engine brings different strengths to different workloads.

Each engine in a modern lakehouse has a distinct operational sweet spot.

Spark — batch ETL, heavy transformations, streaming ingestion, and all Iceberg maintenance operations (compaction, snapshot expiration, orphan cleanup). The most complete write-path support.
Trino — interactive SQL, low-latency analytics, ad-hoc exploration. Stateless architecture scales reads independently.
Flink — continuous streaming writes with exactly-once semantics. Checkpoint-based ingestion into Iceberg tables.
DuckDB — single-node, in-process analytics. Sub-second response for selective queries on compacted tables. Zero infrastructure.
Athena — serverless, scan-priced. No clusters to manage. Suited for infrequent or unpredictable query volumes where per-query pricing beats reserved compute.
Snowflake — managed warehouse with BI tool integrations. Reads Iceberg tables via external catalogs or Iceberg REST.
StarRocks — high-concurrency, low-latency MPP engine. Strong for BI dashboard workloads with many concurrent users.

No single engine handles everything well. Trino is poor at large-scale ETL writes. Spark is slow at interactive point lookups. DuckDB cannot distribute across nodes. Athena charges per byte scanned, which is expensive for compute-heavy joins. The right engine depends on the query shape, the latency requirement, and the cost model.

The problem without routing

Without a routing layer, a multi-engine Iceberg deployment creates four problems that compound at scale.

Every query hits the same engine regardless of shape. Interactive dashboards, heavy ETL, and ad-hoc exploration all land on one backend. CPU-heavy joins pay scan-pricing on Athena. Selective lookups wait behind batch jobs on a shared Trino cluster.

N×M driver configurations fragment the platform. Each engine requires its own connection strings, credentials, and client libraries. Each team maintains a separate config per engine. Adding or removing a backend means coordinating across every consumer.

No cost awareness in query dispatch. Compute-priced engines (Trino, StarRocks) charge for CPU-seconds or cluster uptime. Scan-priced engines (Athena, BigQuery) charge for bytes read. Without cost-aware routing, queries pay the wrong pricing model. A complex join that costs $0.03 on Trino costs $0.08 on Snowflake. A simple filter that costs $0.01 on DuckDB costs $0.05 on Athena when it scans the full partition.

SLA violations when workloads compete on a shared cluster. A batch ETL job competing with an interactive dashboard on the same Trino cluster degrades both. Without workload isolation at the routing layer, latency-sensitive queries suffer unpredictably.

Architecture of a routing layer

A SQL routing proxy sits between clients and engines. Clients connect using whatever protocol they already use — Trino HTTP, PostgreSQL wire, MySQL wire, or Arrow Flight SQL. The proxy accepts the connection, evaluates routing rules, selects a backend engine, optionally translates the SQL dialect, dispatches the query, and returns results. To the client, it looks like a regular database.

The routing decision happens in two stages.

1.Group selection — routing rules evaluate the query (protocol, SQL text, client tags, headers, or custom logic) and select a cluster group: a named pool of engines configured for a specific workload type.
2.Member selection — within that group, a selection strategy (round-robin, least-loaded, failover, or weighted) picks the specific engine instance that is healthy, enabled, and under its concurrency limit.

If no member is available — all at capacity or unhealthy — the query queues at the proxy until capacity opens, or spills to a fallback group. This is proxy-side throttling: controlled backpressure instead of overloading backends.

QueryFlux: open-source SQL routing proxy

QueryFlux is an open-source, Rust-based SQL proxy built for exactly this problem. It provides multi-engine query routing with protocol translation, capacity management, and observability — designed for the same pattern that proxies like ProxySQL solve for MySQL/PostgreSQL fleets, but applied to heterogeneous analytical engines and lakehouse workloads. The source is on GitHub.

Three-stage pipeline

Every query passes through three stages.

1. Protocol ingestion. QueryFlux listens on multiple frontends simultaneously: Trino HTTP (port 8080), PostgreSQL wire (5432), MySQL wire (3306), Snowflake HTTP, and Arrow Flight SQL. Clients connect using whatever driver they already have — no new SDKs, no code changes.

2. Routing. Rules are evaluated in order. Six router types are available: protocol-based (route by connection type), header-based (inspect HTTP headers), query regex (match SQL patterns), client tags (route by session metadata), compound conditions (AND/OR logic across multiple signals), and Python scripts for fully custom routing logic. First match wins. A routingFallback group catches everything else.

3. Dispatch and translation. The proxy picks a healthy cluster in the selected group using the configured selection strategy (round-robin, least-loaded, failover, or weighted), translates SQL via sqlglot if the source and target dialects differ, and dispatches the query. Results stream back to the client in the original protocol.

Key capabilities

~0.35ms p50 proxy overhead — written in Rust. The measured overhead on the Trino HTTP path is 0.36ms p50, 0.52ms p95. For typical analytical workloads, the proxy is not on the critical path.
Per-group concurrency limits and queuing — maxRunningQueries per cluster group with queue-based overflow and fallback routing when a group is full.
sqlglot-backed dialect translation — automatic SQL conversion across 30+ dialects. Compatible dialects (e.g. MySQL ↔ StarRocks) skip translation entirely.
Python script routers — embed custom routing logic in a route(query, ctx) function. The context exposes protocol, user identity, headers, and database hints.
Custom translation scripts — post-translation Python transforms that mutate the parsed AST for edge cases: stripping catalog prefixes when targeting Athena, renaming functions, or applying environment-specific rewrites.
Prometheus metrics and Grafana dashboards — query count, latency, queue depth, and routing decisions per group — one observability surface across all engines.
[QueryFlux Studio](https://queryflux.dev/docs/studio) — web UI for managing clusters, groups, routing rules, scripts, and query history.
Admin API — full REST API with OpenAPI spec for programmatic cluster and routing management.

Routing configuration

A minimal QueryFlux configuration defines clusters, cluster groups, routers, and a fallback.

yaml

1clusters:2  trino-1:3    engine: trino4    endpoint: http://trino:80805  duckdb-1:6    engine: duckDb7    databasePath: /tmp/queryflux.duckdb8 9clusterGroups:10  analytics:11    members: [trino-1, duckdb-1]12    maxRunningQueries: 10013    strategy:14      type: leastLoaded15  etl:16    members: [trino-1]17    maxRunningQueries: 2018 19routers:20  - type: queryRegex21    rules:22      - regex: "^INSERT|^MERGE|^CREATE"23        targetGroup: etl24  - type: protocolBased25    postgresWire: analytics26    trinoHttp: analytics27 28routingFallback: analytics

Clients connecting via PostgreSQL wire or Trino HTTP are routed to the analytics group. DML statements (INSERT, MERGE, CREATE) match the regex router first and land on etl. Within each group, the leastLoaded strategy picks the cluster with the fewest running queries. To try it locally, the getting started guide has Docker Compose examples including a full multi-engine stack with Trino, StarRocks, and Iceberg.

SQL dialect translation

Multi-engine routing requires SQL compatibility across dialects. A client connected via PostgreSQL wire sending Trino-flavored SQL to a DuckDB backend will fail without translation.

QueryFlux handles this with sqlglot-backed automatic dialect conversion.

1.Detect the source dialect from the frontend protocol (Trino HTTP → Trino SQL).
2.Detect the target dialect from the engine type of the selected cluster (DuckDB → DuckDB SQL).
3.If source and target are compatible (MySQL → StarRocks), skip translation entirely.
4.Otherwise, parse the SQL, transform the AST, and re-serialize for the target dialect.

Common translation cases:

Catalog-qualified names — Trino uses three-part names (catalog.schema.table). Athena has no catalog layer and expects schema.table. A post-translation Python script strips the catalog prefix.
Function mapping — date_diff in Trino vs datediff in DuckDB vs TIMESTAMPDIFF in MySQL.
Type casting — CAST(x AS VARCHAR) vs CAST(x AS TEXT) vs x::text.
Window function syntax — minor variations across dialects in frame clauses and null handling.

For edge cases, custom Python transform scripts run after sqlglot translation.

yaml

1translation:2  pythonScripts:3    - |4      import sqlglot.expressions as exp5      def transform(ast, src: str, dst: str) -> None:6          if dst == "athena":7              for table in ast.find_all(exp.Table):8                  table.set("catalog", None)

Per-group scripts can be attached via Studio for engine-specific rewrites that only apply to certain cluster groups.

LakeOps: intelligent routing on top of QueryFlux

QueryFlux handles the proxy layer — protocols, rule-based routing, dialect translation, capacity management. LakeOps extends that foundation with table-level telemetry, engine-level observability, and optimization that ties routing decisions to table health and cost data.

Routing Performance Metrics — queries routed, average response time, engine load distribution — LakeOps Routing Metrics: 7,285 queries routed this week, 1.4s average response time, 68% engine utilization. Engine load distribution — Trino 35%, Snowflake 25%, Athena 20%, DuckDB 15%, StarRocks 5% — with query shape breakdowns across SELECT, JOIN, AGGREGATE, INSERT, UPDATE, and DELETE.

What LakeOps adds

Table-health-aware routing. LakeOps reads Iceberg metadata continuously — file counts, manifest structure, delete file ratios, partition skew. When a table has fragmented manifests or high delete file counts, routing directs queries on that table to engines that handle degraded state better (Trino or Spark over DuckDB, for example) until compaction resolves the issue.

Query-pattern learning. LakeOps telemetry feeds historical latency and cost data per query shape back into routing decisions. Over time, the system learns which engine performs well for each recurring template — and routing improves automatically without manual rule updates.

Optimization-driven engine expansion. As LakeOps compacts and sorts tables, more engines become viable for each query shape. A table that previously required Trino for a 10-second join might, after compaction, run the same query in 0.5 seconds on DuckDB. The routing layer picks up this change as engine-level metrics shift.

Unified cost model. LakeOps combines query routing cost data (cost per query per engine, total spend per group) with compaction compute savings and storage optimization for full lake-wide cost visibility.

Policy-driven automation. Cost ceilings, latency targets, and priority rules are defined once. LakeOps enforces them across both routing decisions and table optimization — so the routing layer and the maintenance layer work toward the same objectives.

Engine management

Query Engines directory — registered engines with status, metrics, and health — LakeOps engine directory: six engines registered with status (Active, Inactive, Maintenance), query success rate, average runtime, cost per query, total queries, and data scanned. Side-by-side comparison to inform routing rules and capacity planning.

The engine management console provides a unified view: register engines, compare performance side by side, monitor health (uptime, CPU, memory, disk, network), and trigger maintenance mode — which automatically excludes the engine from routing and fails over in-flight queries. Adding or removing engines updates routing groups automatically.

Routing strategies

QueryFlux and LakeOps support three routing strategies, each optimized for a different workload profile.

Cost routing sends each query to the cheapest engine that can execute it within acceptable latency bounds. Compute-heavy joins go to compute-priced backends (Trino, StarRocks where CPU-seconds are cheap). Scan-heavy reads go to scan-priced backends (Athena, where selective queries with good partitioning pay very little). Cost routing works well for batch ETL, scheduled reports, and background processing.

Latency routing sends each query to the fastest engine for the given query pattern and data shape. It uses historical P50 latency per engine for similar query shapes. Interactive dashboards, BI tools, and AI agent queries that need near-instant responses use latency routing.

Throughput (balanced) routing distributes queries across engines to maximize overall throughput and prevent any single engine from becoming a bottleneck. It considers current engine utilization and remaining capacity. Mixed workloads with varying query patterns use throughput routing.

Routing Endpoints — stable URLs per workload type with engine pools, query types, and priority levels — Routing groups in practice: Analytics routes SELECT and AGGREGATE queries to Trino + DuckDB at high priority. BI routes to Snowflake + Trino. ETL routes INSERT and MERGE to Athena + StarRocks. Reports route to Snowflake + ClickHouse. Each group has its own stable endpoint URL, engine pool, and priority level.

In practice, you assign a strategy per routing group rather than globally. Analytics dashboards get latency routing on a Trino + DuckDB pool. ETL pipelines get cost routing on an Athena + Spark pool. Each group has its own concurrency limits, priority level, and fallback rules.

Practical routing patterns

Five patterns that work in production multi-engine Iceberg deployments.

1. Workload-scoped endpoints. Create a routing group per workload type — analytics, etl, bi, reports — each with its own endpoint URL, engine pool, query-type scope (SELECT, INSERT, etc.), and priority level. Applications connect to their endpoint and inherit all routing rules. Adding or removing engines from a group requires no client changes.

2. Cost-aware dispatch. Steer CPU-heavy joins to compute-priced engines and scan-heavy Iceberg reads to scan-priced backends. Encode the cost model once in the routing configuration. Every client inherits the same dispatch automatically. In the queryflux-bench benchmark suite, workload-aware routing reduced total query cost by up to 56%, with individual queries sometimes dropping by 90% compared with always using a single default engine.

3. Dashboard SLA protection. Set a maxRunningQueries cap on the StarRocks or Trino pool that serves dashboards. When the pool is full, ad-hoc queries queue at the proxy or spill to a fallback group. Dashboard traffic always has headroom. Analysts wait transparently rather than degrading everyone.

4. Transparent engine migration. The weighted selection strategy runs old and new engines together in the same cluster group. Ramp the new engine from 10% to 100% of traffic over time, comparing latency and cost side by side in Studio query history. Zero client changes, no flag day, no coordinated cutover.

5. Health-based failover. The failover selection strategy uses YAML ordering as priority. Background health checks monitor each cluster. When an engine becomes unhealthy, it is automatically excluded and traffic routes to remaining healthy members. When it recovers, it is re-admitted.

Engine comparison: what to measure

Engine Comparison — side-by-side cost and latency metrics per engine — Engine comparison: Trino $0.03/query at 1.8s average, Snowflake $0.08/query at 2.1s, DuckDB $0.01/query at 0.5s — with success rates of 99.5%, 99.8%, and 100% respectively. These numbers drive routing rules.

Before configuring routing rules, you need data on how each engine performs on your actual workloads. The metrics that matter:

Cost per query — varies by engine pricing model and query shape. Track per-engine, per-query-type.
Average runtime — P50 and P99 latency per engine for comparable queries.
Success rate — failed queries, timeouts, and retries per engine.
Data scanned — bytes read per query. Critical for scan-priced engines.
Concurrency capacity — how many concurrent queries each engine handles before degradation.
Engine health — uptime, CPU/memory/disk utilization, and response time trends.

Run the same representative queries across all engines for a week. The results will show clear patterns: DuckDB dominates selective lookups, Trino dominates complex joins on compacted data, Athena dominates infrequent wide scans where the scan-pricing model wins. Those patterns become your routing rules.

Table optimization unlocks better routing

Routing decides where a query runs. But the set of engines eligible for any given query depends on the table's physical layout.

A table with 50,000 small files, fragmented manifests, and no sort order forces even simple queries onto heavy distributed engines. The same table compacted into a few hundred sorted files with lean manifests and current Puffin statistics can be queried by DuckDB in sub-second time. Compaction does not just speed up one engine — it expands the set of engines that are viable for each query shape.

This creates a compounding effect between table maintenance and routing.

Compacted tables reduce the minimum viable engine per query shape. More queries become eligible for the cheapest engine tier.
Query-aware sort orders match file layout to actual access patterns, enabling more aggressive file-level pruning across all engines.
Puffin statistics (min/max column stats, bloom filters) allow engines to skip entire files during scan planning, reducing bytes read and favoring scan-priced backends.
Manifest optimization keeps metadata lean, so engines with limited scan-planning capacity (DuckDB, StarRocks) can still open tables quickly.

The practical takeaway: optimizing your Iceberg tables is a prerequisite for getting full value from multi-engine routing. Routing alone improves dispatch — but compacted, sorted tables unlock more engines for more queries.

Self-improving routing for agentic workloads

AI agents accessing Iceberg tables create a specific routing challenge. Their query shapes are both repetitive (80% parameterized templates) and unpredictable (20% novel queries the agent constructs on the fly). A single router type cannot handle both efficiently.

The QueryFlux design document for agentic workloads describes a three-router stack where each router handles what it does well, and they fall through in sequence.

Adaptive router — pure statistics over query history. For any parameterized query shape with 20+ observations across multiple engines, it routes to the engine with the lowest P50 latency if the gap exceeds 25%. Decision cost: 0ms (in-memory cache lookup). Handles the 80% of traffic that consists of repeated templates.
LLM router — for novel query shapes the adaptive router has never seen. A language model reasons about optimal placement using live table statistics, engine capabilities, and current cluster load. Cached by parameterized hash — subsequent runs of the same shape skip the LLM entirely. Decision cost: ~300ms first call, 0ms on cache hit.
Semantic router — local embedding similarity against a library of known query shapes. When the LLM router's confidence is below threshold, the semantic router finds the most similar previously-routed query and uses its routing decision. Decision cost: ~1ms with local embeddings.
Default group — fallback for queries that no router claims.

The system improves automatically. As agents run queries and the feedback collector records execution metrics, the adaptive router builds confidence for repeated templates. The LLM is only called once per unique query shape — every subsequent execution hits the in-memory cache. Novel patterns become known patterns over time.

The design document also describes guardrails that sit between routing and dispatch — ReadOnlyGuard, RowLimitGuard, CostEstimateGuard, PIIMaskGuard, and HumanApprovalGuard — enforcing safety policies per agent session before any query reaches an engine. Combined with an MCP server frontend, any MCP-compatible agent gets QueryFlux routing, translation, guardrails, and observability with zero custom integration code.

Getting started

If you run multiple engines on shared Iceberg tables, you can start with a straightforward setup.

1.Deploy the proxy. QueryFlux runs as a Docker container. Clone the repo, run docker compose up in the examples directory, and you have a working proxy with Trino in minutes.
2.Define cluster groups. One group per workload type: analytics with Trino + DuckDB, etl with Spark + Athena, bi with StarRocks.
3.Set routing rules. Start with protocol-based routing (PostgreSQL wire → analytics group, Trino HTTP → analytics group). Add query regex rules as you learn workload patterns.
4.Set concurrency limits. Cap each group based on engine capacity. Enable queue-based overflow.
5.Monitor and iterate. Use Prometheus metrics to track cost per query, latency per engine, and routing decisions per group. Adjust rules based on data.

Point applications and BI tools at the proxy endpoint instead of directly at engines. The proxy handles routing, translation, failover, and observability. Clients connect once; the backend topology is configuration, not code.

Over time, add cost-aware routing, health-based failover, and — if your tables are well-compacted — expand the set of engines eligible for each workload tier. Connect LakeOps to gain table-health-aware routing, query-pattern learning, engine comparison, and a unified view of how routing, optimization, and cost fit together across your lake. The better your table layout, the more engines become viable, and the more value the routing layer delivers.