Query Performance

Optimize Your
Lakehouse Performance

LakeOps continuously reshapes data layout based on actual query patterns — sort order, file sizes, manifests, and delete files all adapt to how your tables are really used. The result: up to 12× faster queries across Trino, Spark, and Snowflake without changing a single query or pipeline.

Get a Demo

All solutions →

12×faster queries

51%less data scanned

99.8%max file reduction

95%faster compaction

Results

Measured impact on
real Iceberg workloads

Benchmarks from production-grade tables across multiple engines and cloud providers.

Compaction speed

95%faster

vs. Apache Spark on identical datasets

Spark

LakeOps

+ Sort

Query performance

12×faster

After compaction + layout optimization

Cost savings

80%reduction

In compute & storage spend

Table health

100%healthy

Autonomous maintenance keeps every table optimized

TPC-DS benchmark suiteProduction Iceberg tablesMulti-cloud, multi-engine

How LakeOps Supercharges Performance

Six layers of
performance optimization.

Each layer amplifies the others. Fewer files + sorted data + lean manifests + clean deletes + right engine = compound acceleration on every query.

Query-aware compaction

Data sorted by how it's actually queried

LakeOps tracks which columns appear in WHERE, JOIN, and GROUP BY clauses for every table. During compaction, data is physically sorted by those columns — so Parquet row group statistics enable engines to skip irrelevant data without reading it.

51% less data scanned — sorted by real filter columns, per table
47,000 → 280 files: same data, same query — 52s drops to 5.8s
Self-improving: sort strategy adapts as query patterns evolve

Query Acceleration12× faster

Files before

47,000

Files after

280

Query before

52s

Query after

5.8s

Scan volume reduced 51%

Query-aware sort + optimized file layout

95% faster Rust engine

Tables stay optimized because compaction is fast enough to run continuously

A purpose-built Rust engine with Apache DataFusion eliminates JVM/GC overhead. Compaction completes in minutes instead of hours — so tables never degrade between maintenance windows.

221s vs 1,612s (Spark) vs 6,300s (S3 Tables) on identical 200 GB
2,522 MB/s peak throughput — TB-scale tables compacted in minutes
Bounded memory: no OOM crashes regardless of table size

Compaction Speed95% faster

6300s

S3 Tables

1612s

Spark

221s

LakeOps

780s

LakeOps Sort

Manifest & metadata optimization

Query planning stays fast at any table scale

LakeOps consolidates fragmented manifests and computes Puffin column statistics (NDV, min/max, null counts). Planners read fewer manifests and make smarter skip decisions — planning drops from seconds to milliseconds.

Manifest consolidation: 200+ manifests → ~30 in a single atomic rewrite
Puffin statistics enable aggressive file-level pruning across all engines
Auto-triggered after compaction cycles — manifests never drift

Metadata Optimization3 operations

Rewrite Manifests

Consolidate for faster planning

Planning

Rewrite Position Deletes

Eliminate read-time overhead

Reads

Compute Puffin Statistics

Enable aggressive file pruning

Skipping

Delete file optimization

Eliminate read-time reconciliation overhead

Position delete files from merge-on-read workloads accumulate and force every query to reconcile deletions at scan time. LakeOps consolidates and physically applies delete files so reads are always clean.

Rewrite Position Deletes: consolidate without full table rewrite
Full compaction: physically merge deletes — zero read-time overhead
23,433 delete files (551M rows) cleaned in one compaction cycle

Delete File CleanupZero overhead

Delete files

23,433

After cleanup

Rows affected

551M

Read overhead

Eliminated

Multi-engine query routing

Every query on the fastest engine for its shape

LakeOps routes queries across Trino, Spark, DuckDB, Snowflake, Athena, and Flink based on latency profile, query shape, and engine availability. Interactive queries hit sub-second engines. Heavy scans go where compute is strongest.

DuckDB: 0.5s point lookups vs 2.3s on Athena for same query
Three strategies: latency, cost, throughput — per routing group
Optimized tables unlock faster engines for more workload shapes

Engine RoutingLatency-aware

DuckDBPoint lookups0.5s

TrinoInteractive SQL1.8s

SnowflakeHigh concurrency2.1s

SparkLarge distributed scans4.2s

Right engine for every query shape

Layout simulations

Test sort strategies on real data before rewriting anything

Run proposed layout changes on an isolated Iceberg branch — real data, real query patterns replayed. Compare scan reduction and planning overhead across multiple strategies. Discard the branch. Zero production risk.

Field access frequency analysis: which columns in FILTER, SELECT, JOIN
Side-by-side comparison of file sizes, strategies, and sort keys
Predicted vs actual: measurable before committing to a rewrite

Layout SimulationBranch-based

1Analyze field access frequency (FILTER, SELECT, JOIN)

2Create isolated Iceberg branch

3Apply proposed sort order to real data

4Replay production queries, measure scan reduction

Branch discarded after analysis — zero production impact

The problem

Why Iceberg tables get slow

Iceberg's metadata architecture is built for fast queries. But without active maintenance, physical table state degrades — and every query pays the penalty.

Small files multiply per-query overhead

Streaming ingestion creates thousands of tiny files. Each file costs an S3 GET request, a metadata read, and a connection — query time scales with file count, not data volume.

Unsorted data defeats data skipping

Without sort order aligned to query patterns, Parquet min/max statistics are useless. Engines scan every row group regardless of predicate filters.

Fragmented manifests bloat planning time

Hundreds of small manifests force the query planner to read excessive metadata. Planning often dominates total query time at 200+ manifests per table.

Delete files compound read amplification

Merge-on-read tables accumulate position delete files. Every query reconciles deletes at read time — performance degrades linearly with delete file count.

Runs on your stack

Go deeper

Optimizing Iceberg Lakehouse Performance

Six layers that determine query speed and how to optimize each one.

Optimizing Iceberg Lake Compaction: A Guide

Event-driven triggers, query-aware sort, Rust engine, and coordinated maintenance.

Iceberg Lakehouse Observability

Table health, insights, metrics, and cross-engine telemetry in one control plane.

Minutes to value with no risk

Connect & collect telemetry

Manual or autonomous management

Manual

Autonomous

Operations run & optimize

Compaction

Snapshots

Orphan cleanup

Manifests & metadata

Observability & governance

Metrics

Health

Agents

Routing

Logs

Policies

No vendor lock-in

No code / infra changes

No data changes

Get a Demo

Set up in 10 minutes · Works with your existing stack

Production benchmarks

5.5 TB across 10 production tables

Real workloads. Real data. Batch, streaming, delete-heavy, multi-writer, and terabyte-scale tables — all on the same engine, same hardware.

101K → 19K

files (81% reduction)

2,522 MB/s

peak throughput

99.8%

max file reduction

551M

deleted rows cleaned

Table	Size	Workload	Files (B → A)	Throughput	Time	Notes
balance_snapshots	1,192 GB	TB-Scale batch	11,957 → 3,270	1,572 MB/s	11 min	Spark OOM on same hardware
user_accounts	174 GB	Batch	878 → 400	2,269 MB/s	74s	Single Node
events_analytics	484 GB	Delete-Heavy	16,128 → 7,198	729 MB/s	11m 21s	23,433 delete files; 551M rows removed
raw_sdk_events	8 GB	Streaming	42,633 → 69	167 MB/s	138s	99.8% file reduction
site_traffic	292 GB	Multi-Writer	2,740 → 754	1,465 MB/s	3m 25s	Single partition
cluster_registry	322 GB	Batch	998 → 440	2,522 MB/s	2m	Peak throughput

Compaction cost per TB

Normalized to Spark = 100%

Apache Spark100%

AWS S3 Tables / Databricks100%

LakeOps10%

Source: 200 GB (~1 TB uncompressed) benchmark. Spark cost index 100 vs LakeOps 10.

Self-improving: same table, zero config changes

balance_snapshots — 1.192 TB across consecutive runs

Run 122 min · 925 MB/s

Run 218 min · 1,100 MB/s

Run 3 (learned)11 min · 1,572 MB/s

Same data and hardware; planner learns workload telemetry and improves runtime from 22 to 11 minutes.

See your projected acceleration

Connect your catalog and get a free performance analysis in 10 minutes — see exactly where your tables are degraded and how much LakeOps can accelerate them.

Get a Demo

Explore the platform →

Optimize YourLakehouse Performance

Six layers ofperformance optimization.

Data sorted by how it's actually queried

Tables stay optimized because compaction is fast enough to run continuously

Query planning stays fast at any table scale

Eliminate read-time reconciliation overhead

Every query on the fastest engine for its shape

Test sort strategies on real data before rewriting anything

Small files multiply per-query overhead

Unsorted data defeats data skipping

Fragmented manifests bloat planning time

Delete files compound read amplification

Optimizing Iceberg Lakehouse Performance

Optimizing Iceberg Lake Compaction: A Guide

Iceberg Lakehouse Observability

Minutes to value with no risk

Connect & collect telemetry

Manual or autonomous management

Operations run & optimize

Observability & governance

Compaction cost per TB

Self-improving: same table, zero config changes

See your projected acceleration

Optimize Your
Lakehouse Performance

Six layers of
performance optimization.