Apache Iceberg with Trino: Performance Optimization Guide

Trino has become one of the most popular engines for querying Apache Iceberg tables in production lakehouses. Its stateless, distributed SQL architecture makes it a natural fit for interactive analytics, ad-hoc exploration, and BI workloads over open table formats. But getting fast, cost-efficient queries out of Trino on Iceberg is not automatic — it requires understanding how Trino plans scans against Iceberg metadata, how file layout affects pruning, and where Trino-native maintenance falls short at scale.

This guide covers the full optimization surface: how Trino resolves Iceberg metadata and plans queries, the configuration knobs that matter, Trino-native maintenance procedures and their limitations, physical layout strategies, and how a dedicated control plane like LakeOps closes the gaps that Trino cannot address on its own. For broader context on Iceberg performance tuning across engines, see our Iceberg lake analytics optimization guide.

How Trino queries Iceberg tables

Understanding what happens between SELECT and the first byte of results is essential for optimization. Trino's Iceberg connector follows a multi-stage process to translate SQL into efficient object storage reads.

Metadata resolution

When a query references an Iceberg table, Trino contacts the Iceberg catalog (Hive Metastore, AWS Glue, Nessie, or Iceberg REST) to locate the current metadata pointer. It reads the latest metadata.json file, which references the current snapshot. From the snapshot, Trino walks the manifest list to identify which manifest files describe the table's data files. Each manifest contains per-file metadata: file path, partition values, column-level min/max statistics, row counts, and file size.

This metadata walk is the foundation of Iceberg's query planning advantage over Hive-style tables. Instead of listing directories in object storage (which is slow and eventually consistent on S3), Trino reads a structured metadata tree that tells it exactly which files could contain matching rows.

Scan planning and predicate pushdown

Once Trino has the manifest entries, the coordinator evaluates query predicates against the metadata to eliminate files before any data is read. This happens in three layers:

1.Partition pruning — if the table is partitioned and the query filters on a partition column, Trino eliminates entire partitions by comparing the predicate against partition values in the manifest. A query filtering WHERE event_date = '2026-05-01' on a day-partitioned table skips every partition except that date.
2.Min/max file pruning — for each surviving partition, Trino checks per-file column statistics. If a file's max value for user_id is 500 and the query filters WHERE user_id > 1000, that file is skipped entirely. This works on any column with statistics, not just partition columns.
3.Row group skipping — within individual Parquet files, Trino uses row group (page) statistics to skip blocks of rows that cannot match the predicate. This is the finest-grained pruning level and depends on data being physically sorted so that value ranges within row groups are tight.

The effectiveness of these layers compounds. Partition pruning might eliminate 95% of files. Min/max pruning removes another 80% of what remains. Row group skipping reduces the bytes read within each surviving file. The combined effect can be orders of magnitude less I/O than a full scan — but only if the table's physical layout cooperates. Dedicated control planes like LakeOps can optimize this layout automatically based on real query patterns — we'll cover how later in this guide.

Query performance tuning

Trino's Iceberg connector exposes several mechanisms for reducing scan volume and query latency. Most of them depend on metadata quality and file layout rather than Trino configuration alone.

Partition pruning

Partition pruning is the highest-impact optimization. When a query filters on a partition column, Trino can skip entire sets of files without reading any data. Iceberg supports hidden partitioning transforms — day(ts), month(ts), bucket(id, 16), truncate(region, 3) — which means partitioning does not require users to add synthetic columns to their data.

To verify that partition pruning is active, use EXPLAIN and check for partitionConstraint in the output:

sql

1EXPLAIN SELECT * FROM lakehouse.events2WHERE event_date = DATE '2026-05-01';

If the constraint appears, Trino is pruning at the partition level. If it does not, the predicate may not align with the partition spec — a common issue when timestamp filters do not match the partition transform granularity. For detailed partition design guidance, see our Iceberg partitioning best practices guide.

Min/max column statistics and file pruning

Beyond partitions, Trino uses per-file column statistics stored in Iceberg manifests to skip files whose value ranges do not overlap with query predicates. This is sometimes called "data skipping" or "file pruning" and it works on any column, not just partition columns.

The effectiveness depends on how well the data is physically sorted. If a column is used frequently in WHERE clauses but the data is written in arrival order, the min/max ranges per file will span nearly the entire domain, and no files can be pruned. Sorting the data on that column during compaction tightens the ranges dramatically — each file covers a narrow slice of the value space, and most files can be skipped.

Consider a transactions table with 10,000 files. Without sorting on account_id, a query filtering WHERE account_id = 42 may need to read all 10,000 files because every file's min/max range includes 42. After sorting on account_id, the same query might read only 3 files — a 3,000x reduction in I/O.

Row group skipping

Parquet files are internally organized into row groups, each with its own column statistics. When Trino reads a file, it checks row group stats before decompressing data. If a row group's statistics show that no rows can match the predicate, Trino skips that row group entirely.

Row group skipping is most effective when the data within each file is sorted on the filtered column. Unsorted files have row groups with wide, overlapping value ranges — sorting produces row groups with tight, non-overlapping ranges. The Parquet writer's row-group-size (default 128 MB in most configurations) interacts with sort order to determine how many row groups exist per file and how selective they can be.

Dynamic filtering

Trino's dynamic filtering feature collects filter values from the build side of a join and pushes them into the probe-side scan at runtime. For Iceberg tables, this means that a join on a high-cardinality column can reduce the scan on the probe side to only the files and row groups that contain matching values — even though the original query did not have an explicit filter on that column.

Enable dynamic filtering in Trino's configuration:

text

1# config.properties2enable-dynamic-filtering=true3dynamic-filtering.large.max-distinct-values-per-driver=100004dynamic-filtering.large.range-row-limit-per-driver=100000

Dynamic filtering is particularly effective when joining a large fact table against a small dimension table. The dimension-side values are collected during the build phase and pushed down as predicates into the Iceberg scan, potentially eliminating the majority of fact-table files.

Trino-side configuration

Beyond scan planning, several Trino configuration parameters directly affect query performance on Iceberg tables.

Worker memory and split sizes

Trino divides each query into splits — units of work assigned to workers. For Iceberg, each split typically corresponds to a file or a range within a file. The split size and worker memory determine how many splits a worker can process concurrently.

text

1# config.properties2query.max-memory-per-node=16GB3query.max-total-memory-per-node=20GB4node-scheduler.max-splits-per-node=256

Smaller splits increase parallelism but add scheduling overhead. Larger splits reduce overhead but may cause memory pressure on workers with limited RAM. For most Iceberg workloads, 16 GB per-node memory with reasonable split concurrency provides a good balance. Tune node-scheduler.max-splits-per-node based on your worker count and object storage throughput.

Task concurrency

The task.concurrency property controls how many threads each worker uses for processing splits. Higher values increase throughput for I/O-bound scans but can cause contention for CPU-bound operations like aggregations.

text

1# config.properties2task.concurrency=163task.writer-count=4

For scan-heavy Iceberg workloads reading from S3, increasing task.concurrency to 16–32 often improves throughput because workers spend much of their time waiting for object storage responses.

Iceberg-specific connector properties

The Iceberg connector itself has tuning knobs that affect metadata resolution and scan planning:

text

1# iceberg.properties2iceberg.table-statistics-enabled=true3iceberg.projection-pushdown-enabled=true4iceberg.query-partition-filter-required=true5iceberg.dynamic-filtering.wait-timeout=1s

Table statistics feed Trino's cost-based optimizer for better join ordering and aggregation strategies. Projection pushdown eliminates unnecessary columns early in the scan. Requiring partition filters prevents accidental full-table scans on partitioned tables — a common source of runaway queries in production. Dynamic filtering wait timeout controls how long the scan side waits for the build-side filter in join queries.

Table maintenance with Trino

Trino provides built-in procedures for Iceberg table maintenance. These are useful for small-to-medium deployments but come with significant limitations at scale.

Compaction — rewriting data files

Trino's Iceberg connector supports the optimize statement for compacting small files into larger ones. This is Trino's native compaction mechanism:

sql

1-- Compact files with a size threshold (Trino 4xx+)2ALTER TABLE lakehouse.analytics.events EXECUTE optimize3  WHERE event_date > CURRENT_DATE - INTERVAL '7' DAY;4 5-- With explicit file size threshold6ALTER TABLE lakehouse.analytics.events7  EXECUTE optimize(file_size_threshold => '128MB')8  WHERE event_date > CURRENT_DATE - INTERVAL '7' DAY;

The optimize statement supports partition-level filtering with WHERE, allowing you to target recent partitions without scanning the entire table. The file_size_threshold parameter controls which files are eligible for compaction — files smaller than this value are candidates for merging.

Snapshot expiration

Iceberg retains snapshots for time travel and rollback. Over time, old snapshots accumulate metadata that slows down planning. Trino can expire snapshots using the expire_snapshots procedure:

sql

1ALTER TABLE lakehouse.analytics.events2  EXECUTE expire_snapshots(retention_threshold => '7d');

This removes snapshot metadata older than the retention threshold and marks the associated data files as candidates for cleanup — but does not delete the files themselves.

Orphan file removal

After snapshot expiration, data files that are no longer referenced by any snapshot become orphans. These files consume storage but serve no purpose. Trino can remove them:

sql

1ALTER TABLE lakehouse.analytics.events2  EXECUTE remove_orphan_files(retention_threshold => '3d');

The retention_threshold is a safety measure to avoid deleting files from in-progress writes. Setting it to at least 3 days is common practice.

Manifest optimization

Over time, tables accumulate many small manifest files — especially under frequent streaming or micro-batch ingestion. Trino's optimize_manifests procedure rewrites manifests to cluster them by partition, improving scan planning performance:

sql

1ALTER TABLE lakehouse.analytics.events2  EXECUTE optimize_manifests;

This is particularly useful when queries include partition filters but manifest files are not grouped by partition — the optimizer must read every manifest to determine which partitions they cover.

Limitations of Trino-native maintenance

Running maintenance through Trino works for small deployments. At scale — hundreds of tables, mixed streaming and batch ingestion, multiple concurrent engines — Trino's built-in procedures become a bottleneck.

JVM overhead. Trino runs on the JVM. Compaction is a memory-intensive, I/O-heavy operation that competes with query workloads for heap space and garbage collection cycles. Running large compaction jobs on the same Trino cluster that serves interactive queries degrades both workloads. Dedicating a separate Trino cluster for maintenance solves contention but doubles infrastructure cost — a significant factor in Iceberg lakehouse cost optimization.

No cross-engine awareness. Trino only sees queries that Trino executes. If Spark, Flink, or Athena also query the same tables, Trino has no visibility into their access patterns. Sort order decisions based only on Trino query patterns may be suboptimal for the overall workload. Similarly, a compaction job triggered by Trino may conflict with a Spark write that just started — Trino has no mechanism to coordinate across engines.

No coordinated sequencing. The three maintenance operations — snapshot expiration, orphan cleanup, compaction — need to run in a specific order. Expiring snapshots first frees files that are no longer needed, so compaction does not waste time rewriting files that will be deleted. Trino provides the individual procedures but no orchestration layer to sequence them correctly. Teams end up building custom cron jobs or Airflow DAGs to coordinate, adding operational overhead.

No event-driven triggers. Trino maintenance is schedule-driven — you run optimize when a cron job fires. There is no mechanism to detect that a table has accumulated enough small files or delete files to warrant compaction. Cron-based scheduling either wastes compute on tables that do not need maintenance or misses tables that need it urgently. For a deep dive into how event-driven compaction works at scale, see efficient lakehouse compaction at scale.

No per-table tuning at lake scale. Each optimize call requires explicit parameters. Across hundreds of tables with different file size targets, partition schemes, and SLA requirements, maintaining per-table configuration in scheduler scripts becomes unwieldy.

Healthy vs Unhealthy Iceberg tables — the impact of maintenance on query performance — The contrast between unmaintained and well-maintained Iceberg tables — fragmented files and slow queries vs. compacted data with fast, consistent performance.

Physical layout optimization

The physical layout of data within Iceberg files — how rows are sorted across and within files — is the single most impactful optimization for Trino query performance. Layout determines whether min/max statistics enable file pruning or are useless, whether row group skipping works or does not, and how much data Trino must read to answer a query.

Sort orders

Iceberg tables support a sorted_by table property that determines how data is arranged within partitions. When data is compacted, the compaction engine re-sorts files according to this order.

For Trino workloads where queries predominantly filter on one column (e.g., customer_id), a single-column sort order is optimal:

sql

1ALTER TABLE lakehouse.analytics.events2  SET PROPERTIES sorted_by = ARRAY['customer_id'];

After setting the sort order, the next compaction pass will physically re-sort the data. Each file will cover a contiguous range of customer_id values, and min/max statistics will reflect tight, non-overlapping ranges. A query filtering on customer_id will prune the vast majority of files.

Multi-column sort orders

When queries filter on multiple columns, a composite sort order prioritizes the most selective column first:

sql

1ALTER TABLE lakehouse.analytics.events2  SET PROPERTIES sorted_by = ARRAY['customer_id', 'event_type'];

The sort hierarchy matters — filtering on customer_id alone still benefits from pruning. Filtering on event_type alone does not, because the secondary sort key is only ordered within each primary key group. Choose the column that appears in the most queries as the primary sort key.

Z-order for multi-dimensional queries

When queries filter on multiple columns with roughly equal selectivity, Z-ordering interleaves the sort dimensions so that filtering on any subset of the Z-ordered columns produces spatial locality. This is useful when there is no single dominant filter column — for example, tables queried by both region and timestamp in different dashboards.

Z-order trades optimal single-column pruning for good multi-column pruning. If one column dominates 90% of queries, a linear sort on that column will outperform Z-order. If two or three columns each appear in 30–40% of queries, Z-order is the better choice.

Choosing the right strategy

The optimal layout depends on actual query patterns — which columns appear in WHERE, JOIN, and GROUP BY clauses, and how frequently. In practice, most teams set a sort order at table creation and never revisit it, even as query patterns evolve. The gap between the current layout and the optimal layout grows silently, manifesting as gradual query slowdowns that are hard to diagnose because the data volume is also growing.

This is where production telemetry becomes essential. Without visibility into how each table is actually queried — across all engines, not just Trino — layout decisions are guesswork.

How LakeOps enhances Trino-based Iceberg lakehouses

LakeOps is a dedicated lakehouse control plane built in Rust on Apache DataFusion. It connects to your existing Iceberg catalogs and query engines — including Trino — without code changes, infrastructure provisioning, or data movement. For Trino-centric lakehouses, it addresses every maintenance limitation outlined above while adding capabilities no engine-embedded approach can provide.

Modern Lakehouse Architecture — multi-engine routing through LakeOps — LakeOps as a dedicated control plane: Spark, Trino, Flink, Snowflake, Athena, and DuckDB all connect through the platform, which coordinates maintenance and optimization with full workload visibility across every engine.

LakeOps in action — catalog connection, table health analysis, and autonomous optimization.

Key capabilities for Trino-based lakehouses:

Cross-engine query telemetry — collects query patterns from Trino, Spark, Flink, Athena, Snowflake, and DuckDB. Tracks which columns appear in WHERE, JOIN, and GROUP BY clauses per table. Sort orders reflect how Trino actually queries tables — not how a data engineer guessed six months ago.
Query-aware sort optimization — evaluates single-column sort, multi-column sort, and Z-order strategies independently per table. Selects the approach that maximizes data skipping for the actual query mix from all engines, not a global default.
Layout simulations — tests proposed sort changes on a real Iceberg branch, replaying production queries and comparing scan reduction before modifying any production data.
Autonomous maintenance off-cluster — runs compaction, snapshot expiration, orphan cleanup, and manifest optimization on its own Rust engine, not on your Trino cluster. Interactive query performance is unaffected by background maintenance.
Rust execution engine — Parquet data processed through Arrow columnar buffers with bounded memory and lock-free parallelism. Binpack compaction 86% faster than Spark in production benchmarks — no JVM overhead, no GC pauses.
Coordinated maintenance pipeline — snapshot expiration → orphan cleanup → compaction → manifest optimization, sequenced per table. Triggered by structural signals (file count, delete-file ratio, manifest depth), not cron schedules.
Lake-wide health classification — every table scored as Critical, Warning, or Healthy. Severity-ranked Insights surface issues (excessive manifests, partition skew, small files) before Trino queries degrade. See table health maintenance and observability.
Multi-engine conflict avoidance — maintenance scheduling respects active writes from Flink, Spark, and other engines, preventing conflicts when Trino's optimize would collide with concurrent operations.

LakeOps Dashboard — lake-wide optimization and table health — The LakeOps Dashboard: 30-day optimization activity, query acceleration, cost savings, and table health — Critical, Warning, and Healthy — with recent maintenance operations at a glance.

LakeOps Layout Simulations — testing sort strategies against real query patterns — Layout Simulations: field access frequency from real queries per column, candidate sort configurations compared against the baseline, and projected impact for each approach.

LakeOps compaction benchmarks — Rust engine vs Spark performance — Compaction benchmarks: LakeOps Rust engine vs Spark — binpack and sort times, throughput, and file reduction metrics.

LakeOps Insights — proactive alerts driven by table and query signals — Lake-wide Insights: severity-ranked alerts for partition file issues, excessive manifests, snapshot sprawl, and small file accumulation — signals that Trino cannot surface on its own.

Best practices checklist

Use this checklist to systematically optimize your Trino-on-Iceberg deployment:

Query planning and pruning:

Partition tables on the highest-cardinality column used in WHERE clauses (typically a date or timestamp)
Verify partition pruning is active with EXPLAIN for your most frequent queries
Ensure Iceberg column statistics are enabled and up-to-date — stale or missing stats disable file pruning
Sort data on the columns most frequently used in predicates to tighten min/max ranges
Choose between single-column sort (dominant filter) and Z-order (multiple roughly-equal filters) based on actual query patterns

Trino configuration:

Enable dynamic filtering for join-heavy workloads
Set task.concurrency to 16–32 for I/O-bound S3 scans
Enable iceberg.table-statistics-enabled and iceberg.projection-pushdown-enabled for cost-based optimization
Set iceberg.query-partition-filter-required=true to prevent accidental full-table scans
Size worker memory to support your target split concurrency — 16 GB per node with node-scheduler.max-splits-per-node=256 is a solid baseline
Monitor query plans for full-table scans that bypass pruning

Table maintenance:

Run snapshot expiration before compaction — free unreferenced files first so compaction does not rewrite data that will be deleted
Run orphan cleanup after snapshot expiration to reclaim storage from dereferenced files
Target file sizes of 128–256 MB for point-lookup tables, 256–512 MB for full-scan analytics
Do not run optimize on the same Trino cluster serving interactive queries — use a dedicated cluster or offload to a purpose-built engine
Revisit sort orders quarterly, or use telemetry-driven tooling to detect when access patterns have shifted

At scale — use a dedicated control plane:

Replace cron-scheduled maintenance with event-driven triggers based on table structural signals
Use a dedicated maintenance engine (like LakeOps) to avoid JVM overhead and query contention — maintenance should not compete with Trino for cluster resources
Collect query telemetry across all engines — not just Trino — to inform sort order decisions. A control plane that sees Spark, Flink, Athena, and Trino traffic makes layout decisions no single engine can
Classify table health lake-wide (Critical → Warning → Healthy) and address the most degraded tables first
Automate the full maintenance pipeline — snapshot expiration, orphan cleanup, compaction, manifest optimization — as a coordinated sequence, not independent cron jobs

Conclusion

Trino is a powerful query engine for Apache Iceberg lakehouses, but getting the best performance requires work beyond writing SQL. Partition pruning, min/max statistics, sorted file layouts, and proper Trino configuration form the foundation. Trino-native maintenance procedures handle basic compaction and cleanup but break down at scale — JVM overhead, no cross-engine awareness, no coordinated sequencing, no event-driven triggers.

For production lakehouses with hundreds of tables and multiple query engines, a dedicated control plane bridges the gap. LakeOps collects query telemetry from Trino and every other engine, applies sort orders based on actual access patterns, runs maintenance on a purpose-built Rust engine without touching your Trino cluster, and coordinates the full maintenance pipeline autonomously. The result: Trino queries run faster, storage costs drop, and platform teams stop maintaining maintenance infrastructure.

Apache Iceberg with Trino: Performance Optimization Guide

How Trino queries Iceberg tables

Metadata resolution

Scan planning and predicate pushdown

Query performance tuning

Partition pruning

Min/max column statistics and file pruning

Row group skipping

Dynamic filtering

Trino-side configuration

Worker memory and split sizes

Task concurrency

Iceberg-specific connector properties

Table maintenance with Trino

Compaction — rewriting data files

Snapshot expiration

Orphan file removal

Manifest optimization

Limitations of Trino-native maintenance

Physical layout optimization

Sort orders

Multi-column sort orders

Z-order for multi-dimensional queries

Choosing the right strategy

How LakeOps enhances Trino-based Iceberg lakehouses

Best practices checklist

Conclusion

Tags

Related articles

Apache Iceberg at Scale: Infrastructure, Performance, and Enterprise Lessons

Apache Iceberg Operational Runbook: Incidents, Symptoms, and Fixes

Automating Apache Iceberg Table Maintenance