Layout Simulations

LakeOps lets you run layout simulations on any table to preview the impact of different compaction strategies, partition schemes, and sort orders before applying them. Test changes safely, compare results, and apply only when you're confident.

Why simulate before applying?

Changing a table's file layout (partition scheme, sort order, compaction strategy) can dramatically improve or degrade query performance depending on your workload. Simulations let you run proposed changes on a real branch of your data and see actual results — without affecting production.

•Risk-free — runs on an isolated branch, no production data is modified
•Real results — operates on actual data and query patterns, not theoretical models or sampling
•Comparable — run multiple strategies side by side to find the best one
•Fast feedback — results available in seconds, not hours

How simulations work

LakeOps simulations operate on a real Iceberg branch of your table data — not a theoretical model. When you run a simulation, LakeOps:

•Creates an isolated Iceberg branch from the current table snapshot, so the simulation has access to real metadata, file statistics, and partition structures
•Applies the proposed layout change (partition scheme, sort order, cluster keys) to the branch, producing real output metrics — actual file counts, actual average sizes, actual execution times
•Discards the branch after analysis — no changes are committed to the main table, and no data files are permanently written

Because the simulation runs against real data (not sampling or estimation), the results reflect exactly what would happen if you applied the change to production.

Query-aware simulation

What makes LakeOps simulations unique is that they understand how your data is actually queried. LakeOps continuously tracks every query that touches a table and builds a field access frequency profile that breaks down each column by operation type:

Operation	What it measures	Why it matters for layout
SELECT	How often each column appears in query projections	Identifies hot columns that benefit from efficient file layout
FILTER	How often each column is used in WHERE clauses and predicates	High-filter columns are strong candidates for sort order (enables data skipping)
JOIN Rows	How often each column is used as a join key across tables	High-join columns benefit from clustering to co-locate related rows

The chart visualizes access counts per column (up to millions of accesses), so you can immediately see which columns dominate your workload. For example, if created_at has the highest SELECT and FILTER frequency, partitioning by that column and sorting within partitions will yield the largest performance gain.

Query pattern sidebar

Next to the field access chart, sidebar cards show real query patterns from your workload with query counts (e.g. “17 queries”, “41 queries”). These include:

•Sort key configurations — showing the partition, cluster, and sort columns with their relationships
•Sample queries — actual SQL patterns like time-range scans with specific predicates
•Policy contracts — optimization scope and target contract for the table

This means simulations don't just predict file sizes — they predict how the layout change will affect your actual queries based on real access patterns.

What simulations produce

Each simulation produces concrete, measurable results:

File layout impact

• Expected output file count
• Average file size after compaction
• Execution time

Query performance impact

• Data skipping effectiveness
• Scan reduction per query pattern
• Planning latency improvement

Data relationships

• Partition key effectiveness
• Cluster column correlation
• Sort order alignment with filters

Field access frequency

• Which columns are used in SELECT
• Which columns are used in FILTER
• Which columns are used in JOIN

Simulations vs. manual execution

Simulations tab

Navigate to Data > Explore > select a table > Simulations tab. Run full layout simulations with custom partition, cluster, and sort configurations on an isolated branch. Compare multiple strategies side by side — no production data is modified.

Optimization tab — Execute

The Execute button on each operation card in the Optimization tab runs the actual operation once on production data. Use it to verify results before enabling a cron schedule. This is not a simulation — changes are applied immediately.

Running a simulation

From the Simulations tab:

1Click Run Sim to create a new simulation.

2Configure the layout: choose partition keys (e.g. order_date daily), cluster columns (e.g. customer_id, order_status), and sort order.

3Give the simulation a descriptive name (e.g. clusterByOrderDate).

4Run the simulation. LakeOps analyzes the table's current state, your query patterns, and the proposed layout to produce results.

5Review the simulation card showing real execution time, expected file size, and data relationship analysis.

Click Play Back on any completed simulation to replay its analysis and review results again.

Simulation results

Each completed simulation appears as a card (marked with a ✓ indicator) showing:

Field	Description
Simulation name	Your descriptive name for the strategy (e.g. clusterByOrderDate)
Description	Partition, cluster, and sort configuration breakdown (e.g. “Partition: order_date → daily buckets. Cluster by customer_id, order_status.”)
Strategy columns	The primary data relationship columns used (e.g. customer_id, order_status)
Execution time	Real measured time from executing the layout change on the branch (e.g. 141.6s)

Simulations are tagged with labels (meta, layout) for organization. The tag badges and a simulation count appear above the cards.

Reading the frequency chart

The field access frequency chart (described in detail in the Query-aware simulation section above) is displayed below the simulation cards. Use it to validate your strategy choices:

•If a column has the longest FILTER bar, it's the strongest candidate for sort order — sorting on it maximizes data skipping
•If a column has high JOIN Rows frequency, clustering by it will co-locate related rows and improve join performance
•Columns with minimal bars across all three categories can usually be ignored when choosing layout keys

The subtitle beneath the chart reads “How this table's fields correlate — the foundation for choosing the right layout strategy.” This is the core input that makes simulations query-aware: the chart tells you what matters, and the simulation tells you how much it helps.

Comparing simulations

The Layout Customization Diff table at the bottom of the Simulations tab lets you compare multiple simulations side by side. Each row shows a tag, simulation name, data_rel (data relationships), strategy, and average file size.

Example comparison

Tag	Simulation	data_rel	Strategy	Avg Size
meta	clusterByOrderDate	customer_id, order_status	order_date (day)	343 MB / file
meta	cluster.order_type.by.status	order_status, payment_method	order_status, payment_method	511 MB / file
layout	cluster.insert-time-line	customer_id, store_id	created_at (hour)	128 MB / file

Applying simulation results

Once you've identified the optimal layout strategy through simulation:

•Switch to the Optimization tab and configure compaction with the winning strategy
•Set the partition scheme and sort order to match your simulation
•Enable scheduled compaction, or click Execute for an immediate run
•Monitor results in the Events and Metrics tabs

← Query Routing Next: Agentic AI →