Layout Simulations

LakeOps lets you run layout simulations on any table to preview the impact of different compaction strategies, partition schemes, and sort orders before applying them. Test changes safely, compare results, and apply only when you're confident.

Why simulate before applying?

Changing a table's file layout (partition scheme, sort order, compaction strategy) can dramatically improve or degrade query performance depending on your workload. Simulations let you run proposed changes on a real branch of your data and see actual results — without affecting production.

  • Risk-free — runs on an isolated branch, no production data is modified
  • Real results — operates on actual data and query patterns, not theoretical models or sampling
  • Comparable — run multiple strategies side by side to find the best one
  • Fast feedback — results available in seconds, not hours

How simulations work

LakeOps simulations operate on a real Iceberg branch of your table data — not a theoretical model. When you run a simulation, LakeOps:

  • Creates an isolated Iceberg branch from the current table snapshot, so the simulation has access to real metadata, file statistics, and partition structures
  • Applies the proposed layout change (partition scheme, sort order, cluster keys) to the branch, producing real output metrics — actual file counts, actual average sizes, actual execution times
  • Discards the branch after analysis — no changes are committed to the main table, and no data files are permanently written

Because the simulation runs against real data (not sampling or estimation), the results reflect exactly what would happen if you applied the change to production.

Query-aware simulation

What makes LakeOps simulations unique is that they understand how your data is actually queried. LakeOps continuously tracks every query that touches a table and builds a field access frequency profile that breaks down each column by operation type:

OperationWhat it measuresWhy it matters for layout
SELECTHow often each column appears in query projectionsIdentifies hot columns that benefit from efficient file layout
FILTERHow often each column is used in WHERE clauses and predicatesHigh-filter columns are strong candidates for sort order (enables data skipping)
JOIN RowsHow often each column is used as a join key across tablesHigh-join columns benefit from clustering to co-locate related rows

The chart visualizes access counts per column (up to millions of accesses), so you can immediately see which columns dominate your workload. For example, if created_at has the highest SELECT and FILTER frequency, partitioning by that column and sorting within partitions will yield the largest performance gain.

Query pattern sidebar

Next to the field access chart, sidebar cards show real query patterns from your workload with query counts (e.g. “17 queries”, “41 queries”). These include:

  • Sort key configurations — showing the partition, cluster, and sort columns with their relationships
  • Sample queries — actual SQL patterns like time-range scans with specific predicates
  • Policy contracts — optimization scope and target contract for the table

This means simulations don't just predict file sizes — they predict how the layout change will affect your actual queries based on real access patterns.

What simulations produce

Each simulation produces concrete, measurable results:

File layout impact

  • • Expected output file count
  • • Average file size after compaction
  • • Execution time

Query performance impact

  • • Data skipping effectiveness
  • • Scan reduction per query pattern
  • • Planning latency improvement

Data relationships

  • • Partition key effectiveness
  • • Cluster column correlation
  • • Sort order alignment with filters

Field access frequency

  • • Which columns are used in SELECT
  • • Which columns are used in FILTER
  • • Which columns are used in JOIN

Two ways to run simulations

Simulations tab

Navigate to Explore > select a table > Simulations tab. Run full layout simulations with custom partition, cluster, and sort configurations. Compare multiple strategies side by side.

Optimization tab Simulate button

In the Optimization tab, every configuration card (File Compaction, Snapshot Retention, Orphan Files Cleanup, Rewrite Manifests) has a Simulate button. Click it to preview the impact of that specific operation before saving.

Running a simulation

From the Simulations tab:

1Click Run Sim to create a new simulation.
2Configure the layout: choose partition keys (e.g. order_date daily), cluster columns (e.g. customer_id, order_status), and sort order.
3Give the simulation a descriptive name (e.g. clusterByOrderDate).
4Run the simulation. LakeOps analyzes the table's current state, your query patterns, and the proposed layout to produce results.
5Review the simulation card showing real execution time, expected file size, and data relationship analysis.

Click Play Back on any completed simulation to replay its analysis and review results again.

Simulation results

Each completed simulation appears as a card (marked with a ✓ indicator) showing:

FieldDescription
Simulation nameYour descriptive name for the strategy (e.g. clusterByOrderDate)
DescriptionPartition, cluster, and sort configuration breakdown (e.g. “Partition: order_date → daily buckets. Cluster by customer_id, order_status.”)
Strategy columnsThe primary data relationship columns used (e.g. customer_id, order_status)
Execution timeReal measured time from executing the layout change on the branch (e.g. 141.6s)

Simulations are tagged with labels (meta, layout) for organization. The tag badges and a simulation count appear above the cards.

Reading the frequency chart

The field access frequency chart (described in detail in the Query-aware simulation section above) is displayed below the simulation cards. Use it to validate your strategy choices:

  • If a column has the longest FILTER bar, it's the strongest candidate for sort order — sorting on it maximizes data skipping
  • If a column has high JOIN Rows frequency, clustering by it will co-locate related rows and improve join performance
  • Columns with minimal bars across all three categories can usually be ignored when choosing layout keys

The subtitle beneath the chart reads “How this table's fields correlate — the foundation for choosing the right layout strategy.” This is the core input that makes simulations query-aware: the chart tells you what matters, and the simulation tells you how much it helps.

Comparing simulations

The Layout Customization Diff table at the bottom of the Simulations tab lets you compare multiple simulations side by side. Each row shows a tag, simulation name, data_rel (data relationships), strategy, and average file size.

Example comparison

TagSimulationdata_relStrategyAvg Size
metaclusterByOrderDatecustomer_id, order_statusorder_date (day)343 MB / file
metacluster.order_type.by.statusorder_status, payment_methodorder_status, payment_method511 MB / file
layoutcluster.insert-time-linecustomer_id, store_idcreated_at (hour)128 MB / file

Applying simulation results

Once you've identified the optimal layout strategy through simulation:

  • Switch to the Optimization tab and configure compaction with the winning strategy
  • Set the partition scheme and sort order to match your simulation
  • Enable Auto mode for continuous optimization, or trigger a Manual run
  • Monitor results in the Events and Metrics tabs