Compaction

LakeOps includes a Rust-based compaction engine for Apache Iceberg that analyzes query patterns and access frequency to optimize file layout at scale. It runs more compactions in less time with minimal resource footprint, so your lake stays performant without blocking writes or queries.

Why compaction matters

Apache Iceberg tables accumulate small files over time from streaming ingestion, frequent appends, and partial updates. These small files degrade query performance because engines must open and scan each file individually, increasing I/O overhead and planning time.

Compaction merges these files into optimally-sized data files. The result: fewer files to scan, less metadata to plan over, and dramatically faster queries across every connected engine.

Symptoms of poor compaction

•Thousands of small files per partition (well below target file size)
•Query planning time growing with each ingestion cycle
•High S3/GCS request costs from excessive LIST and GET operations
•LakeOps Insights flagging “Partition Data Files” warnings

How LakeOps compaction works

LakeOps compaction operates directly on your Iceberg metadata and storage. It:

•Reads table metadata to identify partitions with suboptimal file counts or sizes.
•Analyzes query patterns to determine which columns are filtered and joined on most, informing sort-order decisions.
•Rewrites data files using the Rust compaction engine — merging small files into target-sized outputs without blocking concurrent readers or writers.
•Commits atomically via Iceberg's optimistic concurrency, so the table is never in an inconsistent state.
•Logs every operation in the Events tab with before/after file counts, duration, and data volume.

Compaction strategies

LakeOps supports two compaction strategies, configurable per-table or via organization-wide policies:

Binpack

Optimize for size

Combines small files into larger files targeting the configured file size (default: 512 MB). Does not reorder data within files.

Best for

• Tables with heavy append workloads (streaming ingestion)
• Reducing file count as the primary goal
• Fast execution with minimal compute overhead

Sort

Optimize for queries

Rewrites data files sorted by the columns your queries filter on most. Enables data skipping and dramatically reduces data scanned per query.

Best for

• Query-heavy tables with predictable filter patterns
• Large analytical tables where scan reduction matters
• Tables where you know the most-queried columns

Choosing a strategy

Start with Binpack to eliminate small-file pressure quickly. Then evaluate whether Sort would benefit query-heavy tables by using the Simulations feature to preview the impact before applying.

Using the compaction UI

Navigate to Explore in the sidebar, select a table from the tree view, then open the Optimization tab. The File Compaction card is the first section.

Step-by-step

1Choose your execution mode using the Auto and Manual toggles at the top of the card. Auto runs compaction on the cron schedule without intervention. Manual lets you trigger runs on demand.

2Set the Target File Size (MB) (default: 512). Larger targets reduce file count; smaller targets allow finer-grained parallelism.

3Select a Compaction Strategy: Binpack (optimize for size) or Sort (optimize for queries).

4Configure the Cron Expression to control when compaction runs (e.g. 0 2 * * * * for daily at 2:00 AM). The schedule display shows a human-readable interpretation.

5Click Simulate to preview the expected output (file count, average size, estimated duration) without modifying any data.

6Click Save to persist your configuration. If Auto is enabled, compaction will begin running on the next cron trigger.

Auto vs. Manual mode

Auto (autopilot)

LakeOps runs compaction autonomously on the configured cron schedule. Ideal for production tables where consistent optimization is critical and you want zero manual overhead.

Conflict-aware: LakeOps detects concurrent writes and retries safely.

Manual (on-demand)

You control when compaction runs. Good for development/staging environments, or when you want to verify simulation results before committing.

The configuration is saved; you simply trigger execution when ready.

Configuration reference

Compaction settings per table, or applied globally via Policies:

Setting	Default	Description
Target file size	512 MB	The target output file size after compaction. Files will be merged until this size is reached.
Strategy	Binpack	Binpack (merge by size only) or Sort (reorder by query-relevant columns).
Schedule (cron)	`0 2 * * * *`	Cron expression controlling execution frequency. Default: daily at 2:00 AM.
Auto / Manual	Manual	Auto: runs on schedule. Manual: triggered on demand.

Conflict handling

LakeOps uses Iceberg's optimistic concurrency control. If a concurrent write occurs during compaction:

•The compaction commit detects the conflict via metadata version check.
•LakeOps retries the affected partition(s) on the next scheduled or manual run.
•No data is lost or corrupted — the worst case is a delayed compaction cycle.

Monitoring compaction

After compaction runs, verify results through multiple observability surfaces:

•Events tab — shows “Compact Data Files” operations with before/after file counts, duration, and data volume.
•Metrics tab — check avgFileSize and totalFiles to confirm convergence toward target.
•Insights tab — previously flagged “Partition Data Files” insights should resolve after successful compaction.
•Dashboard — aggregated operations count and query speed improvement reflect compaction benefits across the lake.

Simulate before you apply

Every compaction configuration includes a Simulate button. Click it to preview:

•How many files will be rewritten
•Expected output file count and average size
•Estimated execution duration

For more advanced layout analysis (field access frequency, partition strategy comparison), use the dedicated Simulations tab.

Compaction via policies

Instead of configuring each table individually, create a Compact Data Files policy to apply compaction rules across an entire catalog or organization. Policies support:

•Scoping to per-table, per-namespace, per-catalog, or organization-wide
•Independent cron schedules and enable/disable toggles
•Version history and audit trail for every policy change

← Quick Start Next: Snapshots →