Compaction
LakeOps includes a Rust-based compaction engine for Apache Iceberg that analyzes query patterns and access frequency to optimize file layout at scale. It runs more compactions in less time with minimal resource footprint, so your lake stays performant without blocking writes or queries.
Why compaction matters
Apache Iceberg tables accumulate small files over time from streaming ingestion, frequent appends, and partial updates. These small files degrade query performance because engines must open and scan each file individually, increasing I/O overhead and planning time.
Compaction merges these files into optimally-sized data files. The result: fewer files to scan, less metadata to plan over, and dramatically faster queries across every connected engine.
Symptoms of poor compaction
- •Thousands of small files per partition (well below target file size)
- •Query planning time growing with each ingestion cycle
- •High S3/GCS request costs from excessive LIST and GET operations
- •LakeOps Insights flagging “Partition Data Files” warnings
How LakeOps compaction works
LakeOps compaction operates directly on your Iceberg metadata and storage. It:
- •Reads table metadata to identify partitions with suboptimal file counts or sizes.
- •Analyzes query patterns to determine which columns are filtered and joined on most, informing sort-order decisions.
- •Rewrites data files using the Rust compaction engine — merging small files into target-sized outputs without blocking concurrent readers or writers.
- •Commits atomically via Iceberg's optimistic concurrency, so the table is never in an inconsistent state.
- •Logs every operation in the Events tab with before/after file counts, duration, and data volume.
Compaction strategies
LakeOps supports two compaction strategies, configurable per-table or via organization-wide policies:
Binpack
Optimize for size
Combines small files into larger files targeting the configured file size (default: 512 MB). Does not reorder data within files.
Best for
- • Tables with heavy append workloads (streaming ingestion)
- • Reducing file count as the primary goal
- • Fast execution with minimal compute overhead
Sort
Optimize for queries
Rewrites data files sorted by the columns your queries filter on most. Enables data skipping and dramatically reduces data scanned per query.
Best for
- • Query-heavy tables with predictable filter patterns
- • Large analytical tables where scan reduction matters
- • Tables where you know the most-queried columns
Choosing a strategy
Start with Binpack to eliminate small-file pressure quickly. Then evaluate whether Sort would benefit query-heavy tables by using the Simulations feature to preview the impact before applying.
Using the compaction UI
Navigate to Explore in the sidebar, select a table from the tree view, then open the Optimization tab. The File Compaction card is the first section.
Step-by-step
0 2 * * * * for daily at 2:00 AM). The schedule display shows a human-readable interpretation.Auto vs. Manual mode
Auto (autopilot)
LakeOps runs compaction autonomously on the configured cron schedule. Ideal for production tables where consistent optimization is critical and you want zero manual overhead.
Conflict-aware: LakeOps detects concurrent writes and retries safely.
Manual (on-demand)
You control when compaction runs. Good for development/staging environments, or when you want to verify simulation results before committing.
The configuration is saved; you simply trigger execution when ready.
Configuration reference
Compaction settings per table, or applied globally via Policies:
| Setting | Default | Description |
|---|---|---|
| Target file size | 512 MB | The target output file size after compaction. Files will be merged until this size is reached. |
| Strategy | Binpack | Binpack (merge by size only) or Sort (reorder by query-relevant columns). |
| Schedule (cron) | 0 2 * * * * | Cron expression controlling execution frequency. Default: daily at 2:00 AM. |
| Auto / Manual | Manual | Auto: runs on schedule. Manual: triggered on demand. |
Conflict handling
LakeOps uses Iceberg's optimistic concurrency control. If a concurrent write occurs during compaction:
- •The compaction commit detects the conflict via metadata version check.
- •LakeOps retries the affected partition(s) on the next scheduled or manual run.
- •No data is lost or corrupted — the worst case is a delayed compaction cycle.
Monitoring compaction
After compaction runs, verify results through multiple observability surfaces:
- •Events tab — shows “Compact Data Files” operations with before/after file counts, duration, and data volume.
- •Metrics tab — check avgFileSize and totalFiles to confirm convergence toward target.
- •Insights tab — previously flagged “Partition Data Files” insights should resolve after successful compaction.
- •Dashboard — aggregated operations count and query speed improvement reflect compaction benefits across the lake.
Simulate before you apply
Every compaction configuration includes a Simulate button. Click it to preview:
- •How many files will be rewritten
- •Expected output file count and average size
- •Estimated execution duration
For more advanced layout analysis (field access frequency, partition strategy comparison), use the dedicated Simulations tab.
Compaction via policies
Instead of configuring each table individually, create a Compact Data Files policy to apply compaction rules across an entire catalog or organization. Policies support:
- •Scoping to per-table, per-namespace, per-catalog, or organization-wide
- •Independent cron schedules and enable/disable toggles
- •Version history and audit trail for every policy change
