Compaction
LakeOps includes a compaction engine built on Apache DataFusion (Rust) for Apache Iceberg that analyzes query patterns and access frequency to optimize file layout at scale. It runs more compactions in less time with minimal resource footprint, so your lake stays performant without blocking writes or queries.
Why compaction matters
Apache Iceberg tables accumulate small files over time from streaming ingestion, frequent appends, and partial updates. These small files degrade query performance because engines must open and scan each file individually, increasing I/O overhead and planning time.
Compaction merges these files into optimally-sized data files. The result: fewer files to scan, less metadata to plan over, and dramatically faster queries across every connected engine.
Symptoms of poor compaction
- •Thousands of small files per partition (well below target file size)
- •Query planning time growing with each ingestion cycle
- •High S3/GCS request costs from excessive LIST and GET operations
- •LakeOps Insights flagging “Partition Data Files” warnings
How LakeOps compaction works
LakeOps compaction operates directly on your Iceberg metadata and storage. It:
- •Reads table metadata to identify partitions with suboptimal file counts or sizes.
- •Analyzes query patterns to determine which columns are filtered and joined on most, informing sort-order decisions.
- •Rewrites data files using the Rust compaction engine — merging small files into target-sized outputs without blocking concurrent readers or writers.
- •Commits atomically via Iceberg's optimistic concurrency, so the table is never in an inconsistent state.
- •Logs every operation in the Events tab with before/after file counts, duration, and data volume.
Compaction strategies
LakeOps supports two compaction strategies, configurable per-table or via policies:
Binpack
Optimize for size
Combines small files into larger files targeting the configured file size (default: 512 MB). Does not reorder data within files.
Best for
- • Tables with heavy append workloads (streaming ingestion)
- • Reducing file count as the primary goal
- • Fast execution with minimal compute overhead
Sort
Optimize for queries
Rewrites data files sorted by the columns your queries filter on most. Enables data skipping and dramatically reduces data scanned per query.
Best for
- • Query-heavy tables with predictable filter patterns
- • Large analytical tables where scan reduction matters
- • Tables where you know the most-queried columns
Choosing a strategy
Start with Binpack to eliminate small-file pressure quickly. Then evaluate whether Sort would benefit query-heavy tables by using the Simulations feature to preview the impact before applying.
Using the compaction UI
Navigate to Data > Explore in the sidebar, select a table, then open the Optimization tab. The File Compaction card provides compaction controls for the selected table.
Step-by-step
file_size, record_count, delete_file_count, sequence_number, and partition['col']. For example: file_size < 134217728 OR delete_file_count > 0.0 2 * * * for daily at 2:00 AM). The schedule display shows a human-readable interpretation.If Adaptive Maintenance is enabled for the table, the File Compaction card is automatically managed and locked — compaction runs as part of the adaptive bundle.
Configuration reference
Per-table compaction settings, or applied globally via Policies:
| Setting | Default | Description |
|---|---|---|
| Predicate | (none) | Optional WHERE clause to scope compaction to specific files. Supports file_size, record_count, delete_file_count, sequence_number, and partition['col']. |
| Schedule (cron) | 0 2 * * * | 5-field Unix cron expression controlling execution frequency. Default: daily at 2:00 AM. |
| Enabled | On | Toggle to enable or disable scheduled compaction for this table. |
Conflict handling
LakeOps uses Iceberg's optimistic concurrency control. If a concurrent write occurs during compaction:
- •The compaction commit detects the conflict via metadata version check.
- •LakeOps retries the affected partition(s) on the next scheduled or manual run.
- •No data is lost or corrupted — the worst case is a delayed compaction cycle.
Monitoring compaction
After compaction runs, verify results through multiple observability surfaces:
- •Events tab — shows “Compact Data Files” operations with before/after file counts, duration, and data volume.
- •Metrics tab — check avgFileSize and totalFiles to confirm convergence toward target.
- •Insights tab — previously flagged “Partition Data Files” insights should resolve after successful compaction.
- •Dashboard — aggregated operations count and query speed improvement reflect compaction benefits across the lake.
Simulate before you apply
Use the dedicated Simulations tab to preview the impact of different compaction strategies before applying them. Layout simulations run on a real Iceberg branch and show:
- •How many files will be rewritten
- •Expected output file count and average size
- •Field access frequency analysis for query-aware optimization
- •Side-by-side comparison of multiple strategies
Compaction via policies
Instead of configuring each table individually, use Adaptive Maintenance to bundle compaction with other operations across your catalog. A standalone Compact Data Files policy is coming soon. Policies support:
- •Scoping to per-table, per-namespace, or per-catalog
- •Independent cron schedules and enable/disable toggles
- •Version history and audit trail for every policy change
