Orphan File Detection & Cleanup
LakeOps detects and safely removes data files no longer referenced by any table snapshot. Eliminate storage drift from failed jobs, aborted commits, and legacy tables — reclaim capacity without risking data integrity.
What are orphan files?
Orphan files are data files that exist in your object storage (S3, GCS, ADLS) but are no longer referenced by any Iceberg table snapshot. They are invisible to query engines but still consume storage budget.
Common causes
- •Failed writes — write jobs that produced files but crashed before committing the metadata update.
- •Aborted commits — Optimistic concurrency conflicts where files were written but the commit was rejected.
- •Snapshot expiration — Old snapshots removed, but without the “clean associated files” option, data files remain on disk.
- •Table drops & migrations — Catalog entries removed without cleaning up underlying storage.
- •Compaction remnants — Old files from pre-compaction state left behind when expiration hasn't cleaned up their snapshots yet.
Impact of orphan files
- •Wasted storage costs — paying for files that no query will ever read.
- •Misleading storage metrics — total lake size appears larger than the actual active dataset.
- •Slower LIST operations — object storage directories with millions of files take longer to enumerate.
How cleanup works
LakeOps performs orphan detection by:
- •Listing all files in the table's storage location.
- •Comparing against metadata — checking which files are referenced by any snapshot (current or historical within the retention window).
- •Applying age threshold — only files older than the configured retention period are candidates for removal.
- •Safely deleting unreferenced files that exceed the age threshold.
Safety guarantee: files are only removed if they are not referenced in any snapshot and exceed the age threshold. This prevents accidental deletion of files from in-progress writes or recently committed data.
Using the orphan cleanup UI
Navigate to Data > Explore, select a table, then open the Optimization tab. Scroll to the Orphan Files Cleanup card.
0 3 * * * — daily at 3:00 AM).Execution control
Orphan cleanup runs on the configured cron schedule when Enabled. You can also click Execute to trigger cleanup immediately. For the first run on a table you haven't cleaned before, executing manually lets you verify the scope before enabling scheduled execution.
If Adaptive Maintenance is active for the table, orphan cleanup is managed automatically and the individual section is locked.
Configuration reference
Orphan cleanup can be applied at scale via Policies (Remove Orphan Files type).
| Setting | Default | Description |
|---|---|---|
| Retention threshold | 7 days | Only remove files older than this. Protects in-progress writes. |
| Scope | Per-table | Can be scoped per-table, per-namespace, or catalog-wide via policies. |
| Schedule (cron) | 0 3 * * * | When cleanup runs. Default: daily at 3:00 AM. |
| Enabled | On | Toggle to enable or disable scheduled cleanup for this table. |
Relationship to snapshot expiration
Orphan cleanup and snapshot expiration work together but serve different purposes:
Snapshot expiration
Removes old snapshot metadata entries. Optionally removes associated metadata files, but may leave data files if they're shared with other snapshots.
Orphan cleanup
Removes actual data files from storage that are not referenced by any remaining snapshot. Catches files missed by expiration.
For complete hygiene, enable both: snapshot expiration to keep metadata lean, and orphan cleanup to reclaim storage from unreferenced files.
Monitoring cleanup operations
- •Events tab — shows “Remove Orphan Files” operations with file count and total size removed.
- •Metrics tab — Stale Files count should drop to zero after successful cleanup.
- •Dashboard — Cost Savings metric reflects storage reclaimed.
- •Monitoring > Storage Metrics — total storage should decrease after orphan removal.
Best practices
- •Set retention threshold to at least 2× your longest-running write job duration.
- •Run orphan cleanup after snapshot expiration, not before (so expired data files become detectable).
- •Use the Simulations tab first on tables you haven't cleaned before to understand the scope.
- •For high-ingestion tables, consider toggling Enabled with a cron schedule to prevent accumulation.
