Orphan File Detection & Cleanup
LakeOps detects and safely removes data files no longer referenced by any table snapshot. Eliminate storage drift from failed jobs, aborted commits, and legacy tables — reclaim capacity without risking data integrity.
What are orphan files?
Orphan files are data files that exist in your object storage (S3, GCS, ADLS) but are no longer referenced by any Iceberg table snapshot. They are invisible to query engines but still consume storage budget.
Common causes
- •Failed writes — Spark/Flink jobs that wrote files but crashed before committing the metadata update.
- •Aborted commits — Optimistic concurrency conflicts where files were written but the commit was rejected.
- •Snapshot expiration — Old snapshots removed, but without the “clean associated files” option, data files remain on disk.
- •Table drops & migrations — Catalog entries removed without cleaning up underlying storage.
- •Compaction remnants — Old files from pre-compaction state left behind when expiration hasn't cleaned up their snapshots yet.
Impact of orphan files
- •Wasted storage costs — paying for files that no query will ever read.
- •Misleading storage metrics — total lake size appears larger than the actual active dataset.
- •Slower LIST operations — object storage directories with millions of files take longer to enumerate.
How cleanup works
LakeOps performs orphan detection by:
- •Listing all files in the table's storage location.
- •Comparing against metadata — checking which files are referenced by any snapshot (current or historical within the retention window).
- •Applying age threshold — only files older than the configured retention period are candidates for removal.
- •Safely deleting unreferenced files that exceed the age threshold.
Safety guarantee: files are only removed if they are not referenced in any snapshot and exceed the age threshold. This prevents accidental deletion of files from in-progress writes or recently committed data.
Using the orphan cleanup UI
Navigate to Explore, select a table, then open the Optimization tab. Scroll to the Orphan Files Cleanup card.
0 3 * * * * — daily at 3:00 AM).Auto vs. Manual mode
Orphan cleanup supports two execution modes, toggled at the top of the Orphan Files Cleanup card in the Optimization tab:
Auto (autopilot)
LakeOps scans for and removes orphan files autonomously on the configured cron schedule. Ideal for high-ingestion tables where orphans accumulate continuously.
Safety-first: always respects the retention threshold to protect in-progress writes.
Manual (on-demand)
You control when cleanup runs. Recommended for the first run on a table you haven't cleaned before — simulate first, then trigger manually.
The configuration is saved; trigger execution whenever you're ready.
Configuration reference
Orphan cleanup can be applied at scale via Policies (Remove Orphan Files type).
| Setting | Default | Description |
|---|---|---|
| Retention threshold | 7 days | Only remove files older than this. Protects in-progress writes. |
| Scope | Per-table | Can be scoped per-table, per-namespace, or catalog-wide via policies. |
| Schedule (cron) | 0 3 * * * * | When cleanup runs. Default: daily at 3:00 AM. |
| Auto / Manual | Manual | Auto: runs on schedule. Manual: triggered on demand. |
Relationship to snapshot expiration
Orphan cleanup and snapshot expiration work together but serve different purposes:
Snapshot expiration
Removes old snapshot metadata entries. Optionally removes associated metadata files, but may leave data files if they're shared with other snapshots.
Orphan cleanup
Removes actual data files from storage that are not referenced by any remaining snapshot. Catches files missed by expiration.
For complete hygiene, enable both: snapshot expiration to keep metadata lean, and orphan cleanup to reclaim storage from unreferenced files.
Monitoring cleanup operations
- •Events tab — shows “Remove Orphan Files” operations with file count and total size removed.
- •Metrics tab — Stale Files count should drop to zero after successful cleanup.
- •Dashboard — Cost Savings metric reflects storage reclaimed.
- •Monitoring > Storage Metrics — total storage should decrease after orphan removal.
Best practices
- •Set retention threshold to at least 2× your longest-running write job duration.
- •Run orphan cleanup after snapshot expiration, not before (so expired data files become detectable).
- •Use Simulate first on tables you haven't cleaned before to understand the scope.
- •For high-ingestion tables, consider enabling Auto mode to prevent accumulation.
