Back to blog

Apache Iceberg with dbt: Optimization Guide

dbt transforms your data — but who maintains the Iceberg tables underneath? A practical guide to dbt adapters, incremental strategies, table properties, and the maintenance gap that every dbt + Iceberg team hits in production.

Apache Iceberg with dbt Optimization — dbt logo above SQL model cards flowing through a transformation pipeline into a geometric iceberg, with chart and analytics icons

dbt has become the standard for SQL-based data transformation — version-controlled models, dependency graphs, testing, and documentation. Apache Iceberg has become the standard for open lakehouse storage — ACID transactions, schema evolution, time travel, and engine-agnostic access. Naturally, teams are combining them: dbt for the transformation logic, Iceberg for the physical tables.

The combination works. But it introduces a gap that most teams discover only after months in production: dbt is a transformation tool, not a table maintenance tool. It writes data into Iceberg tables but does not compact files, expire snapshots, clean orphans, or optimize manifests. Over time, the tables dbt builds degrade — queries slow down, storage costs rise, and someone has to build the maintenance layer that dbt intentionally does not provide.

This is where LakeOps comes in — it fills the maintenance gap by autonomously compacting files, expiring snapshots, cleaning up orphans, and rewriting manifests as a dedicated control plane for Iceberg tables. We'll cover this integration in detail below.

This guide covers everything you need to run dbt on Iceberg effectively: adapter setup, materialization strategies, incremental model patterns, table property configuration, common pitfalls, and how LakeOps fills the maintenance gap for production lakehouses.

dbt adapters for Iceberg

dbt does not interact with Iceberg directly. It generates SQL that an execution engine runs against an Iceberg catalog. The adapter determines which engine and which Iceberg features are available:

dbt-spark — the most mature adapter for Iceberg. It generates Spark SQL and supports the full range of Iceberg table properties, partition transforms, and write modes. Use it when your lakehouse runs on Spark (EMR, Databricks, Glue, or self-managed). Configuration goes in profiles.yml:

yaml
1my_iceberg_lakehouse:2  target: prod3  outputs:4    prod:5      type: spark6      method: thrift7      schema: analytics8      host: spark-thrift.internal9      port: 1000110      properties:11        spark.sql.catalog.iceberg: org.apache.iceberg.spark.SparkCatalog12        spark.sql.catalog.iceberg.type: glue

dbt-trino — connects to Trino (or Starburst), which reads and writes Iceberg through its Iceberg connector. Supports CREATE TABLE AS SELECT, INSERT INTO, MERGE INTO, and DELETE FROM. Trino's Iceberg connector exposes table properties via WITH clauses. Ideal for teams already using Trino as their primary query engine.

dbt-athena — targets AWS Athena, which uses Iceberg as a first-class table format via the AWS Glue catalog. Athena supports Iceberg-specific features like OPTIMIZE (compaction), VACUUM (snapshot expiration), and CTAS/INSERT. The adapter handles Athena's async query model and S3 staging.

dbt-databricks — supports Iceberg through Databricks Unity Catalog via catalog integrations (dbt-databricks 1.9+). Supports both managed Iceberg tables and Iceberg-enabled Delta tables (UniForm). All incremental strategies are available including merge, append, insert_overwrite, delete+insert, and microbatch. Configuration uses table_format: 'iceberg' in the catalog integration rather than file_format.

Other adapters with Iceberg support include dbt-glue (AWS Glue ETL jobs) and dbt-duckdb (local development with Iceberg extensions). The choice depends on your execution engine — dbt itself is engine-agnostic.

Materializations and Iceberg

dbt materializations control how a model's SQL translates to physical table operations. Each interacts differently with Iceberg's storage layer:

Table materialization — runs CREATE OR REPLACE TABLE ... AS SELECT. Every dbt run drops and recreates the table, producing a clean set of data files. Simple and correct, but expensive for large tables and produces orphan files from the previous version (Iceberg retains old data files until snapshot expiration).

Incremental materialization — the workhorse for production Iceberg models. On the first run it creates the table; on subsequent runs it applies only new or changed data. The strategy parameter controls how:

sql
1-- models/orders_enriched.sql2{{3  config(4    materialized='incremental',5    incremental_strategy='merge',6    unique_key='order_id',7    file_format='iceberg',8    partition_by=['date(order_date)'],9    table_properties={10      'write.distribution-mode': 'hash',11      'write.merge.mode': 'merge-on-read'12    }13  )14}}15 16SELECT17  o.order_id,18  o.order_date,19  o.amount,20  c.segment21FROM {{ ref('stg_orders') }} o22JOIN {{ ref('stg_customers') }} c ON o.customer_id = c.customer_id23{% if is_incremental() %}24WHERE o.updated_at > (SELECT max(updated_at) FROM {{ this }})25{% endif %}

View materialization — creates a catalog view pointing at Iceberg tables. No data files are written. Useful for lightweight abstractions over Iceberg tables, but provides no performance benefit — queries still scan the underlying tables.

Incremental strategies on Iceberg

The incremental strategy determines how dbt applies new data to an existing Iceberg table. Each strategy has different implications for write amplification, read performance, and maintenance burden.

Merge strategy

The merge strategy generates a MERGE INTO statement that matches source rows against the target table on the unique_key. Matching rows are updated; non-matching rows are inserted. On Iceberg v2, this uses merge-on-read (MoR) — the engine writes position delete files that mark rows as logically deleted, then appends new data files for the updated values. The original data files remain unchanged.

sql
1MERGE INTO analytics.orders_enriched t2USING (3  SELECT * FROM __dbt_tmp_orders_enriched4) s ON t.order_id = s.order_id5WHEN MATCHED THEN UPDATE SET *6WHEN NOT MATCHED THEN INSERT *

The advantage: writes are fast because no existing data files are rewritten. The disadvantage: delete files accumulate over time. Every read must reconcile data files against delete files, degrading query performance progressively. This is the most common source of performance degradation in dbt-managed Iceberg tables.

Insert overwrite strategy

The insert_overwrite strategy replaces entire partitions that contain new data. It is the cleanest approach for partition-aligned workloads — each run produces a fresh set of data files for affected partitions with no delete files:

sql
1{{2  config(3    materialized='incremental',4    incremental_strategy='insert_overwrite',5    partition_by=['date(event_date)'],6    file_format='iceberg'7  )8}}9 10SELECT *11FROM {{ ref('stg_events') }}12{% if is_incremental() %}13WHERE event_date >= date_add(current_date(), -3)14{% endif %}

Best for daily or hourly batch models where new data always arrives aligned to a partition boundary (date, hour). Does not handle updates to rows in arbitrary partitions well — it rewrites the entire partition even for a single changed row.

Append strategy

The append strategy is the simplest: it inserts all new rows without checking for duplicates or matching existing records. No MERGE, no DELETE, no partition overwrite — just INSERT INTO:

sql
1{{2  config(3    materialized='incremental',4    incremental_strategy='append',5    file_format='iceberg'6  )7}}8 9SELECT *10FROM {{ ref('stg_clickstream') }}11WHERE event_timestamp > (SELECT max(event_timestamp) FROM {{ this }})

Ideal for immutable event streams, logs, and append-only fact tables. No delete files, no rewriting — but no deduplication either. If your upstream can produce duplicates, you need a downstream dedup layer.

Delete+insert strategy

The delete+insert strategy first deletes all rows matching the incremental predicate (typically a partition range), then inserts the fresh data. Unlike merge, this uses copy-on-write — the affected data files are rewritten entirely:

sql
1{{2  config(3    materialized='incremental',4    incremental_strategy='delete+insert',5    unique_key='order_id',6    partition_by=['date(order_date)'],7    file_format='iceberg'8  )9}}

This avoids delete file accumulation at the cost of higher write amplification. Suitable when you want merge semantics (upsert) without the MoR overhead — particularly effective when changes cluster in recent partitions.

Microbatch strategy

The microbatch strategy (dbt Core 1.9+) splits large time-series datasets into time-based batches — daily, hourly, or by custom grain. Each batch is processed independently, enabling parallel execution and automatic retry of failed batches without reprocessing the entire model:

sql
1{{2  config(3    materialized='incremental',4    incremental_strategy='microbatch',5    event_time='event_timestamp',6    begin='2024-01-01',7    batch_size='day',8    file_format='iceberg'9  )10}}11 12SELECT *13FROM {{ ref('stg_events') }}

Unlike other strategies, microbatch does not require a manual {% if is_incremental() %} block — dbt automatically filters each batch by event_time. Supported by dbt-spark, dbt-databricks, dbt-trino, dbt-athena, and dbt-snowflake. Ideal for high-volume event tables where daily insert-overwrite is too coarse and append produces too many small files per run.

Configuring Iceberg table properties in dbt

Iceberg tables expose dozens of properties that control file format, layout, compaction behavior, and write semantics. dbt passes these through the table_properties config:

sql
1{{2  config(3    materialized='incremental',4    file_format='iceberg',5    table_properties={6      'format-version': '2',7      'write.parquet.compression-codec': 'zstd',8      'write.target-file-size-bytes': '134217728',9      'write.distribution-mode': 'hash',10      'write.metadata.delete-after-commit.enabled': 'true',11      'write.metadata.previous-versions-max': '10'12    },13    partition_by=['bucket(16, user_id)', 'date(event_date)']14  )15}}

Note: Iceberg sort orders are not configured through a dbt model config parameter. Set the sort order via DDL (ALTER TABLE ... WRITE ORDERED BY) or through the write.sort-order table property after table creation. Sort-order compaction is a maintenance operation — tools like LakeOps can apply it automatically during compaction passes.

Key properties to set: | Property | Purpose | Recommended | |----------|---------|-------------| | write.target-file-size-bytes | Target data file size | 128–256 MB for scan workloads | | write.distribution-mode | How writers distribute data | hash for partitioned, range for sorted | | write.parquet.compression-codec | Compression algorithm | zstd (best ratio/speed) | | write.merge.mode | MoR vs CoW for merges (Spark) | merge-on-read for fast writes | | commit.retry.num-retries | Concurrent write retries | 4 for multi-writer tables | | write.metadata.delete-after-commit.enabled | Auto-clean old metadata | true |

Partition transforms deserve special attention. Iceberg supports hidden partitioning — transforms like day(ts), bucket(N, col), truncate(N, col) — which decouple the physical layout from the query syntax. dbt exposes these through partition_by:

yaml
1# schema.yml2models:3  - name: events_enriched4    config:5      partition_by:6        - "day(event_timestamp)"7        - "bucket(32, tenant_id)"

Choose partition granularity based on query patterns. Over-partitioning (too many small partitions) creates a small-file problem that compounds with every dbt run.

Common problems with dbt + Iceberg

Incremental merge creates delete files

Every merge strategy run on Iceberg v2 produces position delete files. After weeks of daily runs, a table may have hundreds of delete files — one per affected data file per run. Queries must reconcile these against data files at read time. A table with 50 data files and 200 delete files may scan 3–5x more data than the same table after compaction. This is the single most common performance issue in dbt + Iceberg deployments.

Full refreshes create orphan files

dbt run --full-refresh drops and recreates the table, but Iceberg's metadata still references old data files until snapshot expiration runs. If expiration never runs (and dbt does not run it), those files remain in storage indefinitely. A single full refresh of a 500 GB table orphans 500 GB of Parquet files. Do this monthly across 50 models and you accumulate terabytes of waste that inflates your cloud storage costs.

No built-in compaction or maintenance

dbt has no mechanism to run rewrite_data_files, expire_snapshots, remove_orphan_files, or rewrite_manifests. These are Iceberg maintenance procedures that require a separate execution path. Teams typically discover this after 2–3 months when queries start timing out on previously fast models.

Schema evolution gotchas

dbt's on_schema_change config (sync_all_columns, append_new_columns, ignore, fail) works differently with Iceberg than with traditional warehouses. Iceberg supports column addition, renaming, reordering, and type promotion natively — but dbt adapters may not expose all of these. Adding a column via sync_all_columns works; renaming a column may require a full refresh depending on the adapter. Test schema changes in development before applying to production incremental models.

Performance tuning

Beyond choosing the right strategy, several model-level configurations improve dbt + Iceberg performance:

Partition pruning in incremental models — ensure your WHERE clause in the {% if is_incremental() %} block aligns with partition columns. If your table is partitioned by day(event_date), filter on event_date — not on an unrelated timestamp column. Proper partition pruning can reduce scan volume by 95% or more.

Write distribution mode — set write.distribution-mode to hash when writing to partitioned tables. This ensures each writer task produces files for a single partition, avoiding small files scattered across many partitions. For sorted tables, use range to maintain sort order within files:

sql
1{{2  config(3    table_properties={4      'write.distribution-mode': 'hash'5    }6  )7}}

Batch size control — for high-frequency incremental models, control how much data each run processes. Processing too little data per run creates many small files; processing too much in a single MERGE creates excessive delete files. Find the sweet spot based on your table's write pattern.

Concurrent model execution — dbt's threads config controls how many models run in parallel. When multiple incremental models write to the same Iceberg catalog concurrently, commit conflicts can occur. Set commit.retry.num-retries to handle transient conflicts, or stagger dependent models using DAG ordering and tag-based selectors.

Testing and data quality

dbt's testing framework works on Iceberg tables without modification — unique, not_null, accepted_values, and relationships tests all execute as SELECT queries against the Iceberg table:

yaml
1# schema.yml2models:3  - name: orders_enriched4    columns:5      - name: order_id6        tests:7          - unique8          - not_null9      - name: amount10        tests:11          - not_null12          - dbt_utils.accepted_range:13              min_value: 014      - name: segment15        tests:16          - accepted_values:17              values: ['enterprise', 'mid-market', 'smb']18    tests:19      - dbt_utils.recency:20          datepart: hour21          field: updated_at22          interval: 6

Freshness checksdbt source freshness queries the source table's most recent timestamp to validate data is arriving on schedule. This works with Iceberg sources but depends on the adapter's ability to push down MAX() aggregations efficiently. On large uncompacted tables, freshness checks can be slow because they scan delete files alongside data files — another reason compaction matters.

Row count assertions — custom tests that validate row counts between source and target catch silent data loss that incremental models can introduce during schema changes or partition misalignment.

The maintenance gap

Here is what dbt does after each run: nothing. It commits the data and moves on. There is no post-run hook for compaction, no snapshot expiration, no orphan cleanup. This is by design — dbt is a transformation tool, and table maintenance is a platform concern.

But someone has to handle it. In practice, teams fill this gap in one of three ways:

Option 1: Custom scripts — a cron job that runs CALL iceberg.system.rewrite_data_files(...) and CALL iceberg.system.expire_snapshots(...) on a schedule. This works until you have 50+ tables with different maintenance needs, and the cron schedule cannot adapt to write patterns that change daily.

Option 2: Engine-native maintenance — Athena OPTIMIZE, Spark maintenance procedures, or Trino ALTER TABLE EXECUTE optimize. These are manual, per-table, and require the same engine (and cost) that runs your queries. Running compaction through Spark means paying for a Spark cluster purely for maintenance.

Option 3: A dedicated maintenance layer — a system that monitors dbt-managed tables and autonomously runs the right maintenance operations at the right time, without engine overhead or manual scheduling. This is what LakeOps provides.

How LakeOps fills the maintenance gap

LakeOps is an autonomous table maintenance layer for Apache Iceberg. It connects to your catalog, continuously monitors every table dbt touches, and runs the right maintenance operations — compaction, snapshot expiration, orphan cleanup, manifest optimization, sort optimization — based on actual table state rather than fixed schedules. When dbt commits data to an Iceberg table, LakeOps detects the change through catalog monitoring and evaluates whether maintenance is needed. The result: dbt stays focused on transformation logic, and the tables it produces stay fast, compact, and cost-efficient without manual intervention.

Modern Lakehouse Architecture — multi-engine access to dbt-managed Iceberg tables
Multi-engine architecture: Spark, Trino, Flink, Snowflake, Athena, and DuckDB all query the same dbt-managed Iceberg tables. LakeOps coordinates maintenance across every engine with full workload visibility.
LakeOps in action — autonomous table maintenance that fills the gap dbt leaves behind.

Key capabilities for dbt-managed Iceberg tables:

  • Event-driven compaction after dbt runs — LakeOps detects table state changes from dbt commits through continuous catalog monitoring, not cron. When the delete-file-to-data-file ratio crosses the configured threshold, compaction fires automatically within minutes.
  • Delete file resolution from incremental merges — every dbt merge strategy run creates position delete files. LakeOps triggers targeted compaction that physically applies the deletes — not a full table rewrite, but a surgical pass on the affected partitions. Read performance returns to baseline without manual intervention.
  • Snapshot expiration for full refreshes — configurable retention policies (typically 3–7 days) replace Iceberg's infinite default. When dbt runs --full-refresh, old data files are dereferenced and orphan cleanup removes them from storage.
  • [Manifest optimization](/blog/iceberg-metadata-lifecycle-maintenance-optimization) — as dbt runs accumulate commits, the manifest list grows. LakeOps rewrites manifests after compaction to keep Iceberg's metadata tree shallow and reduce query planning time.
  • Query-aware sort optimization — LakeOps collects query telemetry from Trino, Spark, Athena, and every connected engine, then applies sort compaction during maintenance that maximizes data skipping for the actual query mix hitting each table.
  • Rust execution engine — maintenance runs on Apache DataFusion, not Spark or Trino. No JVM, no cluster to provision, no cost per query. Compaction across hundreds of dbt-managed tables runs at a fraction of engine-based cost.
  • Lake-wide policies — set maintenance rules at the namespace or catalog level rather than per table. All tables in analytics get compaction at 50 delete files; all tables in staging get 24-hour snapshot retention. Policies follow dbt's project structure naturally.
LakeOps Dashboard — real-time health and optimization metrics across all dbt-managed Iceberg tables
LakeOps Dashboard: real-time visibility into table health, storage trends, and optimization status across every Iceberg table in your dbt project.
LakeOps Optimization — compaction and file management
LakeOps Optimization tab: compaction history, file size distribution, and optimization metrics for dbt-managed Iceberg tables.
LakeOps Layout Simulations — query-aware sort optimization for dbt-managed tables
Layout Simulations: LakeOps analyzes which columns dbt-managed tables are actually queried on — across Trino, Spark, Athena, and every connected engine — and applies sort compaction that maximizes data skipping for the real query mix.
LakeOps Cost Savings — storage and compute savings from autonomous maintenance
LakeOps cost impact: measurable reductions in storage and compute costs from automated compaction, snapshot expiration, and orphan file cleanup across dbt-managed tables.

Best practices checklist

A practical checklist for running dbt on Iceberg in production:

  • Fill the maintenance gap with a dedicated control plane — dbt transforms data but does not maintain tables. Connect a system like LakeOps to autonomously compact files, expire snapshots, clean orphans, and optimize sort orders based on actual query patterns. This is the single most impactful step — without it, every dbt model creates maintenance debt that accumulates silently
  • Choose the right incremental strategy — use insert_overwrite for partition-aligned batch loads, merge for upserts with low update rates, append for immutable events, delete+insert when you need upsert without MoR overhead
  • Set `write.distribution-mode`hash for partitioned tables, range for sorted tables
  • Align partition columns with incremental predicates — your WHERE clause in {% if is_incremental() %} should prune partitions, not force full scans
  • Set `write.target-file-size-bytes` — 128 MB for mixed workloads, 256 MB for scan-heavy analytics
  • Use Iceberg format version 2 — required for row-level deletes, position delete files, and merge-on-read
  • [Run compaction](/blog/iceberg-lake-compaction) after merge models — either through a dedicated maintenance system or scheduled procedures, but never skip it
  • Expire snapshots — configure retention based on your time-travel needs (typically 3–7 days), not the infinite default
  • Clean orphan files after full refreshes — the data dbt replaces does not delete itself
  • Monitor [delete file accumulation](/blog/iceberg-delete-files-merge-on-read) — track the delete-file-to-data-file ratio per table as a health metric
  • Test on Iceberg, not just the source — run dbt test against the Iceberg table to catch issues that incremental logic introduces

Conclusion

dbt and Iceberg are a strong pairing — SQL-based transformations landing in an open, engine-agnostic table format with ACID guarantees. But the pairing works best when teams acknowledge the boundary: dbt transforms data, it does not maintain tables. Incremental merges create delete files. Full refreshes orphan data. Manifests grow unbounded. None of these are dbt bugs — they are table maintenance responsibilities that dbt intentionally leaves to the platform layer.

The teams that run dbt on Iceberg successfully in production are the ones that close this gap — either with manual scripts, engine-native procedures, or a dedicated system like LakeOps that autonomously maintains every table dbt touches. The transformation logic stays clean, the tables stay fast, and the storage bill stays predictable.

Tags

Apache IcebergApache IcebergdbtIncremental ModelsData LakehouseTable MaintenanceCompaction

Related articles

Found this useful? Share it with your team.