Iceberg Hot and Cold Data Tiering: StarRocks + Iceberg for Real-Time Analytics

Hot and cold data tiering on Apache Iceberg with StarRocks — real-time analytics architecture

Hot and cold data tiering on Apache Iceberg solves a problem every analytics platform hits: dashboards need sub-second latency on fresh data, while historical analysis needs cheap petabyte scans. This pattern is proven in production — StarRocks as the hot tier, Iceberg as cold history, one SQL surface via federation. Fresha's engineering blog documents the architecture in detail.

The cold tier — Apache Iceberg on object storage — requires continuous maintenance to keep federation queries performant. Without compaction, sort optimization, and snapshot management, the cold tier degrades until queries that should take seconds take minutes.

This article covers the full architecture: why tiering is necessary, how StarRocks and Iceberg work together through external catalog federation, ingestion patterns for both tiers, tier transition mechanics that avoid gaps and duplicates, deduplication strategies across tiers, operational challenges with late-arriving data, performance tuning through caching and predicate pushdown, and when this pattern is the right choice versus alternatives.

The tiering problem

Same Data, Different Engines — Hot data lives in a real-time engine for sub-second queries. Cold data lives on object storage in Iceberg format. Federation makes them appear as one dataset to consumers.

The core tension is not about storage cost alone — it is about query engine optimization. A real-time analytics engine like StarRocks keeps recent data in memory-mapped columnar storage with primary-key indexes, bitmap indexes, and pre-aggregated rollups. Queries against this hot data return in milliseconds. But maintaining this level of optimization for years of historical data is economically absurd — the storage cost alone would exceed the value of the queries against it.

Conversely, Iceberg on object storage costs pennies per terabyte-month — and with proper S3 cost optimization, even petabyte-scale cold tiers remain remarkably affordable. You can store five years of event data for less than what a single month of hot-tier storage costs. But scanning cold data through a generic query engine — even with Iceberg's metadata pruning — adds latency that dashboard users will not tolerate for their real-time metrics.

The naive approaches all fail. A single engine on hot storage cannot hold enough history. A single engine on cold storage cannot deliver real-time latency. Running two completely separate systems with different schemas and query interfaces fragments the analytics experience — users must know which system to query for which time range, and cross-tier queries (comparing today's metrics to last year's baseline) become multi-step manual operations.

The tiering pattern solves this by combining three elements: a hot tier optimized for real-time queries on recent data, a cold tier optimized for cost-efficient storage of historical data, and a federation layer that presents both tiers as a single queryable surface. The user writes one SQL query. The engine handles the routing internally — fetching recent rows from hot storage and historical rows from cold storage, combining them transparently.

Organizations that have adopted this pattern report replacing traditional data warehouses entirely. Fresha's data engineering team published their production implementation: they exported historical data from Snowflake into Iceberg via Spark, stream CDC events from Kafka through Flink into StarRocks primary-key tables for the hot tier, and use StarRocks federation to present both tiers as a single SQL surface — replacing their Snowflake dependency. Their deduplication runs through dbt, tier transitions are orchestrated by Airflow, and late-arriving events are routed to a dedicated staging table to prevent data loss during partition expiration. The economics are compelling: real-time query performance on fresh data (which warehouses struggle with due to ingestion lag), petabyte-scale historical storage at object-storage pricing (which warehouses charge orders of magnitude more for), and a single SQL interface for both (which separate systems cannot provide). The architecture eliminates the warehouse as the expensive middle ground that does neither job optimally.

Architecture pattern: hot tier plus cold tier

The architecture places StarRocks as the hot tier and Apache Iceberg on object storage as the cold tier. StarRocks is a vectorized MPP analytics engine designed for real-time workloads — it stores data in its own columnar format with primary-key tables that support upsert semantics, efficient point lookups, and pre-aggregated materialized views. Iceberg provides the cold tier with ACID transactions, schema evolution, time travel, and partition pruning — all on commodity object storage.

The key architectural insight is that StarRocks can attach Iceberg as an external catalog, creating a federation layer that spans both tiers. From the SQL perspective, hot-tier tables and cold-tier tables exist in the same namespace. A single query can JOIN a hot-tier primary-key table with a cold-tier Iceberg table. StarRocks handles the execution planning — reading local columnar data for hot rows and pushing down filters to Iceberg for cold rows.

The freshness model typically operates across three tiers. The hot tier delivers data with seconds of latency — events flow from Kafka through Flink into StarRocks primary-key tables, where they are immediately queryable. The warm tier delivers data with minutes of latency — StarRocks reads recent Iceberg partitions through its external catalog, benefiting from Iceberg's metadata caching and StarRocks' own data cache. The cold tier stores historical data in Iceberg with periodic Spark backfills handling corrections, late arrivals, and schema migrations.

This three-tier freshness model maps naturally to how organizations consume analytics. Executive dashboards showing today's revenue need the hot tier. Operational reports comparing this week to last week query across hot and warm tiers through federation. Annual planning models scanning three years of data hit the cold tier exclusively. Each consumer gets appropriate freshness and performance without knowing which tier serves their query.

How StarRocks and Iceberg federation works

StarRocks' external catalog mechanism is the architectural foundation that makes tiering practical. Rather than requiring ETL pipelines to move data between systems, StarRocks reads Iceberg tables directly — the same physical Parquet files on the same object storage, accessed through the same Iceberg catalog metadata.

Attaching Iceberg as an external catalog

StarRocks registers an Iceberg catalog as an external data source. The configuration points to either a Hive Metastore, AWS Glue, or an Iceberg REST catalog endpoint. Once registered, every Iceberg namespace and table appears as a queryable entity within StarRocks — no table-by-table registration, no schema mapping, no ongoing sync.

Configuration for attaching an Iceberg REST catalog. StarRocks 4.0+ supports JWT identity passthrough and vended credentials — the catalog issues short-lived storage tokens rather than requiring long-lived S3 keys in the engine configuration:

sql

1-- REST catalog with vended credentials (recommended)2CREATE EXTERNAL CATALOG iceberg_cold3PROPERTIES (4    'type' = 'iceberg',5    'iceberg.catalog.type' = 'rest',6    'iceberg.catalog.uri' = 'https://catalog.example.com/api/v1',7    'iceberg.catalog.warehouse' = 's3://lakehouse/warehouse',8    'iceberg.catalog.security' = 'jwt',9    'iceberg.catalog.vended-credentials-enabled' = 'true'10);11 12-- Hive Metastore variant13CREATE EXTERNAL CATALOG iceberg_cold14PROPERTIES (15    'type' = 'iceberg',16    'iceberg.catalog.type' = 'hive',17    'hive.metastore.uris' = 'thrift://metastore:9083',18    'aws.s3.region' = 'us-east-1',19    'aws.s3.access_key' = '<access-key>',20    'aws.s3.secret_key' = '<secret-key>'21);

Once attached, queries against the external catalog look identical to queries against internal tables:

sql

1-- Query hot tier (internal StarRocks table)2SELECT customer_id, SUM(revenue)3FROM hot_db.orders4WHERE order_date >= CURRENT_DATE - INTERVAL 7 DAY5GROUP BY customer_id;6 7-- Query cold tier (Iceberg via external catalog)8SELECT customer_id, SUM(revenue)9FROM iceberg_cold.analytics.orders10WHERE order_date >= '2024-01-01' AND order_date < '2026-01-01'11GROUP BY customer_id;12 13-- Federation: single query spanning both tiers14SELECT customer_id, SUM(revenue)15FROM (16    SELECT customer_id, revenue, order_date17    FROM hot_db.orders18    WHERE order_date >= CURRENT_DATE - INTERVAL 7 DAY19    UNION ALL20    SELECT customer_id, revenue, order_date21    FROM iceberg_cold.analytics.orders22    WHERE order_date >= '2024-01-01'23      AND order_date < CURRENT_DATE - INTERVAL 7 DAY24) combined25GROUP BY customer_id;

Unified SQL surface

The federation model means consumers do not need to know about tiering. A dbt model or a view can encapsulate the UNION logic, presenting a single orders entity to downstream dashboards and reports. The tier boundary (which date range comes from hot vs. cold) is a configuration detail hidden behind the abstraction. When the boundary shifts — say, from 7 days to 14 days — only the view definition changes. No downstream consumer is affected.

This is architecturally significant because it decouples the tiering strategy from the consumption layer. Platform engineers can tune tier boundaries based on workload patterns, storage costs, and freshness SLAs without coordinating with every dashboard and report owner. The unified SQL surface absorbs the complexity.

Predicate pushdown into Iceberg

When StarRocks queries Iceberg tables through the external catalog, it does not naively scan all files. StarRocks pushes filter predicates down into the Iceberg metadata layer, leveraging Iceberg's partition pruning and per-file column statistics to eliminate data files before any I/O occurs. For well-partitioned, well-sorted Iceberg tables, this can eliminate 95%+ of files from a scan.

The optimization chain works as follows: StarRocks evaluates the WHERE clause and identifies predicates applicable to partition transforms (eliminating entire partitions), then checks per-file min/max statistics in manifests (eliminating files within partitions), then applies row-level filters during Parquet read (eliminating row groups within files). Each stage reduces the data volume for the next stage.

This is where cold-tier maintenance becomes critical. If Iceberg tables are fragmented — thousands of small files with overlapping key ranges and no sort order — predicate pushdown eliminates very little. StarRocks must read far more data than necessary. LakeOps — a control plane for Apache Iceberg lakehouses — addresses this directly: continuous compaction reduces file counts, sort optimization ensures data is physically ordered by the columns StarRocks filters on most, and the result is healthier tables, faster federation queries, and lower storage costs without manual maintenance. LakeOps' query-aware sort optimization is particularly valuable here — it observes which predicates StarRocks pushes down most frequently and ensures the Iceberg sort order maximizes elimination for those specific filter patterns.

Ingestion patterns

The tiering architecture requires two parallel ingestion paths — one feeding the hot tier with sub-second latency, and one populating the cold tier for long-term storage. Getting both paths right, without data loss or duplication, is the primary engineering challenge.

Hot path: Kafka to Flink to StarRocks

The hot ingestion path delivers events from source systems to queryable state in StarRocks within seconds. The typical pipeline is: source systems emit events to Kafka topics, Flink consumes from Kafka with exactly-once semantics, and Flink writes to StarRocks primary-key tables via the Stream Load API or Routine Load connector.

StarRocks primary-key tables are essential for the hot tier because they support upsert semantics. When an order status changes from 'pending' to 'shipped', the update overwrites the previous row in-place rather than appending a new version. This means dashboard queries always see the current state without running deduplication logic at query time — a critical property for real-time dashboards that cannot afford the latency of query-time MERGE.

Flink's role is transformation and enrichment between Kafka and StarRocks. Raw events are parsed, validated, enriched with dimension data (customer segments, product categories), and written to StarRocks in micro-batches aligned with Flink checkpoints. A typical checkpoint interval of 10–30 seconds balances latency against write amplification — shorter intervals mean more frequent but smaller writes to StarRocks, which increases write overhead but reduces query staleness.

java

1// Flink job writing to StarRocks via JDBC sink2DataStream<OrderEvent> enrichedOrders = kafkaSource3    .map(new OrderEnrichmentFunction())4    .filter(new ValidationFilter());5 6enrichedOrders.addSink(7    JdbcSink.sink(8        "INSERT INTO hot_db.orders (order_id, customer_id, revenue, status, order_date) " +9        "VALUES (?, ?, ?, ?, ?) " +10        "ON DUPLICATE KEY UPDATE status = VALUES(status), revenue = VALUES(revenue)",11        (ps, order) -> {12            ps.setString(1, order.getOrderId());13            ps.setLong(2, order.getCustomerId());14            ps.setBigDecimal(3, order.getRevenue());15            ps.setString(4, order.getStatus());16            ps.setTimestamp(5, order.getOrderDate());17        },18        JdbcExecutionOptions.builder()19            .withBatchSize(5000)20            .withBatchIntervalMs(10000)21            .build(),22        new JdbcConnectionOptions.JdbcConnectionOptionsBuilder()23            .withUrl("jdbc:mysql://starrocks-fe:9030/hot_db")24            .build()25    )26);

Cold path: Flink and Spark to Iceberg

The cold ingestion path serves two purposes: it provides a durable, append-only record of all events (the source of truth), and it enables historical queries at object-storage cost. Two strategies populate the cold tier, and production architectures often use both.

Strategy 1: Flink dual-write. The same Flink job that feeds the hot tier also commits to Iceberg on every checkpoint — creating an append-only log of all events in Iceberg format. This dual-write ensures that the cold tier stays current without a separate ingestion pipeline. The cold tier lags the hot tier only by the Flink checkpoint interval (seconds to minutes).

sql

1-- Flink SQL writing to Iceberg cold tier2INSERT INTO iceberg_cold.analytics.orders3SELECT4    order_id,5    customer_id,6    revenue,7    status,8    order_date,9    processing_time10FROM kafka_orders

Strategy 2: Warehouse export plus Spark. Organizations migrating from a traditional data warehouse (Snowflake, BigQuery, Redshift) already have years of historical data in the warehouse. Rather than replaying history through Flink, they export historical data to Parquet on object storage and use Spark to create properly partitioned Iceberg tables from those files. This is the approach Fresha used — exporting history from Snowflake, then using Spark to write Iceberg tables with the correct partition spec and table properties. Ongoing tiering from the hot tier to cold also uses Spark MERGE INTO to move aged partitions from StarRocks into Iceberg.

Spark also handles backfills, corrections, and heavy transformations. When a schema migration requires rewriting historical partitions, when a data quality issue requires patching months of records, or when a new derived table must be populated from raw events — Spark runs batch jobs against the cold tier. These jobs are infrequent but large, often rewriting entire partitions with corrected data.

The dual-write pattern (Flink writing to both hot and cold simultaneously) is valuable when starting from scratch. It means the cold tier is not populated by extracting from the hot tier — a pattern that would create a dependency where hot-tier failures cascade to cold-tier staleness. Instead, both tiers are populated independently from the same Kafka source, and each can operate, fail, and recover independently. When migrating from a warehouse, the export-plus-Spark approach seeds the cold tier with historical data, and either dual-write or periodic Spark merges keep it current going forward.

The file proliferation problem

Flink's streaming writes to Iceberg create a predictable operational challenge: many small files. Each Flink checkpoint produces a new set of Parquet files — with 30-second checkpoints on a high-volume table, that is 2,880 file commits per day, each containing only the rows accumulated during that 30-second window. After a month, a single table can have hundreds of thousands of small files.

Small files devastate query performance across every engine that reads the cold tier. StarRocks federation queries must open more files, read more metadata, and coordinate more I/O operations. The Iceberg metadata itself grows — manifest files become large, planning time increases, and the per-file overhead dominates actual data processing.

This is where automated compaction becomes non-negotiable. LakeOps continuously monitors file sizes and counts across all Iceberg tables, triggering compaction when tables cross configurable thresholds. For tables receiving streaming writes from Flink, LakeOps schedules compaction during checkpoint gaps — avoiding conflicts with active writes while ensuring that downstream federation queries always hit reasonably-sized files. The compaction runs on LakeOps' own Rust-based execution engine, independent of StarRocks, Spark, or Flink resources — no capacity contention with production workloads.

LakeOps Table Monitoring — health classification across the lake — Lake-wide table health: file counts, manifest state, snapshot depth, and partition details — cold-tier tables degrading from streaming writes surface immediately as Warning or Critical.

Tier transition mechanics

Moving data from the hot tier to the cold tier is conceptually simple — at some cutoff point, data ages out of StarRocks and exists only in Iceberg. In practice, this transition is the most operationally complex part of the architecture because it must guarantee three properties simultaneously: no data loss (every record exists in at least one tier at all times), no duplicates visible to consumers (records that exist in both tiers during transition are deduplicated at query time), and no query gaps (the transition does not create a window where recent data is missing from both tiers).

Time-based cutoff strategy

The simplest tier transition strategy uses a time-based cutoff. Data younger than N days lives in the hot tier. Data older than N days exists only in the cold tier. The overlap period — where data exists in both tiers — provides safety margin for the transition.

A typical configuration: hot tier retains 14 days of data in StarRocks primary-key tables. Cold tier receives all data continuously via Flink's dual-write. The cutoff view exposes hot data for the most recent 7 days and cold data for everything older than 7 days — leaving a 7-day overlap where data exists in both tiers but is served from hot.

The retention job runs daily via Airflow, deleting records from StarRocks that are older than 14 days. Because the cold tier already has these records (written by Flink at ingest time), no data is lost. The 7-day gap between the query cutoff (7 days) and the retention cutoff (14 days) provides safety: even if the cold-tier Flink writer fails for several days, data remains queryable from the hot tier until the issue is resolved.

python

1# Airflow DAG for tier transition2from airflow import DAG3from airflow.operators.python import PythonOperator4from datetime import datetime, timedelta5 6def expire_hot_tier_data():7    """Remove data from StarRocks that has aged past the retention window."""8    cutoff_date = datetime.now() - timedelta(days=14)9    starrocks_conn.execute(f"""10        DELETE FROM hot_db.orders11        WHERE order_date < '{cutoff_date.strftime('%Y-%m-%d')}'12    """)13 14def verify_cold_tier_completeness():15    """Ensure cold tier has all records before hot tier expires them."""16    hot_count = starrocks_conn.execute("""17        SELECT COUNT(*) FROM hot_db.orders18        WHERE order_date BETWEEN CURRENT_DATE - INTERVAL 15 DAY19          AND CURRENT_DATE - INTERVAL 14 DAY20    """).fetchone()[0]21    cold_count = starrocks_conn.execute("""22        SELECT COUNT(*) FROM iceberg_cold.analytics.orders23        WHERE order_date BETWEEN CURRENT_DATE - INTERVAL 15 DAY24          AND CURRENT_DATE - INTERVAL 14 DAY25    """).fetchone()[0]26    if cold_count < hot_count * 0.99:27        raise Exception(f"Cold tier incomplete: {cold_count} vs {hot_count} in hot")28 29with DAG('tier_transition', schedule_interval='@daily') as dag:30    verify = PythonOperator(task_id='verify_cold', python_callable=verify_cold_tier_completeness)31    expire = PythonOperator(task_id='expire_hot', python_callable=expire_hot_tier_data)32    verify >> expire

Workload-aware cutoff management

A static N-day cutoff is rarely optimal. Some tables are queried heavily for 30 days (monthly reporting), others are only relevant for 48 hours (real-time alerting). Airflow orchestrates tier transitions based on workload patterns and freshness SLAs per table — moving alerting tables out of hot tier after 2 days while keeping monthly-reporting tables hot for 35 days.

The decision logic considers query patterns (how frequently is data at age N queried from the hot tier?), cost (what is the marginal cost of keeping another day of data in StarRocks vs. serving it from Iceberg?), and latency requirements (does the consuming dashboard tolerate the added latency of Iceberg federation for data at this age?).

Production deployments often discover that the optimal cutoff varies not just by table but by time of month. Month-end reporting queries scan 30+ days of data intensively for a few days, then access drops off. Dynamic cutoff management responds to these patterns — extending hot retention before month-end close and contracting it after, automatically balancing cost against query performance.

Deduplication across tiers

Because the hot tier (StarRocks) and cold tier (Iceberg) overlap during the transition window, consumer-facing queries must deduplicate records that appear in both tiers. This is the most nuanced aspect of the tiering architecture — getting deduplication wrong means either double-counting (inflating metrics) or missing records (understating metrics).

The dbt UNION with dedup pattern

The standard approach uses dbt to model a unified view that combines both tiers with explicit deduplication logic. The model UNIONs hot and cold data with a cutoff boundary, and uses primary-key-based deduplication to resolve overlapping records:

sql

1-- dbt model: unified_orders.sql2WITH hot_data AS (3    SELECT4        order_id,5        customer_id,6        revenue,7        status,8        order_date,9        'hot' AS source_tier10    FROM {{ source('starrocks', 'orders') }}11    WHERE order_date >= CURRENT_DATE - INTERVAL 7 DAY12),13cold_data AS (14    SELECT15        order_id,16        customer_id,17        revenue,18        status,19        order_date,20        'cold' AS source_tier21    FROM {{ source('iceberg_cold', 'orders') }}22    WHERE order_date < CURRENT_DATE - INTERVAL 7 DAY23),24combined AS (25    SELECT * FROM hot_data26    UNION ALL27    SELECT * FROM cold_data28),29deduped AS (30    SELECT *,31        ROW_NUMBER() OVER (32            PARTITION BY order_id33            ORDER BY34                CASE source_tier WHEN 'hot' THEN 1 ELSE 2 END35        ) AS rn36    FROM combined37)38SELECT * FROM deduped WHERE rn = 1

The deduplication logic always prefers the hot tier when a record exists in both, because the hot tier has the most current state (StarRocks primary-key tables reflect the latest upsert). Cold-tier records in the overlap window may be stale if updates arrived after the cold-tier write.

Dedup without the overlap

A cleaner alternative eliminates the overlap entirely in the consumer view — the hot tier serves everything newer than the cutoff, the cold tier serves everything older, and there is no PARTITION BY deduplication needed:

sql

1-- Non-overlapping tier boundary2SELECT order_id, customer_id, revenue, status, order_date3FROM hot_db.orders4WHERE order_date >= CURRENT_DATE - INTERVAL 7 DAY5UNION ALL6SELECT order_id, customer_id, revenue, status, order_date7FROM iceberg_cold.analytics.orders8WHERE order_date < CURRENT_DATE - INTERVAL 7 DAY

This approach is simpler and faster (no window function), but it requires absolute confidence that the cold tier is complete and accurate up to the cutoff date. If Flink's cold-tier writer lagged or failed, records near the boundary may be missing from the cold tier. The verification step in the Airflow DAG guards against this — never moving the cutoff past data that has not been confirmed in the cold tier.

Handling updates that cross tiers

The hardest deduplication scenario occurs when an update arrives for a record that has already transitioned to the cold tier. An order placed 10 days ago (now cold-only) receives a status update today. The hot tier's Flink writer processes the update and writes it to StarRocks — but StarRocks may have already expired that record. The cold tier has the old state.

Three strategies handle cross-tier updates. First, extend hot retention long enough that updates rarely arrive after expiration — if 99.9% of updates arrive within 7 days, a 14-day hot retention handles them. Second, route late updates to a correction pipeline that applies them directly to the Iceberg cold tier via a Spark MERGE operation. Third, maintain a small 'late arrivals' table in StarRocks that holds cross-tier updates and is included in the UNION logic — this table is periodically flushed to Iceberg by a Spark job.

Hard deletes present a subtler cross-tier problem. If a record that has already transitioned to the cold tier gets deleted in the source system, removing it from the hot tier alone causes the cold-tier copy to resurface in query results. Fresha's solution is to model hard deletes as soft deletes at the ingestion layer — setting a _hard_deleted_at timestamp in Flink rather than passing the delete through. The dbt deduplication model filters on this flag, and a periodic Spark job applies the delete to the cold tier. This pattern ensures that deletes propagate correctly across tiers without requiring Iceberg's row-level delete files for every incoming CDC delete event.

Operational challenges

Late-arriving events

Late-arriving events — records with timestamps in the past that arrive after their time partition has transitioned to cold — are the most common operational headache in tiered architectures. A mobile app that was offline for three days, a partner system that batches events weekly, or a retry queue that replays failures from last week all produce late arrivals.

The operational challenge is twofold. First, the late event must land in the correct time partition in the cold tier (Iceberg), not just in the current partition — otherwise time-based queries will miss it. Second, if a consumer view uses strict cutoff boundaries, the late event might land in the cold tier's current partition but be excluded by the cold-tier WHERE clause that only includes data older than the cutoff.

The solution combines Iceberg's ability to write to historical partitions with explicit late-arrival handling in the ingestion pipeline. The most robust approach — used in production at Fresha — routes late events to a dedicated late_events table in StarRocks rather than attempting to write them to an expired hot-tier partition. Flink SQL's EXECUTE STATEMENT SET block enables this routing: events with timestamps within the retention window go to the main hot-tier table, while events with timestamps beyond the cutoff go to the late-events table. A periodic Spark job then merges the late-events table into the correct Iceberg partitions via MERGE INTO. This three-layer approach (hot, cold, and late) keeps the deduplication model clean — the dbt UNION includes all three sources, and the late-events table remains small because it only accumulates records that would otherwise be lost during partition expiration.

sql

1-- Flink SQL routing late events to a dedicated table2EXECUTE STATEMENT SET3BEGIN4INSERT INTO hot_db.orders5SELECT * FROM orders_cdc_view6WHERE order_date >= TIMESTAMPADD(7    MONTH, -${retention_months},8    CAST(DATE_FORMAT(CURRENT_TIMESTAMP, 'yyyy-MM-01 00:00:00') AS TIMESTAMP)9);10 11INSERT INTO hot_db.late_orders12SELECT * FROM orders_cdc_view13WHERE order_date < TIMESTAMPADD(14    MONTH, -${retention_months},15    CAST(DATE_FORMAT(CURRENT_TIMESTAMP, 'yyyy-MM-01 00:00:00') AS TIMESTAMP)16);17END;

Cutoff consistency under failure

When the hot-tier Flink writer fails, the cutoff boundary becomes dangerous. If the cold-tier writer is also affected (same Flink job writing to both), there is a window where neither tier has fresh data. If only the hot-tier writer fails, the cold tier continues accumulating but the hot tier falls behind — when it recovers, it may replay events that the cold tier already has, creating duplicates.

Airflow health checks monitor both writers and adjust the cutoff boundary dynamically. If the hot-tier writer is down for more than one checkpoint interval, the cutoff shifts to include cold-tier data for the affected window. When the hot-tier writer recovers and catches up, the cutoff shifts back. This dynamic adjustment prevents gaps but requires careful coordination to avoid briefly exposing duplicates during the transition.

Schema evolution across tiers

Schema changes must propagate to both tiers consistently. Adding a column to the hot-tier StarRocks table without adding it to the Iceberg cold tier breaks the UNION in the dbt model. Iceberg's schema evolution handles this gracefully — columns can be added, renamed, or widened without rewriting data — but the StarRocks schema must evolve in sync.

The recommended pattern: schema changes are applied to the Iceberg catalog first (Iceberg handles evolution natively), then applied to StarRocks (ALTER TABLE ADD COLUMN), then the dbt model is updated to include the new column. Applying changes in this order ensures that the cold tier can always receive data with the new schema, even if the hot-tier change is briefly delayed.

Performance considerations

Data caching in StarRocks

StarRocks implements a multi-level cache hierarchy for external catalog queries. When a federation query accesses Iceberg data, the physical Parquet files are read from object storage and cached locally on StarRocks BE (backend) nodes. Subsequent queries against the same files hit the local cache instead of object storage — reducing latency from hundreds of milliseconds to single-digit milliseconds.

The cache operates at the data-file level with an LRU eviction policy. Frequently-accessed Iceberg files (recent partitions, popular dimension tables) stay cached while rarely-accessed historical data is evicted. The effective result is that warm-tier queries (recent Iceberg data that is accessed frequently) perform nearly as fast as hot-tier queries — the first access pays the cold-storage penalty, but subsequent accesses are served from local SSD.

sql

1-- Configure StarRocks data cache for external catalog queries2ALTER SYSTEM SET enable_datacache = true;3ALTER SYSTEM SET datacache_mem_size = '32G';4ALTER SYSTEM SET datacache_disk_size = '500G';

Predicate pushdown optimization

The effectiveness of StarRocks' predicate pushdown into Iceberg depends entirely on how well the Iceberg table's physical layout matches the query predicates. Three factors determine pushdown effectiveness.

Partition pruning is the first and most impactful optimization — its effectiveness depends entirely on choosing the right partition strategy. If the Iceberg table is partitioned by day(order_date), a query filtering WHERE order_date >= '2026-06-01' eliminates all partitions before June 2026 without reading any data files. This works regardless of sort order or file size — partition pruning operates on metadata alone.

File-level min/max statistics provide the second level of elimination. Iceberg stores per-column min/max values for every data file in its manifest files. A query filtering WHERE customer_id = 12345 on a table sorted by customer_id can eliminate 99% of files because each file's min/max range covers a narrow, non-overlapping slice of the customer_id space. On an unsorted table, the same filter eliminates very few files because every file's range potentially overlaps with the target value.

Row-group pruning within Parquet files provides the third level. Parquet stores per-row-group statistics that StarRocks can evaluate before decompressing data. Well-sorted data means row groups within a file also have tight ranges, enabling sub-file elimination.

LakeOps' sort optimization is specifically designed to maximize predicate pushdown effectiveness. By analyzing which columns appear most frequently in StarRocks' pushed-down predicates — observable through LakeOps' cross-engine telemetry — LakeOps can automatically adjust Iceberg sort orders to align with actual query patterns. If StarRocks federation queries consistently filter on customer_id and event_type, LakeOps sorts the cold-tier data accordingly, ensuring that each subsequent compaction produces files with tight, non-overlapping ranges on those columns. The result compounds: better sort → better pushdown → fewer files read → faster federation queries.

Materialized views for cross-tier queries

StarRocks materialized views can pre-compute cross-tier aggregations, eliminating repetitive federation queries for common access patterns:

sql

1CREATE MATERIALIZED VIEW mv_daily_revenue2DISTRIBUTED BY HASH(customer_id)3REFRESH ASYNC EVERY (INTERVAL 10 MINUTE)4AS5SELECT6    customer_id,7    DATE_TRUNC('day', order_date) AS day,8    SUM(revenue) AS daily_revenue,9    COUNT(*) AS order_count10FROM (11    SELECT customer_id, order_date, revenue12    FROM hot_db.orders13    UNION ALL14    SELECT customer_id, order_date, revenue15    FROM iceberg_cold.analytics.orders16    WHERE order_date >= CURRENT_DATE - INTERVAL 90 DAY17) combined18GROUP BY customer_id, DATE_TRUNC('day', order_date);

The materialized view refreshes every 10 minutes, scanning the hot tier (fast) and the recent cold tier (cached) to maintain a pre-aggregated summary. Dashboard queries against mv_daily_revenue return in milliseconds regardless of the underlying data volume. The refresh cost is amortized across all consumers — one federation scan serves hundreds of dashboard queries.

This is where cold-tier file optimization has a direct impact on refresh performance. Each materialized view refresh scans Iceberg partitions. If those partitions contain thousands of small, unsorted files, the refresh takes minutes and consumes significant StarRocks cluster resources. With LakeOps maintaining compact, sorted Iceberg files, the same refresh completes in seconds — and the materialized view stays fresher because refreshes can run more frequently without resource contention.

Multi-engine awareness and query routing

LakeOps multi-engine routing — LakeOps provides unified observability and routing across both the hot tier (StarRocks) and cold tier (Iceberg) — understanding the writing engines, reading engines, and optimizing the physical layout for both.

The hot/cold tiering architecture inherently involves multiple engines: StarRocks for hot queries and federation, Flink for streaming ingestion, Spark for batch corrections and backfills, and potentially additional engines (Trino, DuckDB) for specific cold-tier workloads. LakeOps understands all of these engines simultaneously — a capability that no single engine possesses.

Understanding writers and readers

LakeOps observes both the writing engines (Flink streaming to Iceberg, Spark running backfills) and the reading engines (StarRocks federation queries, Trino ad-hoc analysis). This dual awareness enables intelligent maintenance scheduling: compaction runs after Flink checkpoints complete but before StarRocks' materialized view refreshes, ensuring that each refresh encounters freshly-compacted files rather than the raw small-file output of streaming writes.

Without this multi-engine awareness, teams resort to static cron schedules that inevitably conflict with either writers or readers. A compaction job scheduled at midnight that overlaps with a Flink checkpoint creates commit conflicts. A compaction that runs just before a StarRocks MV refresh means the refresh reads stale file layouts. LakeOps eliminates these conflicts by observing the actual cadence of all engines and scheduling maintenance in the gaps.

Query routing across tiers

LakeOps QueryFlux can route queries intelligently based on the data's age and the optimal engine for each time range. A query scanning today's data routes to StarRocks (hot tier, millisecond latency). A query scanning last year's data routes to Trino or Spark (cold tier, cost-optimized compute). A query spanning both routes to StarRocks with federation (leveraging its cache for recent cold data and pushdown for historical cold data).

This routing intelligence goes beyond simple time-based rules. QueryFlux considers table health — if a cold-tier table is currently fragmented (pending compaction), queries that would normally route to a cold-tier engine are temporarily routed to StarRocks federation instead, because StarRocks' caching layer absorbs the fragmentation penalty better than a cold-read engine. Once LakeOps completes compaction and the table health improves, routing reverts to the cost-optimal path. For the full routing architecture, see the multi-engine routing solution.

Unified observability

A tiered architecture without unified observability is a tiered architecture that will degrade silently. The hot tier might be healthy while the cold tier accumulates thousands of small files. StarRocks federation queries might be slowing down gradually — but because the degradation is in Iceberg, not in StarRocks' internal metrics, it does not trigger StarRocks alerts.

LakeOps Dashboard — lake-wide health and optimization activity — LakeOps Dashboard: 30-day optimization activity, cost savings, and health distribution across Critical, Warning, and Healthy tables — unified visibility across both the hot-tier ingestion path and the cold-tier storage layer.

LakeOps provides a unified health view across both tiers. Every Iceberg table is classified as Critical, Warning, or Healthy based on continuously-updated structural signals — file count, average file size, sort order effectiveness, snapshot accumulation, and delete file ratio. Individual problems surface as Insights at four severity levels (CRITICAL, HIGH, WARNING, LOW), each with the specific metric that triggered it, the threshold violated, and one-click remediation. When a table's health degrades below Warning threshold, LakeOps triggers compaction automatically and — if QueryFlux is deployed — adjusts routing to route queries away from the degraded table until health recovers. This closed-loop optimization ensures that cold-tier degradation never silently impacts hot-tier query performance. For the full observability architecture, see the lakehouse observability guide.

When to use this pattern vs alternatives

The hot/cold tiering pattern with StarRocks and Iceberg is not universally optimal. It solves a specific set of requirements and introduces a specific set of operational costs. Understanding when it is the right choice — and when simpler alternatives suffice — prevents over-engineering.

This pattern is right when:

You need sub-second query latency on real-time data AND petabyte-scale historical analysis on the same logical tables. The tiering pattern is the only architecture that delivers both without duplicating data into separate systems with separate query interfaces.

Your workload has a clear temporal access pattern — recent data is queried frequently and intensively, while historical data is queried occasionally or in batch. This pattern is well-served by the hot/cold split because each tier is optimized for its access frequency.

You are replacing a traditional data warehouse and need to maintain its query ergonomics (single SQL surface, sub-second dashboards) while escaping its economics (credit-based pricing, compute-storage coupling, vendor lock-in). The StarRocks + Iceberg architecture provides warehouse-like UX at lakehouse economics.

Your data volumes exceed what a single-tier hot engine can economically hold. If you have 500TB of historical data and 2TB of recent data, storing all 500TB in StarRocks is absurd. The tiering pattern stores 500TB at $10/TB/month in Iceberg and 2TB at high performance in StarRocks — a natural FinOps optimization that aligns storage costs with data access frequency.

Consider alternatives when:

All your data is recent and query latency requirements are uniform. If you only need 30 days of data and every query needs sub-second performance, a single StarRocks instance with no federation may be simpler and sufficient. Adding Iceberg and the tier transition machinery is unnecessary overhead.

Your workload is purely batch and nobody needs real-time dashboards. If all consumers tolerate minutes-to-hours latency, a single Iceberg lakehouse with Spark/Trino is simpler, cheaper, and operationally lighter. StarRocks adds value only when sub-second latency is a requirement.

Your data volume is small enough to fit entirely in a single engine's hot storage. If your entire dataset is 100GB, the complexity of tiering is not justified. Put it all in StarRocks (or Trino, or DuckDB) and skip the architectural overhead.

You lack the engineering capacity to operate the tier transition machinery. The Airflow DAGs, deduplication models, late-arrival handling, and cutoff management require ongoing attention. If your team is two data engineers supporting fifty dashboards, the operational burden of tiering may exceed the value it delivers. In this case, a managed solution that handles tier transitions automatically — or simply a single-tier architecture with accepted limitations — is more practical.

The middle ground

For organizations that need tiering but want to minimize operational burden, the architecture can be simplified by leaning on automation. LakeOps handles cold-tier maintenance (compaction, sort optimization, snapshot management) automatically. Airflow handles tier transitions with pre-built DAGs. dbt handles deduplication logic with standard models. StarRocks handles federation transparently. The platform team configures policies (retention windows, freshness SLAs, sort orders) and the automation handles execution.

The result is an architecture that delivers warehouse-like query performance on real-time data, lakehouse economics on historical data, and a single SQL surface for consumers — without requiring a large platform team to operate it manually. The cold tier stays healthy because LakeOps compacts and sorts it continuously. The hot tier stays fresh because Flink streams into it directly. The boundary between them is managed by Airflow based on configurable policies. And QueryFlux routes queries to the optimal engine based on real workload patterns.

Conclusion

Hot/cold data tiering with StarRocks and Apache Iceberg is the architecture that resolves the fundamental tension between real-time analytics and historical analysis. StarRocks provides the sub-second query performance that dashboards demand, with primary-key tables that reflect the current state of every record. Iceberg provides petabyte-scale storage at object-storage cost, with ACID guarantees and engine-agnostic access. Federation through StarRocks' external catalog makes both tiers appear as a single dataset — consumers write one query and the engine handles the rest.

The operational complexity is real but manageable: ingestion pipelines that feed both tiers simultaneously, tier transition logic that prevents gaps and duplicates, deduplication models that handle the overlap window, and late-arrival processing that routes events to the correct historical partition. Each of these concerns has a well-understood solution — Flink for dual-write ingestion, Airflow for orchestrated transitions, dbt for deduplication modeling, and Spark for corrections.

The cold tier requires continuous attention. Without maintenance, Iceberg tables degrade — files proliferate, sort orders drift, snapshots accumulate — and federation queries slow down gradually until the architecture's latency promises are broken. LakeOps provides the control plane that keeps the cold tier healthy: autonomous compaction coordinated with streaming writes, sort optimization aligned with StarRocks' predicate pushdown patterns, snapshot management that balances time-travel needs against metadata bloat, and unified observability across both tiers. The result is a tiered architecture that maintains its performance characteristics over time — not just at initial deployment but months and years later as data volumes grow, query patterns shift, and the cold tier accumulates the petabytes of history that justify its existence.

For more on the underlying patterns: multi-engine architecture, query routing with Iceberg, the multi-engine routing solution, and the lakehouse observability guide.