Apache Iceberg at Scale: Infrastructure, Performance, and Enterprise Lessons

Running Apache Iceberg at 10 tables is configuration. Running it at 10,000 is infrastructure. The distance between those two states is where most data platform teams discover that Iceberg's power comes with operational demands that scale nonlinearly — demands that no amount of documentation or default settings can prepare you for.

This article synthesizes lessons from production Iceberg deployments at enterprise scale. It covers infrastructure evolution, performance engineering, catalog scaling, enterprise requirements, and observability — the topics that matter once Iceberg stops being a proof-of-concept and starts being a production system.

The scaling trajectory: what changes at each milestone

Every Iceberg deployment follows a recognizable trajectory. The challenges at each stage are predictable, but teams encounter them sequentially — which means each stage feels novel even though the patterns repeat across organizations. Understanding the trajectory helps teams prepare for the next stage before it becomes a crisis.

1–100 tables: the configuration era

At this scale, Iceberg works out of the box. Teams configure a catalog (typically AWS Glue or a Hive Metastore), connect Spark or Trino, create tables, and run queries. Maintenance is manual and infrequent — someone runs compaction when queries slow down, snapshot expiration happens when storage bills arrive, and manifest rewriting is something the team has read about but never needed. The catalog is a single instance. The object storage layout is simple. Schema evolution is rare. Life is manageable.

The operational model is reactive. Problems are noticed when users complain. Fixes are applied manually. There is no monitoring beyond basic query latency. This works because the surface area is small — 100 tables produce a manageable number of alerts, and one engineer can keep track of the state of every table in their head.

100–1,000 tables: the automation era

Between 100 and 1,000 tables, manual maintenance breaks down. An engineer who could mentally track 100 tables cannot track 500. Compaction backlogs grow. Snapshot chains lengthen on tables that receive frequent writes. Small files accumulate on streaming tables faster than anyone notices. Query performance degrades on some tables while others remain healthy, and the inconsistency makes it hard to identify the pattern.

This is when teams build their first automation. Cron jobs for compaction. Scripts for snapshot expiration. Airflow DAGs that run maintenance operations on a schedule. The automation helps, but it introduces new problems: fixed schedules do not adapt to table state. A table that receives 10,000 commits per day needs compaction hourly; a table that receives 10 commits per day needs it weekly. A uniform daily schedule over-compacts quiet tables (wasting compute) and under-compacts busy ones (degrading performance).

At this stage, teams also discover that maintenance operations interact. Compacting before expiring snapshots wastes compute on files that expiration would have garbage-collected. Rewriting manifests before compaction produces manifests that become stale immediately. The correct sequence — expire → orphan cleanup → compact → rewrite manifests → refresh statistics — is not obvious, and getting it wrong wastes resources without improving table health.

1,000–10,000 tables: the infrastructure era

Beyond 1,000 tables, the platform itself becomes the problem. The catalog becomes a bottleneck — a single REST catalog handling metadata operations for 5,000 tables from dozens of concurrent engines starts returning latency spikes and occasional timeouts. Object storage layouts that worked at small scale produce hot prefixes and S3 throttling at large scale. The team that built the automation scripts has moved on, and the scripts are fragile, undocumented, and running on infrastructure that nobody owns.

This is the stage that production experience addresses directly. At 10,000 tables, every infrastructure decision from the early stages compounds. A catalog that was provisioned for 100 concurrent connections now handles 2,000. A storage prefix scheme that was fine for 50 tables creates hot spots with 5,000 tables writing to the same prefix hierarchy. Maintenance automation that ran sequentially now takes 18 hours to complete a single pass across all tables — which means some tables go days without compaction.

The infrastructure evolution required at this stage is not incremental. It requires rearchitecting the catalog layer, redesigning storage layouts, replacing sequential maintenance with parallel and condition-based systems, and building observability that can surface problems across thousands of tables simultaneously. This is where Iceberg transitions from a format choice to an infrastructure discipline.

LakeOps is a control plane for Apache Iceberg lakehouses built for this inflection point — it provides autonomous table maintenance, lake-wide observability, and adaptive optimization policies that keep thousands of tables healthy, queries fast, and costs controlled without requiring teams to build custom infrastructure from scratch.

Infrastructure evolution: from single catalog to federated architecture

The catalog is the single most critical piece of Iceberg infrastructure at scale, and it is the component most likely to become a bottleneck. Every query, every commit, and every metadata operation routes through the catalog. When the catalog slows down, everything slows down.

Single catalog limitations

Most Iceberg deployments start with a single catalog — AWS Glue, a Hive Metastore, or a REST catalog like Polaris or Nessie. At small scale, a single catalog is simple and sufficient. At scale, a single catalog creates three categories of problems.

Connection exhaustion. A REST catalog has a finite connection pool. When 50 Spark executors, 30 Trino workers, 10 Flink task managers, and assorted maintenance jobs all maintain connections to the same catalog, the connection pool saturates. New connections queue. Metadata operations that should complete in 50ms take 5 seconds. Commits retry and conflict. The catalog becomes a serialization point for the entire lakehouse.

Metadata contention. Concurrent writes to the same table require the catalog to serialize commits — checking that each commit's base snapshot has not been superseded since the writer read it. On high-write tables, this creates contention that manifests as commit retry storms. A table receiving 100 concurrent write tasks may see 90% of commits retry at least once, with some retrying five or six times before succeeding. Each retry reads the latest metadata, recomputes the commit, and attempts again — multiplying catalog load.

Blast radius. A single catalog means a single failure domain. A catalog outage — whether from a bug, a deployment, or resource exhaustion — takes down every table in the lakehouse simultaneously. At 10,000 tables, this means every production pipeline, every dashboard, and every ad-hoc query fails at once. The blast radius of a single catalog is the entire data platform.

Federated catalog architecture

The solution at scale is federation — multiple catalogs, each responsible for a subset of tables, with a routing layer that directs operations to the appropriate catalog. Federation reduces connection pressure on any single catalog, isolates failure domains, and allows different catalogs to be tuned for different workload patterns.

A common federation pattern separates catalogs by workload type. Streaming tables — which generate high commit frequencies and benefit from low-latency metadata operations — route to a dedicated catalog optimized for throughput. Batch analytics tables — which have lower commit frequencies but larger metadata payloads — route to a catalog optimized for large metadata reads. Compliance-sensitive tables — which require audit logging and access control — route to a catalog with those capabilities enabled.

Federation introduces its own complexity. Cross-catalog queries require the engine to resolve table references across multiple catalog endpoints. Schema management must be coordinated across catalogs to prevent drift. And monitoring must aggregate health metrics from multiple catalog instances to provide a unified view of the platform.

Connection pooling and caching

Regardless of whether the catalog is single or federated, connection pooling and metadata caching are essential at scale.

Connection pooling prevents individual engines from monopolizing catalog connections. A Spark application with 200 executors should not open 200 concurrent catalog connections. Instead, a connection pool on the driver side multiplexes executor requests across a smaller number of persistent connections — typically 10–20 per application. This reduces catalog connection count by 10x while adding minimal latency through connection reuse.

Metadata caching reduces the number of catalog round-trips for frequently accessed tables. When a Trino coordinator plans a query, it reads the table's current metadata from the catalog. If 50 queries hit the same table within a minute, the coordinator can cache the metadata and serve subsequent planning requests from the cache, refreshing only when the cache TTL expires. Typical cache TTLs range from 30 seconds to 5 minutes — short enough to reflect recent commits, long enough to absorb query bursts.

The tradeoff with caching is freshness. A 5-minute cache means queries may not see data committed in the last 5 minutes. For streaming tables where freshness matters, shorter TTLs (30 seconds) or invalidation-based caching (where the catalog notifies clients of changes) is necessary. For batch analytics tables queried by dashboards, 5-minute caching is typically acceptable and dramatically reduces catalog load.

Performance engineering: tuning Parquet for Iceberg workloads

Iceberg's storage layer is Parquet (in most deployments), and Parquet performance is not one-size-fits-all. The default Parquet settings that ship with Spark, Trino, and Flink are reasonable for moderate workloads but leave significant performance on the table at scale. Production experience highlights three tuning dimensions that deliver measurable improvements.

Row group size

A Parquet file is organized into row groups — horizontal slices of the data that contain a configurable number of rows. Each row group stores column chunks independently, and each column chunk has its own statistics (min, max, null count). The query engine uses these statistics for predicate pushdown — skipping entire row groups whose statistics indicate they contain no matching data.

The default row group size is typically 128 MB (in Spark's parquet.block.size). At scale, this default creates two problems in opposite directions.

Too-small row groups on wide tables. A table with 500 columns and 128 MB row groups may contain only 10,000–50,000 rows per group. Each row group has its own column chunk headers and statistics, creating per-group overhead that compounds across millions of files. Increasing row group size to 256 MB or 512 MB on wide tables reduces the number of groups per file, cuts per-group overhead, and improves sequential scan performance.

Too-large row groups on heavily filtered tables. A table that is always queried with selective filters (WHERE status = 'error' AND region = 'us-east-1') benefits from smaller row groups — 64 MB or even 32 MB. Smaller groups mean tighter statistics ranges per group, which means more groups can be skipped entirely. A 512 MB row group on a table with 100 distinct values for region has wide min/max ranges that prevent skipping. A 32 MB row group on the same table has narrow ranges that enable skipping — reducing data scanned by 10x or more.

The optimal row group size depends on the query pattern. Tables that serve full scans (aggregations, ML training) benefit from larger groups. Tables that serve selective lookups benefit from smaller groups. Tables that serve both need a compromise — or sort-order optimization that makes statistics tight regardless of group size.

Page size

Within each column chunk of a row group, data is further divided into pages — the smallest unit of compression and encoding. The default page size is typically 1 MB. Pages are relevant for two performance dimensions.

Compression ratio. Larger pages (2–4 MB) give the compressor more data to work with, which generally improves compression ratios for dictionary-encoded and run-length-encoded columns. On tables with repetitive string columns (status codes, region names, event types), increasing page size from 1 MB to 4 MB can improve compression by 15–25%.

Random access latency. Smaller pages enable finer-grained random access within a column chunk. When a query reads a specific row range within a column chunk, it must decompress entire pages to access the data. Smaller pages mean less unnecessary decompression. For point-lookup workloads, 512 KB or 256 KB pages reduce wasted I/O.

At enterprise scale, the page size tuning is secondary to row group size and compression codec selection, but on tables with billions of rows queried by thousands of concurrent users, the marginal improvement from page size tuning compounds into material gains.

Compression codec selection

The choice between Snappy, ZSTD, LZ4, and Gzip affects both storage cost and query performance. Each codec sits at a different point on the compression-ratio-versus-speed curve.

Snappy is the default in most Spark deployments. It compresses and decompresses fast with moderate compression ratios. For tables where query performance is the primary concern and storage cost is secondary, Snappy is a safe default.

ZSTD delivers 20–40% better compression than Snappy with slightly slower decompression. For cold storage tables (historical data queried infrequently), ZSTD reduces storage costs significantly. ZSTD also supports configurable compression levels (1–22), allowing fine-grained tradeoffs between compression ratio and speed. Level 3 (the typical default) offers a good balance; level 1 approaches Snappy speed with better compression; levels 9+ are appropriate only for archival data.

LZ4 is the fastest codec — faster than Snappy for decompression — with slightly lower compression ratios. For tables where query latency is critical and storage cost is not a concern, LZ4 provides the lowest decompression overhead.

At scale, the compression choice is not uniform across all tables. Hot tables queried thousands of times per day benefit from Snappy or LZ4 (fast decompression). Cold tables storing petabytes of historical data benefit from ZSTD (smaller footprint). Tiered compression — Snappy for recent partitions, ZSTD for older partitions — gives the best of both, though it requires partition-aware write configuration.

Spark configuration for Iceberg workloads

Spark is the most common engine for writing to Iceberg tables and running maintenance operations. Its default configuration is designed for general-purpose data processing, not for Iceberg-specific workloads. Production deployments highlight several Spark tuning areas that deliver significant improvements at scale.

Adaptive Query Execution (AQE)

Spark's Adaptive Query Execution dynamically adjusts query plans based on runtime statistics. For Iceberg workloads, AQE provides three critical optimizations.

Dynamic partition coalescing. AQE detects when shuffle partitions produce very small outputs and coalesces them into fewer, larger partitions. On Iceberg writes, this prevents the creation of many small files. Without AQE, a Spark job with 200 shuffle partitions writing to an Iceberg table creates 200 files — many of which may be only a few megabytes. With AQE enabled and spark.sql.adaptive.coalescePartitions.enabled set to true, Spark coalesces those 200 partitions into 20–30 larger partitions, producing 20–30 appropriately sized files instead.

Dynamic join optimization. AQE detects when one side of a join is small enough for a broadcast join and switches from a shuffle join to a broadcast join at runtime. For Iceberg workloads that join large fact tables against small dimension tables, this eliminates expensive shuffles — reducing both execution time and the risk of producing skewed file sizes.

Skew handling. AQE detects data skew in shuffle partitions and splits skewed partitions into smaller sub-partitions. For Iceberg tables partitioned by columns with skewed distributions (e.g., a country column where 40% of data is from the US), skew handling prevents individual tasks from becoming bottlenecks that delay the entire job.

The key AQE settings for Iceberg workloads are spark.sql.adaptive.enabled (true), spark.sql.adaptive.coalescePartitions.enabled (true), and spark.sql.adaptive.advisoryPartitionSizeInBytes (256 MB for most Iceberg workloads — matching the target file size).

Shuffle and task sizing

The relationship between Spark's shuffle configuration and Iceberg's file output is direct and often misunderstood.

Shuffle partition count (spark.sql.shuffle.partitions) determines how many tasks process shuffled data, which directly controls how many output files Iceberg writes. The default of 200 is too high for small tables (creating too many small files) and too low for very large tables (creating tasks that are too large and risk OOM). With AQE enabled, this setting becomes a ceiling rather than a fixed value — AQE coalesces partitions downward but cannot split them upward from this initial value.

Advisory partition size (spark.sql.adaptive.advisoryPartitionSizeInBytes) tells AQE what size each partition should target after coalescing. Setting this to match the target Iceberg file size (128–256 MB) ensures that AQE produces partitions whose output aligns with optimal Iceberg file sizes. This single setting is the most important Spark-Iceberg tuning parameter — it bridges Spark's internal partitioning with Iceberg's file layout.

Task memory (spark.executor.memory and spark.executor.memoryOverhead) must account for Iceberg's metadata operations. Commit operations on large tables require the driver to hold manifest metadata in memory. On tables with 10,000+ files per partition, the driver's memory overhead for a single commit can exceed 2 GB. Under-provisioning driver memory leads to OOM errors during commit — not during data processing — which is confusing because the job appears to complete successfully until the final commit step fails.

Write distribution mode

Iceberg's write.distribution-mode property controls how Spark distributes data across output files. The three options — none, hash, and range — have dramatically different performance characteristics at scale.

None writes data in the order it arrives, with no additional shuffle. This is fastest but produces files with overlapping partition and sort ranges, which limits data skipping effectiveness. Use for append-only tables where query patterns do not benefit from sorted data.

Hash shuffles data by partition key before writing. Each task writes data for a single partition, producing files with clean partition boundaries. This is the recommended default for partitioned tables — it prevents files from spanning multiple partitions and ensures partition pruning is effective.

Range sorts data globally by partition key and sort order before writing. This produces optimally organized files with tight column statistics, enabling maximum data skipping. But the global sort requires a full shuffle — which is expensive for large writes. Use for tables where query performance on sorted columns justifies the write-time cost.

At scale, the distribution mode choice is a write-speed versus read-speed tradeoff. Tables that are written once and read thousands of times benefit from range distribution. Tables that are written frequently and read occasionally benefit from none or hash. Tables in between — written daily, read hundreds of times — typically use hash distribution with sort-within-partition for a balanced approach.

Enterprise requirements at scale

Enterprise Iceberg deployments introduce requirements that do not exist in proof-of-concept or small-scale environments. These requirements are often non-negotiable — compliance, security, and audit teams have veto power over architectural decisions.

Encryption at rest

Iceberg's Parquet files inherit Parquet's encryption capabilities. Parquet Modular Encryption (PME) supports column-level encryption — different columns can be encrypted with different keys, allowing fine-grained access control at the file format level. A table with PII columns (email, phone, SSN) can encrypt those columns with restricted keys while leaving non-sensitive columns readable by broader audiences.

At scale, encryption key management becomes the challenge. Each table — or each column within each table — requires its own encryption key. A deployment with 10,000 tables and column-level encryption across 50 tables generates thousands of key management operations per day as files are written, compacted, and read. Integration with enterprise key management systems (AWS KMS, HashiCorp Vault, Azure Key Vault) must handle this volume without becoming a latency bottleneck.

Compaction adds complexity to encryption. When files are compacted, the compaction engine must decrypt the input files and re-encrypt the output files. If keys have rotated since the original files were written, the compaction engine must handle both old and new keys. This key-lifecycle awareness must be built into any maintenance automation — a compaction job that cannot decrypt input files because it does not have access to the original keys will fail silently or corrupt data.

Audit trails

Enterprise environments require audit trails that record who accessed what data and when. Iceberg's snapshot metadata provides a partial audit trail — each snapshot records who created it and when — but it does not capture read access. Queries that read a table leave no trace in Iceberg's metadata.

Complete audit trails require integration with the query engine's audit logging (Spark's event log, Trino's query log) and the catalog's access log. At scale, correlating these logs across thousands of tables and dozens of engine instances requires a centralized audit pipeline — typically streaming engine logs into a separate Iceberg table (or equivalent immutable store) for compliance queries.

The audit pipeline must handle volume. A deployment with 500 concurrent users running queries across 10,000 tables generates millions of audit events per day. The audit store itself becomes a significant Iceberg table that requires its own maintenance — compaction, snapshot expiration, and retention management. Neglecting the audit table's health is a common oversight that creates compliance risk when audit queries slow to minutes or fail entirely.

Data classification and access control

Data classification — tagging tables and columns with sensitivity levels (public, internal, confidential, restricted) — is a prerequisite for meaningful access control at scale. Without classification, access policies devolve into per-table ACLs that are impossible to manage across 10,000 tables.

Iceberg's table properties can store classification metadata, but enforcement happens at the catalog and engine level. A REST catalog can enforce that only users with 'restricted' clearance can access tables tagged as restricted. A Trino coordinator can enforce column-level masking based on column-level classification tags. But these enforcement mechanisms must be consistent across all engines accessing the same tables — a policy enforced in Trino but not in Spark is not a policy.

At scale, classification must be automated. Manual classification of 10,000 tables with 50–500 columns each is not feasible. Automated classification tools scan column names, sample data, and apply heuristic rules (columns named 'email' or 'ssn' are classified as PII; columns containing patterns matching credit card numbers are classified as restricted). These tools must run continuously as new tables are created and schemas evolve.

LakeOps Dashboard: lake-wide health overview showing optimization activity, cost savings, and health distribution — surfacing infrastructure issues across thousands of tables simultaneously.

Observability and metrics collection

At 10,000 tables, you cannot manage what you cannot see. Observability is not a nice-to-have — it is the foundation on which every other scaling strategy depends. Without comprehensive metrics, every tuning decision is guesswork and every incident is a mystery.

Table-level health metrics

Every Iceberg table emits signals that indicate its health. The critical metrics are file count (total and per partition), average file size, delete file count and ratio, manifest count per snapshot, snapshot chain depth, and metadata file size. These metrics, tracked over time, reveal the trajectory of each table's health — whether it is improving (maintenance is working), stable (maintenance is keeping pace with writes), or degrading (maintenance is falling behind).

Collecting these metrics requires reading each table's metadata — which, at 10,000 tables, is itself a significant operation. A naive approach that reads every table's metadata sequentially takes hours. A parallel approach that reads all tables simultaneously overwhelms the catalog. The right approach is batched collection — reading metadata for 50–100 tables at a time, with rate limiting to avoid catalog pressure — producing a complete health snapshot every 15–30 minutes.

LakeOps collects these metrics continuously across all connected tables and catalogs. Rather than requiring teams to build custom metric collection pipelines, it provides a unified health dashboard that classifies every table as Healthy, Warning, or Critical based on configurable thresholds. When a table transitions from Healthy to Warning — say, because its manifest count crossed 200 or its delete file ratio exceeded 10% — LakeOps triggers the appropriate maintenance action automatically.

Query performance correlation

Table health metrics become actionable when correlated with query performance. A table with 5,000 manifests is a data point. A table with 5,000 manifests whose P95 planning time increased 4x this week is an incident. The correlation between structural metrics (file count, manifest count, statistics) and performance metrics (planning time, scan time, total query time) is what transforms monitoring from passive observation to proactive optimization.

Building this correlation requires integrating table health metrics with engine performance metrics — Spark's query execution timeline, Trino's query statistics, Flink's job metrics. Each engine reports performance differently, and correlating across engines (e.g., a table whose Spark writes create small files that degrade Trino query performance) requires a unified metrics layer.

LakeOps Insights: per-table health signals with severity classification, specific thresholds violated, and remediation recommendations — correlating structural metrics with performance impact across the lake.

ML-driven query optimization

At sufficient scale, the observability data itself becomes a training set for ML-driven optimization. With months of metrics across thousands of tables — file layouts, partition schemes, sort orders, compression codecs, query patterns, planning times, scan volumes — ML models can identify correlations that humans cannot.

Predictive maintenance. A model trained on the relationship between write patterns and table health can predict when a table will cross a health threshold — triggering maintenance proactively rather than reactively. If a table's manifest count has grown by 50 per day for the past week, the model can predict when it will cross the 500-manifest threshold that triggers planning degradation and schedule a manifest rewrite before queries are impacted.

Layout optimization. A model trained on query patterns and file statistics can recommend optimal sort orders and partition schemes. If 80% of queries on a table filter by event_type and region, but the table is sorted by timestamp, the model can recommend re-sorting by (event_type, region, timestamp) — and estimate the expected improvement in data skipping effectiveness.

Resource allocation. A model trained on compaction costs (time, compute, I/O) and compaction inputs (file count, file sizes, delete file ratios) can predict the resources needed for each compaction job — enabling precise resource allocation instead of fixed-size provisioning.

These ML-driven optimizations are not theoretical. Production deployments managing thousands of tables generate enough observability data within weeks to train useful models. The key requirement is structured, consistent metrics collection — which is why the observability foundation must be in place before ML optimization becomes feasible.

Common pitfalls at enterprise scale

Production experience across enterprise deployments reveals recurring pitfalls that teams encounter when scaling Iceberg. These are not edge cases — they are the default outcome when infrastructure does not evolve with the deployment.

The maintenance gap

The most common pitfall is the gap between the maintenance a table needs and the maintenance it receives. At 100 tables, manual maintenance keeps pace. At 1,000 tables, cron-based automation keeps pace for most tables. At 10,000 tables, the maintenance backlog grows faster than any fixed-schedule automation can clear it. Tables that need hourly compaction get daily compaction. Tables that need daily manifest rewriting get weekly rewriting. And tables that need immediate attention — because a streaming pipeline just committed 50,000 small files — wait in the queue behind tables that do not need maintenance at all.

The solution is condition-based maintenance that prioritizes tables by urgency. A table with 20,000 small files and 500 manifests gets compacted before a table with 200 appropriately sized files and 10 manifests. A table whose planning time just doubled gets manifest rewriting before a table whose planning time has been stable for weeks. LakeOps implements this prioritization natively — its scheduler evaluates table health continuously and allocates maintenance compute where it delivers the most impact.

Catalog as the bottleneck

The second most common pitfall is treating the catalog as a simple metadata store rather than critical infrastructure. At scale, the catalog handles more operations per second than any individual engine. Every query plans through the catalog. Every commit writes through the catalog. Every maintenance operation reads and writes through the catalog. When the catalog degrades, every downstream operation degrades — and the failure mode is not a clean error but a gradual slowdown that is difficult to diagnose.

Teams that invest in catalog infrastructure — connection pooling, read replicas, caching, federation — see order-of-magnitude improvements in platform reliability. Teams that treat the catalog as an afterthought spend significant time debugging mysterious performance regressions that trace back to catalog pressure.

Storage layout neglect

Object storage is not a filesystem, and treating it like one creates performance problems at scale. S3 partitions objects by key prefix, and a single prefix can handle approximately 3,500 PUT and 5,500 GET requests per second. An Iceberg table with all files under a single prefix — s3://bucket/warehouse/db/table/data/ — can saturate this limit during compaction, when hundreds of files are being read and written simultaneously.

The solution is prefix distribution. Iceberg supports configurable storage layouts that spread files across multiple prefixes — by partition, by hash, or by a combination. Distributing files across 100 prefixes raises the effective throughput limit from 3,500/5,500 to 350,000/550,000 requests per second. This is not a micro-optimization — it is the difference between a compaction job completing in 10 minutes and one failing with S3 throttling errors.

Ignoring the write path

Teams optimize the read path obsessively — better partition pruning, tighter statistics, more aggressive data skipping — while neglecting the write path. But at scale, how data is written determines how well it can be read. Files written without sort order have wide statistics ranges that defeat data skipping. Files written with too many small partitions create manifest explosion. Files written without size targets create small-file problems that compound over time.

Investing in write-path optimization — target file sizes, distribution modes, sort orders, AQE tuning — pays dividends on every subsequent read. A table written well requires less maintenance, plans faster, and scans less data than a table written carelessly. The write path is the foundation that everything else builds on.

LakeOps Tables — LakeOps table monitoring: health classification across the lake with file counts, manifest state, snapshot depth, and partition details — surfacing the structural issues that create performance problems at scale.

From batch to real-time: infrastructure implications

The transition from batch-only to real-time Iceberg workloads is an infrastructure inflection point. Batch workloads produce large, well-organized files at predictable intervals. Streaming workloads produce small, fragmented files continuously. The infrastructure requirements are fundamentally different.

Streaming commit frequency

A streaming pipeline committing every 60 seconds creates 1,440 snapshots per day per table. Each commit creates at least one manifest and one or more data files. After a week without maintenance, a single streaming table has 10,080 snapshots, 10,000+ manifests, and tens of thousands of small files. The metadata tree grows linearly with commit frequency, and the operational overhead of managing it grows super-linearly — because planning time, manifest count, and snapshot chain depth all interact to compound the degradation.

Infrastructure for streaming Iceberg requires aggressive maintenance: snapshot expiration running continuously (not daily), compaction triggered by file count thresholds (not time intervals), and manifest rewriting after every compaction pass. The maintenance cadence for streaming tables is hours, not days.

Mixed workload isolation

Production environments rarely run pure batch or pure streaming. They run both — streaming tables ingesting real-time events alongside batch tables receiving nightly ETL loads. The infrastructure must handle both without one degrading the other.

Resource isolation is critical. Compaction jobs running on streaming tables should not compete for catalog connections with batch ETL pipelines. Manifest rewriting on batch tables should not block snapshot expiration on streaming tables. Maintenance operations for different table types should run in separate resource pools with independent scheduling.

LakeOps manages this isolation through per-table maintenance policies that account for write pattern, table criticality, and resource availability. Streaming tables get dedicated maintenance cycles with aggressive thresholds. Batch tables get scheduled maintenance windows that align with ETL cadence. Each table gets the maintenance it needs without contending for shared resources.

The infrastructure maturity model

Scaling Iceberg from 10 tables to 10,000 is a journey through four maturity levels. Each level requires infrastructure investments that the previous level did not.

Level 1: Manual. Everything runs by hand. Maintenance happens when someone remembers. Monitoring is ad-hoc. This works for fewer than 50 tables and breaks immediately at 100.

Level 2: Automated. Cron jobs and Airflow DAGs automate maintenance. Basic monitoring tracks file counts and query times. This works for 100–500 tables and breaks when tables have diverse maintenance needs.

Level 3: Adaptive. Condition-based triggers replace fixed schedules. Per-table policies replace uniform configuration. Observability covers table health, query performance, and maintenance effectiveness. This works for 500–5,000 tables and requires dedicated platform engineering investment.

Level 4: Autonomous. ML-driven optimization recommends layout changes. Predictive maintenance prevents degradation before it occurs. Self-tuning policies adapt to changing workload patterns without human intervention. This is the target state for deployments beyond 5,000 tables.

Most enterprise deployments operate at Level 2 and struggle to reach Level 3. The investment required to build Level 3 infrastructure internally — condition-based maintenance, per-table policies, comprehensive observability — is significant. LakeOps provides Level 3 and early Level 4 capabilities as a managed service, enabling teams to operate at the adaptive/autonomous level without building the infrastructure from scratch.

LakeOps Control Plane — catalogs, engines, and autonomous optimization — LakeOps connects to your existing Iceberg catalogs and query engines as a dedicated control plane — no data movement, no pipeline changes.

LakeOps walkthrough — autonomous table maintenance, health monitoring, and infrastructure-scale observability for production Iceberg deployments.

Putting it all together

Apache Iceberg at scale is an infrastructure challenge masquerading as a format choice. The format is elegant — immutable files, snapshot isolation, schema evolution, column statistics. The infrastructure required to keep that format performing well at 10,000 tables is anything but simple.

The lessons from production deployments managing petabytes across thousands of tables converge on a consistent theme: proactive infrastructure beats reactive troubleshooting at every scale milestone.

Tune Parquet settings for your workload. Row group size, page size, and compression codec selection are not one-size-fits-all. Tables serving selective queries benefit from smaller row groups. Tables serving full scans benefit from larger ones. Compression choices should vary by table temperature — Snappy or LZ4 for hot tables, ZSTD for cold storage.

Configure Spark for Iceberg. AQE, advisory partition size, write distribution mode, and driver memory are the four most impactful Spark settings for Iceberg workloads. Getting them right prevents small-file creation at the source — which is cheaper than cleaning up small files after the fact.

Invest in catalog infrastructure. The catalog is the most critical piece of your Iceberg deployment. Connection pooling, caching, and federation are not premature optimizations — they are prerequisites for operating beyond a few hundred tables.

Meet enterprise requirements early. Encryption, audit trails, and data classification are harder to retrofit than to build in. Teams that defer these requirements until compliance asks for them face expensive re-architecture.

Build observability first. You cannot optimize what you cannot measure. Table-level health metrics, query performance correlation, and trend analysis are the foundation on which every tuning decision rests.

Automate adaptively. Fixed-schedule maintenance breaks at scale. Condition-based triggers that prioritize maintenance by urgency and allocate resources by impact are the only approach that scales to thousands of tables.

The distance between a 10-table proof-of-concept and a 10,000-table production platform is not measured in tables — it is measured in infrastructure maturity. Every table you add increases the surface area of problems that can occur. Every day without maintenance increases the debt that accumulates. The teams that scale Iceberg successfully are the ones that invest in infrastructure proportional to their ambition — not the ones that hope the defaults will hold. If you are approaching the infrastructure era, LakeOps provides the autonomous maintenance, observability, and adaptive policies that make 10,000-table deployments operationally sustainable — explore the production readiness checklist, the table health maintenance guide, or the lakehouse performance solution.