Apache Iceberg Puffin Statistics: A Practical Guide

Apache Iceberg Puffin Statistics — a puffin bird beside a statistics dashboard showing file counts, records, partitions, and data size, connected to a geometric iceberg

Query engines make thousands of decisions before a single row is read. Which partitions to scan. How to order joins. How many splits to allocate per task. Every one of those decisions depends on statistics — and in Apache Iceberg, the statistics that matter most for table-level planning live in Puffin files.

Parquet footer statistics — per-column min/max values, null counts, row counts — are well understood. They drive predicate pushdown and row-group skipping at the file level. But file-level statistics cannot answer table-level questions: How many distinct values does `customer_id` have across the entire table? What is the cardinality of `region` across all partitions? Is this column high-cardinality or low-cardinality for join planning purposes? Those answers require a different layer of metadata. That layer is Puffin.

This guide covers the Puffin file format, the types of statistics it stores, how query engines produce and consume them, when statistics go stale, and how to maintain them in production.

What are Puffin files?

Puffin is a statistics file format defined in the Apache Iceberg spec. Introduced alongside the Iceberg table format, it provides a standard container for storing statistical metadata that is too large or too complex to embed directly in Iceberg manifest files or table metadata JSON.

The core idea is separation of concerns. Manifest files track which data files belong to each snapshot. Table metadata tracks schemas, partition specs, and sort orders. Puffin files carry the heavyweight statistical summaries — today, primarily NDV sketches, plus any custom blobs an engine chooses to write — referenced from table metadata but stored as independent objects in the same storage layer (S3, GCS, ADLS) as the data files.

A table's metadata JSON includes a statistics field that points to one or more Puffin files by path. Each Puffin file is tied to a specific snapshot ID, which means statistics are versioned alongside the data — when a new snapshot is committed, the engine can write new statistics that reflect the updated state.

Puffin file structure

A Puffin file has a well-defined binary layout designed for efficient reading and selective blob retrieval.

text

1┌──────────────────────────┐2│  Magic bytes (4 bytes)   │  0x50, 0x46, 0x41, 0x31 ("PFA1")3├──────────────────────────┤4│  Blob data               │  One or more blobs, concatenated5│  ┌────────────────────┐  │6│  │ Blob 1 (compressed)│  │7│  ├────────────────────┤  │8│  │ Blob 2 (compressed)│  │9│  ├────────────────────┤  │10│  │ ...                │  │11│  └────────────────────┘  │12├──────────────────────────┤13│  Footer                  │  JSON FileMetadata object14│  Footer payload length   │  4 bytes15│  Flags                   │  4 bytes16│  Magic bytes (4 bytes)   │  0x50, 0x46, 0x41, 0x31 ("PFA1")17└──────────────────────────┘

Magic bytes. The file starts and ends with the four-byte sequence PFA1 (Puffin Format, version A1). This allows readers to quickly validate that a file is a valid Puffin container.

Blob data. The body contains one or more blobs, concatenated sequentially. Each blob is a self-contained payload — a serialized sketch, a histogram, or any binary data the producer chose to write. Blobs can be individually compressed using LZ4, Zstandard, or left uncompressed. The compression codec is recorded in the footer metadata, not in the blob itself.

Footer. The footer is a JSON object with a blobs field containing an array of blob metadata descriptors, and an optional top-level properties field for file-level metadata. Each blob descriptor includes:

type — a string identifying the blob type (e.g., apache-datasketches-theta-v1 for NDV sketches)
fields — which column IDs the blob describes
snapshot-id — the snapshot the statistics were computed from
sequence-number — the Iceberg sequence number at computation time
offset and length — the byte range of the blob in the file body
compression-codec — lz4, zstd, or null (uncompressed)
properties — an optional key-value map for custom metadata

This design means a reader can parse only the footer to discover what statistics are available, then seek directly to the relevant blob without reading the entire file. For tables with many statistical blobs, this random-access pattern avoids unnecessary I/O.

Types of statistics stored in Puffin files

NDV — Number of Distinct Values

The most important statistic Puffin carries is NDV (Number of Distinct Values), stored as Apache DataSketches Theta sketches. The blob type identifier is apache-datasketches-theta-v1.

A Theta sketch is a probabilistic data structure that estimates cardinality with bounded error and fixed memory. A serialized sketch is typically 8–32 KB depending on the configured nominal entries (default 4096 in DataSketches), regardless of the actual cardinality — whether the column has 100 distinct values or 100 billion. Sketches are also mergeable: partition-level sketches can be combined to produce a table-level NDV estimate without re-scanning the data.

NDV estimates are critical for query planning:

Join ordering. The optimizer picks join order based on estimated output cardinality. A join on a column with 10M distinct values produces far more intermediate rows than a join on a column with 100 distinct values. Without NDV, the optimizer falls back to heuristics or defaults — often producing suboptimal plans.
Aggregation strategy. Knowing whether a GROUP BY column is low-cardinality (hundreds) or high-cardinality (millions) determines whether the engine uses a hash aggregate or a sort-based aggregate.
Filter selectivity. A WHERE status = 'active' on a column with 3 distinct values has ~33% selectivity. On a column with 10,000 distinct values, the same pattern has ~0.01% selectivity. NDV makes that difference visible to the planner.

LakeOps uses Puffin statistics as a core input for its optimization engine — using NDV data and query patterns to determine optimal sort orders, detect stale statistics, and trigger re-analysis after compaction. We'll explore this later.

Column bounds live in manifests, not Puffin

A common misconception is that Puffin files store column min/max bounds. They do not — at least not today. Per-file column bounds (lower_bounds, upper_bounds, null counts, row counts) are stored in Iceberg manifest files and are the primary source of column-range pruning and data skipping in all engines. Puffin's only standardized blob type is NDV via Theta sketches (apache-datasketches-theta-v1).

That said, the Puffin format is designed for extensibility. As engines evolve, aggregated range statistics — global min/max across all files, value distribution histograms — could be stored as Puffin blobs to accelerate planning for non-partition columns without scanning every manifest entry. But as of the current spec, column bounds belong to manifests.

Custom statistics blobs

The Puffin format is extensible by design. Any producer can write custom blob types with arbitrary payloads, as long as the type field in the footer descriptor uniquely identifies the format. This extensibility supports future statistical types — bloom filters for membership testing, histograms for value distribution — without changes to the Puffin spec itself.

How Puffin statistics improve query planning

Without table-level statistics, query engines rely on two fallback strategies: heuristic defaults (assume every table has 1 million rows, every column has 10,000 distinct values) or runtime sampling (read a fraction of files during planning to estimate cardinality). Both are unreliable at scale.

Puffin statistics eliminate these guesses for three key planning decisions.

Partition pruning. When statistics include per-partition NDV and bounds, the planner can skip entire partitions that cannot contain matching rows — not just based on partition key equality, but based on value ranges and cardinality of non-partition columns within each partition.

Cost-based join ordering. Accurate NDV enables cost-based optimization (CBO) to estimate the output cardinality of each join. For a three-way join, the difference between the optimal and worst join order can be 100x or more in intermediate data volume. With Puffin NDV, the optimizer picks the order that minimizes intermediate result sizes. This is one of the highest-impact levers for lakehouse query performance.

Split planning. Engines like Spark and Trino divide table scans into splits (tasks) distributed across workers. Accurate row count and column statistics let the planner size splits to balance work evenly. Without statistics, splits may be skewed — some tasks process 10x more data than others, and the query's wall time is bounded by the slowest task.

Collecting statistics

Spark

Spark produces Puffin statistics through the ANALYZE TABLE command.

sql

1-- Collect statistics for specific columns2ANALYZE TABLE catalog.db.events3  COMPUTE STATISTICS FOR COLUMNS4  user_id, event_type, region;5 6-- Collect statistics for all columns7ANALYZE TABLE catalog.db.events8  COMPUTE STATISTICS FOR ALL COLUMNS;

When executed, Spark reads the table's data files, computes DataSketches Theta sketches for each specified column, serializes them into a Puffin file, writes that file to the table's metadata directory in object storage, and updates the table metadata to reference the new statistics file with the current snapshot ID.

The process is a full table scan — every data file in the current snapshot is read to produce accurate sketches. For large tables, this is expensive. A 10 TB table with 50 columns takes significant compute to analyze fully. In practice, teams analyze only the columns that appear in join conditions and high-selectivity filters.

Puffin statistics support was added in the Iceberg 1.1.0+ library. When using the iceberg-spark-runtime jar (version 1.1.0+) with Spark 3.3 or later, Puffin statistics collection is available via ANALYZE TABLE.

Trino

Trino collects Iceberg statistics with a similar SQL interface.

sql

1-- Analyze specific columns2ANALYZE catalog.db.events3  WITH (columns = ARRAY['user_id', 'event_type', 'region']);4 5-- Analyze with a specific number of partitions6ANALYZE catalog.db.events7  WITH (columns = ARRAY['user_id']);

Trino's Iceberg connector reads the table's data files, computes NDV sketches and null counts, writes them to a Puffin file, and registers the statistics in the table metadata. The connector then uses these statistics for subsequent query planning — predicate pushdown decisions, join reordering, and dynamic filtering all benefit from the presence of Puffin stats.

Trino also reads Puffin statistics produced by other engines (e.g., Spark), which means statistics collected once are shared across the entire multi-engine stack — a key advantage of Iceberg's open format.

Other engines

Engine support for Puffin statistics varies across the ecosystem:

Flink — reads Puffin statistics for planning but does not produce them. Streaming ingestion workloads should use Spark or Trino to periodically refresh statistics.
DuckDB — the Iceberg extension reads table metadata but does not currently consume or produce Puffin statistics. DuckDB relies on Parquet footer statistics and its own adaptive sampling.
Athena — uses Iceberg metadata for partition pruning but does not currently produce Puffin files. Tables analyzed via Spark or Trino will have their statistics available but Athena's planner may not fully leverage NDV sketches.
Snowflake — Snowflake's Iceberg table support uses internal statistics infrastructure. For managed Iceberg tables, Snowflake collects its own statistics; for externally cataloged tables, it may read Puffin files but does not produce them.
Databricks — supports ANALYZE TABLE for Delta and Iceberg tables. Statistics are stored in the table metadata and used for CBO.

Multi-engine lakehouse — Puffin statistics shared across all engines — Multi-engine architecture: statistics collected once are shared across Spark, Trino, Flink, Snowflake, Athena, and DuckDB — a key advantage of Iceberg's open format. LakeOps coordinates statistics collection and maintenance across all engines.

When statistics go stale

Statistics are computed against a specific snapshot. Any operation that changes the data invalidates the accuracy of existing statistics to some degree.

Data ingestion. New data files are added but the NDV sketch still reflects the old snapshot. If the new data introduces previously unseen values (new regions, new user IDs), the NDV underestimates cardinality. If the new data is within existing value ranges, the impact is smaller but row counts are still inaccurate.

Compaction. File rewrites change the physical layout without changing logical data, but compaction triggers a new snapshot. If the statistics reference a pre-compaction snapshot ID, some engines may ignore them entirely — treating the table as unanalyzed. This is especially common in tables suffering from small file proliferation, where frequent compaction cycles rapidly advance the snapshot count.

Schema evolution. Adding, renaming, or reordering columns can invalidate column ID mappings. Puffin statistics reference columns by field ID, not name, so column renames are safe. But adding new columns means no statistics exist for them — queries filtering on the new column fall back to defaults.

Partition evolution. Changing the partition spec creates new partitions that have no per-partition statistics. The table-level NDV may still be approximately correct, but partition-level pruning decisions degrade.

Delete operations. Equality deletes and position deletes logically remove rows but do not update the Theta sketch. NDV remains unchanged even though some distinct values may have been entirely deleted. Until compaction physically removes the deleted data and statistics are recomputed, the planner overestimates cardinality. For a deeper look at how delete files affect table health, see our companion guide.

The result is predictable: without re-collection, statistics drift from reality. The drift is gradual for append-only tables and sharp for tables with frequent updates, deletes, or schema changes. Stale statistics also inflate Iceberg compute costs — suboptimal join orders and poor split sizing mean queries burn more CPU and memory than necessary.

Healthy vs Unhealthy Iceberg tables — the impact of stale statistics on query planning — Stale statistics contribute to the unhealthy state: engines make suboptimal join orders, split sizes, and pruning decisions. Accurate, up-to-date statistics are part of what makes a table healthy — alongside compacted files and clean metadata.

Best practices for maintaining statistics

Use a control plane to automate statistics lifecycle. Statistics collection should not be a separate, manually scheduled job. A platform like LakeOps uses Puffin statistics as a core input — feeding NDV data into sort order optimization, detecting stale statistics across the lake, and triggering re-analysis after compaction automatically. Whether in autopilot mode or manual mode, the observability surfaces which tables have drifted statistics and by how many snapshots — eliminating guesswork about when to re-analyze. For teams pursuing fully autonomous Iceberg table maintenance, statistics refresh is a first-class stage alongside compaction and cleanup.

Re-analyze after compaction. Compaction is the natural trigger for re-collecting statistics. The data has just been rewritten — the table is in a clean state with no delete files, fresh manifests, and a new snapshot. Running ANALYZE TABLE immediately after compaction captures the most accurate statistics at the lowest marginal cost.

sql

1-- Maintenance pipeline: compact, then analyze2CALL catalog.system.rewrite_data_files('db.events');3ANALYZE TABLE catalog.db.events4  COMPUTE STATISTICS FOR COLUMNS user_id, event_type, region;

Prioritize high-impact columns. Analyzing all columns on a wide table is expensive and often unnecessary. Focus on columns that appear in:

JOIN ON conditions — NDV directly determines join order
WHERE clauses with high selectivity — cardinality drives filter estimation
GROUP BY clauses — aggregation strategy depends on cardinality

Skip columns used only in SELECT projections — their statistics do not affect planning.

Monitor for staleness. Track the gap between the statistics snapshot ID and the current table snapshot. If the table has advanced 50+ snapshots since the last ANALYZE, the statistics are likely significantly stale. Alert when this gap exceeds a threshold. Lakehouse observability tooling can surface these gaps automatically across hundreds of tables.

Use incremental strategies where available. Some engines support incremental statistics updates — computing sketches only for new data files and merging them with existing sketches. This is far cheaper than a full re-analyze. Theta sketches support this natively via the merge operation, though not all engines expose incremental collection yet.

Puffin vs Parquet footer statistics

Puffin and Parquet footer statistics are complementary, not competing.

Parquet footers store per-file, per-column statistics: min value, max value, null count, and row count for each column chunk and row group. These are written automatically by every Parquet writer. They enable file-level predicate pushdown — the engine reads the footer and skips row groups whose ranges do not overlap the filter predicate. No explicit collection step is needed.

Puffin files store table-level or partition-level aggregated statistics: primarily NDV sketches today, with extensibility for custom blobs. These require explicit collection (ANALYZE TABLE) because they summarize data across all files. They enable plan-level optimizations — join ordering, split sizing, and cross-partition pruning — that file-level statistics cannot support.

Aspect	Parquet Footer	Puffin Statistics
Scope	Per-file, per-row-group	Table-level or partition-level
Content	Min/max, null count, row count	NDV sketches, custom blobs (extensible)
Collection	Automatic at write time	Explicit (`ANALYZE TABLE`)
Used for	Row-group skipping, predicate pushdown	Join ordering, split planning, partition pruning
Size	Embedded in each Parquet file	Separate file in metadata directory
Staleness	Always current for existing files	Stale after new data or compaction

The best query performance comes from having both: Parquet footers for fine-grained I/O elimination and Puffin statistics for intelligent planning decisions. For a broader look at the levers that drive Iceberg query speed, see Optimizing Iceberg Lakehouse Performance.

How LakeOps leverages statistics for optimization

LakeOps is a dedicated lakehouse control plane for Apache Iceberg that uses Puffin statistics as a core input for its optimization engine. By combining NDV data from Puffin files with cross-engine query telemetry and table health signals, LakeOps automates maintenance decisions, optimizes physical layout, and identifies tables where stale or missing statistics degrade query performance — across Spark, Trino, Snowflake, Athena, DuckDB, and Flink.

LakeOps Dashboard — lake-wide health, optimization activity, and query acceleration — The LakeOps Dashboard: 30-day optimization activity, query acceleration, cost savings, and table health — surfacing which tables have stale or missing statistics alongside compaction and maintenance status.

LakeOps demo — how statistics drive autonomous sort optimization and table health monitoring.

Key capabilities for statistics-driven optimization:

Statistics-informed maintenance triggers — Continuously monitors the gap between a table's current snapshot and the snapshot its statistics were collected against. When this gap exceeds configurable thresholds — or when compaction, partition evolution, or schema changes invalidate existing statistics — the system flags the table for re-analysis, eliminating the common failure mode where statistics were collected once and never refreshed.

Query-aware sort using NDV data — Combines Puffin NDV estimates with cross-engine telemetry (which columns appear in WHERE, JOIN, and GROUP BY clauses) to determine optimal sort orders per table. A high-NDV column appearing frequently in equality filters becomes a strong sort key candidate; a low-NDV column is better suited for partitioning. The planner evaluates single-column sort, multi-column sort, and z-order strategies independently.

Identifying degraded query planning — The Insights engine surfaces tables where missing or stale statistics are actively degrading performance. If engines fall back to heuristic defaults for join ordering, LakeOps raises a proactive alert. If statistics are 100+ snapshots behind, the system quantifies estimated NDV drift and recommends re-analysis.

Statistics as part of the maintenance pipeline — Statistics re-collection is sequenced within the full maintenance pipeline: snapshot expiration → orphan cleanup → compaction → manifest optimization → statistics refresh. This ensures statistics are computed on the final, clean layout — after dead files are removed, small files merged, and sort orders applied.

Per-table statistics policies — Policies control statistics refresh frequency at table, namespace, and catalog levels. A high-velocity streaming table refreshes every 6 hours; a slowly changing dimension table refreshes weekly. The dashboard shows which tables have current statistics, which are stale, and which have never been analyzed.

LakeOps Layout Simulations — field access patterns and sort strategy comparison — Layout Simulations: field access frequency from real queries (SELECT, FILTER, JOIN) per column, candidate sort configurations compared against the baseline — using NDV and access patterns to determine the optimal physical layout.

LakeOps Insights — proactive alerts for table health and performance issues — Lake-wide Insights: severity-ranked signals surfacing tables with stale statistics, excessive manifests, partition skew, and small-file buildup — each linked to actionable recommendations.

The future of Puffin

The Puffin format was designed to be extensible, and several statistical types are under active discussion in the Iceberg community.

Column-level histograms. NDV tells you how many distinct values a column has. Histograms tell you how those values are distributed — frequency of each value (equi-height) or value ranges (equi-width). Histograms would enable far more accurate selectivity estimates for range predicates and inequality filters. A WHERE price > 1000 on a column where 95% of values are below 100 has very different selectivity than the same predicate on a uniformly distributed column. Histograms make that distinction visible to the planner.

Bloom filters. A bloom filter is a probabilistic data structure that answers membership queries: Is value X in this set? with false positives but no false negatives. Table-level or partition-level bloom filters stored in Puffin files would accelerate point lookups — WHERE order_id = '12345' — by letting the planner skip partitions that provably do not contain the value. Parquet already supports per-file bloom filters; Puffin would extend this to partition-level and table-level membership tests.

Correlation statistics. Current statistics treat columns independently. In practice, columns are often correlated — city and state are not independent, and filtering on one dramatically narrows the other. Correlation-aware statistics would prevent the planner from overestimating the selectivity of multi-column predicates, which is a common source of suboptimal join orders in production.

Incremental sketch maintenance. The most impactful near-term improvement. Rather than requiring a full table scan for every ANALYZE TABLE, engines would maintain Theta sketches incrementally — merging sketches from newly ingested files with existing table-level sketches. This reduces the cost of statistics maintenance from O(table size) to O(new data size), making it practical to keep statistics current on tables that ingest continuously.

Conclusion

Puffin statistics are the layer between raw data files and intelligent query planning in Apache Iceberg. Without them, engines guess. With them, engines make informed decisions about join ordering, partition pruning, and resource allocation — decisions that compound across thousands of queries per day into significant performance and cost differences.

The practical challenge is not collecting statistics once — it is keeping them accurate as tables evolve. Ingestion, compaction, schema changes, and deletes all erode statistical accuracy over time. The teams that benefit most from Puffin statistics are the ones that treat statistics maintenance as part of the table lifecycle — sequenced with compaction, triggered by data changes, and monitored for staleness.

For teams managing many tables across multiple engines, a managed Iceberg control plane like LakeOps automates this lifecycle — from identifying which tables need statistics, to sequencing collection with compaction and cleanup, to surfacing when staleness degrades query performance. The result is a lakehouse where query planning is consistently informed by accurate metadata, without manual intervention.