
The engineering team behind STACKIT — one of Europe's first sovereign managed lakehouse offerings — has built an architecture that turns Apache Iceberg into a fully managed, multi-tenant cloud service on Kubernetes. For platform teams evaluating whether to build or buy lakehouse infrastructure, their approach provides a rare inside look at the engineering required to deliver Iceberg-as-a-Service on Kubernetes — from custom resource definitions to tenant isolation to upgrade strategies.
Building a Lakehouse-as-a-Service means automating everything — provisioning, scaling, monitoring, upgrades, and crucially, table maintenance. The gap between provisioning infrastructure and keeping it healthy is where most managed lakehouse services struggle: compaction, snapshot expiration, orphan cleanup, and observability must run continuously for every tenant, adapting to each tenant's write patterns without per-customer engineering.
This article synthesizes the architecture patterns behind managed lakehouse services on Kubernetes — drawing from STACKIT's experience building their sovereign Dremio-based lakehouse, the broader Kubernetes operator ecosystem for data platforms, and the engineering challenges that every team faces when turning an open table format into a managed cloud product.

What "Lakehouse as a Service" actually means
The term "managed lakehouse" gets used loosely. Vendors apply it to everything from a hosted query engine with Iceberg support to a fully integrated platform that provisions storage, catalogs, compute, and governance as a single product. To understand what building a real Lakehouse-as-a-Service requires, it helps to define the components that must be managed.
A managed lakehouse service provisions and operates, at minimum, four layers for each tenant: object storage where Iceberg data files and metadata live, a catalog service that tracks table state and enforces access control, compute engines that read and write data through the catalog, and a maintenance plane that keeps tables healthy over time — compaction, snapshot expiration, orphan cleanup, manifest optimization.
The critical distinction between "running Iceberg" and "offering Iceberg as a service" is who owns the operational burden. When a team runs their own Iceberg deployment, they own provisioning, monitoring, upgrades, scaling, maintenance, and incident response. When a platform team offers Iceberg as a service, the platform absorbs that burden — multiplied by the number of tenants. A problem that is manageable for one team becomes an engineering challenge at 50 tenants and an architectural constraint at 500.
STACKIT's approach — building a managed Dremio-based lakehouse on their sovereign European cloud — illustrates this clearly. Their service provisions Iceberg-native analytics with Apache Polaris catalogs, STACKIT Object Storage, and elastically scaled compute engines, all operated within European data sovereignty boundaries. The customer sees a simple provisioning flow: click, configure, connect. The engineering team sees a complex Kubernetes control plane that must reliably orchestrate dozens of interdependent components for every tenant.
This is the fundamental tension of any managed data service: the product must feel simple while the implementation handles complexity at scale. Kubernetes is the platform that makes this possible — and the platform that introduces its own category of challenges.
Why Kubernetes is the natural deployment target
Kubernetes has become the default orchestration layer for managed data services, and for good reason. The platform provides exactly the primitives that lakehouse services need: declarative resource management, automated scaling, namespace-based isolation, rolling upgrades, and a robust API for building custom controllers.
For a managed lakehouse, Kubernetes offers several specific advantages that map directly to service requirements.
Declarative infrastructure. Every component of a tenant's lakehouse environment — the catalog instance, the query engine cluster, the storage configuration, the maintenance jobs — can be described as a Kubernetes resource with a desired state specification. The platform does not imperatively create and configure components; it declares what should exist, and controllers continuously reconcile actual state to match. This is essential for reliability at scale. When a catalog pod crashes, Kubernetes restarts it. When a compute node is evicted, the scheduler places a replacement. The platform team does not write recovery scripts — the declarative model handles recovery natively.
Namespace isolation. Kubernetes namespaces provide a natural boundary for tenant isolation. Each tenant's lakehouse components can run in a dedicated namespace with resource quotas, network policies, and RBAC rules that prevent cross-tenant interference. This is not perfect isolation — namespaces share the same cluster kernel and API server — but it provides the resource and access boundaries that most multi-tenant services require without the overhead of dedicated clusters per tenant.
The operator pattern. Kubernetes operators — custom controllers that watch for changes to custom resource definitions (CRDs) and reconcile the cluster state accordingly — are the core mechanism for building managed services. An operator encodes domain-specific operational knowledge into software: how to provision a new catalog instance, how to scale a query engine, how to perform a rolling upgrade, how to respond to a health check failure. Without operators, managed services require manual runbooks executed by on-call engineers. With operators, the operational knowledge runs continuously as code.
Elastic scaling. Lakehouse workloads are inherently bursty. A tenant might run a handful of lightweight queries all day, then submit a complex analytical workload that needs 10x the compute for an hour. Kubernetes Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA), combined with cluster autoscalers that add or remove nodes, provide the elastic scaling that makes this economically viable. The platform does not provision peak capacity for every tenant — it provisions baseline capacity and scales dynamically, keeping costs proportional to actual usage.
Ecosystem integration. The Kubernetes ecosystem provides battle-tested solutions for the cross-cutting concerns every managed service needs: Prometheus for metrics, cert-manager for TLS, external-dns for service discovery, Helm for packaging, ArgoCD or Flux for GitOps deployment. Building on Kubernetes means adopting these solutions rather than building them from scratch.
STACKIT's Kubernetes Engine (SKE) — their managed Kubernetes offering — demonstrates this at the infrastructure level. SKE provides CNCF-compliant clusters with managed control planes, auto-updates, node autoscaling, and automatic repair. The lakehouse service builds on top of this foundation, inheriting the reliability and scaling characteristics of the underlying Kubernetes platform without managing the control plane itself.
Custom Kubernetes operators for Iceberg resource provisioning
The heart of any managed lakehouse service on Kubernetes is its operator architecture. Operators encode the entire lifecycle of lakehouse resources — creation, configuration, scaling, upgrading, and teardown — as reconciliation loops that continuously drive the cluster toward the desired state.
For an Iceberg-based lakehouse, the operator architecture typically manages several distinct resource types, each represented as a Custom Resource Definition (CRD).
Catalog CRD. The catalog is the most critical component of an Iceberg deployment. It stores table metadata, manages concurrent access, and enforces governance policies. A CatalogInstance CRD might specify the catalog type (Polaris, REST, Nessie, Gravitino), the storage backend, authentication configuration, and high-availability settings. The operator that watches this CRD provisions the catalog pods, creates the backing database (if needed), configures TLS, registers the catalog endpoint in service discovery, and continuously monitors health.
Compute CRD. Query engines like Dremio, Trino, or Spark need their own CRDs that specify cluster topology (coordinators, workers), resource allocations, Iceberg catalog connections, and autoscaling policies. A TrinoCluster CRD, for example, might declare two coordinators with 8 CPU and 32 GB RAM, a worker pool with 4–16 nodes that autoscales based on query queue depth, and connector configurations pointing to the tenant's Polaris catalog. The Stackable project's Trino operator demonstrates this pattern — it manages TrinoCluster and TrinoCatalog resources with full lifecycle automation.
Storage CRD. Object storage configuration — bucket provisioning, access credentials, encryption settings, lifecycle policies — can be managed through a storage CRD. The operator creates storage buckets, configures IAM policies, distributes credentials to the catalog and compute components, and enforces retention and encryption requirements. For STACKIT, this means provisioning buckets in STACKIT Object Storage with European data residency guarantees.
Maintenance CRD. This is the resource type that most managed lakehouse implementations underestimate. Table maintenance — compaction, snapshot expiration, orphan file cleanup, manifest optimization — must run continuously for every tenant's tables. A maintenance CRD might define per-tenant maintenance policies: compaction thresholds, snapshot retention periods, cleanup schedules. The operator translates these policies into scheduled jobs or event-driven workflows that keep tables healthy.
Building and maintaining this maintenance operator is a significant engineering investment — one that LakeOps eliminates. Rather than encoding maintenance logic into custom Kubernetes operators, platform teams connect LakeOps as a ready-made control plane that handles autonomous compaction, snapshot expiration, orphan cleanup, and health monitoring across all tenants. Built in Rust on DataFusion, LakeOps connects to existing catalogs (Glue, Polaris, Nessie) and engines (Spark, Trino, Flink) without data movement, adapting maintenance per table based on observed state rather than static CRD configurations.

The interaction between these CRDs follows a predictable pattern that mirrors the tenant provisioning lifecycle.
The deployment flow: from API call to service ready
When a new tenant provisions a lakehouse instance, the system follows a well-defined sequence from API interaction through CRD creation to fully operational service.
Step 1: API interaction. The tenant requests a new lakehouse instance through the platform's API or console. This might be a REST API call, a click in a web portal, or a Terraform resource creation. The request specifies the desired configuration: engine type, cluster size, storage location, catalog settings.
Step 2: CRD creation. The platform's provisioning service translates the API request into one or more custom resources. A single tenant provisioning request typically creates a CatalogInstance, a ComputeCluster, a StorageBucket, and a MaintenancePolicy. These resources are submitted to the Kubernetes API server, where they are persisted in etcd.
Step 3: Operator reconciliation. Each operator watches for its respective CRD and begins reconciling. The storage operator creates the object storage bucket and configures access policies. The catalog operator provisions the Polaris or REST catalog instance and connects it to the storage backend. The compute operator deploys the query engine cluster and configures it to connect to the catalog. The maintenance operator sets up the scheduled jobs that will keep the tenant's tables healthy.
Step 4: Health verification. Once all operators have reconciled their resources, a health check verifies that the entire stack is operational. The catalog is reachable. The compute engine can connect to the catalog and read table metadata. The storage backend is accessible. Health status is propagated back through the CRD status fields, and the tenant's dashboard shows the service as ready.
Step 5: Service ready. The tenant receives their connection endpoint — a SQL endpoint, a catalog URL, or both — and begins using the lakehouse. From this point forward, the operators continue running, continuously monitoring and reconciling the tenant's resources.
This flow is deceptively simple to describe and enormously complex to implement reliably. Each step has failure modes that must be handled gracefully. What happens if the storage bucket creation fails due to a quota limit? What if the catalog pod crashes during initialization? What if the compute engine cannot reach the catalog due to a network policy misconfiguration? The operator must handle every failure mode — retry transient errors, report permanent failures, clean up partial resources, and leave the system in a consistent state regardless of where the provisioning process fails.
The Flink Kubernetes Operator illustrates this reconciliation pattern clearly: the user submits a custom resource via kubectl, the operator observes current status, validates the resource change, and reconciles any required changes — a continuous loop that adjusts until the current state matches the desired state.
Multi-tenancy challenges: isolation without isolation overhead
Multi-tenancy is the defining engineering challenge of any managed service. A single-tenant deployment can be messy, over-provisioned, and manually managed. A multi-tenant service must be clean, right-sized, and automated — because every architectural decision gets multiplied by the number of tenants.
For a managed Iceberg lakehouse, multi-tenancy introduces specific challenges at every layer of the stack.
Namespace isolation and resource quotas
The first question is the isolation boundary. Kubernetes offers several options, each with different trade-offs.
Namespace-per-tenant is the most common approach. Each tenant gets a dedicated namespace with ResourceQuota objects that limit CPU, memory, storage, and object counts. Network policies restrict cross-namespace traffic. RBAC rules prevent tenants from accessing each other's resources. This provides good isolation with reasonable overhead — adding a new tenant means creating a new namespace, not a new cluster.
Virtual clusters (tools like vcluster) provide stronger isolation by running a lightweight Kubernetes control plane per tenant inside a shared host cluster. Each tenant sees their own API server, their own etcd, their own namespace hierarchy — but the workloads run on shared nodes. This adds overhead but provides near-complete API-level isolation.
Dedicated clusters provide the strongest isolation but the highest overhead. Each tenant runs on their own Kubernetes cluster with a dedicated control plane, dedicated nodes, and dedicated networking. This is appropriate for high-security or regulatory requirements but does not scale economically to hundreds of tenants.
Most managed lakehouse services use namespace-per-tenant with network policies and RBAC, supplementing with node affinity or taints where workload isolation is critical. STACKIT's Kubernetes Engine supports all three models, with the managed control plane abstracting the cluster-level complexity.
Catalog separation
Catalog isolation is arguably more important than compute isolation for lakehouse services. The catalog stores table metadata — schema definitions, partition specs, snapshot history, access policies. If tenants share a catalog instance, a misconfigured access policy could expose one tenant's metadata to another. If they share a catalog database, a runaway query against the metadata store could impact all tenants.
The Polaris catalog — which STACKIT uses — supports multi-tenant isolation through its internal project and catalog hierarchy. Each tenant can get their own Polaris catalog within a shared Polaris deployment, with project-level access control ensuring metadata isolation. Alternatively, each tenant can get their own Polaris instance running in their namespace, providing stronger isolation at the cost of more resources per tenant.
The Iceberg REST Catalog API has emerged as the standard interface for catalog interoperability. Regardless of the backing catalog implementation — Polaris, Nessie, Gravitino, Lakekeeper — the REST API provides a consistent interface that compute engines connect to. This means the catalog operator can swap implementations without changing how compute engines are configured — enabling seamless catalog migration for managed services that need to evolve their catalog infrastructure without disrupting tenants.

Compute isolation
Query engines present a different isolation challenge. Trino and Dremio are typically deployed as shared-nothing clusters where each tenant gets their own coordinator and worker pool. This provides strong isolation — one tenant's expensive query cannot starve another's resources — but it means provisioning a minimum cluster size per tenant, even if most tenants are idle most of the time.
The alternative is shared compute with query queuing and resource management. A single Trino or Dremio cluster serves multiple tenants, with resource groups or workload management policies ensuring fair allocation. This is more efficient for tenants with light or intermittent workloads but introduces noisy-neighbor risks that must be carefully managed.
STACKIT's approach with Dremio uses isolated running engines — each business unit or tenant gets their own compute engine that scales independently. This ensures predictable performance but requires sophisticated autoscaling to avoid over-provisioning. The Kubernetes HPA, combined with Dremio's engine-level metrics, enables engines to scale from zero workers to dozens based on query demand.
Engineering hurdles: state, upgrades, and compatibility
Building the initial version of a managed lakehouse service is a significant engineering effort. Keeping it running, evolving it, and maintaining backward compatibility across hundreds of tenants is harder.
State management
Kubernetes is designed for stateless workloads. Containers are ephemeral. Pods are replaceable. Nodes are cattle, not pets. But lakehouse components are inherently stateful. Catalogs have metadata databases. Query engines have query plan caches and intermediate results. Storage configurations have credentials and policies.
The operator must manage this state carefully. Catalog metadata databases need persistent volume claims that survive pod restarts. Database migrations must run during upgrades without corrupting existing data. Credentials must be rotated without disrupting active connections. Configuration changes must be applied atomically — a half-updated configuration where the catalog points to a new storage backend but the compute engine still uses the old credentials is a production incident.
The standard Kubernetes pattern for stateful workloads — StatefulSets with persistent volume claims — provides the foundation. But the operator must layer application-specific state management on top: backup and restore for catalog databases, graceful drain for query engines during rolling restarts, and consistent configuration distribution using ConfigMaps or Secrets.
Upgrade strategies
Upgrading a managed service across hundreds of tenants is one of the most complex operational challenges in cloud engineering. Every upgrade must be:
Safe. The upgrade must not corrupt data, drop connections, or break queries. Tenants expect zero-downtime upgrades, which means rolling restarts at minimum and ideally blue-green deployments for major version changes.
Reversible. If an upgrade introduces a regression, the platform must be able to roll back to the previous version without data loss. This requires careful management of schema migrations, configuration changes, and API versions.
Incremental. Not all tenants should upgrade simultaneously. Canary deployments — upgrading a small percentage of tenants first, monitoring for issues, then rolling out to the rest — are essential for catching problems before they affect the entire fleet.
Compatible. The platform must support multiple versions simultaneously during the rollout period. A tenant on version N must coexist with tenants on version N+1, sharing the same cluster infrastructure and control plane.
Kubernetes makes rolling updates native to Deployments and StatefulSets, but the application-level upgrade logic — database migrations, configuration version negotiation, API backward compatibility — must be implemented in the operator. A well-designed operator exposes the current version and target version in the CRD status, performs pre-upgrade validation, executes the upgrade in stages, and provides rollback commands if health checks fail.
Backward compatibility
Apache Iceberg itself has strong backward compatibility guarantees — newer readers can read tables written by older writers, and the format version evolution is carefully managed. But a managed lakehouse service adds layers above Iceberg that have their own compatibility surfaces: catalog APIs, query engine versions, authentication protocols, client libraries.
When the service upgrades from Polaris 0.x to Polaris 1.0, does every tenant's client library still work? When the Dremio engine is upgraded from version 25 to version 26, do existing JDBC connections maintain their session state? When the Kubernetes API server is upgraded, do the CRDs need migration? Each compatibility surface is a potential breaking change, and each breaking change is a tenant-facing incident.
The mitigation is versioned APIs with deprecation policies, compatibility testing matrices that cover every supported client version, and phased rollouts that catch incompatibilities before they affect production tenants.
Operationalizing: monitoring, alerting, and SLA management
A managed service is only as good as its operational posture. Provisioning infrastructure is the easy part — keeping it healthy, performant, and reliable is the ongoing challenge.
Monitoring
A managed lakehouse service requires monitoring at multiple layers, each with different metrics and alerting thresholds.
Infrastructure layer. Kubernetes cluster health, node resource utilization, pod scheduling latency, persistent volume capacity. These are the foundation — if the infrastructure is unhealthy, nothing above it works correctly. Prometheus and the Kubernetes metrics API provide the standard monitoring stack, with Grafana dashboards showing cluster-level health across the fleet.
Service layer. Catalog response latency, query engine throughput, storage API error rates, operator reconciliation duration. These metrics reflect the health of the managed service components. A spike in catalog response latency might indicate database contention. A drop in query throughput might indicate resource exhaustion. An increase in operator reconciliation duration might indicate a backlog of pending changes.
Application layer. Query performance per tenant, table health metrics, data freshness, cost allocation. These are the metrics that tenants care about — and the metrics that drive SLA compliance. Is the tenant's P99 query latency within the committed SLA? Are their tables being compacted on schedule? Is their data fresh within the committed latency target?
Table health layer. This is where most managed lakehouse services have a blind spot. Tables that are not compacted accumulate small files and delete file overhead. Snapshots that are not expired bloat metadata. Orphan files from failed writes consume storage. These issues do not show up in infrastructure or service metrics — they manifest as gradual query performance degradation that tenants notice before the platform team does.

LakeOps fills this gap with real-time table health monitoring that classifies every table as Healthy, Warning, or Critical based on file count, delete file ratios, snapshot accumulation, and manifest structure. For a managed service with hundreds of tenants and thousands of tables, this lake-wide health view is the difference between reactive firefighting and proactive maintenance. The platform team sees which tenants have degrading tables before those tenants open support tickets — and LakeOps's automated maintenance fixes the issues without manual intervention.
Alerting and incident response
Alerting in a multi-tenant service requires careful tuning to avoid alert fatigue. A single shared cluster serving 100 tenants can generate thousands of alerts if every metric threshold fires independently. Effective alerting requires aggregation — alerting on fleet-level anomalies rather than individual metric crossings — and tiering — distinguishing between P1 incidents (data loss risk, service unavailability) and P3 issues (elevated latency, approaching quota limits).
Incident response follows a standard framework adapted for multi-tenant services: detect the issue (monitoring), determine the blast radius (which tenants are affected), mitigate the immediate impact (failover, scaling, traffic shifting), identify the root cause, implement a fix, and conduct a post-mortem. The multi-tenant dimension adds complexity to every step — a root cause analysis must consider whether the issue is tenant-specific (a runaway query), service-wide (a catalog database problem), or infrastructure-wide (a Kubernetes node failure).
SLA management
Defining and enforcing SLAs for a managed lakehouse service requires specific metrics tied to the tenant's experience, not just infrastructure health.
Availability SLA. The service endpoint is reachable and accepts queries. This is typically measured as the percentage of time the query endpoint returns successful health checks. A 99.9% availability SLA allows roughly 8.7 hours of downtime per year — achievable for single-region deployments with proper redundancy, challenging for services that depend on multiple stateful components (catalog database, query engine, storage backend).
Query performance SLA. P50 and P99 query latencies stay within committed bounds. This is harder to enforce because query performance depends on table state (file layout, delete file accumulation) as much as infrastructure health. A tenant with a poorly maintained table will see degraded query performance even if the infrastructure is healthy — which is why automated table maintenance is an SLA enabler, not just an optimization.
Data freshness SLA. Ingested data becomes queryable within a committed time window. This depends on the ingestion pipeline, the catalog commit latency, and the query engine's metadata refresh interval. Each component in the chain adds latency, and the SLA is only met if the end-to-end latency stays within bounds.
Automated table maintenance from LakeOps directly supports all three SLA dimensions. Availability improves because well-maintained tables do not trigger out-of-memory errors or metadata timeouts that crash query engines. Query performance stays within bounds because compaction keeps file counts low and delete file ratios manageable. Data freshness is maintained because snapshot management prevents metadata bloat that slows down catalog operations. For managed service providers, LakeOps's per-tenant policy engine ensures that every tenant's tables meet the maintenance standards required by the service SLA.
Lessons from STACKIT: bringing a managed lakehouse to market
STACKIT's experience building and launching their managed Dremio-based lakehouse service — from internal development through public preview to planned general availability in summer 2026 — offers several lessons for teams building similar offerings.
Start with open standards and stay there
STACKIT built their lakehouse on Apache Iceberg, Apache Polaris, and Apache Arrow — the open Apache lakehouse standards that prevent vendor lock-in for their customers and for themselves. This is not just a marketing position; it is an architectural decision that pays dividends throughout the service lifecycle. Open standards mean the catalog can be swapped without rewriting connectors. The table format is portable to other platforms. Client libraries are maintained by the community, not by the service team. For European customers in regulated industries, open standards also satisfy sovereignty requirements — data stored in Iceberg on STACKIT Object Storage can be migrated to any other Iceberg-compatible platform without format conversion.
Invest in the operator early
The Kubernetes operator is the most critical piece of software in a managed service. It encodes every operational decision — how to provision, how to scale, how to upgrade, how to recover. Teams that underinvest in the operator end up with manual runbooks that on-call engineers execute at 3 AM. Teams that invest early in a comprehensive operator find that operational burden decreases as the tenant count increases — because the operator handles the work that would otherwise scale linearly with tenants.
STACKIT's engineering team builds and optimizes Kubernetes operators that automate the lifecycle of cloud services. Their presence at KubeCon 2026 with over 20 team members — demonstrating enterprise-grade Kubernetes solutions with digital sovereignty — reflects the organizational investment required to build operators that work reliably at scale.
Design for the maintenance gap
Every managed lakehouse service launches with provisioning and query execution working well. The gap that emerges in production is maintenance. Tables accumulate small files. Snapshots pile up. Delete files from upsert workloads degrade query performance. Orphan files consume storage. These issues appear gradually — weeks or months after launch — and they affect tenants unevenly based on their write patterns.
Teams that build their own maintenance automation face the same challenge Akamai described in their Egnatia platform: evaluating daily whether tables need file rewrites, managing sort orders and maintenance parameters through table properties, running snapshot expiration daily, manifest rewrites daily, and orphan cleanup monthly. Building this for a single deployment is manageable. Building it for a multi-tenant service where each tenant has different write patterns, different table counts, and different performance requirements is a substantial engineering effort.

LakeOps eliminates this engineering effort for managed service providers. Per-tenant maintenance policies adapt automatically to each tenant's actual workload. Tenants with streaming ingestion get frequent small-file compaction. Tenants with batch upserts get merge-on-read reconciliation. Tenants with append-only workloads get lighter-touch maintenance. The platform provider configures lake-wide policies once, and LakeOps enforces them across every tenant's tables — no per-customer engineering, no maintenance scripts, no manual monitoring.
Plan for the long tail of edge cases
Managed services fail on edge cases. The 95th-percentile tenant has a workload that is completely unlike the other 94. The tenant who creates 10,000 tables. The tenant who writes 500 MB files when the platform expects 256 MB. The tenant who runs 8-hour analytical queries that hold locks across maintenance windows. Every edge case is a support ticket, and every support ticket is an operator enhancement or a policy adjustment.
The teams that succeed are the ones that instrument everything, review every incident, and feed lessons back into the operator code. Over time, the operator becomes a codified repository of operational knowledge — each reconciliation loop encodes a lesson learned from production.
Sovereignty is an architectural decision, not a compliance checkbox
STACKIT's positioning as a sovereign European cloud is not just about data residency. It affects every layer of the architecture: which cloud services can be used (only those operated within European legal frameworks), which third-party dependencies are acceptable (open source preferred over proprietary services), how data flows are designed (no data leaves the European perimeter), and how operations are staffed (European team, European support hours). For platform teams building managed services in regulated environments, sovereignty constraints shape the architecture from the earliest design decisions.
The managed lakehouse maturity curve
Building a Lakehouse-as-a-Service follows a predictable maturity curve that every platform team traverses.
Level 1: Manual provisioning. Tenants request lakehouse instances, and engineers provision them manually using scripts, Terraform, or kubectl commands. Each provisioning takes hours. Upgrades are weekend projects. Maintenance runs when someone remembers.
Level 2: Operator-driven provisioning. Custom operators automate provisioning, scaling, and basic lifecycle management. Tenants self-serve through an API or portal. Provisioning takes minutes. Upgrades are operator-managed rolling restarts. Maintenance is scheduled but not adaptive.
Level 3: Fully automated operations. The operator handles the full lifecycle including upgrades, failure recovery, and adaptive scaling. Maintenance is event-driven and workload-aware. Monitoring covers all four layers (infrastructure, service, application, table health). SLAs are defined and enforced. Incidents are detected and mitigated automatically before tenants notice.
Level 4: Self-optimizing platform. The platform learns from tenant workload patterns and optimizes proactively. Partition schemes evolve based on query patterns. Compaction policies adapt based on write frequency and read patterns. Resource allocations adjust based on historical usage. Cost optimization happens continuously. This is the horizon — few platforms have reached it, but it is the direction every managed service is heading.
Most teams launching managed lakehouse services today are somewhere between Level 2 and Level 3. The gap between these levels — from scheduled maintenance to adaptive maintenance, from basic monitoring to table-health-aware observability, from manual SLA tracking to automated SLA enforcement — is where tools like LakeOps accelerate the journey. Instead of building adaptive maintenance, workload-aware compaction, and real-time table health monitoring from scratch, platform teams integrate LakeOps and jump from Level 2 to Level 3 in weeks rather than quarters.
The broader pattern: Kubernetes as the managed service platform
The pattern STACKIT demonstrates — using Kubernetes operators, CRDs, and reconciliation loops to build managed data services — extends far beyond lakehouses. The same architecture powers managed databases (CrunchyData's Postgres operator, Percona's MySQL operator), managed streaming (Strimzi's Kafka operator), managed analytics (Stackable's Trino and Spark operators), and managed AI infrastructure (KubeFlow's training operators).
What makes the lakehouse case unique is the breadth of components that must be orchestrated. A managed database is one component — a single stateful workload with well-understood scaling, backup, and recovery patterns. A managed lakehouse is a system of components — catalog, compute, storage, maintenance, governance — each with its own lifecycle, scaling characteristics, and failure modes. The operator architecture must handle not just the lifecycle of individual components but the interactions between them: the compute engine depends on the catalog, the catalog depends on storage, maintenance depends on all three.
This complexity is also the opportunity. Teams that build reliable managed lakehouse services on Kubernetes — or leverage existing platforms like LakeOps for the maintenance and observability layers — are providing a product that most organizations cannot build internally. The engineering investment required to operate a multi-tenant Iceberg lakehouse at Level 3 maturity is beyond what most data teams can justify. Managed services amortize that investment across tenants, making production-grade Iceberg operations accessible to organizations of every size.
For teams evaluating whether to build or consume a managed lakehouse: the architecture patterns are well-established, the Kubernetes primitives are mature, and the operator ecosystem provides proven building blocks. The engineering challenge is real — state management, multi-tenancy, upgrade compatibility, and adaptive maintenance are genuinely hard problems. But the tools exist, the patterns are documented, and the community is actively sharing lessons.
The question is not whether managed lakehouses on Kubernetes are feasible — STACKIT, Dremio Cloud, Tabular (now part of Databricks), and others have proven they are. The question is how quickly your team can move from Level 1 to Level 3 — and how much of that journey you build yourself versus adopt from the ecosystem.
Further reading
- Automating Iceberg Table Maintenance — how LakeOps automates compaction, snapshot expiration, and orphan cleanup
- Iceberg Production Readiness Checklist — the operational requirements for running Iceberg in production
- Managed Iceberg Solutions — how LakeOps provides per-tenant maintenance for managed lakehouse services
- Iceberg Compaction: The Complete Guide — Rust-based compaction at $5/TB versus $50/TB for Spark
- Iceberg Lakehouse Observability Guide — monitoring and observability patterns for Iceberg deployments



