What is a data lakehouse?

A data lakehouse is an architecture that puts the management features of a data warehouse — schemas, ACID transactions, governance, time-travel — directly on top of inexpensive object storage. It supports structured, semi-structured, and unstructured data and serves data engineering, BI, and AI workloads from one source of truth, instead of moving data between a lake and a warehouse.

What is the bronze-silver-gold pattern?

Bronze is raw ingestion — landed exactly as received, immutable. Silver is cleaned, conformed, deduplicated, joined to reference data — the trustworthy operational layer. Gold is purpose-built data products — features, dimensions, metrics — optimised for downstream consumers like BI, ML, and AI agents. The pattern is also called the medallion architecture.

Lakehouse vs data warehouse — which one should I pick?

Lakehouses are the right choice when your workloads include unstructured data, machine learning, or AI agents alongside BI. Data warehouses remain strong when workloads are pure structured-SQL analytics and you value the operational maturity. Most modern Indian enterprises adopt a lakehouse and use SQL warehouses (Snowflake, BigQuery) as a serving layer for high-concurrency BI on top of gold tables.

Apache Iceberg, Delta Lake, or Apache Hudi?

Iceberg is the most engine-neutral and is winning in multi-vendor estates — it works with Spark, Trino, Snowflake, BigQuery, Athena, and others. Delta is mature and best-of-breed inside Databricks. Hudi excels at high-velocity upserts and CDC-heavy workloads. Pick by your engine strategy: multi-engine open architecture → Iceberg; Databricks-centric → Delta; CDC-heavy streaming ingestion → Hudi.

Where does object storage fit?

All three open table formats sit on cloud object storage — Amazon S3, Azure Blob, Google Cloud Storage, or compatible on-prem stores like MinIO. Object storage gives the lakehouse its cost profile (pennies per GB-month) and durability. The table format adds the schema, transactions, and metadata that make object storage usable for analytics.

How do I serve queries from a lakehouse?

The serving layer is the engine, not the storage. Common patterns: Spark for batch ETL and ML, Trino or Presto for interactive analytics, DuckDB for fast local exploration, vector databases for RAG embeddings, and feature stores for ML serving. Many enterprises also expose gold tables to a SQL warehouse for high-concurrency BI.

How does a lakehouse support DPDP compliance?

Lakehouse architectures support DPDP through column-level access controls, lineage tracing, retention policies enforced at the table-format layer, and time-travel that lets you reproduce a dataset as it was on a given date. Consent state is treated as a first-class column propagated from bronze through gold; deletion requests trigger compaction operations that purge underlying files.

What about GPU and AI workloads on the lakehouse?

Modern lakehouses serve as the foundation for AI: training data lives in silver and gold tables, vector embeddings are stored in or alongside the lakehouse, and feature stores draw from gold. Iceberg, Delta, and Hudi all support efficient access patterns for ML training pipelines and the kind of point-in-time joins ML feature engineering needs.

Data Lakehouse Architecture India: 2026 Enterprise Guide

An AI agent is only as smart as the data underneath it. The data lakehouse is what most Indian enterprises are converging on as the foundation — one architecture that supports BI, machine learning, and agentic AI from the same governed estate, on cheap object storage, with the management features of a warehouse and the openness of a lake.

This guide covers what a lakehouse is, the bronze-silver-gold pattern in practical terms, the open table formats (Apache Iceberg, Delta Lake, Apache Hudi) that make it work, the serving layer that exposes data to consumers, how the architecture supports DPDP compliance for Indian enterprises, and a reference blueprint for a BFSI lakehouse on AWS.

Why the Lakehouse Replaced the Warehouse-Plus-Lake

The classical pattern was a data warehouse for structured analytics and a separate data lake for everything else — semi-structured logs, images, audio, ML training data. Two storage systems, two governance regimes, two cost profiles, and a steady tax of pipelines moving data between them. The lakehouse collapses the two into one architecture: object storage for cost, an open table format for warehouse-grade management, and any compute engine you want sitting on top.

The shift accelerated when three open table formats — Apache Iceberg, Delta Lake, Apache Hudi — matured to the point that ACID transactions, schema evolution, time-travel, and partition pruning worked reliably on parquet files in S3, Azure Blob, or GCS. With those guarantees, the warehouse-only argument lost most of its strength.

The Bronze-Silver-Gold Pattern (Medallion Architecture)

The medallion architecture is how most production lakehouses organise the layers between raw ingestion and consumption. The pattern is conceptually identical across vendors; the names sometimes differ.

Layer	What lives here	Read pattern
Bronze	Raw ingestion. Source data landed exactly as received, immutable. Recoverable on demand.	Replays, audits, regulatory recovery. Not for analytics.
Silver	Cleaned, conformed, deduplicated, joined to reference data. The trustworthy operational layer.	Operational analytics, ad-hoc data science, ML feature engineering.
Gold	Purpose-built data products. Dimensional models, denormalised features, business metrics.	BI dashboards, ML serving, AI agent retrieval, exec reporting.

Two practical rules. First, every consumer reads from gold or silver, never bronze. Bronze exists for replay and audit. Second, transformations between layers must be idempotent and rerunnable — the lakehouse's time-travel feature only saves you if the pipelines that built each layer can be reproduced.

The Open Table Formats — Iceberg, Delta, Hudi

The table format is the metadata layer that turns parquet files in object storage into tables with ACID guarantees, schema evolution, partition pruning, and time-travel. Three formats dominate in 2026:

Apache Iceberg

Created at Netflix, donated to Apache. The most engine-neutral of the three: works with Spark, Trino, Presto, Flink, Snowflake, BigQuery, Athena, Dremio, and DuckDB. Strong support for hidden partitioning and schema evolution. Adopted as the default open format by every major hyperscaler. The pragmatic choice for multi-engine, vendor-neutral architectures.

Delta Lake

Created at Databricks, open-sourced under the Linux Foundation. Mature, deeply optimised inside the Databricks ecosystem, with strong streaming support. Delta UniForm allows the same data to be read as Iceberg by external engines. The right choice if Databricks is your primary compute and you want first-class platform integration.

Apache Hudi

Created at Uber, donated to Apache. Strongest at high-velocity upserts and CDC-heavy ingestion. Two storage modes — copy-on-write for read-optimised workloads, merge-on-read for write-optimised. The right choice when your bronze layer is dominated by streaming change-data-capture from operational databases.

For most Indian enterprises starting fresh in 2026, Iceberg is the safe default. The engine-neutrality matters when your BI team uses Trino, your ML team uses Spark, your finance team uses Snowflake, and your AI agents use DuckDB — all reading from the same gold tables.

Compute — The Engine Layer

Storage is shared; compute is workload-specific. A typical Indian enterprise lakehouse has multiple engines reading the same Iceberg or Delta tables:

Apache Spark — batch ETL between layers, ML training pipelines, large transformations
Trino / Presto — interactive analytics, federated queries, the engine behind most BI
DuckDB — local fast exploration, embedded analytics, single-node ML feature serving
Snowflake / BigQuery — serving layer for high-concurrency BI on top of gold tables
Vector databases — embeddings for RAG, often persisted alongside the lakehouse
Feature stores — online and offline ML feature serving from gold tables

The discipline is governance: one access control plane (Lake Formation, Unity Catalog, or an open equivalent like Apache Polaris) that every engine respects, and one lineage system that tracks data flow regardless of which engine performed which step.

Ingestion — What Feeds Bronze

The ingestion stack determines the freshness ceiling for every downstream layer. Common components in production Indian lakehouses:

Apache Kafka — the streaming spine for change-data-capture, IoT signals, application events
Debezium / Apache Flink CDC — CDC from operational databases into Kafka, then into bronze
Apache Airflow — batch orchestration for ELT jobs and reverse-ETL
Spark Structured Streaming or Flink — the compute that lands streaming data into bronze and silver tables
Fivetran, Airbyte, or custom connectors — SaaS source ingestion

DPDP Compliance Built Into the Lakehouse

For Indian enterprises, the lakehouse is also where DPDP Act 2023 controls sit. Five capabilities the architecture should provide natively:

Consent state as a column. Every row that contains personal data carries the consent state, scope, and timestamp. Downstream queries filter on it. Withdrawals propagate.
Column-level access controls. Engines see only the columns the requester is authorised for. Sensitive columns can be tokenised at silver and detokenised only by privileged consumers.
Retention enforcement. Table-format retention policies expire data automatically. Gold tables that derive from expired bronze are recompacted.
Lineage and time-travel. Every dataset is reproducible to a point in time, satisfying audit requests for "show me the data as it stood on 3 February."
Erasure-on-request. Data principal deletion requests trigger targeted compaction operations across bronze, silver, and gold, with cryptographic evidence of completion.

For the broader DPDP context, see our DPDP Act AI compliance guide.

A BFSI Reference Blueprint on AWS

A pragmatic Indian BFSI lakehouse in 2026 looks roughly like this:

Storage: Amazon S3 with Iceberg as the table format; AWS Glue Data Catalog or Apache Polaris for metadata
Ingestion: Kafka (MSK) for streaming, Debezium CDC from core banking and CRM, Airflow on MWAA for batch
Bronze → Silver → Gold: Spark on EMR or Glue for transformations, with dbt for SQL transformations on Trino
Serving: Trino on EMR for interactive analytics, Snowflake or Athena for BI, OpenSearch / Pinecone for vector retrieval, a feature store (Feast or SageMaker Feature Store) for ML
Governance: Lake Formation for access control, OpenLineage for lineage, Apache Atlas or DataHub for the data catalogue
Observability: CloudWatch + Datadog for pipeline monitoring; data quality checks via Great Expectations or Soda
AI workloads: SageMaker for training, Bedrock or self-hosted LLMs on EKS for inference, with retrieval drawing from gold tables and the vector store

The same blueprint adapts cleanly to Azure (ADLS Gen2 + Synapse + Fabric) or GCP (Cloud Storage + BigQuery + Vertex AI). The point is the architecture, not the SKU.

Anti-Patterns to Avoid

One giant gold table per business. Gold should be many purpose-built data products, not one monolith.
Skipping silver. Reading directly from bronze for analytics is how dashboards lie.
Engine sprawl without governance. Five compute engines, five access control planes, and inconsistent permissions is worse than one warehouse.
Vector store drift from gold. Embeddings are downstream artefacts; if gold changes, the vectors must be reindexed. Not pipelining this leads to silently stale RAG.
Treating consent as metadata only. Consent must be a queryable column that filters every read, not a sticker on a documentation page.

What Comes Next

Through 2026, expect open catalog convergence (Apache Polaris, Unity Catalog open-sourced, AWS Glue's Iceberg REST endpoint) to make multi-engine governance materially easier. Vector and structured data converge in the same lakehouse — Iceberg's recent vector index work and Hudi's vector support move RAG corpora into the same governance plane as everything else. And streaming and batch unify further as Flink, Spark Structured Streaming, and Iceberg's streaming reads mature.

For Indian enterprises serving BI, ML, and agentic AI from one source of truth, the lakehouse is no longer a 2027 plan. It is the 2026 default.

Architect your lakehouse with humaineeti

Data Lakehouse Architecture — India Guide.