An AI agent is only as smart as the data underneath it. The data lakehouse is what most Indian enterprises are converging on as the foundation — one architecture that supports BI, machine learning, and agentic AI from the same governed estate, on cheap object storage, with the management features of a warehouse and the openness of a lake.
This guide covers what a lakehouse is, the bronze-silver-gold pattern in practical terms, the open table formats (Apache Iceberg, Delta Lake, Apache Hudi) that make it work, the serving layer that exposes data to consumers, how the architecture supports DPDP compliance for Indian enterprises, and a reference blueprint for a BFSI lakehouse on AWS.
Why the Lakehouse Replaced the Warehouse-Plus-Lake
The classical pattern was a data warehouse for structured analytics and a separate data lake for everything else — semi-structured logs, images, audio, ML training data. Two storage systems, two governance regimes, two cost profiles, and a steady tax of pipelines moving data between them. The lakehouse collapses the two into one architecture: object storage for cost, an open table format for warehouse-grade management, and any compute engine you want sitting on top.
The shift accelerated when three open table formats — Apache Iceberg, Delta Lake, Apache Hudi — matured to the point that ACID transactions, schema evolution, time-travel, and partition pruning worked reliably on parquet files in S3, Azure Blob, or GCS. With those guarantees, the warehouse-only argument lost most of its strength.
The Bronze-Silver-Gold Pattern (Medallion Architecture)
The medallion architecture is how most production lakehouses organise the layers between raw ingestion and consumption. The pattern is conceptually identical across vendors; the names sometimes differ.
| Layer | What lives here | Read pattern |
|---|---|---|
| Bronze | Raw ingestion. Source data landed exactly as received, immutable. Recoverable on demand. | Replays, audits, regulatory recovery. Not for analytics. |
| Silver | Cleaned, conformed, deduplicated, joined to reference data. The trustworthy operational layer. | Operational analytics, ad-hoc data science, ML feature engineering. |
| Gold | Purpose-built data products. Dimensional models, denormalised features, business metrics. | BI dashboards, ML serving, AI agent retrieval, exec reporting. |
Two practical rules. First, every consumer reads from gold or silver, never bronze. Bronze exists for replay and audit. Second, transformations between layers must be idempotent and rerunnable — the lakehouse's time-travel feature only saves you if the pipelines that built each layer can be reproduced.
The Open Table Formats — Iceberg, Delta, Hudi
The table format is the metadata layer that turns parquet files in object storage into tables with ACID guarantees, schema evolution, partition pruning, and time-travel. Three formats dominate in 2026:
Apache Iceberg
Created at Netflix, donated to Apache. The most engine-neutral of the three: works with Spark, Trino, Presto, Flink, Snowflake, BigQuery, Athena, Dremio, and DuckDB. Strong support for hidden partitioning and schema evolution. Adopted as the default open format by every major hyperscaler. The pragmatic choice for multi-engine, vendor-neutral architectures.
Delta Lake
Created at Databricks, open-sourced under the Linux Foundation. Mature, deeply optimised inside the Databricks ecosystem, with strong streaming support. Delta UniForm allows the same data to be read as Iceberg by external engines. The right choice if Databricks is your primary compute and you want first-class platform integration.
Apache Hudi
Created at Uber, donated to Apache. Strongest at high-velocity upserts and CDC-heavy ingestion. Two storage modes — copy-on-write for read-optimised workloads, merge-on-read for write-optimised. The right choice when your bronze layer is dominated by streaming change-data-capture from operational databases.
For most Indian enterprises starting fresh in 2026, Iceberg is the safe default. The engine-neutrality matters when your BI team uses Trino, your ML team uses Spark, your finance team uses Snowflake, and your AI agents use DuckDB — all reading from the same gold tables.
Compute — The Engine Layer
Storage is shared; compute is workload-specific. A typical Indian enterprise lakehouse has multiple engines reading the same Iceberg or Delta tables:
- Apache Spark — batch ETL between layers, ML training pipelines, large transformations
- Trino / Presto — interactive analytics, federated queries, the engine behind most BI
- DuckDB — local fast exploration, embedded analytics, single-node ML feature serving
- Snowflake / BigQuery — serving layer for high-concurrency BI on top of gold tables
- Vector databases — embeddings for RAG, often persisted alongside the lakehouse
- Feature stores — online and offline ML feature serving from gold tables
The discipline is governance: one access control plane (Lake Formation, Unity Catalog, or an open equivalent like Apache Polaris) that every engine respects, and one lineage system that tracks data flow regardless of which engine performed which step.
Ingestion — What Feeds Bronze
The ingestion stack determines the freshness ceiling for every downstream layer. Common components in production Indian lakehouses:
- Apache Kafka — the streaming spine for change-data-capture, IoT signals, application events
- Debezium / Apache Flink CDC — CDC from operational databases into Kafka, then into bronze
- Apache Airflow — batch orchestration for ELT jobs and reverse-ETL
- Spark Structured Streaming or Flink — the compute that lands streaming data into bronze and silver tables
- Fivetran, Airbyte, or custom connectors — SaaS source ingestion
DPDP Compliance Built Into the Lakehouse
For Indian enterprises, the lakehouse is also where DPDP Act 2023 controls sit. Five capabilities the architecture should provide natively:
- Consent state as a column. Every row that contains personal data carries the consent state, scope, and timestamp. Downstream queries filter on it. Withdrawals propagate.
- Column-level access controls. Engines see only the columns the requester is authorised for. Sensitive columns can be tokenised at silver and detokenised only by privileged consumers.
- Retention enforcement. Table-format retention policies expire data automatically. Gold tables that derive from expired bronze are recompacted.
- Lineage and time-travel. Every dataset is reproducible to a point in time, satisfying audit requests for "show me the data as it stood on 3 February."
- Erasure-on-request. Data principal deletion requests trigger targeted compaction operations across bronze, silver, and gold, with cryptographic evidence of completion.
For the broader DPDP context, see our DPDP Act AI compliance guide.
A BFSI Reference Blueprint on AWS
A pragmatic Indian BFSI lakehouse in 2026 looks roughly like this:
- Storage: Amazon S3 with Iceberg as the table format; AWS Glue Data Catalog or Apache Polaris for metadata
- Ingestion: Kafka (MSK) for streaming, Debezium CDC from core banking and CRM, Airflow on MWAA for batch
- Bronze → Silver → Gold: Spark on EMR or Glue for transformations, with dbt for SQL transformations on Trino
- Serving: Trino on EMR for interactive analytics, Snowflake or Athena for BI, OpenSearch / Pinecone for vector retrieval, a feature store (Feast or SageMaker Feature Store) for ML
- Governance: Lake Formation for access control, OpenLineage for lineage, Apache Atlas or DataHub for the data catalogue
- Observability: CloudWatch + Datadog for pipeline monitoring; data quality checks via Great Expectations or Soda
- AI workloads: SageMaker for training, Bedrock or self-hosted LLMs on EKS for inference, with retrieval drawing from gold tables and the vector store
The same blueprint adapts cleanly to Azure (ADLS Gen2 + Synapse + Fabric) or GCP (Cloud Storage + BigQuery + Vertex AI). The point is the architecture, not the SKU.
Anti-Patterns to Avoid
- One giant gold table per business. Gold should be many purpose-built data products, not one monolith.
- Skipping silver. Reading directly from bronze for analytics is how dashboards lie.
- Engine sprawl without governance. Five compute engines, five access control planes, and inconsistent permissions is worse than one warehouse.
- Vector store drift from gold. Embeddings are downstream artefacts; if gold changes, the vectors must be reindexed. Not pipelining this leads to silently stale RAG.
- Treating consent as metadata only. Consent must be a queryable column that filters every read, not a sticker on a documentation page.
What Comes Next
Through 2026, expect open catalog convergence (Apache Polaris, Unity Catalog open-sourced, AWS Glue's Iceberg REST endpoint) to make multi-engine governance materially easier. Vector and structured data converge in the same lakehouse — Iceberg's recent vector index work and Hudi's vector support move RAG corpora into the same governance plane as everything else. And streaming and batch unify further as Flink, Spark Structured Streaming, and Iceberg's streaming reads mature.
For Indian enterprises serving BI, ML, and agentic AI from one source of truth, the lakehouse is no longer a 2027 plan. It is the 2026 default.