The Modern Data Stack: What Changed and What Stayed the Same

The term "modern data stack" became ubiquitous in the early 2020s, describing a category of cloud-native, composable analytics infrastructure. Five years later, the landscape has evolved in ways that both validate the original vision and reveal its limitations. Some components that were marketed as essential have been commoditized or replaced. Others that seemed peripheral have proved foundational. And entirely new categories have emerged that did not exist when the term was coined.

This article is an honest assessment for data leaders evaluating their current infrastructure. It distinguishes between what has genuinely changed and what the marketing suggests has changed. Understanding this distinction is essential for making sound technology investments in an industry that moves fast and hypes faster.

What Has Genuinely Changed

The ingestion layer is largely commoditized. In 2020, choosing and implementing a data ingestion tool was a meaningful engineering challenge. Today, Fivetran, Airbyte, and their competitors offer point-and-click connectors to virtually every data source. The pricing has dropped dramatically, and the connectors have matured. This is good news: teams no longer need to invest significant engineering time in ingestion. The bad news is that everyone has access to the same connected data, so the competitive advantage from having more data connections is minimal.

ELT has definitively won over ETL. The extract-transform-load pattern that dominated on-premise data warehousing has been superseded by extract-load-transform in cloud environments. Modern cloud warehouses (Snowflake, BigQuery, Redshift, Databricks) are powerful enough to run SQL transformations at scale, so it is faster, cheaper, and more maintainable to load raw data and transform it inside the warehouse using SQL and dbt than to transform it before loading. The shift has made data pipelines dramatically more debuggable because the raw data is always available for inspection.

dbt has become the transformation standard. dbt (data build tool) has achieved something rare in the data engineering world: near-universal adoption as the standard for SQL-based data transformation. Its version-controlled, tested, documented approach to building data models has raised the quality floor for data engineering significantly. dbt's testing framework, combined with data contracts and expectations, makes production data models substantially more reliable than the hand-crafted SQL scripts they replaced.

The data lakehouse has arrived. The theoretical vision of a unified platform that combines the cost economics of data lakes with the query performance and ACID semantics of data warehouses is now production reality. Apache Iceberg as a table format, Delta Lake (Databricks) and Hudi as alternatives, running on object storage with Spark, Trino, or Flink as query engines, delivers genuinely warehouse-quality analytics at data lake cost. Many organizations that maintained separate data lake (for data science) and data warehouse (for BI) stacks are consolidating onto a single lakehouse architecture.

What Has Not Changed As Much As Claimed

Orchestration remains genuinely hard. Airflow has been the orchestration standard for years, and despite constant criticism of its complexity and numerous challengers (Prefect, Dagster, Mage), Airflow remains dominant because the alternatives all introduce different trade-offs rather than straightforwardly solving Airflow's problems. Orchestrating complex, dependent data pipelines with good observability, retry logic, and dependency management remains one of the harder practical problems in data engineering. Any vendor claiming to have "solved" orchestration deserves skepticism.

Data modeling expertise is still scarce and valuable. Despite the improvements in tooling, the skill of designing a good dimensional data model — a Kimball star schema or an OBT (one big table) for specific use cases — remains genuinely difficult and rare. Bad data models produce dashboards that are fast but wrong, or slow but correct, or correct but unusable by business analysts without SQL expertise. No tool has automated this judgment. Organizations continue to underinvest in data modeling expertise and continue to pay for it in downstream analytics quality.

Governance and master data management are still organizational problems. The data catalog vendors have delivered real improvements in metadata management, data discovery, and lineage documentation. What they have not solved is the organizational problem: getting the humans who create data to document it consistently, agree on common definitions, and maintain data quality as a shared responsibility rather than delegating it to a separate data team. The technology is ahead of the organizational practice.

What Has Emerged That Was Not There Before

The semantic layer has become a distinct category. For most of the modern data stack era, metrics definitions lived in BI tool configurations that were invisible to other tools and not reusable across the organization. The emergence of dedicated semantic layer platforms (dbt Semantic Layer, Cube, AtScale) that define metrics once and expose them consistently to any BI tool, data product, or API is a genuine architectural advance. The ability to define "revenue" as a certified, tested, documented metric that every tool in the organization uses consistently eliminates one of the most persistent sources of analytics disagreement.

Reverse ETL has gone mainstream. The modern data stack originally flowed data one way: from operational systems into the warehouse for analytics. Reverse ETL tools (Census, Hightouch, Polytouch) close the loop by pushing analytics output back to operational tools: syncing Salesforce with CRM scores calculated in the warehouse, updating marketing automation with segment assignments computed from behavioral data, pushing personalization signals to product teams. This bidirectional data flow enables data-activated operations at a scale that was previously only possible for organizations with large data engineering teams.

AI/ML has moved from a separate stack to an integrated layer. In 2020, machine learning in the enterprise required a separate data science infrastructure: MLflow for experiment tracking, feature stores, model registries, separate serving infrastructure. Today, cloud warehouses natively support ML model training and inference in SQL (BigQuery ML, Snowflake Cortex, Databricks ML Runtime), and the feature store is increasingly just a well-designed data mart. The convergence of the analytics and ML stacks is reducing the organizational friction between data engineering and data science.

The Convergence Trend

The most significant macro trend in the modern data stack is consolidation. The explosion of point solutions that characterized 2019-2022 is giving way to consolidation around a smaller number of comprehensive platforms. Databricks has built a broad ecosystem extending from data ingestion to ML serving. Snowflake has expanded from data warehousing into data engineering, data apps, and native ML. The cloud providers themselves (AWS, GCP, Azure) continue to fill gaps with managed services that compete with the point solution vendors.

This consolidation creates tension between the composable, best-of-breed philosophy of the original modern data stack vision and the operational simplicity of fewer vendor relationships and more integrated tool experiences. The right answer for most organizations is not fully one or the other: maintain composability where the value of best-of-breed tools is genuinely high (semantic layer, specialized visualization), and consolidate where operational simplicity outweighs marginal capability differences (orchestration, compute, storage).

What Matters Most for Investment Decisions Today

For data leaders making technology investment decisions, three questions determine the highest-value focus areas. First, is your data accessible? If business users cannot find, understand, and trust the data they need, no amount of processing sophistication matters. Investments in data catalog, data quality, and semantic layer that make data trustworthy and accessible often have higher ROI than investments in processing capability.

Second, is your transformation layer maintainable? Teams operating without dbt or equivalent testing and documentation practices are accumulating technical debt that becomes increasingly expensive to pay down. Modernizing the transformation layer is usually the highest-leverage infrastructure investment for teams still operating with legacy SQL scripts.

Third, are you getting business value from your data? The ultimate measure of a data stack is not its technical elegance but the quality and frequency of decisions it enables. Organizations that can clearly connect their data infrastructure to improved business decisions have built the right stack regardless of which tools they used.

Key Takeaways

Ingestion is commoditized; ELT has won; dbt is the transformation standard; the lakehouse is production-ready.
Orchestration complexity, data modeling expertise, and governance remain genuinely hard despite vendor claims.
Semantic layers, reverse ETL, and integrated AI/ML are the new categories that have genuinely changed the stack.
Consolidation is happening; the right response is composability where it matters, simplicity where it does not.
Data accessibility and business value, not technical sophistication, are the right investment decision criteria.

Conclusion

The modern data stack has delivered on its core promise: cloud-native, scalable analytics infrastructure is accessible to organizations of all sizes, not just those with hundreds of data engineers. What it has not delivered is automatic value. The organizations that have gotten the most from their data stack investments are those that paired technical investments with organizational investments in data literacy, governance practices, and clear measurement of data's contribution to business outcomes.

See how Dataova's analytics intelligence layer sits on top of your existing modern data stack to deliver the insight layer that converts your data infrastructure investment into measurable business value.

Back to Blog