Abstract

An enterprise data warehouse (EDW) must do both of the following:

· Integrate data from various sources (into enterprise form).

· Present data in a form that is easily used by the consumers.

An age-old question in DW circles is whether the same set of assets (tables) should be used for both.

In this article, I will explain (briefly) why I believe EDWs should be implemented with separate tiers and describe (even more briefly) why the presentation tier should be virtual.

Integration vs Presentation

Bill Inmon pioneered an approach in which the two fundamental needs of an EDW are addressed in separate tiers and that approach has been used for 40+ years (and counting). Data is integrated first (in the lower tier), typically in a normalized form, and then presented to the consumers in whatever form they require (in the higher tier).

Diagram

Description automatically generated

Advantages of this approach include:

· Consumers can also leverage data in normalized, non-dimensional form which may be more appropriate for their needs.

· Presentation rules and business rules are decoupled, simplifying the architecture as well as its use and maintenance.

· Business rules are implemented during integration and, therefore, are not repeated.

· Faster solution delivery later in the EDW’s lifespan.

· Simpler ETLs (due to segregation of integration and presentation duties).

· Easier implementation and maintenance of dimensional solutions (because they are virtual).

Various practitioners considered this expensive and excessive and developed EDWs in which there was only one tier and the assets served both purposes. This, of course, meant that one had to accept trade-offs. If the asset was tailored more for integration, it might be hard to consume. And vice versa.

Ralph Kimball pioneered the most widely accepted single-tier approach, known as dimensional modeling, and that approach has been used for 30+ years (and counting).

Diagram

Description automatically generated

Advantages of this approach include:

· Faster solution delivery early in the EDW’s lifespan.

· Fewer ETLs to maintain.

· Shorter ETL windows required (because most jobs run in parallel).

· Compatibility with powerful analytical tools.

Note that he took the approach one step further by advocating for a purely dimensional presentation of data (reasoning that it was the most useful form data assets could take). The fundamental constructs are stars with fact tables surrounded by related dimensions:

Diagram

Description automatically generated

Physical vs Virtual

As computing platforms have become more powerful, it has become more reasonable to implement the presentation tier in a virtual form (which eliminates many of the costs of a separate presentation tier). Over time, it has become best practice to implement as much of the presentation tier in a virtual form as is possible. Here, for example, is the pertinent documentation from the Data Vault Guide:

As noted the data marts are not required to be persisted. Ideally the data marts exist only in-memory using some form of data virtualization.

The agility requirements of the modern data warehouse have spurred the development of agile modeling techniques (Ensemble methods like Data Vault), agile project management approaches (SCRUM, BEAM and others), automation and tooling (Varigence, Data Vault Builder, WhereScape, etc.) and other agility enabling features in the broader DWBI program (testing, production move schedules, etc.).

Because dimensional models do not adapt easily to change, persisting data mart layers compromises this program agility that we are otherwise addressing. For this reason, the ultimate data warehouse architecture includes a virtualized data mart / data presentation layer.

Conclusion

During the early life of the EDW, solution delivery in a dimensional warehouse is comparatively rapid because only one set of ETLs is required (those to load the fact table and related dimension tables). In a normalized warehouse, two sets of ETLs might need to be written. The first set loads the normalized tables. The second set loads the dimensional tables. The second set, however, may not need to be written at all. It is preferable that the dimensional solution be virtual (implemented with database views) in which case there is not a distinguishable difference between the approaches (with respect to solution delivery time). As the EDW ages, however, the pendulum swings in favor of the normalized warehouse because the addition or change of a new normalized asset is typically easier than adding or changing assets in a dimensional warehouse. In a dimensional warehouse, you must do one of the following to satisfy various analytical use cases:

· Create redundant facts, dims, and stars (one per solution).

· Create conformed (non-redundant) facts, dims, and stars.

The former approach results, in the long run, in much redundancy both of data and of the business rules. The risk of discrepancies between those solutions grows over time.

The latter approach results, in the long run, in highly complex stars that are intended to solve multiple problems. As they grow more complex, they become harder to maintain and harder to utilize.

Therefore, despite the advantages of a dimensional warehouse in the early life of the EDW, I recommend implementation of a multi-tier EDW with a virtual presentation tier.