Logo

Go back

Functional areas of an EDW

Chip Hartney (The Datamology Company)

First written: July 26, 2022

Last updated: January 18, 2024

Contents

Abstract 1

Types of asset 1

Big picture. 1

Reserved data areas. 1

Overview.. 1

Eng Area. 1

Admin Area. 1

Consumable data areas. 1

Overview.. 1

Collection Area. 1

Core Area. 1

Mart Area. 1

Dataset Area. 1

Files. 1

Distribution Area. 1

 

Abstract

An enterprise data warehouse (EDW) is comprised of myriad data assets.  A large EDW can easily contain 10,000 or more such assets.  Effectively managing those assets is a paramount concern.  Every data asset is created to serve a specific purpose.  I believe it is best to segregate those assets into function-specific areas and apply standards and processes that are appropriate to the particular functional area.

In this article, I identify the functional areas of an enterprise data warehouse (EDW), identify the flows of data between those areas, and describe the purpose of each such area.

Types of asset

An EDW may contain many types of data assets.  Each type of asset may require its own storage implementation (such as file directories and relational databases (or schemas).  It is to be understood that the functional areas described in this article are assumed to include each type-specific storage implementation necessary.

Big picture

A diagram of a diagram

Description automatically generated

Notes:

1.      Arrows show data flows (only allowed in the indicated direction):

a.      Green arrows are ingestion flows.

b.      Black arrows are preparation flows.

                                                    i.     Dashed arrows are allowed as exceptions, but discouraged.

c.      Red arrows are consumption flows.

                                                    i.     Solid arrows represent standard (analytical) consumption via DQL.

                                                   ii.     Dashed arrows represent non-standard (operational) consumption via DQL.  Such is allowed, but generally only when the pertinent source system does not already accommodate the need.

                                                  iii.     Dotted arrows represent standard (analytical) consumption via file reads.

d.      Purple arrows are egestion flows.

2.      Green ellipses show reserved data areas (each has its own purpose and rules and is explained in the following section):

a.      Eng Area: Optional data assets established to support ETL flows.

b.      Admin Area: Data assets established to manage and administer an EDW.

3.      Black ellipses show consumable data areas (each has its own purpose and rules and is explained in the following section):

a.      Collection Area: Raw data that has been collected from sources.

b.      Core Area: Definitive business data that has been established for use in other areas of an EDW and across the enterprise.

c.      Mart Area: Dimensional data that has been established for specific consumption needs.

d.      Dataset Area: Stand-alone (flattened) datasets that have been established for specific consumption needs.

e.      Distribution Area: Data files that have been prepared for specific consumption needs.

Reserved data areas

Overview

This section documents those areas of an EDW that are reserved for the management of the EDW.  They are to be accessed only by the technical staff.

Eng Area

Purpose: Work area containing whatever data assets are needed by the engineering team to support their efforts including any data assets used to support any given ETL.   (Those processes prepare data for use in any of the other areas of an EDW.)

Users of the data:

·        ETLs

Principles:

·        Completely under control of the engineering team.

·        Data assets can be persistent or temporary.

·        No consumer should ever access this area.

Some of the important consequences of the above principles are:

·        Standards are set by the engineering team.

·        Data assets utilized to manage and administer ETLs should be established in the Admin Area (not here).

Admin Area

Purpose: Data assets that are used to manage and administer an EDW.

Users of the data:

·        EDW administrators

Principles:

·        Support all admin tasks required of the data teams.

·        Data assets can be persistent or temporary.

·        No consumer should ever access this area.

Some of the important consequences of the above principles are:

·        Data assets utilized to manage and administer ETLs, including the following, should be established here (not in the Eng Area):

o   ETL control list.

o   ETL execution log.

o   ETL message log.

Consumable data areas

Overview

This section documents those areas of an EDW that contain consumable data, meaning that they are exposed to end users.

Collection Area

Purpose: Raw data as collected from source(s).  By “raw data”, we mean datasets obtained from the source at a specific point-in-time per the contract dictating the content and structure of that data. It is our intent to retain the data exactly as fed to an EDW for later reference and use in establishing integrated data in the other consumable areas of an EDW.

Justification:

·        All data feeds are kept in the EDW … fully under EDW control.

·        All other portions of the EDW can be rebuilt at any time from this data.

Users of the data:

·        Consumers who need access to historical versions (in business time) of operational data.

·        Consumers with operational reporting needs that are not supported by the operational system, itself.

·        Processes that populate the rest of an EDW!

Principles:

·        Accurately reflects the source data, over time, as it is and was known to the business.

·        It is the foundation on which the rest of the EDW is built.

·        Contains the most granular form of that source data.  No summarizations or aggregations.

·        Contains the raw form of the source data.  No transformations.

·        Data is ingested from source systems in the form of point-in-time datasets.

·        Historical record of the datasets (source data) fed to the EDW.

Some of the important consequences of the above principles are:

·        Though we wish the data to reflect the source, we can only be sure that it reflects the feeds.  The designer of the feed is expected to ensure that the content accurately reflects the source.

·        Immutable copy of every dataset obtained from the pertinent source systems.

·        May contain incorrect information because it must accurately reflect the feeds (and, therefore, the source) and the feeds (or even the source) could be incorrect!

·        Not expected to provide integrated and/or cleansed data.

·        Volume will be quite high compared to other areas of an EDW.

·        Loading needs to be as efficient/quick as possible so that data can (if needed) be ingested in near-real-time.

·        Loading needs to be fail-safe so that it does not become a burden to the production support teams.

·        As a result of the above consequences, we prefer direct ETL-less file drops.

·        Because we need DQL access to data in an EDW, we need dynamic DQL access to the content of the collected files.

·        Loading needs to support late-arriving data so that it does not become a burden to the production support teams.

·        Because the consumer is interested in the effect of the loads but the loads themselves may include records from prior loads, the views through which the consumers see the data must resolve the duplication.

Core Area

Purpose: Business data which has been integrated, cleansed, normalized, etc, to provide a basis from which other enterprise use cases can be satisfied (including the establishment of other enterprise data assets within the EDW).

Users of the data:

·        Consumers with enterprise reporting and analytic needs.

Principles:

·        Golden data (basis of org’s system of reference).

·        Basis for other enterprise data assets in the EDW.

·        Data is integrated from various sources.

·        Data is not duplicated.  (I.e., it is normalized.)

·        Data assets are subject-oriented.

·        History of data is available.

Some of the important consequences of the above principles are:

·        The most complicated ETLs are utilized to populate this area.

·        All other ETLs (populating other areas of the EDW) should utilize these data assets.

Mart Area

Purpose: Data which has been reformatted to satisfy dimensional needs of consumers.  Lean as much as possible on virtual assets (views of other data in an EDW) to prevent duplication of data and reduce ETL processing.  Generally, the data should already be prepared in the Core Area and exposed here through such views.  I.e., the transformation occurs in the Core Area, not here.  Exceptions are likely to be made, allowing data assets (such as conformed dimensions) to be created directly from source data in this area, typically for expedience.  But it is recommended that this practice not be implemented often.  Better to implement the transformations in the Core Area and expose the transformed data here with simple views.

Users of the data:

·        Consumers with enterprise reporting and analytic needs who specifically require the solution in dimensional form.

Principles:

·        Exposes enterprise data (from Core Area) in a dimensional form that is conducive to use with dimensional data analysis and reporting tools.

Some of the important consequences of the above principles are:

·        Strict (dimensional) standards are enforced to ensure consistency of the area to facilitate use of its data assets.

Dataset Area

Purpose: Data which has been reformatted (typically flattened/denormalized) to satisfy needs of consumers that are not using an access method that supports joining of the relational assets found elsewhere in the EDW.  I.e., these consumers are expecting stand-alone datasets.  Lean as much as possible on virtual assets (views of other data in the EDW) to prevent duplication of data and reduce ETL processing.

Users of the data:

·        Consumers with analytic needs for flattened datasets.

·        Partners with whom we need to expose prepared datasets.

Principles:

·        Exposes enterprise data (typically from Core Area) in a stand-alone form that is conducive to use with analytical tools and programming languages.

Some of the important consequences of the above principles are:

·        Care must be taken to prevent excessive redundancy else the number of data assets in this area can explode.  (Everyone thinks they have a unique need that requires its own unique data asset.)

Files

Some will advocate for the inclusion of mutable files in this area to support analytic tools like the R programming language.  But care should be taken in this regard.  This area contains datasets which are continually and automatically maintained by an EDW ETLs.  It is unlikely that the R programmer wants to point their solution at a dynamic dataset of this type because the results obtained would change every time an ETL ran (which is out of the programmer’s control).  More likely, that programmer will want to do one of the following:

·        Create the file themselves at a location of their choosing by running a query that they deem appropriate at the time.

·        Have the data team create a fixed, point-in-time file per pre-supplied specifications which, per these standards, should be created in the Distribution Area.

Distribution Area

Purpose: Prepared data as distributed to internal consumers and/or external orgs.  By “prepared data”, we mean datasets containing specific point-in-time data that satisfies a pre-defined need.  These are provided in the form of immutable data files that can be transmitted to the pertinent party.  Examples include reports to oversight agencies and periodic reports required by a department within the org.

Users of the data:

·        Processes (or staff) that require the files.

Principles:

·        Provisioning area for fixed (immutable) files.

·        Satisfies consumers (internal or external) who require a specific file for a specific purpose at a specific point-in-time.

·        Historical record of the files (published data) provided from the EDW.

Some of the important consequences of the above principles are:

·        Immutable copy of every file provided to the pertinent consumers.

·        For those cases where the consumer cannot access the files directly, we need to push the files toward them either directly through an API or indirectly through a handoff area.