Logo

Go back

Data collection in a data warehouse

Author: Chip Hartney (The Datamology Company)

First written: Circa 2017

Last updated: June 16, 2023

Contents

Abstract 1

Batch processing. 1

Basics. 1

Diagram.. 1

Explanation (including subjects) 1

Details. 1

Direct (1-step) 1

Diagram.. 1

Explanation (including feeds) 1

Indirect (2-step) 1

Diagram.. 1

Explanation (including feeds) 1

Indirect (3-step) 1

Diagram.. 1

Explanation (including feeds) 1

Subject types. 1

Immutable. 1

Append. 1

Sequenced. 1

Unsequenced. 1

Mutable. 1

Keyed. 1

Grouped. 1

Unkeyed. 1

Feed types. 1

Snapshot feed. 1

Delta feed. 1

Other 1

Frequency of batches. 1

Latency of data. 1

Content of feed. 1

Organization of feeds. 1

File splitting. 1

Collection patterns. 1

Relational tables in Collection Area. 1

Table type. 1

Table structure. 1

Relational organization of Collection Area. 1

Microsoft SQL Server info. 1

Decision tree. 1

Collection pattern notes. 1

Approach. 1

Virtual table. 1

Physical table. 1

Consumer view.. 1

Collection pattern 0. 1

Approach. 1

Virtual table. 1

Physical table. 1

Consumer view.. 1

Collection pattern 11. 1

Approach. 1

Virtual table. 1

Physical table. 1

Consumer view.. 1

Collection pattern 12. 1

Approach. 1

Virtual table. 1

Physical table. 1

Consumer view.. 1

Collection pattern 21. 1

Approach. 1

Virtual table. 1

Physical table. 1

Consumer view.. 1

Collection pattern 22. 1

Approach. 1

Virtual table. 1

Physical table. 1

Consumer view.. 1

Source evolution. 1

 

Abstract

In this article, I analyze the process of data collection (especially on behalf of a data warehouse) and define the distinct patterns by which such collection may be implemented.  This is an aspect of data analytics that I consider critical to successful and efficient solutions, yet is widely overlooked in the industry.  Specifics are included as to how such collection might be facilitated in an Azure SQL Server environment, but the concepts are general.

Batch processing

This entire section about data collection restricts itself to a batch processing (high latency) approach to data warehousing.  If low latency solutions (real-time or near real-time) are required, an entirely different approach to data collection would be needed.  No attempt is (as of yet) made to describe such solutions here.  Suffice it to say that CDC technology would need to be implemented in the source systems and those changes would have to be communicated, and applied, to the data warehouse very rapidly (possibly through asynchronous message queues or some sort of synchronous communication/replication mechanism).

Note that, though often confused with each other, CDC processing differs from batch processing of delta feeds in ways including the following:

 

CDC processing

Batch processing

Effective time

Specified individually for each record in feed

(Usually) specified once for all records in feed

Granularity

Timestamp based on system clock

Chosen by designer based on use case

Processing

Continuous

Periodic

 

Basics

Diagram

Diagram

Description automatically generated

Notes:

1.      Source “A” stores data for 4 subjects (named Sub1-Sub4).  A decision has been made to collect the data about subjects “Sub2” and “Sub4” from that source into the EDW.

2.      Source “B” stores data for 3 subjects (named Sub1-Sub3).  A decision has been made to collect the data about subject “Sub2” from that source into the EDW.

3.      Note that a given subject (such as “Sub2”) may be described in multiple sources (Source A and Source B in this case).

4.      The data for each source subject of interest is stored in the Collection Area of the EDW. 

5.      Each data asset in the Collection Area is named in a manner to make clear which source subject it represents.

6.      Each arrow reflects an EtL in which data about a given subject is moved from the pertinent source to the EDW.

Explanation (including subjects)

The general act of data collection is driven by a decision that one, or more, subjects are of analytical interest.  For each subject of interest, a determination is made as to the source (internal or external) in which data about that subject is stored.  An ETL is established to collect data about the subject from that source and store it in the Collection Area of the EDW.

By “subject”, we mean any business concept of interest.  Examples include people (e.g., employees), places (e.g. service locations), things (e.g., insurance policies), events (e.g., phone calls), and the like.

The pertinent source is usually the system of record for the subject of interest.  Typical sources include the organization’s HR, Finance, and Service Delivery applications.  The source subject may be described by a physical asset in the application’s data store.  But, often, those physical assets are not utilized directly (as would be the case if we read data from that data asset).  Instead, the ETL might collect data about the subject through an API or it might be provided the data in the form of a file (independent of the physical data store).  So, while we may understand the source subject to represent something very specific (such as an employee), we do not assume that the subject’s data is stored in any particular form or location in the source application.

The collected data for a subject is stored in a data asset in the EDW (the “target” of the particular ETL).  Given that we need to persist that data, the target is a physical data asset (such as a relational table or a new data file in a file group or directory).  One target asset is established for each source subject and is dedicated to that source subject.  To keep clear that 1:1 relationship, each target asset is named in a manner that makes clear the source from which the data were collected and the subject (within the source) described by that data.

The process that collects the data from the source (and stores it in the target) is a standard ETL process.  However, in data collection, we desire an ELT paradigm, meaning that we load the data in the EDW (in the Collection Area) before doing any transformations.  So, our hope is to use an EL process (without any transformation).  But practicalities (such as the difference in platforms and operating systems) sometimes require a bit of transformation before the data can be stored in the EDW.  Therefore, we use “EtL” (with a very small “t”) to describe the process that loads a data asset in the Collection Area.

Details

Here we take a more detailed look at data collection with respect to the ingestion of data for a specific subject of interest (Subject 2 in Source B, in this case).

Direct (1-step)

Here we consider a direct feed in which the EDW pulls data directly from the source (presumably via a published API).

Diagram

Diagram

Description automatically generated

Notes:

1.      Source “B” stores data for Subject “2” (which we wish to collect into the EDW).

2.      EtL1 extracts the feed from the source and loads it into the EDW.

3.      The arrow reflects an EtL in which data about the subject is moved.

Explanation (including feeds)

The overall process is a single EtL that runs periodically (sometimes manually, but usually on an automatic pre-scheduled basis).  Each run of the EtL adds data to the EDW.  Importantly, each run processes a specific and unique set of source data.  We think of that data as being fed to the EDW.  We describe each such dataset as a “feed”.  (See below for a closer look at feeds.)

Regarding EtL1, the feed can be:

·        Pushed from the source by a process managed by the source team; or

·        Pulled from the source by a process managed by the EDW team.

The former is not recommended because the EDW team must control and manage the data in the EDW.  It cannot be left to outside orgs to insert, update, or delete data in the EDW.

Given that the EtL will pull data from the source, that pull might come from:

·        A published API for the source.

·        The data store underlying the source.

The latter is not recommended because:

·        The EDW team has no official knowledge of that data store (given that it is not published by the source, else it would be an “API”).  Even if the EDW team were to reverse engineer that data store and find a way to tap into it, that effort is well beyond what can/should be expected of the team.

·        The EDW team cannot control the structure of that data store.  It can be changed at any time by the source (via a simple software upgrade) which would break the EtL.

·        The EDW team lacks permission to read data from that data store.

·        This establishes a tight and unwanted dependency between the source and EDW systems.

·        The dependency is not known to the source team and, therefore, not accounted for in its planning activities.

Therefore:

·        This approach should be used only if there is a published API to support it.

·        EtL1 should be developed and managed by the EDW team and consume the feed via that API.

The feed may be temporary, existing only for the duration of this process.  But it is often best practice to preserve that feed in the EDW for future reference and debugging.

Indirect (2-step)

Here we consider an indirect feed in which the data is passed through an intermediate handoff area.

Diagram

Diagram

Description automatically generated

Notes:

1.      Source “B” stores data for Subject “2” (which we wish to collect into the EDW).

2.      EtL1 extracts the feed from the source.

3.      EtL2 loads the feed into the EDW.

4.      A handoff area is used to transfer the feed from the source to the EDW.

5.      Each arrow reflects an EtL in which data about the subject is moved.

Explanation (including feeds)

The overall process is a pair of programmatic EtLs that run periodically (sometimes manually, but usually on an automatic pre-scheduled basis).  Each run of the EtL pair adds data to the EDW.  Importantly, each run processes a specific and unique set of source data.  We think of that data as being fed to the EDW.  We describe each such dataset as a “feed”.  (See below for a closer look at feeds.)

Regarding the handoff area, it is critical for stability and maintainability that the handoff area be under the control and management of the EDW team.  That team must control access to the handoff area as well as dictate its content.  This is especially important in the cases where the source system is another organization.  In such cases, the source team must write the feed to the EDW team’s handoff area.  (See following 3-step alternative for those cases where the source team cannot write to the handoff area.)

The overall process is comprised of two different EtLs:

·        The first (EtL1) extracts the data from the source and saves that feed in a handoff area. 

·        The second (EtL2) processes that feed later, loading it into the EDW.

Regarding EtL1, the feed can be:

·        Pushed from the source by a process managed by the source team; or

·        Pulled from the source by a process managed by the EDW team.

The latter is not recommended because:

·        The EDW team does not generally have the knowledge or capability or permission to read data from the source system.

·        The EtL, not being the responsibility of the source team, could fail for unexpected reasons (such as the source team taking the source system down for maintenance).

·        The dependency is not known to the source team and, therefore, not accounted for in its planning activities.

Therefore:

·        EtL1 should be developed and managed by the source team and produce the feed in the handoff area.

·        EtL2 should be developed and managed by the EDW team and consume the feed from the handoff area.

As a general practice, for simplicity’s sake, the dataset in a feed is provided in a single file.  This is good practice as it ensures atomicity in all processes dealing with that feed.  But it is not a requirement.  A feed may be a set of files in which a proper subset of the dataset is provided in each file.  In such scenarios, it is critical to ensure that all related files are processed transactionally, meaning that they are processed together as a single unit-of-work in which all or none of those operations complete successfully, never some.

Indirect (3-step)

Here we consider an indirect feed in which the data is passed through two intermediate handoff areas.

In 2-step indirect collection (described above), the handoff area is necessarily within your data environment and under control of your data team.  In some indirect cases, however, the source system is external to your org and unable to push the feed into your system.  In such cases, the source system will establish a handoff area on their system, push the feed into that handoff area, and give you permissions to obtain the feed from their handoff area.

Diagram

A picture containing text, diagram, screenshot, plan

Description automatically generated

Notes:

1.      Source “B” stores data for Subject “2” (which we wish to collect into the EDW).

2.      EtL1 extracts the feed from the source and places it in the source handoff area.

3.      EtL2 copies the feed from the source handoff area to the target handoff area.

4.      EtL4 loads the feed from the target handoff area into the EDW.

5.      Handoff areas are used to transfer the feed from the source to the EDW.

6.      Each arrow reflects an EtL in which data about the subject is moved.

Explanation (including feeds)

The overall process is a set of programmatic EtLs that run periodically (sometimes manually, but usually on an automatic pre-scheduled basis).  Each run of the EtL set adds data to the EDW.  Importantly, each run processes a specific and unique set of source data.  We think of that data as being fed to the EDW.  We describe each such dataset as a “feed”.  (See below for a closer look at feeds.)

The overall process is comprised of three different EtLs:

·        The first (EtL1) extracts the data from the source and saves that feed in a handoff area in the source system.

·        The second (EtL2) then copies the feed from the source system’s handoff area to the target system’s handoff area. 

·        The third (EtL3) finally processes that feed, loading it into the EDW.

Regarding EtL1, because this scenario arises when the EDW team lacks access to the source data, EtL1 must be developed and managed by the source team and produce the feed in their handoff area.

Regarding EtL2, because this scenario arises when the source system lacks access to the target handoff area, EtL2 must be developed and managed by the target team.

As a general practice, for simplicity’s sake, the dataset in a feed is provided in a single file.  This is good practice as it ensures atomicity in all processes dealing with that feed.  But it is not a requirement.  A feed may be a set of files in which a proper subset of the dataset is provided in each file.  In such scenarios, it is critical to ensure that all related files are processed transactionally, meaning that they are processed together as a single unit-of-work in which all or none of those operations complete successfully, never some.

Subject types

For the purposes of selecting and implementing the applicable collection pattern (below), each subject in a source system (that is to say, the information that describes that subject) is classified as one, and only one, of the following types (each defined further below):

·        Immutable

·        Append

·        Mutable

And each of those type can be further broken down into discrete sub-types such that, in the end, every subject is one, and only one, of the following:

·        Immutable

·        Sequenced (append)

·        Unsequenced (append)

·        Keyed (mutable)

·        Grouped (mutable)

·        Unkeyed (mutable)

Understanding the type of subject you are dealing with guides you to how the EtL collection process should be designed and how the content of the feed is to be interpreted.  In the following discussion, it is useful to think of a subject as being described by a dataset comprised of instances (each described by a single record) and attributes in those records.

Note that we do not account for historical source systems directly (as might be the case if we collected data from another EDW or from a temporal database that maintains history, itself).  Such scenarios are rare to start with.  And, even when they arise, they are usually handled in one of the following manners:

·        Collect only “current” data from such sources.

·        Collect the data as if it is a grouped subject (where the history records for a given keyed instance are the pertinent group).

Any attempt to have our EDW (which builds its own history) also keep track of the history of a source system is needlessly (and even hopelessly) complex, let alone redundant and confusing to consumers.

Immutable

An immutable subject is a subject whose information does not change.

A simple example might be the following dataset of truth codes:

Code

Description

T

True

F

False

There will always be 2 instances and their properties (Code and Description) will never change.

Append

An append subject is a subject whose information can only be supplemented with new instances. 

An example might be the following dataset of hourly temperature readings at a weather station in Highlands Ranch:

Time

Temp (F)

2022-08-12 12:00

91

2022-08-12 13:00

94

2022-08-12 14:00

96

2022-08-12 15:00

97

The properties (Time and Temp) of the recorded instances will never change, but additional instances (for 16:00, 17:00, etc) can be added.

Append subjects are further classified as follows.

Sequenced

A sequenced append (or simply sequenced) subject is an append subject in which one or more properties indicate the sequence in which the instances have been recorded.

See above for an example (in which the Time property describes the sequence).

The majority of append subjects are sequenced.

Unsequenced

An unsequenced append (or simply unsequenced) subject is an append subject in which the sequence in which the instances have been recorded cannot be determined.

Note that this is quite rare, to the point where it is likely you need not support this type of information in the EDW.

Mutable

A mutable subject is a subject whose information can change. 

An example might be the following dataset of offices:

Name

Type

Address

Main

HQ

123 Oak St., Denver

North

Branch

456 Maple Ave, Ft. Collins

South

Branch

789 Pine Blvd, Colorado Springs

New instances (offices) may be added.  Existing instances (offices) may be removed.  The properties (Name, Type, Address) of a given office may be changed.

Mutable subjects are further classified as follows.

Keyed

A keyed mutable (or simply keyed) subject is a mutable subject in which one or more properties act as a key that uniquely identifies a single record describing a specific instance of the subject. 

See above for an example (in which the Name property uniquely identifies each office).

The vast majority of mutable subjects have a key and, therefore, fall into this class.

Grouped

A grouped mutable (or simply grouped) subject is a mutable subject in which one or more properties act as a key that uniquely identifies a set of records describing a specific instance of the subject. 

Note that this is extremely rare, to the point where it is likely you need not support this type of information in the EDW.

Unkeyed

An unkeyed mutable (or simply unkeyed) subject is a mutable subject that lacks a key.  Instances cannot necessarily be distinguished from each other.

Note that this is extremely rare, to the point where it is likely you need not support this type of information in the EDW.

Feed types

For the purposes of selecting and implementing the applicable collection pattern (below), each feed (that is to say, the dataset that is conveyed to the EDW) is classified as one, and only one, of the following types (each defined further below):

·        Snapshot

·        Delta

·        Other

Note that we do not account for feeds for historical subjects because we expect to only ever obtain the current data from such sources.

Snapshot feed

A snapshot (snap) feed is a feed that contains a complete copy of the data describing every instance of the subject in the source.

This type of feed can be used for all types of subjects.

Delta feed

A delta (difference) feed is a feed that contains a complete copy of the data describing each instance of the subject in the source that has been changed (inserted, updated, or deleted) since the feed for the prior batch.

In the case of an append subject (in which only new instances can be added), every record in the feed is presumed to represent a new record that is to be inserted into the target asset.

In the case of a mutable subject, the general expectation is that the feed will contain an action indicator (to distinguish inserts, updates, and deletes) and:

·        For each inserted instance: The image of the instance after it was inserted.

·        For each updated instance: The image of the instance after it was updated.

·        For each deleted instance: The image of the instance before it was deleted.

The most common, as well as simplest, case is one in which there is only one change record for any given source record.  I.e., if the source record is updated multiple times in the pertinent period, only one “update” record is provided in the feed.

However, a variation is one in which every intra-period change is provided in the feed.  I.e., if the source record is updated 3 times in the pertinent period, 3 “update” records are provided in the feed and each has a timestamp to indicate when the update occurred during that pertinent period.  This is the case when a change data capture (CDC) mechanism is utilized in the source system to capture every change as it occurs.  Those changes might then fed to the DW in a single batch, leaving it to the DW to interpret the intra-period changes to establish the overall change made to the matching record in the DW.

Another variation of interest is a minimalist one in which only the necessary properties are reported.  For example:

·        For each deleted instance: Only the key of the deleted instance is provided.

·        For each updated instance: Only the key of that instance and the updated properties are provided.

This type of feed can only be used for keyed subjects.

Other

There are, of course, many possible forms a feed may take.  The standard types noted above are common and can generally be supported by standardized ETL development templates and processes.  If a non-standard type of feed is utilized, it is expected that the ETL will need to be developed with custom code.

Some alternative types of feed might become, if used frequently enough, standard feeds.  An example of a non-standard feed that might be utilized is one that contains a complete copy of the data describing only selected instances of the subject in the source.  Such a feed would apply only to keyed subjects for which physical deletes are not supported.

This type of feed applies all subject types.

Frequency of batches

Batches of data will be collected from the source system on a periodic basis that is agreed to in the contract between the source and the DW.  For example, it might be agreed that a batch is to be transmitted every weekday other than holidays. 

Latency of data

This schedule has implications for the precision of the DW with regards to latency.  The more frequent the transmission, the more aligned the DW will be with the reality represented by the source.  Consider, for example, an ETL for Client data that is run every Friday evening.  The DW will not know about a client added to the source on Monday until after the Friday evening batch has been processed.  If the ETL ran every night, then the DW would know about that new client immediately after processing Monday’s batch. 

In addition to latency (described above), there is also a concern about instances that are missed altogether.  Consider the case in which the client added on Monday was removed on Thursday.  If the ETL runs every Friday, the client would never be communicated to the DW at all because it was not in the source for either batch (the Friday before it was inserted and the Friday after it was deleted).  In general, by choosing to use a weekly schedule, the DW accepts such “inaccuracies”.  If the DW wanted to be more accurate, the ETL would need to run more frequently.  But, regardless of how frequent it is run, the possibility of missed instances still exists (though less likely).

Content of feed

This schedule also has implications for the content of the feed.  In general, assuming the ETL runs when scheduled and passes correct data, there is no chance that information will be lost (other than the intra-batch inaccuracies mentioned above).  But care must be taken to ensure that the batch includes all of the expected data. 

Because a snapshot feed contains a full definition of the source, there is little debate about what is to be included in the feed and it doesn’t matter how much time has passed between batches.  Each snapshot is a complete copy of the source.  Therefore, regardless of the time that has passed between batches, the collection of a snapshot effectively brings the DW back to the current state.

Additional care is, however, required for delta feeds.  By definition, each batch must define all changes that have occurred since the prior batch.  Therefore, the ETL must:

·        Know when the prior batch ran (i.e., what changes it communicated).

·        Include all changes that occurred during the period between the prior batch and this one.

·        Summarize the changes to a particular instance in a single record in the batch.  I.e.:

o   If the instance was inserted and then updated, the batch would simply indicate an INSERT (because, from the DW’s point of view, it is a new record).

o   If the instance was updated and then deleted, the batch would simply indicate a DELETE (because, from the DW’s point of view, it is no longer present).

Note: It is possible that an agreement is made to include each individual intra-batch change to an instance in the batch and leave it to the DW to resolve the multiple records into a single action.  In fact, this is the type of feed one would require if the DW is recording intra-day changes!  But, most commonly, the source is expected to provide a single summary record for each “changed” instance.

It is frequently asked if a delta ETL might use the planned schedule, instead of the actual runs, to determine the period over which changes are to be summarized.  I.e., instead of recording the actual datetime that the ETL last ran and then using that date time to summarize records, might the ETL instead use the knowledge that it is scheduled to run every 24 hours (for example) and simply presume that it ran 24 hours previously.  It should be obvious, with a little thought, that we cannot depend on the ETL to have actually run when scheduled.  If it did not, the ETL would use the wrong period and include the wrong summary of changes.  That, in turn, would cause inaccuracies in the DW … which may never get corrected.  I.e., if we miss the INSERT of a record in one batch and that record is never updated (or deleted) in the source, the DW would never learn of that record.  There is no “catching up” without manual intervention!  So, extra care is required for delta feeds.

Organization of feeds

The feeds must be organized in a meaningful manner in the Collection Area so that they can be properly understood, segregated from each other, and utilized properly.

A recommended approach to such organization, is to utilize a hierarchical file structure such as:

Collection Area feeds

Source

Subject

Effective date of feed

File(s)

Here is an example:

Graphical user interface, application, chat or text message

Description automatically generated

The collection process must be informed of the business time at which the data in the file became effective (i.e., metadata) and that value must be recorded in the EDW.  Note that the business time at which the data expires is not yet known.  That will be communicated in a future feed.  We are concerned here only with the starting point of the effective period.

One common method for communicating the value is to embed it in the name of the file being fed to the EDW.  In the approach shown above, the business time value is recorded as the directory name.  This is a simple mechanism by which the business time for a particular feed can be recorded in the file system.  The value later be referred to when processing the file, effectively keeping the metadata (business time) with the data in the files, themselves.  In some cases, this approach is not sufficient due to the limit on the number of directories that the file system allows.

An alternative approach is to embed the business time in the name of the file, itself.  This approach can be used only when the feeds are guaranteed to be single files (which is a limitation we do not wish to impose).

The collection process must also take care to ensure uniqueness of the feeds, that the values are appropriate, and that they are processed in the proper sequence. 

Advanced data warehouses may code their collection processes to handle (in an automatic manner) unusual cases such as:

·        Feeds that come out of sequence.

·        Feeds that have the same business time (which might be considered corrections, for example).

We assume in this document that such sophistications will not be implemented programmatically.  It is, instead, expected that the engineering team will need to intervene manually when a collection process detects an unexpected feed.

File splitting

A feed is a conceptual dataset.  It may be recorded in a single file.  But it may also be recorded in multiple files (especially when the volume of data is especially large).

Handling of multi-file feeds is naturally more complex than single-file feeds.  To assure consistency, the collection process must utilize a unit-of-work that encompasses all the files in the feed and that can be difficult to do in file-based systems.  Therefore, it is generally encouraged that organizations avoid using multi-file feeds.  But, if multi-file feeds must be supported, accommodations are needed in various aspects of the analytical system.  For example, each file cannot be named with its business time because they all have the same business time.

Collection patterns

Fundamentally, data about a subject is collected, over time via separate feeds, in the EDW as a set of data files.  Understanding the type of subject and type of feed enables one to interpret that collected data accurately.  In this section, we describe the patterns by which those collected files are to be interpreted.

Relational tables in Collection Area

Table type

Though we collect the data in the Collection Area via simple file drops, our goal is to expose that collected data as a relational table that can be queried with DQL.  Simply dropping the files in the Collection Area does not accomplish that goal.  Each table must be established through one of the following means:

·        Virtually, as an external table, through use of a DQL-based data virtualization capability.

·        Physically, as an internal table, through use of a standard CREATE TABLE command and an ETL that populates the table.

The first approach is preferred as it reduces the development and maintenance costs of the EDW and makes the data available for consumption via DQL as soon as the file has been dropped.  The latter approach is to be used when one or more of the following conditions related to data virtualization exist:

·        The capability is not available.

·        The capability is insufficient.

·        The performance of the implementation is inadequate and cannot be resolved.

Microsoft SQL Server info

Microsoft PolyBase is a data virtualization service available (in Preview mode as of June 2022) as part of an Azure SQL Managed Instance.  It enables the execution of Transact-SQL (T-SQL) queries against data from files stored in Azure Data Lake Storage Gen2 (ADLSG2) or Azure Blob Storage.  It is envisioned that PolyBase could be used to expose the collected data for a given subject as a relational asset (external table).

Microsoft informed us in Dec 2022 that:

·        PolyBase is SQL Server functionality.

·        Data virtualization (the cloud equivalent in Azure SQL) went GA for SQL Managed Instance in Sept 2022.

·        Data virtualization only supports files in Azure Blob Storage or ADLSG2, not Azure File Storage.

So we do not know if or when this service will be generally available nor do we know the full capabilities of the service.  For example, an assumption below is that the virtualization service can expose the name of the file directory, as well as the file creation time, as columns in the external table.  (Such capability is available in some competing technologies, though perhaps only when such directories are used for partitioning.)  If directories don’t work, we might consider using the file name to indicate the business time.  (Again, some competing technologies support this.)

Table structure

Because our goal in the Collection Area is to expose the source data in its purest, unadulterated form, we expose most of that data as character strings.  The only exception are the business times which are provided as metadata and must be valid in order for the collection processes to work.

Relational organization of Collection Area

Per the “Organization of feeds” section above, the feeds (themselves data files) are organized in the Collection Area by source and subject.

It is strongly recommended that the relational assets in the same data store be clearly associated with the matching directories in that file structure.  Therefore, a matching hierarchical relational organization be used to organize the relational tables that expose the fed data:

Collection-specific database

Source-specific schema

Subject-specific table

Microsoft SQL Server info

Azure SQL supports 3-part table names which we utilize in the following manner:

·        All data assets are defined in the “Collection” database.

·        A schema is defined, in that database, for each source and is named to uniquely identify the source.

·        The table name, itself, is the name of the subject, from the pertinent source, for which data has been collected.

Decision tree

Given type of subject and feed, choose the pertinent collection pattern per the following decision tree:

If (subject=Immutable) then

               Collection Pattern 0 -- Only pattern for Immutable subjects

Elseif (subject=Append) then

               If (feed=Snapshot) then

                              Collection Pattern 11

               Elseif (feed=Delta) then

                              Collection Pattern 12 -- Preferred pattern for Append subjects

               Else

                              No pattern ... utilize custom ETL

               Endif

Else  -- (subject=Mutable)

               If (feed=Snapshot) then

                              Collection Pattern 21

               Elseif (subject=Keyed) then

                              If (feed=Delta) then

                                             Collection Pattern 22 -- Preferred pattern for Mutable subjects

                              Else

                                             No pattern... utilize custom ETL

                              Endif

               Else

                              No pattern ... utilize custom ETL

               Endif

Endif

Note that the number of patterns is finite and, therefore, manageable.  It is expected that a template be generated for each pattern and reused for each new collection of data (for a given subject).  Each new collection process should be supported in a nearly automatic manner based on only a few inputs (from the decision tree) and a list of the attributes in the feed (which, it is hoped, can be discovered automatically).

Also note that, in a well-defined system, each collection pattern also defines the pertinent relational asset from which the data is consumed.  Exceptions might be made, of course, but they should generally be avoided.

Collection pattern notes

Each collection pattern is described in detail below.  Here we explain the format of those notes.

The relational assets described for the pattern are always:

·        A relational table (virtual or physical) that directly exposes all the collected data.  The table is not expected to be exposed to consumers.

·        A relational view that exposes the collected data in the form that the EDW consumers expect.

Columns describing the effective period are shown here, for simplicity, as dates.  A different grain (such as time in microseconds) might be applicable for the subject in question.  Remember, effective periods are derived from the metadata that accompanies the feed.

Columns describing the actual raw data from the feeds are shown generically as “Properties”.  Because the Collection Area makes no attempt to transform the raw data and because the feeds may contain bad data, all properties are exposed as STRINGs (or the equivalent in the selected data platform) so that the actual fed value can be seen.

Approach

The “Approach” subsection describes the pattern in general terms.

Virtual table

The “Virtual table” subsection describes the approach, in pseudocode, that would be utilized to establish the relational table (as an “external” table) using data virtualization.  Remember, this is the preferred approach because it avoids the need for a second ETL in the collection process.

Note that, being pseudocode, the documented code must be implemented in the language of the pertinent database and data virtualization tool.

Note that it may be possible to use virtual tables with the exception that the load timestamp of the file is not available.  If that is the only shortcoming preventing use of virtual tables, consider obtaining the load time from the ETL execution log and exposing that in the consumer view.

Physical table

The “Physical table” subsection describes the approach, in pseudocode, that would be utilized to establish the relational table (as an “internal” table) and populating it with a traditional ETL.  This approach would be utilized only if the Virtual table is not possible or deemed unacceptable.

Note that, being pseudocode, the documented code must be implemented in the language of the pertinent database and ETL tool.

Consumer view

The “Consumer view” subsection describes, in pseudocode, the view (of the relational table, regardless of whether it is virtual or physical) that exposes the collected data to the consumers in the expected form.  This view consolidates the raw data into a meaningful form by doing things like elimination of duplicate feeds, etc.

Note that, being pseudocode, the documented code must be implemented in the language of the pertinent database.  Note, too, that we use the ROW_NUMBER function (with JOINs) to determine effective periods in this pseudocode.  It may well be that use of the LEAD function, or even some other technique, would be more performant and sufficient.

Collection pattern 0

Collection pattern 0, described here, is the standard collection pattern in which data about an immutable subject is collected.

Note that this pattern is the only approach for immutable subjects and does not involve ETL.  The data is loaded once (manually).

Approach

In this pattern, there is only one “feed” and the data in that feed is loaded into the EDW manually (meaning through whatever mechanism is deemed easiest at the time).  The feed is not processed through the handoff area, nor is it stored in the file system.  The data is simply loaded into the physical table in the EDW and then discarded.

Virtual table

Not applicable.

Physical table

CREATE TABLE "<source_name>"."<subject_name>_T"

    ("<property_1_name>" STRING NULL,

     "<property_2_name>" STRING NULL,

     ...,

     "Load_Time" TIMESTAMP NOT NULL);

Load the table directly from the provided data “feed” (using the CURRENT_TIMESTAMP for Load_Time).

Consumer view

The view exposes all records from that “feed” (which never change).

CREATE VIEW "Collection"."<source_name>"."<subject_name>"

    ("<property_1_name>" STRING NULL,

     "<property_2_name>" STRING NULL,

     ...,

     "Load_Time" TIMESTAMP NOT NULL)

    AS

    SELECT * FROM "Collection"."<source_name>"."<subject_name>_T";

Collection pattern 11

Collection pattern 11, described here, is the standard collection pattern in which data about an append subject (sequenced or unsequenced) is collected with snapshot feeds.

Note that collection pattern 12 is favored over this pattern.  See that pattern for the reasons.

Approach

In this pattern, a snaphot of the subject is obtained from the source in every feed.  Being an append subject, each feed will contain all records that were provided in the prior feed as well as all new records.  All the redundant records are exposed in the relational table.  Only the unique records are exposed in the relational view.  For the first feed, all records in the dataset are included.  For every subsequent feed, only new records are included.

Virtual table

CREATE EXTERNAL TABLE "Collection"."<source_name>"."<subject_name>_T"

    ("Effective_Date" DATE NOT NULL,

     "<property_1_name>" STRING NULL,

     "<property_2_name>" STRING NULL,

     ...,

     "Load_Time" TIMESTAMP NOT NULL)

    WITH (

        LOCATION = '<subject_directory_path_spec>',

        FILE_FORMAT = <external_file_format_name>,

        PARTITION "Effective_Date" ON '<sub_directory_name>' AS DATE NOT NULL,

        PROPERTY "Load_Time" ON <file_creation_time> AS TIMESTAMP NOT NULL

    );

Physical table

CREATE TABLE "<source_name>"."<subject_name>_T"

    ("Effective_Date" DATE NOT NULL,

     "<property_1_name>" STRING NULL,

     "<property_2_name>" STRING NULL,

     ...,

     "Load_Time" TIMESTAMP NOT NULL);

Create and use a simple ETL that reads each feed (where the directory name is the effective date) and appends the records to the table (using the CURRENT_TIMESTAMP for Load_Time).

Consumer view

Because the pattern depends on feeds that contain redundant data records, care must be taken to remove them before exposure to the consumer.  The following solution is for a sequenced subject (which is, by far, the most common type of append subject).  For such, the sequence property is utilized to distinguish the records in any given feed that have been added since the prior feed.  The view exposes all records in the first feed but only those records from later feeds in which the sequence values are higher than the highest sequence value used in the prior feed.  Note that there may be a more efficient way to accomplish the same outcome.  This is simply the first solution we thought of!

CREATE VIEW "Collection"."<source_name>"."<subject_name>"

    ("Effective_Date" DATE NOT NULL,

     "<property_1_name>" STRING NULL,

     "<property_2_name>" STRING NULL,

     ...,

     "Load_Time" TIMESTAMP NOT NULL)

    AS

    WITH "RawData" AS (

         SELECT * FROM "Collection"."<source_name>"."<subject_name>_T"

    ), "MaxSeq" AS (

         SELECT ROW_NUMBER() OVER (ORDER BY "Effective_Date" ASC) AS "RowNum",

                "Effective_Date",

                MAX("<seq_prop>") AS "MaxSeq"

         FROM "RawData"

         GROUP BY "Effective_Date"

    ) , "Control" AS (

         SELECT "A"."RowNum " AS "RowA",

                "A"."Effective_Date" AS "EffA",

                "A"."MaxSeq" AS "SeqA",

                "B"."RowNum " AS "RowB",

                "B"."Effective_Date" AS "EffB",

                "B"."MaxSeq" AS "SeqB"

         FROM "MaxSeq" AS "A"

         RIGHT JOIN "MaxSeq" AS "B"

           ON "A"."RowNum" + 1 = "B"."RowNum"

         GROUP BY "Effective_Date"

    )

    SELECT "RawData".*

    FROM "RawData"

    INNER JOIN "Control"

      ON "RawData"."Effective Date" = "Control"."EffB"

      AND ("Control"."SeqA" IS NULL OR "RawData"."<seq_prop>" > "Control"."SeqA");

Note that, for unsequenced subjects (which are very rare), different logic would need to be utilized to limit the dataset to only the unique records.  It is expected that one would need to:

·        Sort the records in each feed and assign a unique row number to each record in the feed.

·        Expose only those records with a row number higher than the highest row number in the prior feed.

In fact, this latter approach (for unsequenced) data will work for sequenced data as well (as long as the sequence columns are used as the sorting columns.

Collection pattern 12

Collection pattern 12, described here, is the standard collection pattern in which data about an append subject is collected with delta feeds.

Note that this pattern is the preferred approach for append subjects.  It is better than collection pattern 11 for the following reasons:

·        The relational view that is far less complex (and, therefore, easier to maintain).

·        The relational view, because of its simplicity, is more performant.

·        The amount of data stored on disk is significantly less (and, therefore, less costly).

It is, therefore, recommended that this pattern be utilized for append subjects whenever possible.

Approach

In this pattern, each feed includes only records that have been added to the subject since the prior feed.  It is exactly those a records that we wish to expose to the consumer.  The relational view, therefore, is a simple view of all fed data.

Virtual table

CREATE EXTERNAL TABLE "Collection"."<source_name>"."<subject_name>_T"

    ("Effective_Date" DATE NOT NULL,

     "<property_1_name>" STRING NULL,

     "<property_2_name>" STRING NULL,

     ...,

     "Load_Time" TIMESTAMP NOT NULL)

    WITH (  

        LOCATION = '<subject_directory_path_spec>',

        FILE_FORMAT = <external_file_format_name>,

        PARTITION "Effective_Date" ON '<sub_directory_name>' AS DATE NOT NULL,

        PROPERTY "Load_Time" ON <file_creation_time> AS TIMESTAMP NOT NULL

    );

Physical table

CREATE TABLE "<source_name>"."<subject_name>_T"

    ("Effective_Date" DATE NOT NULL,

     "<property_1_name>" STRING NULL,

     "<property_2_name>" STRING NULL,

     ...,

     "Load_Time" TIMESTAMP NOT NULL);

Create and use a simple ETL that reads each feed (where the directory name is the effective date) and appends the records to the table (using the CURRENT_TIMESTAMP for Load_Time).

Consumer view

The view exposes all records in all feeds.

CREATE VIEW "Collection"."<source_name>"."<subject_name>"

    ("Effective_Date" DATE NOT NULL,

     "<property_1_name>" STRING NULL,

     "<property_2_name>" STRING NULL,

     ...,

     "Load_Time" TIMESTAMP NOT NULL)

    AS

    SELECT * FROM "Collection"."<source_name>"."<subject_name>_T";

Collection pattern 21

Collection pattern 21, described here, is the standard collection pattern in which data about a mutable subject (keyed, grouped, or unkeyed) is collected with snapshot feeds.

Note that collection pattern 22 is favored over this pattern.  See that pattern for the reasons.

Approach

In this pattern, a snapshot of the subject is obtained from the source in every feed.  Being a snapshot of the source data, the data only describes subject instances that exists.  We must infer the instances that existed previously but no longer exist (i.e., they were deleted).  Theoretically, this involves comparing the prior feed to the current feed to determine what was deleted.  That is complex and only applicable to keyed or grouped subjects.  We choose, instead, to assume that the effective period for any such record is the period between the current feed and the following feed.

Further, the fed data will be highly redundant.  Each feed will include all the data from the prior feed that did not change in the interim.  For a data record that does not change during a week, the daily feed would repeat the same record Monday, Tuesday, Wednesday, Thursday, and Friday.  Theoretically, we would like to eliminate such redundancy.  But that is not required for accuracy, only for efficiency.  So we choose to leave the redundancy.  Querying of such data by effective period will return accurate results.  Determining the end of the effective period for a given state (across all properties) will be more difficult (requiring the merging of identical records), but that need is rare.

Virtual table

CREATE EXTERNAL TABLE "Collection"."<source_name>"."<subject_name>_T"

    ("Effective_Start_Date" DATE NOT NULL,

     "<property_1_name>" STRING NULL,

     "<property_2_name>" STRING NULL,

     ...,

     "Load_Time" TIMESTAMP NOT NULL)

    WITH (  

        LOCATION = '<subject_directory_path_spec>',

        FILE_FORMAT = <external_file_format_name>,

        PARTITION "Effective_Start_Date" ON '<sub_directory_name>' AS DATE NOT NULL,

        PROPERTY "Load_Time" ON <file_creation_time> AS TIMESTAMP NOT NULL

    );

Physical table

CREATE TABLE "<source_name>"."<subject_name>_T"

    ("Effective_Start_Date" DATE NOT NULL,

     "<property_1_name>" STRING NULL,

     "<property_2_name>" STRING NULL,

     ...,

     "Load_Time" TIMESTAMP NOT NULL);

Create and use a simple ETL that reads each feed (where the directory name is the effective start date) and appends the records to the table (using the CURRENT_TIMESTAMP for Load_Time).

Consumer view

The table exposes the effective start date (which is the name of the directory containing the feed file).  To make the view usable, we add the effective end date (which matches the effective start date of the following feed).

CREATE VIEW "Collection"."<source_name>"."<subject_name>"

    ("Effective_Start_Date" DATE NOT NULL,

     "<property_1_name>" STRING NULL,

     "<property_2_name>" STRING NULL,

     ...,

     "Load_Time" TIMESTAMP NOT NULL)

    AS

    WITH "RawData" AS (

        SELECT * FROM "Collection"."<source_name>"."<subject_name>_T"

    ), "Feed" AS (

        SELECT ROW_NUMBER() OVER (ORDER BY "Effective_Start_Date" ASC) AS "RowNum",

               "Effective_Start_Date"

        FROM (SELECT DISTINCT "Effective_Start_Date" FROM "RawData")

    ), "Period" AS (

        SELECT "A"."Effective_Start_Date" AS "Effective_Start_Date",

               "B"."Effective_Start_Date" AS "Effective_End_Date"

        FROM "Feed" AS "A"

        LEFT JOIN "Feed" AS "B"

          ON "A"."RowNum" + 1 = "B"."RowNum"

    )

    SELECT "Period"."Effective_Start_Date",

           "Period"."Effective_End_Date",

           "RawData"."<property_1_name>",

           "RawData"."<property_2_name>",

           ...,

           "RawData"."Load_Time" TIMESTAMP NOT NULL

    FROM "RawData"

    INNER JOIN "Period"

      ON "RawData"."Effective_Start_Date" = "Period"."Effective_Start_Date";

Collection pattern 22

Collection pattern 22, described here, is the standard collection pattern in which data about a keyed (mutable) subject is collected with delta feeds.

Note that this pattern is the preferred approach for mutable subjects (but it can only be used for keyed subjects).  It is better than collection pattern 21 for the following reasons:

·        Any given state of a specific instance of the subject is recorded only once.

·        Because each state is recorded only once, a consumer can directly determine the end date of that particular state directly.  (I.e., the data is easier to use.)

·        Because each state is recorded only once, the amount of data stored on disk is significantly less (and, therefore, less costly and more performant).

It is, therefore, recommended that this pattern be utilized for keyed subjects whenever possible.

Approach

In this pattern, which only applies to keyed subjects, the feed includes only the changes to the source data.  As always, we expose those feeds in the table.  In the view, we translate the changes (based on the included action indicator) into the desire form (with an explicit effective period).  For insert and update actions, the record is understood to be effective from the business time of the pertinent feed until the next feed including that same record (as identified by the key).  For delete actions, the record is understood to have been deleted as of the business time of the pertinent feed.  For update and delete actions, the business time of the new feed is used as the expiration point of the prior record (with the same key).

Because the first feed acts as a data load, it is expected that the action indicator for every record in that feed indicates an insert action.  In subsequent feeds, the insert action is to be utilized only for new instances (new key) whereas the update and delete actions are utilized for existing instances.  Note that the insert action can be used again for a particular instance only after a delete action for the same instance.

It is possible that the feeds will be incorrect (possibly missing feeds or feeds that utilize the wrong indicators).  To keep the system as resilient as possible, it is recommended that the following rules be observed when processing a feed:

·        An update action for a given key that is not already defined in the data be handled as an insert.

·        An insert action for a given key that is already defined in the data be handled as an update.

·        A delete action for a given key that is not already defined in the data be ignored.

·        If multiple actions are provided for a given key, only one is utilized.

It is also recommended that a supplemental DQ process be created that checks for the above exceptional situations.  Typically, you would use a simple query of the collected data and report such exceptions to the applicable staff for investigation and resolution.

Virtual table

CREATE EXTERNAL TABLE "Collection"."<source_name>"."<subject_name>_T"

    ("Effective_Start_Date" DATE NOT NULL,

     "Action_Indicator" STRING NOT NULL,

     "<key_property_1_name>" STRING NULL,

     "<key_property_2_name>" STRING NULL,

     ...,

     "<non-key_property_1_name>" STRING NULL,

     "<non-key_property_2_name>" STRING NULL,

     ...,

     "Load_Time" TIMESTAMP NOT NULL)

    WITH (  

        LOCATION = '<subject_directory_path_spec>',

        FILE_FORMAT = <external_file_format_name>,

        PARTITION "Effective_Start_Date" ON '<sub_directory_name>' AS DATE NOT NULL,

        PROPERTY "Load_Time" ON <file_creation_time> AS TIMESTAMP NOT NULL

    );

Physical table

CREATE TABLE "<source_name>"."<subject_name>_T"

    ("Effective_Start_Date" DATE NOT NULL,

     "Action_Indicator" STRING NOT NULL,

     "<key_property_1_name>" STRING NULL,

     "<key_property_2_name>" STRING NULL,

     ...,

     "<non-key_property_1_name>" STRING NULL,

     "<non-key_property_2_name>" STRING NULL,

     ...,

     "Load_Time" TIMESTAMP NOT NULL);

Create and use a simple ETL that reads each feed (where the directory name is the effective start date) and appends the records to the table (using the CURRENT_TIMESTAMP for Load_Time).

Consumer view

The table exposes the effective start date (which is the name of the directory containing the feed file).  To make the view usable, we add the effective end date.  That effective end date is the effective start date of the subsequent feed (if any) that defines the same instance (as identified by the key).

CREATE VIEW "Collection"."<source_name>"."<subject_name>"

    ("Effective_Start_Date" DATE NOT NULL,

     "Action_Indicator" STRING NOT NULL,

     "<key_property_1_name>" STRING NULL,

     "<key_property_2_name>" STRING NULL,

     ...,

     "<non-key_property_1_name>" STRING NULL,

     "<non-key_property_2_name>" STRING NULL,

     ...,

     "Load_Time" TIMESTAMP NOT NULL)

    AS

    WITH "UniqRawData" AS (

        SELECT ROW_NUMBER() OVER (

                   PARTITION BY "<key_property_1_name>", ...

                   ORDER BY "Effective_Start_Date" ASC

               ) AS "RowNum",

               X.*

        FROM "Collection"."<source_name>"."<subject_name>_T" AS X

        -- Eliminate redundant rows from each feed

        WHERE ROW_NUMBER() OVER (

                  PARTITION BY "Effective_Start_Date", "<key_property_1_name>", ...

              ) = 1

    )

    SELECT "A".*, -- Except RowNum

           "B"."Effective_Start_Date" AS "Effective_End_Date"

    FROM "UniqRawData" AS "A"

    LEFT JOIN "UniqRawData" AS "B"

      ON "A"."UniqRawData"."<key_property_1_name>" = "B"."UniqRawData"."<key_property_1_name>"

      AND "A"."UniqRawData"."<key_property_2_name>" = "B"."UniqRawData"."<key_property_2_name>"

      AND ...

      AND "A"."UniqRawData"."RowNum" + 1 = "B"."UniqRawData"."RowNum"

    WHERE "A"."UniqRawData"."Action_Indicator" IN ('INSERT', 'UPDATE');

Source evolution

A data collection ETL is expected to run repeatedly, consistently, and without error.  By definition, however, a data collection ETL has a dependency on the source system which is outside of the control of the DW team. The source system may be down, for instance, a circumstance that we can do little to prevent.  One of the most common problems that will occur over the lifetime of the DW is that the source will evolve.  An ETL that works with one version of the source will not necessarily work with later versions of the same source.

In limiting data collection to direct and indirect feeds described earlier, we buffer ourselves from such problems reasonably well:

·        In a direct feed, we are pulling data from a source-provided API.  Being a published API, the source will generally keep the API stable regardless of any internal changes to the software and its data store. 

·        In an indirect feed, the source agreed to feed the data to the DW in an agreed-upon form.  Being agreed to, the source will generally adjust their “push” ETL to accommodate any internal changes to the software and its data store.

So, though it is rare that an ETL will break due to source evolution, it can (and will) happen in which case:

·        For a direct feed, the source will announce the API change and the DW team will have to adjust its ETL.

·        For an indirect feed, the source will announce that they can no longer provide the agreed-upon feed, and the teams will have to negotiate a new agreement (and, therefore, adjust the ETL).

We think of these ETL adjustments as “versions”.  The first version of the ETL works until there is a evolution.  The ETL is updated (to a new version) to accommodate the evolution and then works until there is another evolution.  And so on.

It is, therefore, incumbent on the DW team to accommodate each version of the ETL.

One approach to handling such evolution (and accompanying versions) is to handle each version normally and to add a mechanism (such as a relational view) that combines the various versions of the fed data into one expected perspective of all of the data.  I.e., the view could UNION the various versions of the data together (and make any adjustments needed to synchronize those versions):

Diagram

Description automatically generated

Another approach is to load each version of the data into a single table and distinguish the different versions at the row level so that the consumer can know how to interpret the versioned rows.  This is often made simpler by creating, once again, a relational view that implements the interpretation logic:

Diagram

Description automatically generated

This latter approach can be difficult to implement in some situations because of limitations on the structure of the physical table, itself.  I.e., it cannot always accommodate the differences between version.  Hence this is not a recommended practice as a general solution.

Another approach is similar to the previous one, but the subject table is made extremely generic such that it can accommodate any possible difference between versions.  This typically entails the use of an array of generic characters strings of maximum length.  But that approach is limited by the size of the array and the max size of each string.

Another approach still, addressing the array limits of the prior approach, is to use a generic key-value pair table to land each of the attributes as a separate row in the subject table.  This is certainly a flexible scheme that should accommodate any version differences.  But it also strays very far from our desire to model the data in a form that is easily interpreted by the consumer (whether it be a real person or an ETL process).