Posts: 192
Joined: Wed Aug 15, 2012 8:18 am

Data Warehouse – Slowly Changing Dimensions SCD (Type 1 vs Type 2)

Postby slechtaj » Thu Jun 18, 2015 12:24 am

Data warehouse (DW) structure may differ depending on what Slowly Changing Dimension (SCD) model we choose. Among all SCD approaches there are two that are the most frequent: so called SCD type 1 and SCD type 2. I will discuss how to implement these two SCD types from ETL perspective in CloverETL.

SCD Types
The difference between type 1 and type 2 is usually only in history tracking. While in the first type (SCD type 1) DW does not preserve historical states of dimensional data at all, in the second one DW holds the entire history of dimensional data (records are enriched by additional flags, such as date ranges that describes the period when the record is valid for).

For those who are not familiar with SCD types, I recommend reading the Wikipedia article at first.

Data Warehouse Structure
In this example, there are the following tables:
  • fact_item (Fact table with quantitative sales data: Units Sold, Units Paid, Units Shipped)
  • dim_date (Date dimension – Type 0 SCD type used)
  • dim_item (Product dimension that were sold)
  • dim_store (Dimension of places where aforementioned products were sold)
  • dim_customer (Dimension of customers that bought above products)
Fact table that holds information on how many items were sold, paid and shipped is connected with dimensional tables using so called surrogate keys – these are PRIMARY KEYS of dimensional entries.

Depending on what SCD type is used dim tables may contain other fields. This is the case of SCD type 2 and Item, Store and Customer dimensions, which besides the regular data (surrogate keys and dimensional values) contains also two other columns (start_date and end_date). These two fields describe the period when a particular entry was used.

ETL Process
Regular ETL process may usually be divided into three steps:
  • Extraction
  • Transformation
  • Loading
Loading is the last phase that is subject of this example. The previous two phases are not described in this example. I have already prepared source data – the files are located in data-in folder in both projects.

02-dim-update.png (24.71 KiB) Viewed 36201 times

As you can see on above picture, into the loading phase data goes already divided into separate streams (separate records for each dimension and fact table). This is even more obvious if we drill down to the subgraph (DimensionUpdate).

RunETL.grf - general ETL graph
01-etl-general.png (14.49 KiB) Viewed 36201 times

Although each dimension is treated separately they use the same subgraph for the update. The only difference is in input data (and metadata). The subgraph receives the new source data on input port #0 and data from data warehouse on input port #1. These two record sets are compared to each other using DataIntersection and depending on whether the compared records matches or not data is inserted, updated or left without a change. The way this is done depends on the SCD type. So I will go through these two types in separate chapters.

SCD Type 1
As I have already mentioned above, SCD type 1 does store one record for each dimension value. Example will tell this better: Let’s say a customer replaced their contact person, but we already have their previous contact person in the dimension entry. In this case, the new contact person simply replaces the former one. That means we always keep the latest value only. Now, let’s take a look how to implement this in CloverETL.

LoadData.sgrf (SCD type 2)
04-scd2.png (58.33 KiB) Viewed 36201 times

As you probably know, DataIntersection has two input and three output ports. As I said above, the two input ports are occupied by the source ($in.0) and data warehouse data ($in.1). The three output ports return the following records:
  • $out.0 – not-changed records (contained in $in.0 only)
  • $out.1 – changed records (contained in records on both input ports)
  • $out.2 – data found only in data warehouse
From this it is obvious, the data from the $out.0 needs to be inserted as new into dimensional table. While the second output port ($out.1) contains data that needs to be updated in dimensional table. The third data should not be updated at all as such data is only stored in data warehouse.

Since the first two output ports contain data that needs to be somehow imported into database, we need to compose DML statements for them and run them using DBExecutecomponent. Data from the first port needs to be inserted, so we compose the INSERT statement for them (see the graph for more information). And because we do not need to preserve the historical data for existing records, we can simply create UPDATE statement for data from the second output port.

SCD Type 2
The process is almost the same. The only difference is that data that needs to be updated cannot be simply updated because we need to preserve the former record. Due to this fact we need to update the former record at first and after this is done, we ca insert the new record (e.g. a new contact person for the same customer).

LoadData.sgrf (SCD type 1)
03-scd1.png (44.49 KiB) Viewed 36201 times

As you can see, the difference from the ETL perspective is not that big. An interesting part of this ETL part is the DML statement composition, which is done using CTL transformation in Reformat components. In SCD Type 2 example, pay a special attention to aforementioned additional fields “start_date” and “end_date”.

Fact table loading
At this point, all dimensions are updated and we can continue to loading data into the fact table. Fact data of course cannot be inserted before all relevant dimensions are updated. In order to do this, we can place the FactUpdate subgraph into the later phase than DimensionUpdate subgraph. The FactUpdate itself may look like as follows:

05-fact.png (54.82 KiB) Viewed 36201 times

Readers should bear these graphs as just an example of how the basic data warehouse ETL processes may look like. For detailed information, both mentioned projects are attached.

Project installation steps
It is important to do the below steps in the given order.
  • CREATE DATABASE: the default configuration uses DB name "example_dwh_scd1" / "example_dwh_scd2" (depending on the project); example has been designed for MySQL database.
  • CREATE TABLES: Run the SQL statements from db-scd-type-1.sql and db-scd-type-2.sql files on previously created database. These files can be found in data-init folder in each project.
  • UPDATE CONNECTION FILE: Connection configuration file needs to be set the way to match your database configuration (host, credentials, DB) configuration. Each project contains one configuration file (DWH-SCD-1.cfg or DWH-SCD-2.cfg) in "conn" folder. More information on how to configure the database connection can be found in our documentation.
  • INITIALIZE DATA DIMENSION: Date dimension is of SCD Type 0 - therefore data in this table will not be changed in future. Due to this fact, and in order to keep the example as simple as possible, we will insert required data into the database table upfront. In order to do this run the UpdateDateDimension.grf graph (located in graph/InitProject folder). It is important to run this graph in both projects since each inserts data just into the database table used by the particular project.
  • If you have done all aforementioned steps, you area read to start RunETL.grf
Attached data
Both projects contain the following files:
  • README.txt - contains steps you need to follow in order to successfully run the ETL process
  • data-init/db-scd-type-2.sql, or data-init/db-scd-type-2.sql (DB definition files – CREATE TABLE statements)
  • graph/InitProject/UpdateDateDimension.grf inserts data into Date dimension – should be run before running RunETL.grf
ETL Project (SCD Type 1)
(2.34 MiB) Downloaded 1109 times
ETL Project (SCD Type 2)
(2.34 MiB) Downloaded 1037 times