What  is an SCD table?

An SCD (Slowly Changing Dimension) is an attribute that stores and manages both current and historical data changes over time. 

It is often the case that (for the sake of auditing purposes) an organization may need to see the “state” of the data as it was at a certain point in time. For example, products can change price, cost and composition over time. Employees are promoted, get fired, change departments or receive new titles. These changes occur unpredictably and sometimes frequently too. Based on the need for a business to understand and analyze both “as was” and “as is” states, various SCD types need to be created in dimensional modeling.

There are many types of SCD tables, but for the purpose of our discussion, we will be focusing on SCD Type 2 tables: Creating a new row in the dimensional table every time a change occurs in an attribute that we wish to track. This technique keeps track of all the historical and new data with proper versioning using status flags (is_active) and timestamps (effective_from and effective_to).

Implementing SCD Tables in Matillion and dbt 

In this blog post,we will show you how to implement SCD Type 2 in Matillion (using the Detect Changes Component) and in dbt (using Snapshots). To demonstrate SCD concepts, we will create a sample use case leveraging a subset of customer data. In this example, let’s assume one of our customers changes their phone number. This is what our customer data looks like at the source level:

Matillion

To create an SCD in Matillion, we used the Detect Changes Component in a transformation. Assuming you have a fully set-up Matillion instance, you can follow the documentation here 

Step 1: Compare

Determine your primary key (customer_id in our example) and columns for which you want to track changes.

The Detect Changes component scans two separate inputs, and creates a new column indicating if data has been inserted (new), deleted, changed, or unchanged (identical).

This is reflected in the indicator column as follows:

Indicator Description
C Changed: the record is present in both tables, with different values, but with the same ID.
D Deleted: the record is present in the master table, but not in the second table
I Identical: the same record is present in both tables with no changes.
N New: the record is not present in the master table, but is present in the second table.

In Detect Changes, there are 3 main properties:

  1. Master Table – The target table you want to compare your new records with
  2. Match Keys – Column(s) that act as a primary key that is/are present in both input tables 
  3. Compare Columns – Columns in which you want to track changes 

This component by default adds a column called indicator which shows the status of the record (C, D, I or N)

For initial load, this is what the dimension would look like keeping in mind the job ran on 01-02-2022.

Step 2: Update

This identifies changed records. In this step we retire the old records by making the following changes. Assuming this job ran on 01-04-2022: 

  1. Effective_from – The date when record is inserted
  2. Effective_to – (effective_from – 1) date of the new active record   
  3. Is_active – old record will change is_active to false

Step 3: Insert

This step inserts new and changed data into the target_table dimension. 

  1. Effective_from – The date when record is inserted
  2. Effective_to – This will be a future date for new/changed records which are active
  3. Is_active – new/changed records will have is_active set to true

Note: In this case we have set the SCD fields to a date format. ence, the logic to retire an old record is based on the previous day. You can also change this to use timestamp logic based on your business requirements. 

 

Once this table is set-up, you can get all current active customer records by simply using the following query:

And, to select historical data for a particular customer you can query:

Scheduling Jobs:

Matillion allows you to run a job automatically at a predefined time interval with the flexibility to set up multiple schedules. You can find more information on Manage Schedules within the Matillion documentation. Matillion also offers notification about the job status using the SNS Component.

dbt

The information below assumes you have your dbt implementation up and running. If you need help in setting up dbt, you might find this link useful. 

Step 1: 

Determine the primary key in your dimension model and decide on the strategy you want to use to track your dimension. Options include: 

  1. Timestamp – uses a timestamp row to determine if the row has changed. In this method we are filtering on the time stamp and check for records that have changed/ added since last run  
  2. Check – uses a list of columns to check for updated data. If you want to just track changes for selective columns you can use this strategy. It will only check against the columns mentioned and monitor for changed data.

Step 2: 

Config the model as a Snapshot. This will be a sql file in the snapshot folder.

Step 3: 

Create our initial load using run dbt snapshot.

On the first run it will build the target_table with 4 columns that are created by dbt: 

  1. dbt_scd_id – A unique key generated  by dbt for each snapshotted record
  2. dbt_updated_at – The updated_at timestamp of the source record when this snapshot row was inserted
  3. dbt_valid_from – The timestamp when this snapshot row was first inserted
  4. dbt_valid_to – The timestamp when this row became invalidated.Active or current record will have null value for this field 

Step 4: 

Run an incremental load using run dbt snapshot

dbt will scan the underlying data and append new records based on the configuration in your snapshot model. When you run dbt snapshot, dbt creates a set of records that have been updated since the last time the snapshot was run on that table and then merges it. It does not duplicate data if the data hasn’t changed. 

In order to track changes to your dimension automatically, you will want to schedule these jobs. Both technologies have the ability to do so and offer different ways to get notified about the status of your job run.

Scheduling jobs:

Run dbt snapshot on a schedule by following the documentation Running jobs in production.

It is considered best practice to monitor and set up alerts for job failure. dbt Cloud option has built-in email and Slack notifications as well for alerting such incidents. 

More Info: 

If you are considering either Matillion or dbt for data transformation or would like additional assistance with setting up SCD tables, we would be happy to assist.  Please use our contact us form or reach out to sales@dataclymer.com


 

 

Jui is a cloud data engineer with a passion for full stack analytics. Her expertise include data modeling, integration and visualizations. She approaches complex projects with curiosity, organization, and excellent time management.