Introduction

This article will showcase a previous experience of mine implementing a continuous integration pipeline for a Snowflake dbt project that was used to build confidence when merging code changes into production.  I will talk about what continuous integration is, why it is important for data teams, implementation steps in dbt, dbt Cloud and Snowflake and finally some nuances to consider throughout the process.

What is Continuous Integration?

Continuous integration (CI) is a DevOps practice that automates the integration of code changes into a central repository by running builds and tests to validate the new or modified code’s functionality.  There are a number of popular tools that can be used to facilitate CI including but not limited to: CircleCI, Jenkins and GitHub Actions.

A typical CI workflow might start with code changes being pushed to a remote repository or a pull request being opened against a production branch (using GitHub as an example).  From there, whatever facilitates the CI can kick off building and testing the project with the new or modified code – developers can then get feedback on things like whether or not their project built successfully and their tests passed and then feed those results into status updates, code reviews or other processes.

Why is Continuous Integration important for data teams?

Data is often used to drive important business decisions.  If stakeholders are presented with incorrect data, that could then lead to incorrect decisions being made.  CI provides an added layer that enables data teams to be confident that the changes they are making to their analytics codebase meet their expectations in terms of results, both with tests and the actual data.

Imagine a scenario where a rogue join condition produces cardinality in a table that reports revenue generated by marketing campaigns.  This error overstates revenue and causes the marketing team to adjust their quarterly budget based on incorrect results.  An event such as this could cause stakeholders to lose trust in the data they are supplied.  A CI process could have caught this error by running an automated action that validates results through test cases or expected data values.  Seemingly-simple sanity checks that validate expectations in the data can prevent huge headaches down the road.   

The Continuous Integration Process

Before jumping into the details, here’s a high-level overview of the process:

  • Developer makes changes to existing dbt models/tests or adds new ones
  • Changes are pushed to GitHub and a pull request is opened which triggers a special CI job in dbt Cloud
  • A dbt macro runs which clones the production database to a staging database in Snowflake
  • CI job in dbt Cloud builds new or modified resources contained in the pull request against the production-like data now in the staging environment
    • If everything is successful the code changes are merged into the production branch
    • If there are any failures they can be addressed and once fixed, pushed back to the remote branch which re-triggers the CI process from the top

The stack that was used comprised dbt (Core/CLI), dbt Cloud,and Snowflake.  In the following sections I’ll break down individual processes and how they were set up throughout the stack – I’ll also mention some nuances that were discovered throughout this implementation.

Snowflake

The Snowflake architecture was made up of 3 “environments” (databases) for analytics purposes – note that the names are generalized here to demonstrate flexibility:

  • “DEV”: where developers built models, ran tests etc with dbt installed on their local machines.  This was a sandbox environment for all development efforts.
  • “PROD”: where production deployment jobs ran which were orchestrated by dbt Cloud.  This was what BI tools and analytics resources connected to.
  • “STAGING”: where the CI pipeline ran which was also orchestrated by dbt Cloud.  This database was a production-like environment.

The key in being able to confidently merge code changes was ensuring that models, tests etc. all ran successfully against data that were as close to production as possible.  This was an extra layer on top of everything running successfully in development in the event that the development database had gotten out of sync with production.

Initially, a dbt Cloud deployment job pointed at the staging database would execute standard commands such as dbt run, dbt test etc. in order to refresh staging.  This worked but meant waiting for potentially long-running jobs to complete (especially with complex incremental models) before a developer was able to see the results of their code changes.  

The solution adopted to solve for this was cloning.  Rather than rebuild the dbt project to populate the staging environment, Snowflake’s zero-copy cloning feature provided a convenient and fast way to take a snapshot of the production database and create a derived copy as the staging environment.  I will talk more about how cloning was facilitated in the following dbt section.

Nuances

  • A popular architecture for dbt projects built on Snowflake is to use separate schemas for “dev” and “prod” rather than separate databases.  This CI process can still work for schemas but will require some modifications.
  • Snowflake outlines a number of cloning considerations, however here are a couple notes on things that jumped out throughout the implementation and usage of this process:
    • Cloned objects do not retain granted privileges from the source object.  Depending on how permissions are structured, this could mean handling grants after cloning to make sure your CI process has the correct permissions.
    • Zero-copy cloning means the cloned object shares the underlying storage of the “parent” object until changes are made.  Additional storage costs could be incurred depending on the changes made to the cloned database and whether or not the clone persists.
  • Cloning speed varies depending on the amount of metadata Snowflake has on the object being cloned.  Snowflake’s cloning process does not copy the object’s contents and instead copies the object’s metadata.  Further reading can be found here.

dbt

In order to automate the cloning process described above a dbt macro was created.  This macro handled the execution of the clone command that refreshed the staging database with data from the production database.  Since macros are reusable pieces of code, this was tucked into the dbt Cloud CI job as an operation and removed the overhead of anyone having to manage manual execution of the cloning process.

Nuances

  • As additional storage costs were mentioned in the previous section when cloned objects are modified, the usage of macros could be extended to include a “teardown” process that drops the staging database after the CI job is completed.

 

dbt Cloud

CI is built into dbt Cloud and enabling it is fairly straightforward.  Prior to implementing CI here, dbt Cloud had one deployment environment that orchestrated production jobs.  Once CI was adopted, a separate deployment environment was created that orchestrated a special job that ran against the staging database in Snowflake.  This new “staging” deployment environment had one orchestration job that was responsible for running the CI flow.  This job had a webhook trigger enabled that ran it once a new pull request was opened on GitHub.

Once a pull request was opened and this job was triggered, the first operation ran the dbt macro discussed above and cloned the production database into the staging database.  This ensured that each CI build was running against data that were as close to production as possible.  The next operations executed typical dbt commands such as dbt run and dbt test that built the project and ran tests against the staging environment.

Cloning was one part of the equation in the CI process that enabled developers to get quick feedback on run results – the other part was running only new and modified resources as part of the CI job to prevent irrelevant parts of the dbt project from slowing down the run.  Slim CI is a feature that helped achieve this.  By setting the CI job in dbt Cloud to defer to the previous run state of one of the production deployments jobs, dbt Cloud would use the deferred job’s artifacts to compare state against the CI job and determine the new and/or modified resources.  This meant it would only run models/tests that were relevant to the respective pull request that triggered the CI job which greatly reduced the amount of time it took before developers could review results.

If all models and tests that were part of the CI job were successful, the developers had an extra layer of confidence that the results were as expected.  If any models or tests failed, the developers quickly identified them and began working on fixes without these failures reaching the production data and being exposed to end-users.  Once code changes were made they were then pushed back up to the remote branch used in the respective pull request and dbt Cloud would pick up the new commit(s) and kick off the CI process from the top to build and test the new changes.  

Nuances

  • Depending on the size of the data team or how frequently pull requests are made/updated, a team may not want to run the cloning operation with every CI job execution.  Some thoughts on alternative solutions are:
    • Tuck in a dbt macro that runs the clone operation into other dbt Cloud jobs such as those that build production deployments.  For example, if dbt Cloud runs a production job twice daily, the clone macro could be added at the end of that job to clone production into staging after production is refreshed.
    • Create a separate dbt Cloud job that only runs a clone macro and schedule this job to the appropriate frequency that is needed.
  • By populating staging with a clone of production rather than using dbt commands to rebuild the staging database, incremental models can be run and tested without necessarily having to rebuild them from scratch.  However in the event that a CI job contains changes to an incremental model, adjustments may need to be made to support the full refresh.
  • Slim CI does a pretty good job at detecting new and/or modified models.  Depending on development frequency and how often the production job that the CI job defers to runs, it is possible that other resources in a dbt project not directly relevant to a pull request might get detected.  

Conclusion

Once this process was put into action, the number of failures and bugs that made it into the production database decreased significantly.  CI in dbt Cloud provided a mechanism that saved countless headaches down the road by avoiding data quality issues and ultimately increased the organization’s trust in the data to a level unparalleled to that of before.

Unfortunately, not all data tools integrate well with version control or provide simple mechanisms to configure a CI pipeline.  As seen with this process, dbt Cloud provides the basic features out of the box and some extensions that leveraged other dbt and Snowflake functionality created an even more robust process that can be tweaked to meet an organization’s use case.  For dbt projects not orchestrated via dbt Cloud (using something like Airflow for example) there are even more CI possibilities.


 

 

Zak is a cloud data engineer focused on using his experience with SQL and Python to build analytics tools that help solve business problems.