dbt

dbt is a free and open source python-based command line tool that transforms data inside a data warehouse. It renders and runs SQL command for the user. To write dbt code, you only really need to know how to write SQL select statements.

dbt's documentation is the best place to get started learning about dbt's concepts and how to use it.

In the EDU code, we use dbt to manage our data transformations inside Snowflake. We have airflow issue a command to run a dbt job, and dbt then issues SQL commands that build our data warehouse in ordered steps.

Some things to know about dbt and why we are using it in the EDU code:

  • All transformation of the data happens using dbt and happens inside the data warehouse. This allows us to have a clear lineage of data from start to finish, and to encode all business rules and assumptions in version-controlled code

  • dbt calls a dbt select statement a "model". Sometimes models are materialized as tables, but they can also be views.

  • dbt infers the dependencies of the database build logic by detecting the dependencies of models, so we do not need to instruct dbt about which models to run before others.

  • dbt uses version controlled seed tables to capture things like the mapping of categorical values of data to standard values. We make the use of seed tables in EDU to do things like avoid storing specific Ed-Fi descriptor codes in the dbt models logic.

  • dbt allows us to re-use logic through macros and create loops and conditional logic in our data transformation SQL

  • dbt has a framework to write tests on data quality and warn or error to stop a run if pre-set assumptions we have made about the data are not met on a scheduled run

  • dbt allows us to abstract away from a particular database's flavor of SQL. The code currently only supports Snowflake SQL but we intend to support other databases in time.