Skip to content

dbt Style Guide

Model Naming#

See below for more information on how EDU organizes it's data flow.

  • Base and Stage models should have prefixes that specify the step and the source, separated from the table name by a double underscore. For example: base_canvas__users, stg_canvas__users.
  • Base and Stage models should be easy to track back to a single source system (or system type) and a single resource within that system. Someone reading the code should not have to guess where a table comes from -- keep names clear and consistent.

  • Analytics tables are divided into three basic categories:

    • dim_ for dimensions
    • fct_ for fact tables
    • msr_ for measure tables
  • Analytics tables should be singular, e.g. dim_student, not dim_students
  • Use dimension table names in primary and foreign key naming. Keys should all start with k_ and clearly indicate which dimension they refer to, e.g. k_student for dim_student
  • All dimensions must have:
    • A hashed primary key
    • A natural unique key, potentially composed of multiple columns
  • All fact and measure models must have:
    • A declared and tested natural key, which may be composed of multiple columns
  • All models should have tests and documentation in a yaml file with the same name and in the same folder as the model.

Organizing Columns#

Try to maintain a logical ordering of columns in dbt models. We encourage these 4 basic groupings:

  • Primary data: the key information describing the table. This includes the primary keys (in descending hierarchical order), followed by other relevant unique attributes, like name
  • Foreign keys: (natural or surrogate) keys meant to facilitate linking to other data.
  • Logical data: any remaining descriptive or factual elements, grouped in a sensible way. (e.g. more salient elements first, groups of related elements together, etc)
  • Metadata: data about the loading process, like timestamps

Configuration#

  • Prefer to set materialization levels and target schemas in the yml files. Only override within the model for special cases. (If you have a lot of {{ config(materialization=...)}} at the top of your models, move this out to a yaml file)

Macros#

  • File name must match the macro name
  • Macros should be documented, including their parameters and usage

Data Cleaning#

ID Systems#

Schools typically have multiple ways of identifying people and places because those entities are defined in multiple source systems. For instance: a student may have a local_student_id, state_student_id, sis_student_id, and lms_student_id.

For that reason, a column called student_id is meaningless. It is critical to distinguish between the various IDs coming from different source systems so that there's a clear way to integrate data across systems.

If we have a data source with an inconsistent spec (e.g. an assessment vendor provides a column called student_id with varying interpretations by district or row), a critical component of data cleaning is to split this out into separate fields so there's no ambiguity on how to join it.

An exception to this rule is the student_unique_id: this is the unique ID used by Ed-Fi, which is it's own semantically meaningful type of ID. Under the hood this may be any of the other common ID types, but the Ed-Fi ID becomes the lingua franca against which all other IDs are joined.

Seeds#

Seeds are a way to load data from CSV files into the data warehouse. Because these CSV files are kept in the Github repo, they can be reviewed and version controlled. This is most suited to loading relatively static data that changes infrequently, such as simple lookup tables and configuration.

Anything with more than a few hundred rows and a handful of columns should likely not be a seed table.

EDU Data Organization#

Understanding dbt Schemas#

Within a dbt project, you can configure the schema in which an individual model or folder full of models will be built. For example, we build the core warehouse dimensional tables in a schema called wh. However the actual schema in the database will be different: it will combine environment information with the configured schema name to produce the final schema name. So in production, wh becomes prod_wh, but in other environments it might be dev_wh or test_wh.

This is a great feature for isolating multiple code environments from one another, but it can be confusing at first when the database doesn't match your expectations. Just remember: in the database, schemas are named with the format {environment}_{dbt-schema-name}.

Seeds#

dbt schema

seed

Much of the configuration EDU is handled through seed files. Because Ed-Fi uses so many customizable descriptors, we often need to map these strings to a machine-interpretable simplification. For example, this crosswalk controls how CalendarEventDescriptors are mapped to the simple boolean is_school_day. This allows us interpret arbitrary calendar descriptors programmatically.

Seeds in EDU are generally used in this way: as helpers, translators, lookups, etc. They are loaded to a seed schema and used by tables that need this kind of lookup or translation, but rarely do they pass forward to the warehouse as stand-alone tables.

Staging Layer#

The staging layer in EDU has a very rigid structure, which makes it very easy to navigate: there is little room for creativity in naming, and little ambiguity on where something should go.

The rules below are general and should apply to any source data, but we have more rules we apply to Ed-FI data, which are documented in the readme.

Base Models#

dbt schema

stage

Base models are transparent views on the raw data: they apply no rules that would remove records. They show the data as it was loaded incrementally, which means uniqueness constraints and deletes are not enforced. These go in the stage schema as views.

Base models enforce the following rules:

  • Table names are structured as base_{source_system_abbrev}__{resource_name}
    • The Source System abbreviation uniquely identifies a source system and aligns to a schema in raw
    • The resource name follows the names from a source API or database, converted to snake_case
  • Base tables only source data from raw tables via the source macro
  • Base tables only represent one raw object, and do not perform joins
  • Metadata from the loading process is preserved
  • Flat JSON dictionaries are unpacked to columns, nested lists are passed forward untouched
  • Data types are set
  • Column names are converted to snake case

Stage Models#

dbt schema

stage

Stage models are cleaned up: they apply deduplication rules to enforce a particular grain, remove deleted records, and otherwise prepare the data for analytical use. They are materialized as tables for efficiency of downstream queries. These go in the stage schema as tables.

  • Table names are structured as stg_{source_system_abbrev}__{resource_name}
  • Nested lists are unpacked into sub-tables with a further suffix: resource_name__list_name
  • Stage tables have a one-to-one relationship with base tables, except for unpacked lists.
  • Deleted records are removed
  • Uniqueness constraints are enforced

Build Layer#

dbt schema

build

Many data models can go straight from staging to the warehouse, but when there are particularly complex pieces of logic that need to be factored out either for code cleanliness or to be re-used by multiple warehouse models, these go in the build schema.

These are essentially build artifacts of the warehouse's construction that may not be directly useful to end users unless they are interrogating the build process itself. You can think of them as intermediate models.

Joins are allowed in build models, and the naming scheme is less strict -- they are named for clarity of their purpose rather than clarity of origin, as they may combine data from multiple sources.

Warehouse Layer#

dbt schema

wh

The core warehouse is a Kimball-style dimensional model, so its tables are dimensions, facts, and measures. These tables are organized for consumption by analysts, so their naming convention prioritizes the analytical meaning of data over its origin. Ideally an analyst should be able to use tables in the warehouse without even knowing what source system the data came from, as the naming conventions and keys all follow the same conventions.

Dimensional models seek to simplify data structures from the highly normalized form of an operational system to a format optimized for ease of analytical use: fewer tables, fewer joins, more consistency.

The core dimensional warehouse tables go in the wh schema.

Metrics and Marts#

dbt schema

Varies by use case: marts can be named by the team that uses/owns the mart, by the product that uses it, etc. Prefixing with mrt can keep these organized.

Data marts are separate schemas built on top of the core warehouse layer which collect data relevant to a particular working group, team, or product. They often consist of 'curated views', or simplified pre-joined tables that require less SQL sophistication to work with than a dimensional model. If a dimensional model is a simplified denormalization of operational data, curated views are even simpler denormalizations of a dimensional model.

EDU does not come with Data Marts pre-built, as these are generally very specific to local context. Marts should be built in their own schemas, primarily sourcing from the wh models.

In our managed Stadium product, we use a mix of approaches for these things:

  • Models that may be reused or shared across projects are collected into dbt packages which can be installed in any EDU or Stadium implementation
  • Models that are specific to a single partner (including local metrics and reports, or team-specific data marts) are built directly in the project repository.

In the long run you may want to have multiple teams governing separate parts of your warehouse structure. dbt Packages are a great way to accomplish this. For example: if the assessment team owns the structure of their own data mart, this can be factored out into a standalone package that they can maintain, and the core data team can review, approve, and deploy new releases to production.