Skip to content

SQL Style Guide#

This is the style guide we try to follow in our packages and in projects we manage.

While adopters of the EDU framework need not adopt our exact style guide, it is important to have a style guide, as this eases the cognitive burden of hopping around the codebase. Generally one should not be able to tell who wrote a script just by looking at it.

General Code Practices#

  • Always think very carefully about the grain of each table - what makes an observation unique? Make sure that grain is enforced and tested.
  • Avoid changing the grain of any table unless absolutely necessary, and be prepared to handle any downstream consequences of such a change (including talking to other teams).
  • Make sure to read any documentation of source data - it's important to understand what data actually means, not what it looks like it might mean. When source data is poorly named, give it better names.
  • Do not optimize for fewer lines of code. Clarity is more important than concision.
  • Be consistent.
  • Be explicit.

Commenting#

  • Comment anything that might be surprising or confusing, and to explain business logic.
  • Comments should explain why or in certain cases how code does what it does, but never explain what simple code is doing. Don't make comments like join the student dimension. The code already says that. Do make comments like left join because we need to include missing values in the count
  • Commenting is largely for expressing developer intent - if code needs to be complex to work around an edge case, explain that. It can sometimes be difficult to determine what code is load bearing - is a quirk there by chance, or is it critical? Even comments that express a lack of intent (like: "there's probably a better way to do this") can be helpful.

Field Naming and Reference Conventions#

  • Field names should be snake_case
  • Avoid SQL keywords as column or table names
  • In the staging process, name for clarity of origin. In the analytics warehouse, name for clarity of meaning.
  • Primary or foreign keys should all start with k_ , e.g. k_student
    • The piece following k_ should uniquely and clearly identify the table it references. If you see k_student in a fct table, you should be able to infer the existence of a table dim_student which defines it
    • If a fact table requires multiple foreign keys to the same dimension, use double underscores to clarify these uses. For instance, we could have an enrollment table with k_user__student and k_user__teacher : both join to the same generic user-dimension (dim_user), but with different meanings. Without the double-underscore a user may infer the existence of a dim_user_student , which does not exist.
  • Primary keys on dimensions should generally be machine-generated; either auto-incrementing integers or hashes of the underlying natural key.
  • Fields which comprise the grain of the table should always be the first columns in the table
  • Any columns with ambiguous meanings like id , name , type should always be prefixed by what it is identifying or naming, i.e. account_id , school_name , user_type
  • Avoid overly short or obscure aliases. dim_student doesn't take long to type, and is much easier to interpret than s . Generally prefer the full table name.
  • Boolean field names should start with has_ , is_ , or does_
  • To distinguish between source-oriented fields and process-oriented fields (e.g. updated_at from the source vs updated_at from the ELT process, consider prefixing internal fields with an underscore: _updated_at . At the very least, be consistent in how these things are distinguished.

Dates#

  • Timestamps should end with _at , e.g. last_updated_at
  • Dates should end with _date , e.g. enrollment_start_date

Use CTEs (Common Table Expressions), not subqueries#

  • CTEs read linearly, subqueries read inside out. CTEs also help you break up logic into discrete steps.
  • In modern databases CTEs are pass-throughs, so they're pretty efficient.
  • Use CTEs to reference other tables (at the top of your query). Think of these as import statements: you are declaring the data objects you intend to reference. These imports should not contain logic - they are declarations, not transformations.
    • Exception: simple convenience views that are intended for end users are easier to interpret if they're expressed flat. refs can be stated in the from /join clauses for such views.
    • Exception: in certain cases it's more clear to express simple filters in the 'import' of a table than in a separate CTE.
  • Where performance permits, CTEs should perform a single, logical unit of work
  • CTE names should be as concise as possible while still being clear
  • If the logic in a CTE is confusing or surprising, first try to find a way to break it up and rewrite it more clearly. If you can't, add comments to explain what is going on.
  • CTEs that are duplicated across models should be pulled out into their own models

Indentation#

  • Indent the entire query one level within CTEs
  • When selecting, give each column its own row, indented one level. With the exception of select * , which can be on one line

    select
      school_code,
      state_student_id,
      sum(discipline_events)
    from some_table
    group by 1,2
    
  • The primary keywords (select, from, join, where, group by, having, etc) should be aligned on the left.

  • Each join condition should be on its own line, indented one level from the join (likewise for where and having clauses with multiple conditions)

    from some_table
    join other_table
      on some_table.x = other_table.x
      and some_table.y = other_table.y
    where some_table.x > 10
      and some_table.y = 5
    

General SQL Style#

  • Indents should be four spaces (not tabs), line-breaks should be Linux-style \n , not \r\n (both configurable in your editor)
  • The as keyword should be used when aliasing a field or table
  • Aggregations or window functions should generally be the last columns specified
  • Grouping and ordering should be by number rather than name, and the grouped columns should be sequential and on the left. e.g. group by 1,2,...,n not group by 4, 7, 2
    • Note: now that Snowflake supports group by all, it is generally preferable to use this over explicit numbering.
  • If you're grouping by more than a handful of columns, something has usually gone wrong. Likely you should have aggregated sooner.
  • Know the difference between union all and union. You probably do not need union.
  • Prefer lower-case SQL keywords
  • distinct should be on the same line as select
  • While distinct has valid use cases, prefer an explicit dbt_utils.deduplicate for deduplication.
  • Prefer != to <>
  • Prefer :: to cast
  • Strive to control important strings or parameters in a central, easy-to-find place, and definitely don't duplicate them all over the codebase.
  • Avoid long case statements that just recode values. Find a way to do a programmatic transformation or use a crosswalk in a seed table.
  • Prefer iff to a single-line case statement
  • Align column aliases where practical
  • Avoid double-quotes on columns unless absolutely necessary.

Joins#

  • Specify join conditions using the FROM table first, JOINed table second:

    -- good
    from table_a
    left join table_b
      on table_a.school_id = table_b.school_id
    
    -- bad
    from table_a
    left join table_b
      on table_b.school_id = table_a.school_id
    
  • Always prefix column names with the source table when joining

  • Avoid using, which has bad behavior in Snowflake. Specify joins explicitly.
  • Avoid right join

Other SQL Style Guides#

Gitlab Data Team

dbt labs Style Guide