SQL Style Guide#
This is the style guide we try to follow in our packages and in projects we manage.
While adopters of the EDU framework need not adopt our exact style guide, it is important to have a style guide, as this eases the cognitive burden of hopping around the codebase. Generally one should not be able to tell who wrote a script just by looking at it.
General Code Practices#
- Always think very carefully about the grain of each table - what makes an observation unique? Make sure that grain is enforced and tested.
- Avoid changing the grain of any table unless absolutely necessary, and be prepared to handle any downstream consequences of such a change (including talking to other teams).
- Make sure to read any documentation of source data - it's important to understand what data actually means, not what it looks like it might mean. When source data is poorly named, give it better names.
- Do not optimize for fewer lines of code. Clarity is more important than concision.
- Be consistent.
- Be explicit.
Commenting#
- Comment anything that might be surprising or confusing, and to explain business logic.
- Comments should explain why or in certain cases how code does what it does, but never explain what simple code is doing. Don't make comments like
join the student dimension
. The code already says that. Do make comments likeleft join because we need to include missing values in the count
- Commenting is largely for expressing developer intent - if code needs to be complex to work around an edge case, explain that. It can sometimes be difficult to determine what code is load bearing - is a quirk there by chance, or is it critical? Even comments that express a lack of intent (like: "there's probably a better way to do this") can be helpful.
Field Naming and Reference Conventions#
- Field names should be
snake_case
- Avoid SQL keywords as column or table names
- In the staging process, name for clarity of origin. In the analytics warehouse, name for clarity of meaning.
- Primary or foreign keys should all start with
k_
, e.g.k_student
- The piece following
k_
should uniquely and clearly identify the table it references. If you seek_student
in afct
table, you should be able to infer the existence of a tabledim_student
which defines it - If a fact table requires multiple foreign keys to the same dimension, use double underscores to clarify these uses. For instance, we could have an enrollment table with
k_user__student
andk_user__teacher
: both join to the same generic user-dimension (dim_user
), but with different meanings. Without the double-underscore a user may infer the existence of adim_user_student
, which does not exist.
- The piece following
- Primary keys on dimensions should generally be machine-generated; either auto-incrementing integers or hashes of the underlying natural key.
- Fields which comprise the grain of the table should always be the first columns in the table
- Any columns with ambiguous meanings like
id
,name
,type
should always be prefixed by what it is identifying or naming, i.e.account_id
,school_name
,user_type
- Avoid overly short or obscure aliases.
dim_student
doesn't take long to type, and is much easier to interpret thans
. Generally prefer the full table name. - Boolean field names should start with
has_
,is_
, ordoes_
- To distinguish between source-oriented fields and process-oriented fields (e.g.
updated_at
from the source vsupdated_at
from the ELT process, consider prefixing internal fields with an underscore:_updated_at
. At the very least, be consistent in how these things are distinguished.
Dates#
- Timestamps should end with
_at
, e.g.last_updated_at
- Dates should end with
_date
, e.g.enrollment_start_date
Use CTEs (Common Table Expressions), not subqueries#
- CTEs read linearly, subqueries read inside out. CTEs also help you break up logic into discrete steps.
- In modern databases CTEs are pass-throughs, so they're pretty efficient.
- Use CTEs to reference other tables (at the top of your query). Think of these as import statements: you are declaring the data objects you intend to reference. These imports should not contain logic - they are declarations, not transformations.
- Exception: simple convenience views that are intended for end users are easier to interpret if they're expressed flat.
refs
can be stated in thefrom
/join
clauses for such views. - Exception: in certain cases it's more clear to express simple filters in the 'import' of a table than in a separate CTE.
- Exception: simple convenience views that are intended for end users are easier to interpret if they're expressed flat.
- Where performance permits, CTEs should perform a single, logical unit of work
- CTE names should be as concise as possible while still being clear
- If the logic in a CTE is confusing or surprising, first try to find a way to break it up and rewrite it more clearly. If you can't, add comments to explain what is going on.
- CTEs that are duplicated across models should be pulled out into their own models
Indentation#
- Indent the entire query one level within CTEs
-
When selecting, give each column its own row, indented one level. With the exception of
select *
, which can be on one line -
The primary keywords (select, from, join, where, group by, having, etc) should be aligned on the left.
-
Each join condition should be on its own line, indented one level from the join (likewise for
where
andhaving
clauses with multiple conditions)
General SQL Style#
- Indents should be four spaces (not tabs), line-breaks should be Linux-style
\n
, not\r\n
(both configurable in your editor) - The
as
keyword should be used when aliasing a field or table - Aggregations or window functions should generally be the last columns specified
- Grouping and ordering should be by number rather than name, and the grouped columns should be sequential and on the left. e.g.
group by 1,2,...,n
notgroup by 4, 7, 2
- Note: now that Snowflake supports
group by all
, it is generally preferable to use this over explicit numbering.
- Note: now that Snowflake supports
- If you're grouping by more than a handful of columns, something has usually gone wrong. Likely you should have aggregated sooner.
- Know the difference between
union all
andunion
. You probably do not needunion
. - Prefer lower-case SQL keywords
distinct
should be on the same line asselect
- While
distinct
has valid use cases, prefer an explicitdbt_utils.deduplicate
for deduplication. - Prefer
!=
to<>
- Prefer
::
tocast
- Strive to control important strings or parameters in a central, easy-to-find place, and definitely don't duplicate them all over the codebase.
- Avoid long
case
statements that just recode values. Find a way to do a programmatic transformation or use a crosswalk in a seed table. - Prefer
iff
to a single-linecase
statement - Align column aliases where practical
- Avoid double-quotes on columns unless absolutely necessary.
Joins#
-
Specify join conditions using the FROM table first, JOINed table second:
-
Always prefix column names with the source table when joining
- Avoid
using
, which has bad behavior in Snowflake. Specify joins explicitly. - Avoid
right join