How we structure our raw tables

Structure of the raw database#

The EDU data model is divided into two primary components: the raw database and the analytics database. The raw database is where data from outside systems lands in its pristine raw format, and it is governed by the Extract-Load process. The analytics database is where all stages of data processing happen, and is governed by dbt.

Within the raw database we create one schema per source system. Out of the box, EDU has a schema called raw.edfi3, which is for storing raw data from Suite 3 Ed-Fi systems, but more schemas can be created as sources are added. See our guide on bringing in external data for more details.

While it's best to have one schema per source, what constitutes a source is up for some interpretation. If you're pulling tables from a vendor's database, that should probably have a schema all to itself. But if you have an array of miscellaneous flat files from various sources, this could be a single flat_files schema.

Within these schemas, we create one raw table for each table or file in the data source, named as closely as possible to the name of the table or file that will be loaded there to avoid ambiguity.

Structure of raw tables#

When loading new data sources to the raw database we nearly always convert it to JSON first, regardless of whether the source is a database, CSV, or API. This may seem odd, given that we're going to split it back up into a tabular format in dbt, but doing this has a few advantages:

All raw tables have essentially the same structure, which lowers the burden of creating and maintaining these tables
We don't have to worry about slight variations in structure breaking the loading process: the data will load no matter what, and we can clean it up downstream.
No migrations: we can handle changes to the source data structure in the dbt code downstream without changing tables or migrating data between versions
We don't have to determine the types of every column or maintain lots of unique DDL

Our raw file format can vary a bit by source, but is largely the same: metadata about the loading process, followed by a single variant column containing the actual data. We create a template table first, and then each new data source can be created as:

create table raw.{data_source}.{table_name} like raw.edfi3._template_table;

Here's an example of a generic template table:

create or replace table raw.public._template_table (
    filename varchar,
    file_row_number int,
    tenant_code varchar,
    school_year int,
    pull_date date,
    pull_timestamp timestamp,
    v variant

);

Our Ed-Fi template table contains a few extra Ed-Fi specific columns:

create or replace table raw.edfi3._template_table (
    filename varchar,
    file_row_number int, 
    api_year varchar,
    pull_date date,
    pull_timestamp timestamp,
    name varchar,
    ods_version varchar,
    data_model_version varchar,
    v variant
);

filename: the full path to the raw file in the data lake
file_row_number: the row within the raw file that this row came from
school_year/api_year: The school year that the file is about, or the year of the Ed-Fi ODS/API. Can be omitted if not relevant for a given source.
tenant_code: Only relevant in multi-tenant systems (such as states or collaboratives) -- keeps track of which district owns the data
pull_date: The date at which the data was extracted from the source system.
pull_timestamp: The full timestamp at which the data was extracted from the source system
v: The column containing the full JSON data for each row from the original source
ods_version: The software version of the Ed-Fi ODS/API platform
data_model_version: The version of the Ed-Fi data model that the ODS was using