Understanding the codebase#

edu is a framework for open analytics composed of many independent code modules. This can be confusing to parse at first, but there's a reason for it: EDU is designed to be modular, such that individual components can be modified or swapped out as necessary.

These modules are loosely coupled, which means if you only wanted to use the dbt packages you could substitute your own extract-load and orchestration tools, provided the data landed in the database the way the dbt code expects. Likewise you could use only the extract-load components, swapping in your own data transformation code.

To use the framework as a whole, we've provided a project template repository as a starting point. This project template imports all the other packages and sets up the default configurations. It also contains code generation tools for scraping the set of Ed-Fi resources in your ODS, as these can vary by Ed-Fi version and extensions.

You can fork the project template by clicking the Use this template button in Github, which will create a new repository you own where you can configure and extend the framework.

This diagram shows how each module fits into the whole picture:

System Diagram

Once the project template is forked, it becomes the place where configurations and extensions for your implementation live, pictured as the blue box at the top of the diagram. This manages the installation and composition of the remaining components, which are described below.

Transformation: The dbt packages#

Starting from the right, we have the transformation layer. This is what builds the warehouse data structure. There are two core packages, but we also have several extension packages that can be layered on.

edu_edfi_source #

This package creates the staging layers of the warehouse, or what the Medallion Architecture refers to as the Bronze layer.

The goal of this package is to parse Ed-Fi JSON into a convenient tabular structure in a relatively unopinionated way, to prepare it for a more opinionated analytics structure downstream. This package could be used on its own to create a data warehouse entirely different than the one that comes with EDU.

The models in this package perform tasks like:

Create a table for each resource in Ed-Fi
Unpack the JSON into columns, apply nice column names and data types
Unpack nested lists into sub-tables
Enforce deduplication rules and handle Ed-Fi deletes
Make descriptors easier to work with
Create surrogate keys

edu_wh #

This package uses the data structures from edu_edfi_source to create a dimensional data model on Ed-Fi data, analogous to a Silver layer in Medallion architecture.

These warehouse structures can be used directly by analysts or BI tools, or you can add models to your project to create report-specific models and metrics, or create domain-specific data marts and curated views.

Orchestration and Extract-load#

EDU uses Apache Airflow to schedule, orchestrate, and perform the work of data pipelining. Airflow is the overall supervisor that runs extract-load pipelines and dbt jobs, and it runs a service that lets you monitor the health and history of your pipelines.

The project template is where DAGs are actually defined and configured, but the packages below contain most of the logic and helper functions that make this work.

edfi_api_client #

The Ed-Fi API client is an open source Python tool for interacting with Ed-Fi APIs. The Airflow code uses this client to extract data from Ed-Fi APIs.

edu_edfi_airflow #

This is the code that wraps the edfi_api_client in Airflow handlers. It provides Airflow Hooks and Operators for interacting with Ed-Fi itself, the storage layers of EDU, and tools like earthmover.

It also defines the Ed-Fi DAG factory which the project repo then uses to create particular workflows: on this schedule extract these resources from this Ed-Fi API and put the resulting data into the warehouse.

This defines the core extract-load functionality of the data pipeline.

ea_airflow_util #

This package defines various Airflow helper functions and DAG factories that are not specific to Ed-Fi. This includes tooling for orchestrating dbt runs, interacting with other tools like Google Sheets, SFTPs, AWS, etc.

Infrastructure#

edfi_airflow_cloudformation#

This package sets up the AWS version of the EDU infrastructure. This deploys a handful of AWS components, including:

A Ubuntu Linux server (EC2) to run Airflow
RDS Postgres for Airflow's metadata
Blob storage (S3) for the data lake
Parameter Store and Secret Manager for secret management
A VPC and network stack to contain this infrastructure and manage connectivity

This infrastructure can be deployed multiple times to create environments for development and testing.

Note that we are only using commodity cloud products, of which every cloud vendor offers a similar variant. This can be ported to Azure, Google, or other cloud providers relatively straightforwardly.

edu_db_setup#

This is the codebase for deploying the basic Snowflake infrastructure, including:

A role hierarchy for managing access to data
Databases, schemas, and table shells for the basic objects
Network rules and security structures
Warehouses (compute nodes) for a variety of use-cases
Service accounts for Airflow and dbt

Further modifications for custom team roles, user management, BI service accounts, sandbox environments and the like can then be layered on top.

Understanding the codebase#

Transformation: The dbt packages#

edu_edfi_source#

edu_wh#