This article is targeted toward individuals managing, maintaining, and developing in an edu system. We assume some familiarity with Python development, Apache Airflow, and dbt. This article should be helpful in understanding the system as a whole and how it was designed to work; for step-by-step instructions please see articles in the Guides section.
Overall ELT strategy#
The edu framework uses an extract-load-transform strategy to move data from Ed-Fi into Snowflake. This means that the data is moved out of Ed-Fi and into the Snowflake database exactly as it comes from the Ed-Fi API, and then is tranformed into analytics data through a series of steps within the Snowflake database.
Succesful setup of an edu project will create an Airflow instance with DAGs running on a schedule to extract and load data from the Ed-Fi API, land them in an AWS S3 bucket for storage, and then load them into unstructured raw JSON tables in Snowflake. This aligns with an extract-load strategy to move the data out of the source system without modifying or tranforming it in any way. Because of this, using Airflow as our orchestration engine is not a critical aspect of the system design -- the goal is to get the data extracted from the Ed-Fi API and loaded into the database in its raw form.
From there, we use dbt to tranform the raw JSON data from the Ed-Fi API into a dimensional data warehouse organized for analytics. This allows the full lineage of the data transformation to happen in the database and under the control of dbt, which allows for a more cohesive approach to organizing the steps of the transformations and testing the quality of the data. The dbt process also manages complexities like dealing with deleted data, creates quality control and inventory tables, and runs tests for data quality.
We intend to share as much common code as possible in edu , while creating space for relevant configuration parameters to adapt to the needs of education agencies. The goal is to not create forks of the primary package repos but to store any configurations and extensions in templated implementation repos. This means that changes to the package repos are the responsibility of all users of edu , while changes to implementation repos are under the control of the implementation team for an agency. This configurability applies to both the extract-load setup as well as the dbt data transformation. See code repositories for a full list of the relevant repositories.
The package repos will produce DAGs that extract data from Ed-Fi APIs and trigger a dbt run to process the data inside the Snowflake data warehouse. In the implementation repo, the implementation team can configure components of those DAGs, like the schedule they run on, or add new DAGs for additional data sources.
The implementation repo of an edu project includes the shell of a typical dbt project that imports package repositories. The package repositories contain the dbt models that transform the data. The /models/ folder if the implementation repo is empty at the start of a new project. The implementation repository is where an implementation adds custom models, customizes configuration & settings, and adds seed tables.
Ed-Fi allows for configuration of descriptors, which are usually categorical codes representing observations of data. For example, the edu code knows where to find the recorded instances of attendance events for students in Ed-Fi, but those codes could contain anything from a simple Attended or Absent to much more granular ways of capturing this information. In other cases descriptors might describe meaningful concepts for identification like "state student id" that we want to capture and standardize in the edu data model.
edu attempts to define a system for this interpretation of the underlying data without needing to write custom code containing these values. The idea is to take categorical information like attendance codes or performance levels on an assessment and turn them into standardized analytically meaningful concepts like "attended school" or "proficient on exam". We store this translation in dbt seed tables. This allows us to keep logic out of the code base that is specific to customizable descriptor codes in Ed-Fi. To get these analytics indicators to show up correctly in the data warehouse, you need to update the seed tables in the dbt implementation repo to align with the descriptors in your Ed-Fi ODS. There are also other configurable thresholds in dbt project yml files.
For a more detailed list of the configuration options in the projects see the reference guide for data warehouse configuration.
Developing airflow code for extract-load#
The edu framework contains a place for the implementation team to add additional DAGs to extract and load data from other sources. A developer can write python code to interact with an outside data source, use Airflow to manage credentials, and add code to create additional DAGs in the Airflow environment. For more information see the guide to customizing with external data
Developing dbt code for transformation of data#
framework allows for the layering on of additional dbt models inside the implementation repository. A developer can reference tables in the data warehouse, or custom sources added to the data
warehouse, to the
/models folder of the implementation repository. The dbt code is structured as a series of package dependencies, where the implementation repository imports the
warehouse package, which imports the source package. For details on setting up dbt development, see the local dbt setup guide.
Security management within Snowflake#
The overall security of the cloud system is discussed in the setup and infrastructure section of the documentation. Snowflake allows for a lot of flexibility in managing and granting access to users of the data warehouse. For information about the access roles created in a default implementation, see access roles.