Overview#
This article outlines the overall technical structure of edu . In this article we assume a familiarity with AWS Services and Snowflake. This article should be helpful in understanding the system and the decisions that went into it; for step-by-step instructions please see "Guides".
Problem space#
edu facilitates reporting and analysis of education data. It brings Ed-Fi and (optionally) other data sources together in a standardized data warehouse which can then be used to generate reports, feed BI dashboards, conduct ad-hoc queries, fulfill research data requests, and more. It simplifies the work of gathering data and provides a single source of truth for education data and analytic metrics.
Overall architecture#
In edu , our intent is that everything is code — from infrastructure decisions (like server sizes) and security configurations (like SSO implementation) to business rules (like the definition of attendance threshholds) and access permissions. This means there is no ambiguity due to configuration drift, and common definitions are highly reusable.
Requirements#
edu builds on the Ed-Fi data standard. At least one Ed-Fi API is required as input to the system, from which data is extracted and the data warehouse is built.
Ed-Fi is an education data standard which supports interoperability. Many members of the edu community are also members of the Ed-Fi community; Enable Data Union is dependent on Ed-Fi.
AWS Components#
The EDU code contains templates to set up an AWS environment that contains the infrastructure that handles the orchestration of the scheduled system to move data out of Ed-Fi. Right now, this is implemented as a CloudFormation template.
Networking and security in this AWS environment is up to the education agency using EDU, but it works to deploy this infrastructure in a standard AWS VPC.
When deployed the following AWS services will be in use:
- CloudFormation to manage the setup of the infrastructure components
- EC2 Linux Machine running a python environment and Apache Airflow
- AWS Parameter Store to manage the secure credentials that Airflow will need access to
- An S3 bucket to stage the raw data to load into Snowflake
- An RDS Postgres Database to serve as Airflow's backend
Snowflake#
The EDU code uses a series of scripts to set up infrastructure inside of Snowflake; we intend to have this code managed by Terraform in a future release. The code creates 4 databases and sets up default roles for management of the system. A description of the role definitions and hierarchy available by default is available in the management doc.
Code structure#
edu code lives in several github repositories. The general goal of this structure is to create modularity of function, an ability to make changes to components more easily, and to isolate fixed packages and templated configuration and implementation repositories. A list of the relevant repositories is located in the code tab.
Security#
An implementation of edu includes sensible defaults and has options for further configuration of security settings. The responsibility of managing security is with the agency implementing the code. Many more details about security are available in this article that explains some of the reasoning and architecture.
By default, Snowflake will be set up to be inaccessible over the open internet, and as part of the configuration, you can add IPs that are allowed to connect to the databases. It is also recommended to link SSO to Snowflake for database access
Costs#
The software is free, but running the cloud infrastructure involves costs. These costs can vary depending on choices made in setup and usage / volume of the system. If you have questions about costs, please contact us at edu@edanalytics.org.
There are also other implicit costs involved in managing and customizing the system, such as personnel costs (technical staff who can maintain and customize the system). An alternative that minimizes such costs is to use Stadium, EA's hosted/managed edu service.