Skip to content

Overview#

This article is targeted toward individuals using the edu data warehouse for analytics, research, or building dashboards. We assume some familiarity with SQL-based databases and dimensional data models / star schemas. More information on these topics are available in the explanation section, so if you are unfamiliar with these concepts please start there. This article should be helpful in understanding the system as a whole and the design principles behind it; for step-by-step instructions please see "Guides".

Lineage / transformation steps#

edu code moves data into the analytics database exactly as it comes out of the Ed-Fi API, which is not well organized for analytics use. edu code then executes a series of transformation steps inside the data warehouse to arrive at its destination in a dimensional model or star schema. Users will typically interact with the warehouse (the dimensional model), but it may be useful to understand the steps the data takes along the way.

  • base is the first step of the process in which the data is pulled out of the nested data structures that come from Ed-Fi.
  • stage is a processing step that allows us to organize tables around a desired level of granularity. You can think of this as cleaned-up raw data organized into tables.
  • warehouse is where we pull together the various tables into a denormalized dimensional model.

This data is organized into separate schemas and most users will interact mostly with the warehouse schema called prod_wh.

Dimensional data model principles#

The edu data warehouse aligns to a design philosophy that is intended to make it easy and intuitive to understand relationships. Surrogate keys are always prefixed with k_ and named to be associated with their associated dimension table. For example, you might encounter k_student and k_school in a fact table about student attendance; in this case you can write a join assuming that the corresponding dimension tables for those entities are called dim_student and dim_school and that you can join by the correspondingly named key k_student or k_school.

You can also assume that when joining fact tables to dimension tables that the named key is unique and will produce a many-to-one join. In the example above, if you are joining fct_student_daily_attendance to add contextual information about a student, you can write a join to dim_student on k_student assuming that you will not find a duplicate of student.

Making simplifying assumptions#

In Ed-Fi, data can often be represented in nested structures with flexibilty. In designing the analytics data structures we often face a tradeoff between making a decision or assumption for simplicity and convenience. For example, student demographic and program data can be represented in many ways in Ed-Fi-- we may have records associated with different enrollments in schools, or program start and end dates. Or to take a simpler example, Ed-Fi could store infinitely many school addresses for a school -- should we pick one to show in the school summary table dim_school, or should we refer out to a separate table that associates schools with addresses?

In general, our philosophy in organizing the edu data warehouse is to prefer to make an assumption for convenience of use but to make the underlying data available in a convenient format for more sophisticated inquiry if desired. Simplifications like a binary indicator for program participation in a year for a student may be useful for many cases, but some analyses might require going deeper to source program start and end dates.

Configuring measures#

Ed-Fi allows for customization of descriptors, which are usually categorical codes representing observations of data. For example, the edu code knows where to find the recorded instances of attendance events for students in Ed-Fi, but those codes could contain anything from a simple Attended or Absent to much more granular ways of capturing this information.

edu attempts to define a framework for analytic interpretation of underlying data. We generally call any analytic aggregation of data a measure that is stored in a table prefixed with msr_. The idea is to take categorical information like attendance codes or performance levels on an assessment and turn them into standardized analytically meaningful concepts like attended school or proficient on exam. We store this translation in "seed tables", crosswalks stored in the database. (This concept comes from dbt which you can read more about here. This allows us to keep logic out of the code base that is specific to customizable descriptor codes in Ed-Fi. When you see summary analytic measures in the data warehouse, they are driven by the configuration specified in these tables.

Documentation#

edu comes with a documentation site built by dbt. This site outlines the data model and the dependencies of the tables that are built in the system, which allows for a deep inspection of the data lineage and the processing rules that have been applied. It also shows the rendered SQL that produces each table in the data warehouse.