Slowly Changing Dimensions#
We are often asked whether the EDU warehouse uses Slowly Changing Dimensions, and the answer is "Kind of, but not really, though in principle they can be added where needed." This can be confusing, so this article will try to explain our position on this issue. Incidentally, it may be worth noting that this question is generally asked by IT or Data Warehousing folks, but rarely by analysts or users of data.
What are Slowly Changing Dimensions (SCDs)#
As discussed in the article on Dimensional Models, a dimensional data model consists of 'facts', which are events that occur in time (like attendance or grades), and 'dimensions', which are the various entities that are involved in those events (like students, schools, and dates). The dimension records all the information we know about these entities, but this information can change over time. An SCD is a way to think about tracking those changes.
There are many variants, but by far the most popular is called SCD Type 2. In this approach, every row in a dimension table has a start and end date, and every time the information about the entity in that dimension changes, we 'close' the open record and create a new one.
Let's say the information we know about a course looks like this:
k_course | course_number | course_name | course_subject | begin_date | end_date |
---|---|---|---|---|---|
1 | 123 | Algebra 1 | Math | 2024-08-01 | Null |
If this information changed, we would update this to look as follows:
k_course | course_number | course_name | course_subject | begin_date | end_date |
---|---|---|---|---|---|
1 | 123 | Calculus 1 | Math | 2024-08-01 | 2024-10-01 |
2 | 123 | Calculus 1 | Advanced Math | 2024-10-01 | Null |
Notice that we now have two records: one that tells us what we knew about the course from August 1st through October 1st, and one that tells us what we know about it from October 1st through today.
Why can this be problematic in the Education domain?#
First, notice that this makes things a bit more complicated. When we record a fact event that involves this dimension, like a course enrollment or a grade, we record that in terms of the key of the course dimension, which here is k_course
. But we now have two keys for the same course. An enrollment doesn't occur on a single day, it spans a period of time, which in this case will likely include both versions of the course record. Which should we use to describe this enrollment? Plausibly the most recent, but actually implementing this introduces some complexity, not only in dim_course
itself, but in every table that references it.
And consider an event that does occur on a single day. If we were recording section attendance events, and say, calculating an attendance rate by subject, we would be forced to say that a student's attendance on 8/21 was in 'Math', but their attendance on 10/15 was in 'Advanced Math', which splits the attendance of a single student in a single course across two subjects. This is odd, and not what analysts would tend to expect.
Second, consider: is it actually true that Calculus 1 was in a different academic subject for the first two months of the year? Or is it more likely that this was a correction of the record, and continuing to report the old subject would simply be a mistake?
Third, if the source system from which we get this course information is not tracking how this data changes over time, then only the warehouse is. This is fine in principal, but it has a couple implications.
- This change data capture cannot be reproduced for historic data, only future data. This means we have change tracking only from when the warehouse project started, and only know the final state of any dimensional records from historic years.
- The warehouse can no longer be fully reproduced on demand from source data -- it now is a source of truth itself about how this data has changed over time.
- If any corrupted or incorrect data has made its way to the warehouse, we can no longer just refresh the warehouse and clean it out, we have to manually fix each SCD we have -- not only removing the incorrect records, but re-aligning the start and end dates of the correct data that came before or after the bad data came in. If we don't do this, we'll hold on to the incorrect data and continue reporting it forever.
What does EDU do instead?#
EDU has a few different strategies for situations like this.
One important note here is the intended purpose of EDU as it is constructed. Warehouses in education often serve one of two primary purposes (or attempt to serve both at the same time, to mixed outcomes): one of analytics, that is reporting our current understanding of the world to stakeholders like teachers, school and district leaders, parents, etc; or one of compliance, wherein it may be critical to report history as we understood it at the time, even if the constituent records have been corrected. Both of these are valid, but they are at cross-purposes: in a compliance mindset, it may be critical to report that we understood a student's test score to be 150 as of some date, even if that was incorrect at the time, because that's how we reported it externally and we need to be able to reproduce that understanding forever. In an analytics context, what matters is our best understanding of the truth: it is actively counterproductive to report a number we no longer believe to be correct simply because we once did.
Annualization#
The first change-capture strategy is what we call annualization. This stems from an observation: things like courses don't tend to change arbitrarily throughout the year, they change exactly and only at the boundaries between years, when the new course catalog comes out. Any changes to course records during the year are probably corrections to this data, rather than analytically meaningful events. For dimensions of this type, we annualize them rather than tracking all changes.
Considering our example from above, this means our course dimension might actually look more like this:
k_course | school_year | course_number | course_name | course_subject |
---|---|---|---|---|
1 | 2023 | 123 | Calculus 1 | Math |
2 | 2024 | 123 | Calculus 1 | Advanced Math |
What this means is that we consider the course to have meaningfully different metadata cross school years, and only across school years. This is very simple to compute, as we know exactly what school year any particular definition of a course came from, and it eliminates many of the unintuitive outcomes described above.
Flattening#
Other dimensions we fully flatten, meaning we always and only maintain the most recent metadata about the dimension. An example of this would be schools. For things like BI and dashboards, it can be clumsy to have to remember that a school changed names in 2022, so we have to choose different elements from the drop down as we move through time. It may even make it difficult or impossible to show an over-time graph of this school, if your interface functions on the school's name. For reasons like this, we keep only our most recent understanding of a school's metadata in the school dimension, even if that has in fact changed over time.
It may be perfectly valid for other use-cases to maintain a separate school dimension that tracks changes over time (for instance: if a school changed locations, it may be important to your analysis where it was located at the time in question, rather than where it is located now). But even then it's worth considering whether it's more useful to think of this in terms of annualized or continuous change tracking.
Fact-derived characteristics#
Some elements that can be treated as dimensional characteristics are also somewhat fact-like. For example, consider enrollment in a special education program. These are captured in the fact table fct_student_special_education_program_associations
, which records the begin and end dates of such enrollments, and any metadata associated with that enrollment. This provides a great deal of detail in case it is needed, but it is often useful to have a simpler summary of this data as well. EDU can add columns to dim_student
from this fact table with a few different definitions, for example is_special_ed_active
for enrollments that are active today, is_special_ed_annual
for enrollments that occurred at some point during the school year, and is_special_ed_ever
to capture if the student has ever enrolled in such a program.
EDU's strategy here is to summarize where useful, but preserve the underlying detail for when it is needed. These summaries contain some information about how a student's special education enrollments have changed over time, but you can also go to the full fact table for more detailed information.
Note though that while the fact table can store many enrollments over time and in different types of programs, any individual enrollment is not change-captured, so any changes to the enrollment are considered corrections to that enrollment's metadata rather than analytically meaningful events.
What about students?#
The place where full SCD seems potentially most relevant is the student dimension: whereas courses typically only change at specific points in time, students mostly do not. However the set of attributes a student possesses actually break down into a few different categories that are worth examining.
First, there are 'immutable' characteristics, such as Name, Birthdate, Gender, or Race. For such fields it may be problematic to show even the annualized history of how they have changed, because the old records are likely either incorrect, inappropriate to display, or both. These fields would be problematic under an SCD paradigm. EDU can either treat them like normal annualized characteristics, or treat them as 'immutable', in the sense that we always apply our current understanding (from the most recent school year) to all years of history. This is controlled through the edu:stu_demos:make_demos_immutable
parameter.
Then there are the fact-derived characteristics discussed above, such as summaries of program enrollments or a student's grade-level. Since these are derived from dated events already, their history is already preserved, just outside the context of the dimension.
Finally there are stand-alone, undated fields (such as studentCharacteristics
) which may change meaningfully throughout the school year, and whose changes are not possible to track any other way. We discourage the use of such fields in favor of a more time-tracked table (such as the program tables) if tracking changes is important.
How would one add SCDs or other change-data capture to EDU?#
In general EDU does preserve how all records have changed over time, but it only does this in the very earliest parts of the warehouse: raw
and base
models. One could use this data in downstream models where history tracking is important, but the existing warehouse code removes this in order to stay in sync with the current state of Ed-Fi itself.
To preserve history more directly, one can make use of dbt Snapshots, which natively implement SCD Type 2. While these are not built into the core EDU code, we have often implemented them on top of data where history tracking is particularly important. In practice this has typically been metrics whose history we want to preserve, rather than dimensional data, but the principal is the same.
In short, it is possible to layer a more thorough change-data capture strategy onto EDU, but the choice not to do this by default is a deliberate one.