We take the safety of student data very seriously, and follow the principle of least access, meaning we lock access down as much as possible, and grant only the level of access strictly required for job functions.
edu is, however, an open framework: if you are self-hosting rather than using the managed Stadium product, it is very possible to configure it to be less secure, or to grant overly broad permissions to an overly broad group of people. This article will speak to our default and recommended approaches, providing a high-level overview rather than deeply technical guidance.
Where is data stored?#
Secret values, such as tokens, passwords, etc are stored as encrypted strings in AWS Parameter Store, so they can be accessible to the environments that need them. Access to these secrets is governed by IAM policies.
Data is stored in two places permanently, and one additional place temporarily. All data is encrypted both in transit and at rest.
The Data Lake: data coming from the Ed-Fi API or other source systems is stored encrypted in an S3 bucket. Access to this bucket is scoped to AWS services that need it (such as the compute servers where Airflow runs), the Snowflake environment, and AWS administrators.
The Data Warehouse: Data is loaded from the Data Lake into the Warehouse (Snowflake) where it is encrypted and governed by normal database access policies.
Temporary server storage: While Airflow is running, it uses an AWS Elastic File System (EFS) mounted drive to store files it is operating on. When Airflow is finished with these files it moves them to S3 and deletes them from EFS. EFS drives are encrypted.
Logs are sanitized to avoid containing sensitive data such as secret values or student data. Log data is written to AWS CloudWatch for monitoring the operation of the system. Metadata about the success or failure of tasks or DAGs may be sent to alerting services like Slack or Email for monitoring purposes, but such alerting never contains sensitive data.
Our recommended approach is to utilize network policies to restrict access from the open internet, explicitly adding IP ranges to the allow-list such that connections can only be made from the following places:
- The networks of authorized education agencies
- The compute environments of the orchestration engine (e.g. Airflow)
- Any BI services or other visualization frameworks
- The network ranges of any supporting contractors with permission to access student data
Human accounts on the database should be limited by policy to those people who need direct query access to data. This is generally a smaller group than those who will access data generally, as much of this access will be mediated by a BI or other visualization layer.
Human accounts should utilize an SSO integration to leverage existing policies, such as password strength requirements, password rotation policies, MFA, and staff exits.
Service accounts are those used by machines rather than humans. These include the processes that load or transform data, and may include data visualization layers. Such accounts should never be used by humans directly. Since these accounts are not tied to any particular human, they are not set up through SSO. They should however enforce extremely high password complexity requirements, regular rotation, and limit visibility only to a small set of administrators.
The edu framework is governed by code, which means human accounts never have write access to production data. Changes to production data are handled by merging and deploying code to production environments, giving full transparency and auditability of such changes. As such, requests for write access to production code-governed spaces should always be denied by policy.
Developers do have write access to development servers and development regions of the warehouse. Read access to such areas should be limited to avoid confusion between production and development workspaces.
In single-tenant environments, row-level security is generally handled by the visualization layer, and people with direct database access are assumed to have permission to view all rows (for instance: district staff). If this is not the case, more fine-grained row access policies can be layered onto the framework.
In multi-tenant environments, row-level security is applied at the database level to recreate a single-tenant experience for direct database users. Every table in the database contains
tenant_code to simplify the application of such policies.
While Airflow orchestrates work involving student data, it does not store any such data, or expose secret values to users. The Airflow interface doesn't strictly require tight visibility controls, but it is nonetheless an administrative layer that is not generally useful to a wide audience.
Admin access to production Airflow should be tightly controlled: while student data is not shown in the interface, many destructive or disruptive actions are possible, such as triggering DAG runs, deleting DAGs, or deleting/overwriting connections.
Most users should have view-only permissions to observe task statuses and logs, or view those logs through another interface like CloudWatch. End-users need not access the Airflow interface at all. Monitoring and alerting can be surfaced to a wider audience if necessary external to the Airflow interface.
AWS, VPC, EC2 permissions#
Access to the AWS VPC is not necessary for anyone except AWS administrators and data engineers. The VPC should be locked down to allow access only from allowed IP ranges or VPN connections.
Administrative access to the AWS management console should be highly locked down, while view access to monitoring dashboards can be granted to technical staff tasked with monitoring such components.
EC2 access should be managed via SSH keys only (no password-based logins).