Table of Contents generated with DocToc
A repository of data sets used in the scuole project. When updating the scuole database, this is where you download, clean and save the data. The scripts found in scuole will then upload each dataset directly from this folder into the database.
Before you start the update process in the scuole, make sure all of the data files needed for the update are clean and ready to go.
We try to update every year in January after the latest TAPR data is released at the end of the previous year. However, some data, like cohorts data, are updated much earlier in the year so you can update that data independently from the rest of the database.
TEA provides district boundaries and campus coordinates on their open data site, which you can get to via the School District Locator page. The latest district boundaries are listed under "Current Districts" and campus coordinates under "Current Campuses". Depending on what year you are updating for, you'll go to "Archived Schools Data" and click the school year you are updating for.
Put the GeoJSON boundaries and coordinates in the respective folder: tapr/reference/district/shapes/
or tapr/reference/campus/shapes/
.
This was last updated for the 2022-23 school year. TEA provided the files as GeoJSONs. We don't display the actual shapes on the page because they're not accurate enough and may be misleading. They are useful for determining nearby districts and geolocating.
This was last updated for the 2021-22 school year. TEA provided the files as GeoJSONs.
The latest shapefile campuses_03-10-2020.shp
(fetched March 10, 2020) is for the 2018-2019 school year. It was downloaded as an .sd
file, and unzipped with Unarchiver (a Mac program). The unzipped version yields a schools.gdb
folder, which can be opened in QGIS (instructions).
We convert the TEA provided shapefile into a geojson file using ogr2ogr
with the command:
For school districts:
$ ogr2ogr -f GeoJSON -t_srs crs:84 districts.geojson [tea-provided-file-name].shp
For campuses:
$ ogr2ogr -f GeoJSON -t_srs crs:84 campuses.geojson [tea-provided-file-name].shp
Each year, there's a possibility that campuses and districts change names, are added, or are removed. We rely on the reference.csv
in each year's TAPR folder to create a entities.csv
file that will create models for districts and campuses.
Instructions on how to take the reference.csv
and create a new entities.csv
are in the format_new_entities
Jupyter Notebook — we should be doing this every year.
We do some district and campus name re-formatting in the Jupyter Notebook (i.e. Cayuga H S --> Cayuga High School). Abbreviations, the Regex for those abbreviations, and the string to replace them with are in campus_name_abbrev_guidelines.xlsx
.
When you update entitites in the scoule database, you will be erasing the existing district and campus models and then re-adding every district and campus based on that entities.csv
. This should take care of districts and schools that get renamed/removed/added.
Released: as information is updated
AskTED provides superintendents, principals and directory information for all schools and districts. The scuole
repo downloads data from AskTED directly and updates them in our database, so there's no need to manually download and add them to scuole-data
.
Instructions on what commands to run to update AskTED are in the scuole
README.
As of 2023, the AskTED data is found on their site if you click on Download School and District File with Site Address
. This ensures that we have school and district address for the site of the school and district office rather than the mailing addresses (which can be P.O. Boxes).
If the links above don't lead you to the correct data, then AskTED might have changed the link to the data. If so, then the commands to update askTED data in the scuole database will fail, so make sure the links are updated with the correct headers names.
Released: Annually in late November/early December
All stats are collected by the Texas Education Agency.
To download TAPR data, go to the Texas Academic Performance Report homepage and find the most recent release. Click the Data Download
link and go to the Advanced TAPR Data (Numerators, Denominators & Rates)
option.
This app requires sheets for College, Career, and Military Readiness (CCMR), TSIA, College Prep
, AP/IB, SAT/ACT
, Attendance, Chronic Absenteeism, Graduation (RHSP/DAP & FHSP), and Dropout Rates
, Longitudinal Rate (4-Year, 5-Year, & 6-Year)
, Staff & Student Information
for Campus, District, Region and State. The Reference Information, Accountability Rating and Special Education Determination Status
sheet is only required for Campus, District and Region.
For districts and campuses we need to download one extra file that contains the full A-F rankings. Go back one page and select 20xx Accountability instead. (The year will change depending on what year you're working on!). Download the Accountability Summary
for Campus and Districts.
A detailed walkthrough on how to download and format the TAPR data is available on this Confluence page.
After downloading each file, you will save it in their respective folders (Campus, District, Region, State) in their respective years in the tapr
directory as the following spreadsheets. An example of the directory for Campus from 2021-22 found here.
TAPR data file | File name | What it contains |
---|---|---|
Accountability Summary | accountability.csv | A - F scores |
Reference Information, Accountability Rating and Special Education Determination Status | reference.csv | District and Campus Identifiers |
Staff, Student, and Annual Graduates | staff-and-student-information.csv | Student and Teacher data |
Attendance, Chronic Absenteeism, Graduation (RHSP/DAP & FHSP), and Dropout Rates | attendance.csv | Absenteeism and Dropout rates |
Longitudinal Rate (4-Year, 5-Year, & 6-Year) | longitudinal-rate.csv | Four year graduation rates |
College, Career, and Military Readiness (CCMR), TSIA, College Prep | postsecondary-readiness-and-non-staar-performance-indicators.csv | Texas Success Initiative Assessment (TSIA) scores |
AP/IB, SAT/ACT | ap-ib-sat-act.csv | AP/IB, SAT, ACT scores |
The TAPR data usually needs a cleaning before we run it in the scuole database. During the last couple of years, this is the cleaning we have needed to do:
- The campus, district and region codes should have a set number of digits, usually padded with leading zeroes. They are:
- campus (9 digits)
- district (6 digits)
- region (2 digits)
- county (3 digits)
- There's an unnecessary apostrophe added in the "DISTRICT", "COUNTY", "REGION" and "CAMPUS" columns
You have the option this Jupyter notebook in this repository that uses the zfill() function to fill in all of the leading zeroes and also removes unnecessary apostrophes in their respective columns. Be sure to run it for the district, campus, region and state datasets. In addition, there is this Jupyter notebook that you only have to run once (although not sure if it runs the zfill() function for campuses and districts).
Also, for the 2021-22 TAPR data, the SAT and ACT headers had random letters that were lowercased. Make sure the column headers are capitalized. This step is written in both Python notebooks.
If you're unlucky to run into any other formatting errors, first of all (sorry!), second of all, try to write a solution in a Python notebook and add it to this README so it can be reproduced the following year and properly documented.
The headers for the data should match the schema found in schema_v2.py
which is what we use to map the data. If after uploading all of the data into the scuole database, you notice there are fields missing. It could be because the header in the spreadsheet do not match the schema found in schema_v2.py
. There are a lot of headers and columns so it might get tedious to check each and every one, especially since they don't tend to change year-to-year. But it might be worth checking if you see data missing. The scuole database just skips those headers if it doesn't see a matching header and shows N/A in your local database.
Every year, TEA publishes a Master reference of TAPR elements
like this one from 2022. It's usually found in the TAPR Advanced Data Download for that year that you use to download the data and called Master Reference (HTML)
. You can also download it in an Excel format for campuses, districts, regions and state.
It's also good to remember that for some datasets, TAPR releases the latest data while others are a year behind. For example, if we were updating the 2021-22 TAPR data, the latest data would be for 2021-22 (or Class of 2022) and the previous year would be 2020-21 (or Class of 2021) Here's is a handy breakdown:
Data | TAPR Data File | Year |
---|---|---|
A-F scores | Reference Information, Accountability Rating and Special Education Determination | Latest year |
Student Demographics | Staff, Student, and Annual Graduates | Latest year |
At-risk, economically disadvantages and limited English proficiency students | Staff, Student, and Annual Graduates | Latest year |
Enrollments in special programs (bilingual/ESL, gifted & talented, special ed) | Staff, Student, and Annual Graduates | Latest year |
Four-year graduation rates | Longitudinal Rate (4-Year, 5-Year, & 6-Year) | Previous year |
Dropout rates | Attendance, Graduation (RHSP/DAP & FHSP), and Dropout Rates | Previous year |
Chronic absenteeism | Attendance, Graduation (RHSP/DAP & FHSP), and Dropout Rates | Previous year |
AP/IB participation | AP/IB, SAT/ACT | Previous year |
AP/IB performance | AP/IB, SAT/ACT | Previous year |
SAT score | AP/IB, SAT/ACT | Previous year |
ACT score | AP/IB, SAT/ACT | Previous year |
College-ready graduates (TSIA scores) | College, Career, and Military Readiness (CCMR); and Other Postsecondary Indicators | Previous year |
Teacher demographics | Staff, Student, and Annual Graduates | Latest year |
Degree held by teachers | Staff, Student, and Annual Graduates | Latest year |
Students per teacher | Staff, Student, and Annual Graduates | Latest year |
Teacher salaries | Staff, Student, and Annual Graduates | Latest year |
Released: Annually in late April/early May
The Texas Higher Ed Coordinating Board (THECB) provides data for the Higher Ed Outcomes section of the app. To obtain and clean the data:
-
First, download the latest year of the data from THECB. The latest year should be 11 years from the current year.
-
Then, create a folder in the
cohorts/
folder. Name it the year (YYYY) to which the data corresponds. -
Open the spreadsheet. Unhide the
Master Raw Data
worksheet in the.xlsx
file from THECB by going toFormat -> Sheet -> Unhide --> Master Raw Data
. -
For the last two years, however, the
Master Raw Data
worksheet has not been for the corresponding year. Check to see if it matches the year. If not, contact the Texas Higher Ed Coordinating Board right away so they can get you a spreadsheet of thatMaster Raw Data
. Warning: They tend to take their sweet time when it comes to communication. -
Copy and paste that data into a new spreadsheet. Change the headers to match the list of fields in the loader, or copy and paste the headers from a previous year's data. Be sure the data matches the header, and save it as
regionState.csv
. -
Copy and paste the data found in
Region Cty Gender
,Region Cty Econ
andRegion Cty Ethnicity
into individual csv files. Name themcountyGender.csv
,countyEcon.csv
andcountyEthnicity.csv
, respectively. Change the headers to match the list of fields in the loader, or copy and paste the headers from a previous year's data. -
Be sure to remove any notes found at the top and bottom of all
.xlsx
files. -
Make sure all counts are integers and all percents are floats (which requires changing them in Excel 😬).
-
When updating the latest cohorts data, I noticed that they're adding asterisk (*) and N/As into the datasets. We don't want that! We want them to be empty. I created a Jupyter notebook called
clean_cohorts_data.ipynb
to help with any cleanup of that.
Note: According to a spokesperson at THECB, they are planning on building a cohorts dashboard in the near future, and they don't know how they will make the spreadsheets available to download (if at all). That's why we should probably be on the lookout for any changes in the THECB website and plan for any changes in our updating process in the future depending on how the data is released. Contact Mike Eddleman at [email protected]
or any other spokesperson at THECB when the time comes for any details.