-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify consistency across notebooks, specifically Cross Program and CWA Violations #22
Comments
|
A somewhat separate issue, but in the ECHO-CWA-Violations notebook, we are only pulling data from 2005 through 2019. No 2020 data.
|
Ok, I've run both notebooks and I have confirmed 864 violations for In the Cross-Program notebook, we take the NPDES_IDs for the facilities in ECHO_EXPORTER that have a CWA_FLAG and are in the congressional district of interest. Then we look those NPDES_IDs up in the NPDES_QNCR_HISTORY table. (This is my understanding) In the CWA-violations notebook, we pull all of the NPDES_QNCR_HISTORY based on a match for the state (e.g. "LA" is in the NPDES_ID). Then, we filter that based on the NPDES_IDs for the congressional district as identified in ECHO_EXPORTER. In other words, the operations are reversed between the two notebooks. My best guess at this moment is that the issue is in how the join is done. The upshot is we've given ourselves a useful calibration exercise in creating two notebooks whose functions overlap a little. |
The facility-all-programs.ipynb get 22,858 records in LA #2.
` It was an early notebook where I am trying to attach the facility's id to the dataframe. I'm suspecting that may be duplicating some lines. |
I counted 1,293 unique NPDES_IDs for LA-2 in Cross-Program and 1,744 in CWA- Violations. Makes sense since violations are also more in CWA-Violations. Yes, I think this is echoing what you're seeing Steve. |
Taking out the code lines that are attaching the facility id to the program data doesn't account for the difference. The difference is only 11 records, at my_cd_npdes2=30,529 versus my_cd_npdes=30,518. I'll have to export and diff the csv files to see more.
|
Sounds good, @shansen5! I am wondering how much time we need to spend on this if the idea is to move away from the separate notebooks and use Cross-Program for everything. I am also going to propose a more structured verification exercise. Which would helps us make sure Cross-Program is accurate. |
Hi @shansen5, sorry for all the traffic here. I wonder if the issue is that in Cross-Program, it does not appear as if we are splitting ID strings. In other words, the SQL query might look like this:
but there would be no match for In other words, I think we might need another step before in I very well might be wrong about this if I'm not reading the data class right. |
This code in the facility-all-programs notebook was skipping all ids in every string that has more than one id. In this string "LA0005266 LAG670191 LAJ650044 LAR10E952 LAR10G428 LAR10L158" none of these ids get included. There were 574 ids in cwa-compliance-violations that weren't being included in facility-all-programs. I changed the code to this:
This now gives results like those Cole came up in his report, which used cwa-compliance-violations. |
Thanks for correcting this @shansen5 - I think I forgot to come back to |
No description provided.
The text was updated successfully, but these errors were encountered: