Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify consistency across notebooks, specifically Cross Program and CWA Violations #22

Closed
ericnost opened this issue Jun 3, 2020 · 10 comments
Assignees
Labels
invalid This doesn't seem right question Further information is requested

Comments

@ericnost
Copy link
Member

ericnost commented Jun 3, 2020

No description provided.

Copy link
Member Author

ericnost commented Jun 3, 2020

<@UTL9TARK2> just following up on the updated CSV for the polished congressional district report. Are the numbers generated by the single facility report actually different than the numbers generated by the CWA congressional district report? Also drawing in <@UFGN2FJ85> in here. Cole would you be able to help with making a new chart for slide 10 here of all the CWA violations base on the data from the single facility notebook? https://github.com/edgi-govdata-archiving/ECHO-Cross-Program/blob/master/facility-all-programs.ipynb?
View in Slack

@ericnost ericnost added invalid This doesn't seem right question Further information is requested labels Jun 3, 2020
@ericnost
Copy link
Member Author

ericnost commented Jun 3, 2020

A somewhat separate issue, but in the ECHO-CWA-Violations notebook, we are only pulling data from 2005 through 2019. No 2020 data.

from NPDES_QNCR_HISTORY where YEARQTR > 20050 and YEARQTR < 20200

@ericnost
Copy link
Member Author

ericnost commented Jun 3, 2020

Ok, I've run both notebooks and I have confirmed 864 violations for NUME90Q in Cross-Program and 1014 for NUME90Q in CWA-Violations. Both Louisiana CD 2.

In the Cross-Program notebook, we take the NPDES_IDs for the facilities in ECHO_EXPORTER that have a CWA_FLAG and are in the congressional district of interest. Then we look those NPDES_IDs up in the NPDES_QNCR_HISTORY table. (This is my understanding)

In the CWA-violations notebook, we pull all of the NPDES_QNCR_HISTORY based on a match for the state (e.g. "LA" is in the NPDES_ID). Then, we filter that based on the NPDES_IDs for the congressional district as identified in ECHO_EXPORTER.

In other words, the operations are reversed between the two notebooks. My best guess at this moment is that the issue is in how the join is done.

The upshot is we've given ourselves a useful calibration exercise in creating two notebooks whose functions overlap a little.

@shansen5
Copy link
Collaborator

shansen5 commented Jun 3, 2020

The facility-all-programs.ipynb get 22,858 records in LA #2.
This cell in echo-cwa-compliance-violations.ipynb gets 30,518 records in my_cd_npdes.

my_cd_npdes = pd.DataFrame()
no_data_ids = []
for fac in my_cd_facs.itertuples():
    ids = fac.NPDES_IDS
    for npdes_id in ids.split():
        try:
            npdes_data = qncr_data.loc[ npdes_id ].copy()
            # Add the facility's index number to air_data, to refer to it.
            n = npdes_data.shape[0]
            fac_list = [fac.Index] * n
            npdes_data['facility'] = fac_list
            frames = [my_cd_npdes, npdes_data]
            my_cd_npdes = pd.concat( frames )
        except KeyError:
            no_data_ids.append( npdes_id )

`

It was an early notebook where I am trying to attach the facility's id to the dataframe. I'm suspecting that may be duplicating some lines.

@ericnost
Copy link
Member Author

ericnost commented Jun 3, 2020

I counted 1,293 unique NPDES_IDs for LA-2 in Cross-Program and 1,744 in CWA- Violations. Makes sense since violations are also more in CWA-Violations.

Yes, I think this is echoing what you're seeing Steve.

@shansen5
Copy link
Collaborator

shansen5 commented Jun 3, 2020

Taking out the code lines that are attaching the facility id to the program data doesn't account for the difference. The difference is only 11 records, at my_cd_npdes2=30,529 versus my_cd_npdes=30,518. I'll have to export and diff the csv files to see more.

my_cd_npdes2 = pd.DataFrame()
no_data_ids = []
for fac in my_cd_facs.itertuples():
    ids = fac.NPDES_IDS
    for npdes_id in ids.split():
        try:
            npdes_data = qncr_data.loc[ npdes_id ].copy()
            frames = [my_cd_npdes, npdes_data]
            my_cd_npdes2 = pd.concat( frames )
        except KeyError:
            no_data_ids.append( npdes_id )

@ericnost
Copy link
Member Author

ericnost commented Jun 3, 2020

Sounds good, @shansen5! I am wondering how much time we need to spend on this if the idea is to move away from the separate notebooks and use Cross-Program for everything.

I am also going to propose a more structured verification exercise. Which would helps us make sure Cross-Program is accurate.

@ericnost
Copy link
Member Author

ericnost commented Jun 4, 2020

Hi @shansen5, sorry for all the traffic here. I wonder if the issue is that in Cross-Program, it does not appear as if we are splitting ID strings. In other words, the SQL query might look like this:

...where NPDES_ID in ('LA1234', 'LA8888 LA9999 LA2323, LA4567', 'LA0101')

but there would be no match for 'LA8888 LA9999 LA2323'. That would lead to fewer records returned (so CWA-Violations might actually be more accurate...)

In other words, I think we might need another step before
for pos,row in enumerate( ee_ids ): id_string = id_string + "'"+str(row)+"',"

in def get_data(), to first separate out any multiple ids in the row of ee_ids.

I very well might be wrong about this if I'm not reading the data class right.

@shansen5
Copy link
Collaborator

shansen5 commented Jun 7, 2020

ids_string = program.echo_type + '_IDS'
ids = echo_data.loc[echo_data[ids_string].str.len() >0]    # just give all _IDs and let sql deal with it?
ids = ids.loc[:,ids_string].unique()

This code in the facility-all-programs notebook was skipping all ids in every string that has more than one id. In this string "LA0005266 LAG670191 LAJ650044 LAR10E952 LAR10G428 LAR10L158" none of these ids get included. There were 574 ids in cwa-compliance-violations that weren't being included in facility-all-programs. I changed the code to this:

ids_string = program.echo_type + '_IDS'
ids = list()
for index, value in echo_data[ ids_string ].items():
    try:
        for npdes_id in value.split():
            ids.append( npdes_id )
    except ( KeyError, AttributeError ) as e:
        pass

This now gives results like those Cole came up in his report, which used cwa-compliance-violations.

@ericnost
Copy link
Member Author

ericnost commented Jun 7, 2020

Thanks for correcting this @shansen5 - I think I forgot to come back to ids = echo_data.loc[echo_data[ids_string].str.len() >0] when getting this notebook ready for Chris's class last month.

@ericnost ericnost closed this as completed Jun 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants