Verify consistency across notebooks, specifically Cross Program and CWA Violations #22

ericnost · 2020-06-03T17:57:10Z

No description provided.

ericnost · 2020-06-03T17:57:18Z

<@UTL9TARK2> just following up on the updated CSV for the polished congressional district report. Are the numbers generated by the single facility report actually different than the numbers generated by the CWA congressional district report? Also drawing in <@UFGN2FJ85> in here. Cole would you be able to help with making a new chart for slide 10 here of all the CWA violations base on the data from the single facility notebook? https://github.com/edgi-govdata-archiving/ECHO-Cross-Program/blob/master/facility-all-programs.ipynb?
_{View in Slack}

ericnost · 2020-06-03T18:20:20Z

A somewhat separate issue, but in the ECHO-CWA-Violations notebook, we are only pulling data from 2005 through 2019. No 2020 data.

from NPDES_QNCR_HISTORY where YEARQTR > 20050 and YEARQTR < 20200

ericnost · 2020-06-03T18:41:27Z

Ok, I've run both notebooks and I have confirmed 864 violations for NUME90Q in Cross-Program and 1014 for NUME90Q in CWA-Violations. Both Louisiana CD 2.

In the Cross-Program notebook, we take the NPDES_IDs for the facilities in ECHO_EXPORTER that have a CWA_FLAG and are in the congressional district of interest. Then we look those NPDES_IDs up in the NPDES_QNCR_HISTORY table. (This is my understanding)

In the CWA-violations notebook, we pull all of the NPDES_QNCR_HISTORY based on a match for the state (e.g. "LA" is in the NPDES_ID). Then, we filter that based on the NPDES_IDs for the congressional district as identified in ECHO_EXPORTER.

In other words, the operations are reversed between the two notebooks. My best guess at this moment is that the issue is in how the join is done.

The upshot is we've given ourselves a useful calibration exercise in creating two notebooks whose functions overlap a little.

shansen5 · 2020-06-03T18:45:35Z

The facility-all-programs.ipynb get 22,858 records in LA #2.
This cell in echo-cwa-compliance-violations.ipynb gets 30,518 records in my_cd_npdes.

my_cd_npdes = pd.DataFrame()
no_data_ids = []
for fac in my_cd_facs.itertuples():
    ids = fac.NPDES_IDS
    for npdes_id in ids.split():
        try:
            npdes_data = qncr_data.loc[ npdes_id ].copy()
            # Add the facility's index number to air_data, to refer to it.
            n = npdes_data.shape[0]
            fac_list = [fac.Index] * n
            npdes_data['facility'] = fac_list
            frames = [my_cd_npdes, npdes_data]
            my_cd_npdes = pd.concat( frames )
        except KeyError:
            no_data_ids.append( npdes_id )

`

It was an early notebook where I am trying to attach the facility's id to the dataframe. I'm suspecting that may be duplicating some lines.

ericnost · 2020-06-03T18:48:11Z

I counted 1,293 unique NPDES_IDs for LA-2 in Cross-Program and 1,744 in CWA- Violations. Makes sense since violations are also more in CWA-Violations.

Yes, I think this is echoing what you're seeing Steve.

shansen5 · 2020-06-03T18:56:02Z

Taking out the code lines that are attaching the facility id to the program data doesn't account for the difference. The difference is only 11 records, at my_cd_npdes2=30,529 versus my_cd_npdes=30,518. I'll have to export and diff the csv files to see more.

my_cd_npdes2 = pd.DataFrame()
no_data_ids = []
for fac in my_cd_facs.itertuples():
    ids = fac.NPDES_IDS
    for npdes_id in ids.split():
        try:
            npdes_data = qncr_data.loc[ npdes_id ].copy()
            frames = [my_cd_npdes, npdes_data]
            my_cd_npdes2 = pd.concat( frames )
        except KeyError:
            no_data_ids.append( npdes_id )

ericnost · 2020-06-03T19:04:47Z

Sounds good, @shansen5! I am wondering how much time we need to spend on this if the idea is to move away from the separate notebooks and use Cross-Program for everything.

I am also going to propose a more structured verification exercise. Which would helps us make sure Cross-Program is accurate.

ericnost · 2020-06-04T00:57:56Z

Hi @shansen5, sorry for all the traffic here. I wonder if the issue is that in Cross-Program, it does not appear as if we are splitting ID strings. In other words, the SQL query might look like this:

...where NPDES_ID in ('LA1234', 'LA8888 LA9999 LA2323, LA4567', 'LA0101')

but there would be no match for 'LA8888 LA9999 LA2323'. That would lead to fewer records returned (so CWA-Violations might actually be more accurate...)

In other words, I think we might need another step before
for pos,row in enumerate( ee_ids ): id_string = id_string + "'"+str(row)+"',"

in def get_data(), to first separate out any multiple ids in the row of ee_ids.

I very well might be wrong about this if I'm not reading the data class right.

shansen5 · 2020-06-07T04:41:46Z

ids_string = program.echo_type + '_IDS'
ids = echo_data.loc[echo_data[ids_string].str.len() >0]    # just give all _IDs and let sql deal with it?
ids = ids.loc[:,ids_string].unique()

This code in the facility-all-programs notebook was skipping all ids in every string that has more than one id. In this string "LA0005266 LAG670191 LAJ650044 LAR10E952 LAR10G428 LAR10L158" none of these ids get included. There were 574 ids in cwa-compliance-violations that weren't being included in facility-all-programs. I changed the code to this:

ids_string = program.echo_type + '_IDS'
ids = list()
for index, value in echo_data[ ids_string ].items():
    try:
        for npdes_id in value.split():
            ids.append( npdes_id )
    except ( KeyError, AttributeError ) as e:
        pass

This now gives results like those Cole came up in his report, which used cwa-compliance-violations.

ericnost · 2020-06-07T15:04:39Z

Thanks for correcting this @shansen5 - I think I forgot to come back to ids = echo_data.loc[echo_data[ids_string].str.len() >0] when getting this notebook ready for Chris's class last month.

ericnost assigned ericnost, shansen5 and calderEDJ Jun 3, 2020

ericnost added invalid This doesn't seem right question Further information is requested labels Jun 3, 2020

ericnost mentioned this issue Jun 4, 2020

Add state vs federal actions #25

Open

ericnost closed this as completed Jun 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify consistency across notebooks, specifically Cross Program and CWA Violations #22

Verify consistency across notebooks, specifically Cross Program and CWA Violations #22

ericnost commented Jun 3, 2020

ericnost commented Jun 3, 2020

ericnost commented Jun 3, 2020

ericnost commented Jun 3, 2020

shansen5 commented Jun 3, 2020 •

edited

Loading

ericnost commented Jun 3, 2020

shansen5 commented Jun 3, 2020

ericnost commented Jun 3, 2020 •

edited

Loading

ericnost commented Jun 4, 2020 •

edited

Loading

shansen5 commented Jun 7, 2020

ericnost commented Jun 7, 2020

Verify consistency across notebooks, specifically Cross Program and CWA Violations #22

Verify consistency across notebooks, specifically Cross Program and CWA Violations #22

Comments

ericnost commented Jun 3, 2020

ericnost commented Jun 3, 2020

ericnost commented Jun 3, 2020

ericnost commented Jun 3, 2020

shansen5 commented Jun 3, 2020 • edited Loading

ericnost commented Jun 3, 2020

shansen5 commented Jun 3, 2020

ericnost commented Jun 3, 2020 • edited Loading

ericnost commented Jun 4, 2020 • edited Loading

shansen5 commented Jun 7, 2020

ericnost commented Jun 7, 2020

shansen5 commented Jun 3, 2020 •

edited

Loading

ericnost commented Jun 3, 2020 •

edited

Loading

ericnost commented Jun 4, 2020 •

edited

Loading