Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.1.0 #50

Merged
merged 45 commits into from
Jan 30, 2024
Merged

v0.1.0 #50

merged 45 commits into from
Jan 30, 2024

Conversation

ericnost
Copy link
Member

@ericnost ericnost commented Jan 24, 2024

Fixes #53, fixes #57, fixes #48, fixes #49, fixes #55, fixes #27, fixes #51, fixes #60, fixes #58

I'm proposing to repackage ECHO_modules as something that can be delivered through pypi (pip)

I don't think this will break much^ because the repo will still be installable as a package through pip install git+url

^ replacing state_choropleth_mapper() with choropleth() will break the missing data notebook, but I can fix that afterwards.

Other updates:

  • adds aggregate_by_facility() function to support `point_mapper()
  • replaces state_choropleth_mapper() with choropleth()
  • expands the README
  • provides a notebook full of tutorials/example code and queries
  • basically incorporates useful tools from the work I did on the reorganization branch

After merging, I will set this up with pypi (#49) and also change some settings here on GitHub to automate the process of sending packages to pypi.

  Aggregate a set of records by facility IDs, using sum or count operations.
  Enables point symbol mapping.
  Other facilities in the selection (e.g. facilities in Snohomish County *without*
  reported CWA violations) can be identified and retrieved when the diff flag is True

  Parameters
  ----------
  records : DataSetResults object
      The records to aggregate. records should be a DataSetResults object created from
      a database query. In the :
      ds = make_data_sets(["CWA Violations"]) # Create a DataSet for handling the data
      snohomish_cwa_violations = ds["CWA Violations"].store_results(region_type="County", region_value=("SNOHOMISH",) state="WA") # Store results for this DataSet as a DataSetResults object
  program : String
      The name of the program, usually available from records.dataset.name
  other_records : Boolean
      When True, will retrieve other facilities in the selection
      (e.g. facilities in Snohomish County *without* reported CWA violations)

  Returns
  -------
  A dictionary containing:
    the aggregated results
    active facilities regulated under this program, but without recorded violations, inspections, or whatever the metric is (e.g. violations)
    the name of the new field that counts or sums up the relevant metric (e.g. violations)
minor fixes to variable names in example code
more minor fixes to example code
more minor changes to example code
ericnost and others added 22 commits January 24, 2024 18:15
add air emissions to `aggregate_by_facility`

  # Air emissions
  elif (program == "Greenhouse Gas Emissions" or program == "Toxic Releases"):
    data = data.groupby([records.dataset.idx_field, "FAC_NAME", "FAC_LAT", "FAC_LONG"]).agg({records.dataset.agg_col:'sum'})
    data['sum'] = data[records.dataset.agg_col]
    data = data.reset_index()
    aggregator = "sum" # keep track of which field we use to aggregate data, which may differ from the preset
add `agg_col`, `agg_type` and `units` to DMRs
attempt to deal with tabs/spaces issue
fix indendation?
debug sql and test choropleth (#55)
add aggregation to toxic releases table (this is tricky because of differing pollutants - but if the data are filtered to a specific pollutant, this makes sense)

fix a problem with the units of GHG and TRI charts
fix error where unit was units in DMR data set presets
fix map in `choropleth()`
packaging branch is deleted

point to main instead
add key_id argument to bring back together data separated for the choropleth
delete old choropleth variables
Currently when it is a list we mush it together to make it appear on charts. But that has some downstream consquences for DataSet.region_value. Instead, only smush together the multi-selections when we make charts or store DataSet.results
fix `get_active_facilities()` format string
add the Jupyter Notebook for the tutorials
Comment on lines +417 to +433
def differ(input, program):
'''
Helper function to sort facilities in this program (input) from the full list of faciliities regulated under the program (active)
'''
active = get_active_facilities(records.state, records.region_type, records.region_value )

diff = list(
set(active[records.dataset.echo_type + "_IDS"]) - set(input[records.dataset.idx_field])
)

# get rid of NaNs - probably no program IDs
diff = [x for x in diff if str(x) != 'nan']

# ^ Not perfect given that some facilities have multiple NPDES_IDs
# Below return the full ECHO_EXPORTER details for facilities without violations, penalties, or inspections
diff = active.loc[active[records.dataset.echo_type + "_IDS"].isin(diff)]
return diff
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a robust way of differentiating facilities because of the way EPA stores program IDs.

The differ function is meant to take data like:
all_facilities = [A, B, C, D, E]
facilities_with_inspections = [C, D]
and calculate:
facilities_without_inspections = [A, B, E]

But in reality, facility/program IDs are more like:
all_facilities = [A X, B Y, C, D Z, E]
facilities_with_inspections = [C, D]
So the resulting list of facilities without inspections would incorrectly be:
A X, B Y, D Z
even though D does have an inspection.

Just need a way to parse apart program IDs, ideally without having to call up the database to look at the EXP_PGM table.

add detailed notes about data interpretation
formatting quote/bullets
format example code
update tutorial notebook with interpretation and background sections of README
deletes `selector()`

fixes #58
creates `choropleth()`

fixes #55

This will break the missing data notebook, but we will make a note to fix that later
update zip code definition to match what's in the database
correct zip codes fields
change logic in `choropleth()`
fix `choropleth()` parameters
add tooltip to `choropleth()`
quick fix on `choropleth()` tooltip
change numpy dependency to 1.23.5, which is Google Colab's default
@ericnost ericnost marked this pull request as ready for review January 28, 2024 19:02
@shansen5
Copy link
Collaborator

These all look like good and important improvements to the package.

@shansen5 shansen5 merged commit 200c9bd into main Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment