Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docstrings for make_data_sets module #11

Merged
merged 1 commit into from
Oct 8, 2020

Conversation

Mr0grog
Copy link
Member

@Mr0grog Mr0grog commented Oct 7, 2020

This adds a docstring for the module (really just turns the existing comment into a docstring) and one for the make_data_sets() function.

Will add other docstrings as a separate PR. I wanted to do this one by itself so we can hash out any necessary discussion over the style (these use the Numpy style, which is a little verbose, but reads nicely in both code form and generated/parsed docs, and is what we use in all other EDGI python projects). Also, some open questions for discussion about language below.

I didn’t want to change the function or argument names, but looking at the implementation, I realized I had misunderstood what this was supposed to do based on the name and on the video tutorial I’d watched through a bit too quickly. To clarify, I described this as working with “preset configurations,” which seemed clearer to me. Happy to adjust if that doesn’t seem good to you.

In Google Colab, this displays like:

colab-docstring-popup


Separate but related: I didn’t make any implementation changes since I wanted to keep the scope narrowly focused on documentation here, but I noticed there’s a lot of repeated boilerplate here. I think it might be useful to rewrite this using a pattern like the following:

# List of presets and their associated options.
#
# These could also be actual `DataSet` objects instead of dicts since
# DataSets don't seem to do anything when constructed and therefore
# aren't too expensive (probably only slightly moreso than these dicts).
# On the other hand, I don't know if there's an expectation that
# repeated calls to `make_data_sets()` should construct new DataSet
# instances. It doesn't look like repeated calling is really intended,
# but I'm not familiar enough with all the notebooks.
# (Also, the dicts are safer for protecting yourself in case a future
# change makes DataSet actually do something at construction time.)
PRESETS = {
    "RCRA Violations": dict(
        idx_field="ID_NUMBER", 
        base_table="RCRA_VIOLATIONS",
        table_name="RCRA_VIOLATIONS_MVIEW",
        echo_type="RCRA",
        date_field="DATE_VIOLATION_DETERMINED",
        date_format="%m/%d/%Y",
        agg_type="count", 
        agg_col="VIOL_DETERMINED_BY_AGENCY", 
        unit="violations"
    ),
    
    "RCRA Inspections": dict(
        idx_field="ID_NUMBER", 
        base_table="RCRA_EVALUATIONS",
        table_name="RCRA_EVALUATIONS_MVIEW",
        echo_type="RCRA",
        date_field="EVALUATION_START_DATE",
        date_format="%m/%d/%Y", 
        agg_type="count",
        agg_col="EVALUATION_AGENCY", 
        unit="inspections"
    ),
    # etc.
}

def make_data_sets(dataset_names=None):
    return {name: DataSet(name=name, **PRESETS[name])
            for name in dataset_names or PRESETS.keys()}

Copy link
Member

@ericnost ericnost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve of these changes @Mr0grog I am going to go ahead and merge in order to make some edits needed for a workshop tomorrow.

I like the idea of using Numpy documentation style, since it matches with the rest of EDGI's approach. Verbose is fine by me!

Yes, I think the way you've phrased it "creates objects from a list of preset configurations" is exactly right. I am not familiar with Python's DataSet object actually, hence dictionaries. They are also, I think?, easier to turn into Pandas dataframes, which is ultimately what happens in the notebooks.

As for reorganizing the function, I guess we haven't fully thought the use case of repeated calls through. I may not entirely understand the potential change (@shansen5 has worked more closely here). My concern is that a user could run CAA Violations for MA-4 then CAA Violations for MA-5. We would need an update - a new DataSet instance should be created.

@ericnost ericnost merged commit 383e75c into master Oct 8, 2020
@Mr0grog
Copy link
Member Author

Mr0grog commented Oct 9, 2020

I am not familiar with Python's DataSet object actually, hence dictionaries. They are also, I think?, easier to turn into Pandas dataframes, which is ultimately what happens in the notebooks.

Oh! DataSet isn’t a Python built-in, it’s a class defined in this repo, here:

ECHO_modules/DataSet.py

Lines 49 to 55 in 383e75c

# This class represents the data set and the fields and methods it requires
# to retrieve data from the database.
class DataSet:
def __init__( self, name, base_table, table_name, echo_type=None,
idx_field=None, date_field=None, date_format=None,
sql=None, agg_type=None, agg_col=None, unit=None):

Pandas definitely has built-ins for turning lists and dictionaries into dataframes, but it looks like DataSet has a few methods that create or do things with dataframes.

My concern is that a user could run CAA Violations for MA-4 then CAA Violations for MA-5. We would need an update - a new DataSet instance should be created.

Makes sense! Then I think the better approach is one that matches the example I wrote out, with a dictionary of dictionaries listing the constructor arguments for DataSet (i.e. the “presets”). I’ll make a PR for that.

@Mr0grog Mr0grog deleted the docstringify-make_data_sets branch October 9, 2020 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants