Add docstrings for `make_data_sets` module #11

Mr0grog · 2020-10-07T07:51:50Z

This adds a docstring for the module (really just turns the existing comment into a docstring) and one for the make_data_sets() function.

Will add other docstrings as a separate PR. I wanted to do this one by itself so we can hash out any necessary discussion over the style (these use the Numpy style, which is a little verbose, but reads nicely in both code form and generated/parsed docs, and is what we use in all other EDGI python projects). Also, some open questions for discussion about language below.

I didn’t want to change the function or argument names, but looking at the implementation, I realized I had misunderstood what this was supposed to do based on the name and on the video tutorial I’d watched through a bit too quickly. To clarify, I described this as working with “preset configurations,” which seemed clearer to me. Happy to adjust if that doesn’t seem good to you.

In Google Colab, this displays like:

Separate but related: I didn’t make any implementation changes since I wanted to keep the scope narrowly focused on documentation here, but I noticed there’s a lot of repeated boilerplate here. I think it might be useful to rewrite this using a pattern like the following:

# List of presets and their associated options.
#
# These could also be actual `DataSet` objects instead of dicts since
# DataSets don't seem to do anything when constructed and therefore
# aren't too expensive (probably only slightly moreso than these dicts).
# On the other hand, I don't know if there's an expectation that
# repeated calls to `make_data_sets()` should construct new DataSet
# instances. It doesn't look like repeated calling is really intended,
# but I'm not familiar enough with all the notebooks.
# (Also, the dicts are safer for protecting yourself in case a future
# change makes DataSet actually do something at construction time.)
PRESETS = {
    "RCRA Violations": dict(
        idx_field="ID_NUMBER", 
        base_table="RCRA_VIOLATIONS",
        table_name="RCRA_VIOLATIONS_MVIEW",
        echo_type="RCRA",
        date_field="DATE_VIOLATION_DETERMINED",
        date_format="%m/%d/%Y",
        agg_type="count", 
        agg_col="VIOL_DETERMINED_BY_AGENCY", 
        unit="violations"
    ),
    
    "RCRA Inspections": dict(
        idx_field="ID_NUMBER", 
        base_table="RCRA_EVALUATIONS",
        table_name="RCRA_EVALUATIONS_MVIEW",
        echo_type="RCRA",
        date_field="EVALUATION_START_DATE",
        date_format="%m/%d/%Y", 
        agg_type="count",
        agg_col="EVALUATION_AGENCY", 
        unit="inspections"
    ),
    # etc.
}

def make_data_sets(dataset_names=None):
    return {name: DataSet(name=name, **PRESETS[name])
            for name in dataset_names or PRESETS.keys()}

ericnost

I approve of these changes @Mr0grog I am going to go ahead and merge in order to make some edits needed for a workshop tomorrow.

I like the idea of using Numpy documentation style, since it matches with the rest of EDGI's approach. Verbose is fine by me!

Yes, I think the way you've phrased it "creates objects from a list of preset configurations" is exactly right. I am not familiar with Python's DataSet object actually, hence dictionaries. They are also, I think?, easier to turn into Pandas dataframes, which is ultimately what happens in the notebooks.

As for reorganizing the function, I guess we haven't fully thought the use case of repeated calls through. I may not entirely understand the potential change (@shansen5 has worked more closely here). My concern is that a user could run CAA Violations for MA-4 then CAA Violations for MA-5. We would need an update - a new DataSet instance should be created.

Mr0grog · 2020-10-09T01:05:50Z

I am not familiar with Python's DataSet object actually, hence dictionaries. They are also, I think?, easier to turn into Pandas dataframes, which is ultimately what happens in the notebooks.

Oh! DataSet isn’t a Python built-in, it’s a class defined in this repo, here:

ECHO_modules/DataSet.py

Lines 49 to 55 in 383e75c

    
           # This class represents the data set and the fields and methods it requires  
        
           # to retrieve data from the database. 
        
           class DataSet: 
        
               def __init__( self, name, base_table, table_name, echo_type=None, 
        
                            idx_field=None, date_field=None, date_format=None, 
        
                            sql=None, agg_type=None, agg_col=None, unit=None):

Pandas definitely has built-ins for turning lists and dictionaries into dataframes, but it looks like DataSet has a few methods that create or do things with dataframes.

My concern is that a user could run CAA Violations for MA-4 then CAA Violations for MA-5. We would need an update - a new DataSet instance should be created.

Makes sense! Then I think the better approach is one that matches the example I wrote out, with a dictionary of dictionaries listing the constructor arguments for DataSet (i.e. the “presets”). I’ll make a PR for that.

Add docstrings for make_data_sets module

8e4bf07

Mr0grog requested review from shansen5 and ericnost October 7, 2020 07:51

ericnost approved these changes Oct 8, 2020

View reviewed changes

ericnost merged commit 383e75c into master Oct 8, 2020

Mr0grog deleted the docstringify-make_data_sets branch October 9, 2020 01:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add docstrings for `make_data_sets` module #11

Add docstrings for `make_data_sets` module #11

Mr0grog commented Oct 7, 2020 •

edited

Loading

ericnost left a comment •

edited

Loading

Mr0grog commented Oct 9, 2020

Add docstrings for make_data_sets module #11

Add docstrings for make_data_sets module #11

Conversation

Mr0grog commented Oct 7, 2020 • edited Loading

ericnost left a comment • edited Loading

Choose a reason for hiding this comment

Mr0grog commented Oct 9, 2020

Add docstrings for `make_data_sets` module #11

Add docstrings for `make_data_sets` module #11

Mr0grog commented Oct 7, 2020 •

edited

Loading

ericnost left a comment •

edited

Loading