Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoint.get_data_frame() does not support hierarchical/MultiIndexed data structures #211

Closed
quepasapedro opened this issue Mar 31, 2021 · 0 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@quepasapedro
Copy link

This came up in the Slack channel, and I figured it'd be worth reporting here. See the discussion here: https://nbaapi.slack.com/archives/C012E7UH022/p1617120838011400

Issue:

When the API response data includes a nested axis index, Endpoint.get_data_frames() throws an error:

Click to see full error
>>> response = endpoints.LeagueDashPlayerShotLocations(distance_range='5ft Range', per_mode_detailed='PerGame')
>>> response.get_data_frames()
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py in _list_to_arrays(data, columns, coerce_float, dtype)
    496         result = _convert_object_array(
--> 497             content, columns, dtype=dtype, coerce_float=coerce_float
    498         )

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py in _convert_object_array(content, columns, coerce_float, dtype)
    580             raise AssertionError(
--> 581                 f"{len(columns)} columns passed, passed data had "
    582                 f"{len(content)} columns"

AssertionError: 2 columns passed, passed data had 32 columns

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-729-d4156e49aea5> in <module>
      1 response = endpoints.LeagueDashPlayerShotLocations(distance_range='5ft Range', per_mode_detailed='PerGame')
----> 2 response.get_data_frames()

/Library/Python/3.7/site-packages/nba_api/stats/endpoints/_base.py in get_data_frames(self)
     50 
     51     def get_data_frames(self):
---> 52         return [data_set.get_data_frame() for data_set in self.data_sets]

/Library/Python/3.7/site-packages/nba_api/stats/endpoints/_base.py in <listcomp>(.0)
     50 
     51     def get_data_frames(self):
---> 52         return [data_set.get_data_frame() for data_set in self.data_sets]

/Library/Python/3.7/site-packages/nba_api/stats/endpoints/_base.py in get_data_frame(self)
     26             if not PANDAS:
     27                 raise Exception('Import Missing - Failed to import DataFrame from pandas.')
---> 28             return DataFrame(self.data['data'], columns=self.data['headers'])
     29 
     30     def get_request_url(self):

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    472                     if is_named_tuple(data[0]) and columns is None:
    473                         columns = data[0]._fields
--> 474                     arrays, columns = to_arrays(data, columns, dtype=dtype)
    475                     columns = ensure_index(columns)
    476 

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py in to_arrays(data, columns, coerce_float, dtype)
    459         return [], []  # columns if columns is not None else []
    460     if isinstance(data[0], (list, tuple)):
--> 461         return _list_to_arrays(data, columns, coerce_float=coerce_float, dtype=dtype)
    462     elif isinstance(data[0], abc.Mapping):
    463         return _list_of_dict_to_arrays(

~/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py in _list_to_arrays(data, columns, coerce_float, dtype)
    498         )
    499     except AssertionError as e:
--> 500         raise ValueError(e) from e
    501     return result
    502 

ValueError: 2 columns passed, passed data had 32 columns

Details:

The root of the issue comes from how the get_data_frames method constructs a DataFrame from response data. It assumes that self.data['headers'] is a 1d list of column names, but this isn't always the case:

        def get_data_frame(self):
            if not PANDAS:
                raise Exception('Import Missing - Failed to import DataFrame from pandas.')
            return DataFrame(self.data['data'], columns=self.data['headers'])

For example, the LeagueDashPlayerShotLocations endpoint returns a hierarchical column structure, as you can see in this screenshot pulled from the Slack discussion:

image

When we look at headers, we see that in this case, it's actually a list of dicts, each with a columnNames key which points to a list of column names.

>>> response = endpoints.LeagueDashPlayerShotLocations(distance_range='5ft Range', per_mode_detailed='PerGame')
>>> response.get_dict()['resultSets']['headers']
[{'name': 'SHOT_CATEGORY',
  'columnsToSkip': 5,
  'columnSpan': 3,
  'columnNames': ['Less Than 5 ft.',
   '5-9 ft.',
   '10-14 ft.',
   '15-19 ft.',
   '20-24 ft.',
   '25-29 ft.',
   '30-34 ft.',
   '35-39 ft.',
   '40+ ft.']},
 {'name': 'columns',
  'columnSpan': 1,
  'columnNames': ['PLAYER_ID',
   'PLAYER_NAME',
   'TEAM_ID',
   'TEAM_ABBREVIATION',
   'AGE',
   'FGM',
   'FGA',
   'FG_PCT',
   'FGM',
   'FGA',
   'FG_PCT',
   'FGM',
   'FGA',
   'FG_PCT',
   'FGM',
   'FGA',
   'FG_PCT',
   'FGM',
   'FGA',
   'FG_PCT',
   'FGM',
   'FGA',
   'FG_PCT',
   'FGM',
   'FGA',
   'FG_PCT',
   'FGM',
   'FGA',
   'FG_PCT',
   'FGM',
   'FGA',
   'FG_PCT']}]

How to fix?

I don't have a great recommendation for how to fix this in a stable, extensible way. I was able to put together a hacky one-off fix by manually creating a MultiIndex object, and passing that in as the columns argument when creating the DataFrame, but that solution depended on manually mapping the hierarchy between columns, and won't work if there are different numbers/sets of columns.

response = endpoints.LeagueDashPlayerShotLocations(distance_range='5ft Range', per_mode_detailed='PerGame')

# There are 3 unique columns for each floor zone:
# FGM, FGA, FG_PCT
lists = [[c] * 3 for c in response.get_dict()['resultSets']['headers'][0]['columnNames']]

# We don't want to nest the first 5 columns under a "floor zone" node, so manually add 5 other values for a custom index. 
all_columns = ['player_data']*5

for l in lists:
    all_columns += l

# Note: could do the column logic above with this shorter, less readable code:
# all_columns = ['player_data']*5 + [j for i in lists for j in i]

# True = pass
len(response.get_dict()['resultSets']['headers'][1]['columnNames']) == len(all_columns)

# From the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
#   The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. 
#   You can think of MultiIndex as an array of tuples where each tuple is unique. 
col = pd.MultiIndex.from_arrays([all_columns,
                                 response.get_dict()['resultSets']['headers'][1]['columnNames']])

player_shooting_df = pd.DataFrame(data=response.get_dict()['resultSets']['rowSet'],
                                  columns=col)

The structure of the response data makes it a little tricky to fix; since the first 5 columns in the columns object ('PLAYER_ID', 'PLAYER_NAME', 'TEAM_ID', 'TEAM_ABBREVIATION', 'AGE',) don't actually belong to any of the column hierarchies in the SHOT_CATEGORY list.

I'm happy to help out however, I just don't have a great sense of how we should fix this.

@rsforbes rsforbes added bug Something isn't working enhancement New feature or request labels Nov 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants