Skip to content

Conversation

jknox13
Copy link
Contributor

@jknox13 jknox13 commented Jun 28, 2018

Looks to be a plug and play change between pd.DataFrame.from_csv and
pd.read_csv since I don't see where anything more than the path
argument has been supplied to the function call.

Looks to be a plug and play change between `pd.DataFrame.from_csv` and
`pd.read_csv` since I don't see where anything more than the path
argument has been supplied to the function call.
@codecov-io
Copy link

codecov-io commented Jun 28, 2018

Codecov Report

Merging #180 into master will not change coverage.
The diff coverage is 0%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #180   +/-   ##
=======================================
  Coverage   43.93%   43.93%           
=======================================
  Files          91       91           
  Lines       11666    11666           
=======================================
  Hits         5125     5125           
  Misses       6541     6541
Impacted Files Coverage Δ
allensdk/core/mouse_connectivity_cache.py 92.94% <ø> (ø) ⬆️
allensdk/api/cache.py 80.45% <0%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2d0bac...6744bd4. Read the comment docs.

@jknox13
Copy link
Contributor Author

jknox13 commented Jun 28, 2018

It might be worth adding a threshold argument to codecov.yml to prevent a 0% change in coverage to fail the build (e.g. scikit-learn's)

@dyf
Copy link
Contributor

dyf commented Jul 2, 2018

@jknox13 I have read that pd.read_csv and pd.from_csv have different defaults related to the use of the index column: https://stackoverflow.com/a/26495839.

I'm a little concerned that blanket replacement will change the format of the files we're caching. Would you mind checking that some of the methods you changed don't adversely affect anything? For example: CellTypesCache.get_ephys_features uses cache_csv_json.

@jknox13
Copy link
Contributor Author

jknox13 commented Jul 2, 2018

Good catch! Here is the full difference between the default kwargs for the two:

kwarg pd.DataFrame.from_csv() pd.read_csv
header 0 'infer'
sep ',' ','
index_col 0 None
parse_dates True False
infer_datetime_format False False

The docs for pd.read_csv state that the default value for header (infer) is identical to header=0 if no names are passed. So truly the difference between default behavior is handling the index and parsing dates.

If a fully compatible calls to pd.read_csv are wanted, we could include the utility function

def compatible_read_csv(path, **kwargs):
    if 'index_col' not in kwargs:
        kwargs['index_col'] = 0
    if 'parse_dates' not in kwargs:
        kwargs['parse_dates'] = True

    return pd.read_csv(path, **kwargs)

A brief testing:

try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO
import pandas as pd


testcsv_index = """
0,A
1,B
2,C
3,D
4,E
"""

testcsv_noindex = """
A,apple
B,banana
C,carrot
D,dog
E,elephant
"""

testcsv_offset = """
2,C
3,D
4,E
5,F
6,G
"""

testcsv_alpha = """
A,0
B,1
C,2
D,3
E,4
"""

testcsv_header = """
letter, fuit-animal
A,apple
B,banana
C,carrot
D,dog
E,elephant
"""

testcsv_allnum = """
0.5, 0.2
0.4, 0.3
0.1, 0.2
0.4, 0.0
0.9, 0.3
"""
tests = (testcsv_index, testcsv_noindex, testcsv_offset, testcsv_alpha,
         testcsv_header, testcsv_allnum,)

def compatible_read_csv(path, **kwargs):
    if 'index_col' not in kwargs:
        kwargs['index_col'] = 0
    if 'parse_dates' not in kwargs:
        kwargs['parse_dates'] = True

    return pd.read_csv(path, **kwargs)

def test_read_from_csv(s):
    _from = pd.DataFrame.from_csv(StringIO(s))
    _read = compatible_read_csv(StringIO(s))

    pd.testing.assert_frame_equal(_from, _read)


if __name__ == '__main__':
    for s in tests:
        test_read_from_csv(s)

@jknox13
Copy link
Contributor Author

jknox13 commented Jul 2, 2018

@dyf What do you think: add a utility function read_csv in the sdk similar to allensdk.core.json_utilities.read(), or supply index=0 to each call of pd.read_csv and parse_dates=True where needed?

@dyf
Copy link
Contributor

dyf commented Jul 25, 2018

I think I prefer the latter. Do you have time to make a pass through @jknox13 ?

@NileGraddis
Copy link
Contributor

Closing as #208 covered the same territory. Thanks @jknox13 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants