Skip to content

Conversation

@twoertwein
Copy link
Member

@twoertwein twoertwein commented Oct 9, 2020

REF/BUG: de-duplicate all file handling in TextReader by calling get_handle in CParserWrapper. When TextReader gets a string it uses memory mapping (it is given a file object in all other cases).

REF/TYP: The second commit adds a new return value to get_handle (whether the buffer is wrapped inside a TextIOWrapper: in that case we cannot close it, we need to detach it (and flush it if we wrote to it)). I made get_handle return a typed dataclass HandleArgs and made sure that all created handles are in HandleArgs.created_handles there is no need to close HandleArgs.handle (unless it is created by get_filename_or_buffer).

I used asserts for mypy when I'm 100% certain about the type, otherwise I added mypy ignore statements.

In the future it might be good to merge get_handle and get_filename_or_buffer.

@jreback jreback added the IO CSV read_csv, to_csv label Oct 10, 2020
@jreback
Copy link
Contributor

jreback commented Oct 10, 2020

cc @gfyoung

@twoertwein
Copy link
Member Author

twoertwein commented Oct 12, 2020

the failure on windows when reading ./pandas/tests/io/sas/data/test_sas7bdat_2.csv was caused when encoding was None. TextIOWrapper on windows seems to default to charmap instead of utf-8.

@pep8speaks
Copy link

pep8speaks commented Oct 16, 2020

Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-04 02:11:22 UTC

@twoertwein twoertwein changed the title BUG/REF: read_csv shouldn't close user-provided file handles REF/BUG/TYP: read_csv shouldn't close user-provided file handles Oct 25, 2020
@twoertwein twoertwein requested a review from jreback October 25, 2020 23:59
@jbrockmendel
Copy link
Member

@twoertwein can you merge master and ill take another look

@twoertwein
Copy link
Member Author

@jbrockmendel rebased. Sorry for the large diff. Most code changes are from changing the return value of get_handle.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good a bunch of small comments.

I think if you can make IOHandleArgs fully functional with methods then the io routines become simpler

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks really good @twoertwein if you'd merge master and ping on green.

@pandas-dev/pandas-core if any comments

mode="wt",
storage_options=storage_options,
)
handle_args = get_handle(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, add some comments to indicate the ifferences in the ioargs and handle_args

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe think about having a handle arg IN IOArgs (that you get by calling ioargs.get_handle() but this might be too complicated / nested.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combining get_filepath_or_buffer (mostly used to open URLs) and get_handle (compression, opening files, wrapping bytes, and memory mapping) in some way would make a lot of sense. I feel that calling get_filepath_or_buffer inside get_handle would be a good solution. But I would need to first re-visit all places that call get_filepath_or_buffer and get_handle to make sure that this satisfies everyone.

@jreback Do you prefer to have this in this PR or in a followup? If adding more changes to this PR is okay from a review perspective, I'm tempted do add it to this PR (instead of touching IOArgs twice). I'll probably have time to combine them by the end of the next weekend. Is there a deadline/feature freeze for 1.2 upcoming?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think a follow on or would be easier to review

1.2 is schedule for end of nov so have a little time


if isinstance(path_or_buf, (str, bytes)):
self.path_or_buf = open(path_or_buf, "rb")
self.ioargs = get_filepath_or_buffer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @bashtage if any comments here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is more for convenience: get_filepath_or_buffer will make only meaningful operations on strings/paths (open fsspec resources) but in all cases it creates an IOArgs. Without this, we would need to instatiate an IOArgs if path_or_buf is not a string.

@jreback jreback added this to the 1.2 milestone Nov 2, 2020
created_handles: List[Buffer] = dataclasses.field(default_factory=list)
is_wrapped: bool = False

def close(self) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is actual functionality in here, shouldn't we rather move this to io/common.py (or somewhere in the io module) ?
I would expect to have pure typing-related things in this file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I will move it to io/common. Do you think it is preferable to then import IOHandles (and IOArgs) in _typing so that other modules can import IOHandles from _typing (there are only a few places that will need to import them) or should they directly import it from io/common?
@jorisvandenbossche @jbrockmendel

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes ok to move it to io/common.py

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very small comments, ping on green.

def close(self):
for f in self.handles:
f.close()
self.handles.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this not need self.ioargs.close()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_read calls get_filepath_or_buffer and closes the handle afterwards itself. It then passes down the file handle to TextFileReader and ParserBase.

super().close()

# close additional handles opened by C parser (for compression)
# close additional handles opened by C parser (for memory_map)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory you could add this to ioargs right? (and then add the try/except on the ioargs close); certainly its fine here unless that suggestion is simpler

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, I'm not familiar enough with resources in the c/cython part. I would prefer if this close call is called by the c-engine itself (or its destructor). I honestly don't like that resources are closed by a different class/function that didn't created it.

@jreback
Copy link
Contributor

jreback commented Nov 4, 2020

pandas/_typing.py:1:1: F401 'dataclasses' imported but unused
pandas/_typing.py:6:1: F401 'typing.Generic' imported but unused

from pre-commit checks

@jreback
Copy link
Contributor

jreback commented Nov 4, 2020

lgtm ping on green.

@twoertwein
Copy link
Member Author

@jreback green'ish

@jreback
Copy link
Contributor

jreback commented Nov 4, 2020

i restarted the pre commit

@twoertwein
Copy link
Member Author

@jreback pre-commit is green

@jreback jreback merged commit a648fb2 into pandas-dev:master Nov 4, 2020
@jreback
Copy link
Contributor

jreback commented Nov 4, 2020

thanks @twoertwein really nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

IO CSV read_csv, to_csv

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Pandas closes user-provided file handles that it doesn't own.

6 participants