-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
REF/BUG/TYP: read_csv shouldn't close user-provided file handles #36997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @gfyoung |
|
the failure on windows when reading ./pandas/tests/io/sas/data/test_sas7bdat_2.csv was caused when |
|
Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-11-04 02:11:22 UTC |
|
@twoertwein can you merge master and ill take another look |
|
@jbrockmendel rebased. Sorry for the large diff. Most code changes are from changing the return value of |
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good a bunch of small comments.
I think if you can make IOHandleArgs fully functional with methods then the io routines become simpler
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks really good @twoertwein if you'd merge master and ping on green.
@pandas-dev/pandas-core if any comments
pandas/io/json/_json.py
Outdated
| mode="wt", | ||
| storage_options=storage_options, | ||
| ) | ||
| handle_args = get_handle( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, add some comments to indicate the ifferences in the ioargs and handle_args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe think about having a handle arg IN IOArgs (that you get by calling ioargs.get_handle() but this might be too complicated / nested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combining get_filepath_or_buffer (mostly used to open URLs) and get_handle (compression, opening files, wrapping bytes, and memory mapping) in some way would make a lot of sense. I feel that calling get_filepath_or_buffer inside get_handle would be a good solution. But I would need to first re-visit all places that call get_filepath_or_buffer and get_handle to make sure that this satisfies everyone.
@jreback Do you prefer to have this in this PR or in a followup? If adding more changes to this PR is okay from a review perspective, I'm tempted do add it to this PR (instead of touching IOArgs twice). I'll probably have time to combine them by the end of the next weekend. Is there a deadline/feature freeze for 1.2 upcoming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think a follow on or would be easier to review
1.2 is schedule for end of nov so have a little time
|
|
||
| if isinstance(path_or_buf, (str, bytes)): | ||
| self.path_or_buf = open(path_or_buf, "rb") | ||
| self.ioargs = get_filepath_or_buffer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @bashtage if any comments here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change is more for convenience: get_filepath_or_buffer will make only meaningful operations on strings/paths (open fsspec resources) but in all cases it creates an IOArgs. Without this, we would need to instatiate an IOArgs if path_or_buf is not a string.
pandas/_typing.py
Outdated
| created_handles: List[Buffer] = dataclasses.field(default_factory=list) | ||
| is_wrapped: bool = False | ||
|
|
||
| def close(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is actual functionality in here, shouldn't we rather move this to io/common.py (or somewhere in the io module) ?
I would expect to have pure typing-related things in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I will move it to io/common. Do you think it is preferable to then import IOHandles (and IOArgs) in _typing so that other modules can import IOHandles from _typing (there are only a few places that will need to import them) or should they directly import it from io/common?
@jorisvandenbossche @jbrockmendel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes ok to move it to io/common.py
…that all created handlers are returned
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very small comments, ping on green.
| def close(self): | ||
| for f in self.handles: | ||
| f.close() | ||
| self.handles.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this not need self.ioargs.close()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_read calls get_filepath_or_buffer and closes the handle afterwards itself. It then passes down the file handle to TextFileReader and ParserBase.
pandas/io/parsers.py
Outdated
| super().close() | ||
|
|
||
| # close additional handles opened by C parser (for compression) | ||
| # close additional handles opened by C parser (for memory_map) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in theory you could add this to ioargs right? (and then add the try/except on the ioargs close); certainly its fine here unless that suggestion is simpler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, I'm not familiar enough with resources in the c/cython part. I would prefer if this close call is called by the c-engine itself (or its destructor). I honestly don't like that resources are closed by a different class/function that didn't created it.
from pre-commit checks |
|
lgtm ping on green. |
|
@jreback green'ish |
|
i restarted the pre commit |
|
@jreback pre-commit is green |
|
thanks @twoertwein really nice! |
black pandasgit diff upstream/master -u -- "*.py" | flake8 --diffREF/BUG: de-duplicate all file handling in
TextReaderby callingget_handleinCParserWrapper.WhenTextReadergets a string it uses memory mapping (it is given a file object in all other cases).REF/TYP: The second commit adds a new return value to
get_handle(whether the buffer is wrapped inside a TextIOWrapper: in that case we cannot close it, we need to detach it (and flush it if we wrote to it)). I madeget_handlereturn a typed dataclassHandleArgsand made sure that all created handles are inHandleArgs.created_handlesthere is no need to closeHandleArgs.handle(unless it is created byget_filename_or_buffer).I used asserts for mypy when I'm 100% certain about the type, otherwise I added mypy ignore statements.
In the future it might be good to merge
get_handleandget_filename_or_buffer.