-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Adding InputStream and OutputStream protocols for open and create return types #4021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me, thanks for doing this. I'm still a little bit unsure that the fine-grained separation of Input/Output is pythonic but that ship has probably sailed.
Just to clarify we expect to wrap these objects in higher level constructs if needed (e.g. if we want to pass a JSON input stream to the python JSON parser are these methods enough or is there a wrapper necessary to do read lines?).
python/src/iceberg/io/base.py
Outdated
| @runtime_checkable | ||
| class InputStream(Protocol): | ||
| def read(self, n: int) -> bytes: | ||
| ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs for methods? Or maybe a top level doc that these should have the same semantics as IOBase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these copied from IOBase? I'm not sure what should be here for Python. Are these the methods needed for Arrow to read from this stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the equivalent of SeekableInputStream and PositionOutputStream in the legacy python client. It's just less strict here in that any class with the same structure is automatically a subclass.
I think this highlights the behavior:
from typing import Protocol, runtime_checkable
@runtime_checkable
class A(Protocol):
def foo(self):
...
def bar(self):
...
class B:
def foo(self):
...
def bar(self):
...
class C:
def foo(self):
...
def bar(self):
...
def baz(self):
...
class D:
def foo(self):
...
isinstance(B(), A) # True
isinstance(C(), A) # True
isinstance(D(), A) # FalseSo an input file or output file class could have more methods, as long as it has the required methods set in the protocol which should be the case for any file-like object that's based on RawIOBase (inherits from IOBase and adds a read, readall, readinto, and write method).
The reason I renamed a few of the methods from the legacy client was also to be in-line with RawIOBase. Here's an example of how urllibs HTTPResponse object matches this protocol:
import urllib.request
response = urllib.request.urlopen('http://python.org/')
isinstance(response, InputStream) # True
print(dir(response))output:
[...
'begin',
'chunk_left',
'chunked',
'close',
'closed',
'code',
'debuglevel',
'detach',
'fileno',
'flush',
'fp',
'getcode',
'getheader',
'getheaders',
'geturl',
'headers',
'info',
'isatty',
'isclosed',
'length',
'msg',
'peek',
'read',
'read1',
'readable',
'readinto',
'readinto1',
'readline',
'readlines',
'reason',
'seek',
'seekable',
'status',
'tell',
'truncate',
'url',
'version',
'will_close',
'writable',
'write',
'writelines']
One of the remaining questions though is where do we set this isinstance check, or do we do it at all? The rest of the library will be assuming this interface so do we leave it to every InputFile and OutputFile implementations to do an isinstance check? Or do we do the isinstance check in other parts of the library wherever FileIO.new_input_file(...) is called, for example. This feels like it might be a tradeoff to this approach in addition to lack of python 3.7 support.
(sorry for the super long comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm much less concerned about running an isinstance check than about whether anything that conforms to this protocol can be passed into Arrow to read a Parquet file, or the equivalent on the write path. I'm not familiar with that requirement... so how do you know that this is going to work and we don't require more methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see. This section of the pyarrow docs suggests that an instance of pyarrow.NativeFile will provide the best performance and can be provided directly to pyarrow.parquet.read_table. It has many more methods but includes the methods defined in both protocols here so it would be implicitly considered a subtype of both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR; My best guess is:
read(...)write(...)seek(...)tell(...)writable()close()closed()
I don't know if other libraries (e.g. json) might require more.
I think the safe thing is to use IOBase methods as the protocol. The code linked above mostly seems to only require "writeable()". The C++ code that does the adaptation is here it looks like it accesses most IOBase methods and read(...)/write(...) methods (these I think should take there signatures from RawIOBase. The methods that don't look like they are necessary are 'isattyandfileno. Some of the other methods are just straight pass-through to the object (truncate, readline, readlines`) but even if Arrow won't blow-up I think the question is what other libraries to we expect to interact with and if those require the methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for that we just need to try it out until we have a minimal set of functions.
Great idea @rdblue, I tried this out. @emkornfield it looks like your guess was right--although writable was the only method that could be removed without causing any error. I found that interesting since the PythonFile source code you linked to seems to need that to infer the handler mode. I might be missing where that's inferred some other way, maybe duck-typing.
Read requires:
- read(...)
- seek(...)
- tell(...)
- closed()
- close()
Write requires:
- write(...)
- closed()
- close()
This was the code I used to test this out:
from pyarrow import parquet as pq
class InputFileImpl:
def __init__(self, filepath):
file_object = open(filepath, "rb")
# Required methods for use in pq.read_table(...)
self.read = file_object.read
self.seek = file_object.seek
self.tell = file_object.tell
self.closed = file_object.closed
self.close = file_object.close
class OutputFileImpl:
def __init__(self, filepath):
file_object = open(filepath, "wb")
# Required methods for use in pq.write_table(...)
self.write = file_object.write
self.closed = file_object.closed
self.close = file_object.close
input_file = InputFileImpl("example.parquet")
table = pq.read_table(input_file)
output_file = OutputFileImpl("example_output.parquet")
pq.write_table(table, output_file)
output_file.close()
table_reread = pq.read_table("example_output.parquet")
table.equals(table_reread) # returns TrueThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit surprised that tell was required for read, but not for write... in the JVM implementation we only need the position in the write path, not the read path. Maybe that's tracked internally by the Arrow Parquet writer, though. (@xhochy might know)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd guess for parquet the mode is passed through as a parameter so that code path isn't hit.
Tell is used in determining file size I believe on the read path (capturing and restoring current position). For write path I think parquet does a fair amount of internal state tracking. This could be an artifact of the file abstractions in parquet being initially separate from Arrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this PR to only include the minimally required pyarrow methods for the protocols. I also dropped 3.7 from the tox configuration and python-ci workflow.
python/tests/io/test_base.py
Outdated
| output_file = open(self.parsed_location.path, "wb" if overwrite else "xb") | ||
| if not isinstance(output_file, OutputStream): | ||
| raise RuntimeError( | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the formatting on this looks weird, is there some way to get black to make a better choice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've been planning on increasing the auto formatter's line length setting for a while so I took this opportunity to open PR #4025 for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged the line length update.
python/tox.ini
Outdated
|
|
||
| [tox] | ||
| envlist = py37,py38,py39,linters | ||
| envlist = py38,py39,linters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 year 4 months seems like quite a long time from now, do we really want to drop support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but Protocol was added in 3.8. If we're really aiming to use this for documentation purposes only and don't actually need the functionality of structural subtyping, I can change these to normal classes and include "These are Protocols for documentation purposes only" in the docstrings. How does that sound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the backport not useable to keep 3.7?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right, I was getting an import error and thought it wasn't in the backport, but I was expecting an override of the normal typing import line. I'll update this to use from typing_extensions and add back 3.7. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
python/tests/io/test_base.py
Outdated
| def open(self) -> InputStream: | ||
| input_file = open(self.parsed_location.path, "rb") | ||
| if not isinstance(input_file, InputStream): | ||
| raise RuntimeError("""Object returned from LocalInputFile.open does not match the OutputStream protocol.""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small nit: TypeError is better here but since it is a test it is minor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
python/src/iceberg/io/base.py
Outdated
| from abc import ABC, abstractmethod | ||
| from typing import Union | ||
|
|
||
| from typing_extensions import Protocol, runtime_checkable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might pay to put a conditional import here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. try to import from the native 3.8+ version and fallback to extension? I think this could make the typing_extensions optional for 3.8 (not sure if we consider this a big deal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great catch, I updated this to:
try:
from typing import Protocol, runtime_checkable
except ImportError: # pragma: no cover
from typing_extensions import Protocol # type: ignore
from typing_extensions import runtime_checkable
python/tox.ini
Outdated
| coverage | ||
| mock | ||
| pytest | ||
| typing-extensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not very familiar with tox but is there a way to specify this is required for python 3.7?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think python 3.7 may come with typing-extensions by default. Let me confirm that and we can just remove this dep here entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed and updated!
|
Merging this. Thanks for the update, @samredai! Good to know exactly what we need to implement for files to work. |
This incorporates some of the post-merge feedback for PR #3691, namely, it adds an
InputStreamandOutputStreamprotocol which are used as the return types forInputFile.open()andOutputFile.create().Using a
typing.Protocol(thanks for the suggestion @emkornfield) allows for flexibility of these implementations. Validation is essentially structural subtyping (aka "static duck typing"). This will be very handy when developing FileIO implementations that heavily rely on third party libraries. It should allow us to inform the implementer of the interface without requiring them to generate a wrapper class that inherits from an ABC and wraps the file-like objects provided by the third party library.To demonstrate, an object that matches the protocol will return
Truefor an isinstance check, even if the type of the object does not directly inherit from that class:The purpose of the
runtime_checkabledecorator is to allow for the runtimeisinstancecheck. This gives an implementer the option to validate thatInputStreamorOutputStreamis fully implemented (see an example of how I added this to theLocalFileIOimplementation in commit 2003f73).This would be a replacement for having to do something like...
...which wraps the file-like object returned by smartopen's open function.
Python 3.7 support
It's important to note that
typing.Protocolwas introduced in Python 3.8 (PEP 544) but we're currently supporting Python 3.7+. Python 3.7's end of life is scheduled for about a year and half from now. Btw, this is the explanation for the tests failing for this PR. :)Curious to get everyone's thoughts!