Skip to content

Conversation

@samredai
Copy link
Contributor

@samredai samredai commented Feb 1, 2022

This incorporates some of the post-merge feedback for PR #3691, namely, it adds an InputStream and OutputStream protocol which are used as the return types for InputFile.open() and OutputFile.create().

Using a typing.Protocol (thanks for the suggestion @emkornfield) allows for flexibility of these implementations. Validation is essentially structural subtyping (aka "static duck typing"). This will be very handy when developing FileIO implementations that heavily rely on third party libraries. It should allow us to inform the implementer of the interface without requiring them to generate a wrapper class that inherits from an ABC and wraps the file-like objects provided by the third party library.

To demonstrate, an object that matches the protocol will return True for an isinstance check, even if the type of the object does not directly inherit from that class:

from typing import Protocol, runtime_checkable


@runtime_checkable
class Speaker(Protocol):
    def speak(self) -> str:
        ...

class Greeter:
    def speak(self):
        return "Hello!"
        
greeter = Greeter()
isinstance(greeter, Speaker)  # True

The purpose of the runtime_checkable decorator is to allow for the runtime isinstance check. This gives an implementer the option to validate that InputStream or OutputStream is fully implemented (see an example of how I added this to the LocalFileIO implementation in commit 2003f73).

This would be a replacement for having to do something like...

class SmartOpenInputFileStream(InputFileStream):

    def __init__(self, smart_open_file_stream):
        self._file_stream = smart_open_file_stream

    @property
    def file_stream(self):
        return self._file_stream

    def readable(self) -> bool:
        return self.file_stream.readable()

...which wraps the file-like object returned by smartopen's open function.

Python 3.7 support

It's important to note that typing.Protocol was introduced in Python 3.8 (PEP 544) but we're currently supporting Python 3.7+. Python 3.7's end of life is scheduled for about a year and half from now. Btw, this is the explanation for the tests failing for this PR. :)

Curious to get everyone's thoughts!

@github-actions github-actions bot added the python label Feb 1, 2022
@samredai samredai requested review from jun-he and rdblue February 1, 2022 16:53
Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me, thanks for doing this. I'm still a little bit unsure that the fine-grained separation of Input/Output is pythonic but that ship has probably sailed.

Just to clarify we expect to wrap these objects in higher level constructs if needed (e.g. if we want to pass a JSON input stream to the python JSON parser are these methods enough or is there a wrapper necessary to do read lines?).

@runtime_checkable
class InputStream(Protocol):
def read(self, n: int) -> bytes:
...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs for methods? Or maybe a top level doc that these should have the same semantics as IOBase?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these copied from IOBase? I'm not sure what should be here for Python. Are these the methods needed for Arrow to read from this stream?

Copy link
Contributor Author

@samredai samredai Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the equivalent of SeekableInputStream and PositionOutputStream in the legacy python client. It's just less strict here in that any class with the same structure is automatically a subclass.

I think this highlights the behavior:

from typing import Protocol, runtime_checkable

@runtime_checkable
class A(Protocol):
    def foo(self):
        ...
    def bar(self):
        ...

class B:
    def foo(self):
        ...
    def bar(self):
        ...

class C:
    def foo(self):
        ...
    def bar(self):
        ...
    def baz(self):
        ...

class D:
    def foo(self):
        ...

isinstance(B(), A)  # True
isinstance(C(), A)  # True
isinstance(D(), A)  # False

So an input file or output file class could have more methods, as long as it has the required methods set in the protocol which should be the case for any file-like object that's based on RawIOBase (inherits from IOBase and adds a read, readall, readinto, and write method).

The reason I renamed a few of the methods from the legacy client was also to be in-line with RawIOBase. Here's an example of how urllibs HTTPResponse object matches this protocol:

import urllib.request

response = urllib.request.urlopen('http://python.org/')
isinstance(response, InputStream)  # True

print(dir(response))

output:

[...
 'begin',
 'chunk_left',
 'chunked',
 'close',
 'closed',
 'code',
 'debuglevel',
 'detach',
 'fileno',
 'flush',
 'fp',
 'getcode',
 'getheader',
 'getheaders',
 'geturl',
 'headers',
 'info',
 'isatty',
 'isclosed',
 'length',
 'msg',
 'peek',
 'read',
 'read1',
 'readable',
 'readinto',
 'readinto1',
 'readline',
 'readlines',
 'reason',
 'seek',
 'seekable',
 'status',
 'tell',
 'truncate',
 'url',
 'version',
 'will_close',
 'writable',
 'write',
 'writelines']

One of the remaining questions though is where do we set this isinstance check, or do we do it at all? The rest of the library will be assuming this interface so do we leave it to every InputFile and OutputFile implementations to do an isinstance check? Or do we do the isinstance check in other parts of the library wherever FileIO.new_input_file(...) is called, for example. This feels like it might be a tradeoff to this approach in addition to lack of python 3.7 support.

(sorry for the super long comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm much less concerned about running an isinstance check than about whether anything that conforms to this protocol can be passed into Arrow to read a Parquet file, or the equivalent on the write path. I'm not familiar with that requirement... so how do you know that this is going to work and we don't require more methods?

Copy link
Contributor Author

@samredai samredai Feb 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. This section of the pyarrow docs suggests that an instance of pyarrow.NativeFile will provide the best performance and can be provided directly to pyarrow.parquet.read_table. It has many more methods but includes the methods defined in both protocols here so it would be implicitly considered a subtype of both.

Copy link
Contributor

@emkornfield emkornfield Feb 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR; My best guess is:

  • read(...)
  • write(...)
  • seek(...)
  • tell(...)
  • writable()
  • close()
  • closed()

I don't know if other libraries (e.g. json) might require more.

I think the safe thing is to use IOBase methods as the protocol. The code linked above mostly seems to only require "writeable()". The C++ code that does the adaptation is here it looks like it accesses most IOBase methods and read(...)/write(...) methods (these I think should take there signatures from RawIOBase. The methods that don't look like they are necessary are 'isattyandfileno. Some of the other methods are just straight pass-through to the object (truncate, readline, readlines`) but even if Arrow won't blow-up I think the question is what other libraries to we expect to interact with and if those require the methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for that we just need to try it out until we have a minimal set of functions.

Great idea @rdblue, I tried this out. @emkornfield it looks like your guess was right--although writable was the only method that could be removed without causing any error. I found that interesting since the PythonFile source code you linked to seems to need that to infer the handler mode. I might be missing where that's inferred some other way, maybe duck-typing.

Read requires:

  • read(...)
  • seek(...)
  • tell(...)
  • closed()
  • close()

Write requires:

  • write(...)
  • closed()
  • close()

This was the code I used to test this out:

from pyarrow import parquet as pq


class InputFileImpl:
    def __init__(self, filepath):
        file_object = open(filepath, "rb")

        # Required methods for use in pq.read_table(...)
        self.read = file_object.read
        self.seek = file_object.seek
        self.tell = file_object.tell
        self.closed = file_object.closed
        self.close = file_object.close

class OutputFileImpl:
    def __init__(self, filepath):
        file_object = open(filepath, "wb")

        # Required methods for use in pq.write_table(...)
        self.write = file_object.write
        self.closed = file_object.closed
        self.close = file_object.close


input_file = InputFileImpl("example.parquet")
table = pq.read_table(input_file)

output_file = OutputFileImpl("example_output.parquet")
pq.write_table(table, output_file)
output_file.close()
table_reread = pq.read_table("example_output.parquet")

table.equals(table_reread)  # returns True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised that tell was required for read, but not for write... in the JVM implementation we only need the position in the write path, not the read path. Maybe that's tracked internally by the Arrow Parquet writer, though. (@xhochy might know)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd guess for parquet the mode is passed through as a parameter so that code path isn't hit.

Tell is used in determining file size I believe on the read path (capturing and restoring current position). For write path I think parquet does a fair amount of internal state tracking. This could be an artifact of the file abstractions in parquet being initially separate from Arrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this PR to only include the minimally required pyarrow methods for the protocols. I also dropped 3.7 from the tox configuration and python-ci workflow.

output_file = open(self.parsed_location.path, "wb" if overwrite else "xb")
if not isinstance(output_file, OutputStream):
raise RuntimeError(
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the formatting on this looks weird, is there some way to get black to make a better choice?

Copy link
Contributor Author

@samredai samredai Feb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been planning on increasing the auto formatter's line length setting for a while so I took this opportunity to open PR #4025 for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged the line length update.

python/tox.ini Outdated

[tox]
envlist = py37,py38,py39,linters
envlist = py38,py39,linters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 year 4 months seems like quite a long time from now, do we really want to drop support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but Protocol was added in 3.8. If we're really aiming to use this for documentation purposes only and don't actually need the functionality of structural subtyping, I can change these to normal classes and include "These are Protocols for documentation purposes only" in the docstrings. How does that sound?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the backport not useable to keep 3.7?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right, I was getting an import error and thought it wasn't in the backport, but I was expecting an override of the normal typing import line. I'll update this to use from typing_extensions and add back 3.7. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

def open(self) -> InputStream:
input_file = open(self.parsed_location.path, "rb")
if not isinstance(input_file, InputStream):
raise RuntimeError("""Object returned from LocalInputFile.open does not match the OutputStream protocol.""")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit: TypeError is better here but since it is a test it is minor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

from abc import ABC, abstractmethod
from typing import Union

from typing_extensions import Protocol, runtime_checkable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might pay to put a conditional import here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. try to import from the native 3.8+ version and fallback to extension? I think this could make the typing_extensions optional for 3.8 (not sure if we consider this a big deal).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great catch, I updated this to:

try:
    from typing import Protocol, runtime_checkable
except ImportError:  # pragma: no cover
    from typing_extensions import Protocol  # type: ignore
    from typing_extensions import runtime_checkable

python/tox.ini Outdated
coverage
mock
pytest
typing-extensions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not very familiar with tox but is there a way to specify this is required for python 3.7?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think python 3.7 may come with typing-extensions by default. Let me confirm that and we can just remove this dep here entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed and updated!

@rdblue
Copy link
Contributor

rdblue commented Feb 10, 2022

Merging this. Thanks for the update, @samredai! Good to know exactly what we need to implement for files to work.

@rdblue rdblue merged commit 9b6b5e0 into apache:master Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants