Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(pyspark): fix a bug in spark's to_parquet_dir method #9790

Closed

Conversation

chloeh13q
Copy link
Contributor

@chloeh13q chloeh13q commented Aug 7, 2024

Description of changes

Fix a bug in spark's to_parquet_dir method. Spark's write method only accepts strings as paths, and the current implementation will throw something like

AttributeError: 'PosixPath' object has no attribute '_get_object_id'

I was already doing this in streaming mode, just forgot to do this in batch mode.

xref: #9781

@chloeh13q chloeh13q requested a review from ncclementi August 7, 2024 22:09
@chloeh13q chloeh13q added bug Incorrect behavior inside of ibis pyspark The Apache PySpark backend labels Aug 7, 2024
@chloeh13q chloeh13q force-pushed the fix/pyspark-to-parquet branch from 777d57d to 848e900 Compare August 8, 2024 05:41
@cpcloud
Copy link
Member

cpcloud commented Aug 8, 2024

Do we have a test for this?

@chloeh13q
Copy link
Contributor Author

@cpcloud @ncclementi is adding one in #9781, so I didn't add one.

@ncclementi
Copy link
Contributor

@chloeh13q when I try to run the test with the change you suggest I get a different error now.

    def test_table_to_parquet_dir(tmp_path, backend, awards_players):
        outparquet_dir = tmp_path
>       awards_players.to_parquet_dir(outparquet_dir)

ibis/backends/tests/test_export.py:221: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
ibis/expr/types/core.py:636: in to_parquet_dir
    self._find_backend(use_default=True).to_parquet_dir(self, path, **kwargs)
ibis/backends/pyspark/__init__.py:1329: in to_parquet_dir
    return self._to_filesystem_output(expr, "parquet", path, params, limit, options)
ibis/backends/pyspark/__init__.py:1289: in _to_filesystem_output
    df.save(os.fspath(path)) #df.save(path)
/Users/naty/mambaforge/envs/ibis-dev/lib/python3.11/site-packages/pyspark/sql/readwriter.py:1463: in save
    self._jwrite.save(path)
/Users/naty/mambaforge/envs/ibis-dev/lib/python3.11/site-packages/py4j/java_gateway.py:1322: in __call__
    return_value = get_return_value(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

a = ('xro392', <py4j.clientserver.JavaClient object at 0x13015a9d0>, 'o391', 'save'), kw = {}, converted = AnalysisException()

    def deco(*a: Any, **kw: Any) -> Any:
        try:
            return f(*a, **kw)
        except Py4JJavaError as e:
            converted = convert_exception(e.java_exception)
            if not isinstance(converted, UnknownException):
                # Hide where the exception came from that shows a non-Pythonic
                # JVM exception message.
>               raise converted from None
E               pyspark.errors.exceptions.captured.AnalysisException: [PATH_ALREADY_EXISTS] Path file:/private/var/folders/ky/_zw0jkyd7hl3f9hfy6c994zw0000gn/T/pytest-of-naty/pytest-27/test_table_to_parquet_dir_pysp0 already exists. Set mode as "overwrite" to overwrite the existing path.

/Users/naty/mambaforge/envs/ibis-dev/lib/python3.11/site-packages/pyspark/errors/exceptions/captured.py:185: AnalysisException

I tried adding .mode("overwrite") here df = df.write.format(format) that bypasses the error, although I don't know if that's the desired thing. But the test fails anyways because of assertion errors. I'll push this to the original PR and we can discuss there since we will have the test failure to see the problems.

@cpcloud
Copy link
Member

cpcloud commented Aug 8, 2024

I think this was fixed in @ncclementi's PR, closing!

@cpcloud cpcloud closed this Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis pyspark The Apache PySpark backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants