Using `write_parquet` with `partition_by` breaks writing to S3 #20502

davidia · 2024-12-30T11:18:02Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df = pl.DataFrame({'a':[1,2,3,4],'b':['left','left','right','right']})

# writes the file to s3 as expected
df.write_parquet('s3://my-bucket/a.parquet')

# writes a partitioned dataset to a local directory  's3://my-bucket/hive/'
df.write_parquet('s3://my-bucket/hive/',partition_by=['b'])

Log output

Auto-selected credential provider: CredentialProviderAWS
try_get_writeable: cloud: s3://my-bucket/test/a.parquet
Async thread count: 4
object store cache key: s3://my-bucket/test/a.parquet S { url_base: "s3://my-bucket", cloud_options: Some(C { max_retries: 2, file_cache_ttl: 3600, config: Some(Aws([])), credential_provider: 140141317067952 }) }
async upload_chunk_size: 67108864
[FetchedCredentialsCache]: Call update_func: current_time = 1735556996, last_fetched_expiry = 0
[FetchedCredentialsCache]: Finish update_func: new expiry = (never expires)

Issue description

Calling write_parquet with an S3 URI writes to a local file instead of S3 if partition_by is specified. It works as expected if partition_by is not specified

Expected behavior

df.write_parquet('s3://my-bucket/hive/',partition_by=['b']) should write a partitioned dataset to S3.

Installed versions

--------Version info---------
Polars:              1.18.0
Index type:          UInt32
Platform:            Linux-5.15.150.1-microsoft-standard-WSL2-x86_64-with-glibc2.39
Python:              3.12.5 | packaged by conda-forge | (main, Aug  8 2024, 18:36:51) [GCC 12.4.0]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               5.5.0
azure.identity       <not installed>
boto3                1.35.22
cloudpickle          3.0.0
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2024.9.0
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           3.9.2
nest_asyncio         1.6.0
numpy                1.26.4
openpyxl             3.1.5
pandas               2.1.4
pyarrow              15.0.2
pydantic             2.8.2
pyiceberg            <not installed>
sqlalchemy           1.4.49
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>```

</details>

The text was updated successfully, but these errors were encountered:

davidia added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Dec 30, 2024

ritchie46 assigned nameexhaustion Dec 31, 2024

nameexhaustion added enhancement New feature or an improvement of an existing feature P-medium Priority: medium A-io-cloud Area: reading/writing to cloud storage and removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels Jan 7, 2025

github-project-automation bot added this to Backlog Jan 7, 2025

github-project-automation bot moved this to Ready in Backlog Jan 7, 2025

nameexhaustion mentioned this issue Jan 7, 2025

feat: Support writing partitioned parquet to cloud #20590

Merged

ritchie46 closed this as completed in #20590 Jan 8, 2025

github-project-automation bot moved this from Ready to Done in Backlog Jan 8, 2025

c-peters added the accepted Ready for implementation label Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using `write_parquet` with `partition_by` breaks writing to S3 #20502

Using `write_parquet` with `partition_by` breaks writing to S3 #20502

davidia commented Dec 30, 2024

Using write_parquet with partition_by breaks writing to S3 #20502

Using write_parquet with partition_by breaks writing to S3 #20502

Comments

davidia commented Dec 30, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

Using `write_parquet` with `partition_by` breaks writing to S3 #20502

Using `write_parquet` with `partition_by` breaks writing to S3 #20502