Skip to content

[Python] ImportError calling pyarrow from_pandas within ThreadPool #27815

@asfimport

Description

@asfimport

From dask/dask#7334

The referenced issue report is about an ImportError they get using Python 3.9 (and I can reproduce it). As far as I know how dask works, it's basically calling pa.Table.from_pandas within a ThreadPool, and inside from_pandas we do a with futures.ThreadPoolExecutor, which then fails with this error:


>>> df2.to_parquet('test99.parquet', engine='pyarrow-dataset')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/dataframe/core.py", line 4127, in to_parquet
    return to_parquet(self, path, *args, **kwargs)
  File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/dataframe/io/parquet/core.py", line 671, in to_parquet
    out = out.compute(**compute_kwargs)
  File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/base.py", line 283, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/base.py", line 565, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/threaded.py", line 76, in get
    results = get_async(
  File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/local.py", line 487, in get_async
    raise_exception(exc, tb)
  File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/local.py", line 317, in reraise
    raise exc
  File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/local.py", line 222, in execute_task
    result = _execute_task(task, data)
  File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/utils.py", line 35, in apply
    return func(*args, **kwargs)
  File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py", line 841, in write_partition
    t = cls._pandas_to_arrow_table(df, preserve_index=preserve_index, schema=schema)
  File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/dask/dataframe/io/parquet/arrow.py", line 814, in _pandas_to_arrow_table
    table = pa.Table.from_pandas(df, preserve_index=preserve_index, schema=schema)
  File "pyarrow/table.pxi", line 1479, in pyarrow.lib.Table.from_pandas
  File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 596, in dataframe_to_arrays
    with futures.ThreadPoolExecutor(nthreads) as executor:
  File "/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/concurrent/futures/__init__.py", line 49, in __getattr__
    from .thread import ThreadPoolExecutor as te
ImportError: cannot import name 'ThreadPoolExecutor' from partially initialized module 'concurrent.futures.thread' (most likely due to a circular import) (/home/joris/miniconda3/envs/test-dask-pyarrow-bug/lib/python3.9/concurrent/futures/thread.py)

We can probably avoid that by moving the import top-level (not inline inside dataframe_to_arrays)

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Joris Van den Bossche / @jorisvandenbossche

PRs and other links:

Note: This issue was originally created as ARROW-11983. Please see the migration documentation for further details.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions