Make it easier to create a Pandas dataframe from DataFusion query results #139

andygrove · 2023-01-19T14:52:28Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
DataFrame.collect returns a list of PyArrow record batches. Each batch can be turned into a Pandas datraframe but I do not know how to create a Pandas dataframe that contains data from all of the batches in an efficient way.

Describe the solution you'd like
Either an example for this, or new features to help with this. Perhaps a DataFrame.collect_single_batch could work.

Describe alternatives you've considered
None

Additional context
None

The text was updated successfully, but these errors were encountered:

krzysztof-kwitt · 2023-02-17T17:40:48Z

What do you think about following snippet?

df = Table.from_batches(batches).to_pandas()

batches can be sequence or iteratorr of RecordBatch.

simicd · 2023-02-19T21:23:51Z

Thanks for sharing @krzysztof-kwitt! I think that works - I tried to implement a to_pandas() method on the datafusion dataframe that collects recordbatches and turns them into a single pandas dataframe - see #197. Is that what you had in mind @andygrove?

Example:

batch_1 = pa.RecordBatch.from_arrays(
    [pa.array([0.1, -0.7, 0.55])], names=["value"]
)
batch_2 = pa.RecordBatch.from_arrays(
    [pa.array([0.5, -0.6, 0.8])], names=["value"]
)
df = ctx.create_dataframe([[batch_1, batch_2]])

Result:

simicd · 2023-02-19T21:43:17Z

I saw in the documentation that pyarrow Table has a few methods that might be helpful when constructing dataframes or getting out the results:

Do you think it would be useful to implement those methods as well?

krzysztof-kwitt · 2023-02-20T16:23:53Z

I don't think we need to use other ArrowTable.from_* methods, but I would consider adding polars support too, but this is a question for project maintainers, is it valuable for them.

This is how it has been implemented in DuckDB

PolarsDataFrame DuckDBPyRelation::ToPolars(idx_t batch_size) {
	auto arrow = ToArrowTable(batch_size);
	return py::cast<PolarsDataFrame>(pybind11::module_::import("polars").attr("DataFrame")(arrow));
}

https://github.com/duckdb/duckdb/pull/6181/files
Here is guide about Polars support in DuckDB: https://duckdb.org/docs/guides/python/polars
Then we should also update the DataFusion with DuckDB guide.
https://duckdb.org/docs/guides/python/datafusion

simicd · 2023-02-25T14:59:56Z

@krzysztof-kwitt thanks for the suggestion, I've created a PR with additional export functions that would among others simplify export to polars DataFrames, see #236.

You can convert datafusion dataframes to polars like this:

polars_df = df.to_polars()

andygrove added the enhancement New feature or request label Jan 19, 2023

simicd mentioned this issue Feb 19, 2023

Implement to_pandas() #197

Merged

andygrove closed this as completed in #197 Feb 22, 2023

simicd mentioned this issue Feb 25, 2023

Additional dataframe export functions #235

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make it easier to create a Pandas dataframe from DataFusion query results #139

Make it easier to create a Pandas dataframe from DataFusion query results #139

andygrove commented Jan 19, 2023

krzysztof-kwitt commented Feb 17, 2023 •

edited

Loading

simicd commented Feb 19, 2023 •

edited

Loading

simicd commented Feb 19, 2023

krzysztof-kwitt commented Feb 20, 2023 •

edited

Loading

simicd commented Feb 25, 2023 •

edited

Loading

Make it easier to create a Pandas dataframe from DataFusion query results #139

Make it easier to create a Pandas dataframe from DataFusion query results #139

Comments

andygrove commented Jan 19, 2023

krzysztof-kwitt commented Feb 17, 2023 • edited Loading

simicd commented Feb 19, 2023 • edited Loading

simicd commented Feb 19, 2023

krzysztof-kwitt commented Feb 20, 2023 • edited Loading

simicd commented Feb 25, 2023 • edited Loading

krzysztof-kwitt commented Feb 17, 2023 •

edited

Loading

simicd commented Feb 19, 2023 •

edited

Loading

krzysztof-kwitt commented Feb 20, 2023 •

edited

Loading

simicd commented Feb 25, 2023 •

edited

Loading