Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it easier to create a Pandas dataframe from DataFusion query results #139

Closed
andygrove opened this issue Jan 19, 2023 · 5 comments · Fixed by #197
Closed

Make it easier to create a Pandas dataframe from DataFusion query results #139

andygrove opened this issue Jan 19, 2023 · 5 comments · Fixed by #197
Labels
enhancement New feature or request

Comments

@andygrove
Copy link
Member

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
DataFrame.collect returns a list of PyArrow record batches. Each batch can be turned into a Pandas datraframe but I do not know how to create a Pandas dataframe that contains data from all of the batches in an efficient way.

Describe the solution you'd like
Either an example for this, or new features to help with this. Perhaps a DataFrame.collect_single_batch could work.

Describe alternatives you've considered
None

Additional context
None

@andygrove andygrove added the enhancement New feature or request label Jan 19, 2023
@krzysztof-kwitt
Copy link

krzysztof-kwitt commented Feb 17, 2023

What do you think about following snippet?

df = Table.from_batches(batches).to_pandas()

batches can be sequence or iteratorr of RecordBatch.

@simicd
Copy link
Contributor

simicd commented Feb 19, 2023

Thanks for sharing @krzysztof-kwitt! I think that works - I tried to implement a to_pandas() method on the datafusion dataframe that collects recordbatches and turns them into a single pandas dataframe - see #197. Is that what you had in mind @andygrove?

Example:

batch_1 = pa.RecordBatch.from_arrays(
    [pa.array([0.1, -0.7, 0.55])], names=["value"]
)
batch_2 = pa.RecordBatch.from_arrays(
    [pa.array([0.5, -0.6, 0.8])], names=["value"]
)
df = ctx.create_dataframe([[batch_1, batch_2]])

Result:
image

@simicd
Copy link
Contributor

simicd commented Feb 19, 2023

I saw in the documentation that pyarrow Table has a few methods that might be helpful when constructing dataframes or getting out the results:

Do you think it would be useful to implement those methods as well?

@krzysztof-kwitt
Copy link

krzysztof-kwitt commented Feb 20, 2023

I don't think we need to use other ArrowTable.from_* methods, but I would consider adding polars support too, but this is a question for project maintainers, is it valuable for them.

This is how it has been implemented in DuckDB

PolarsDataFrame DuckDBPyRelation::ToPolars(idx_t batch_size) {
	auto arrow = ToArrowTable(batch_size);
	return py::cast<PolarsDataFrame>(pybind11::module_::import("polars").attr("DataFrame")(arrow));
}

https://github.com/duckdb/duckdb/pull/6181/files
Here is guide about Polars support in DuckDB: https://duckdb.org/docs/guides/python/polars
Then we should also update the DataFusion with DuckDB guide.
https://duckdb.org/docs/guides/python/datafusion

@simicd
Copy link
Contributor

simicd commented Feb 25, 2023

@krzysztof-kwitt thanks for the suggestion, I've created a PR with additional export functions that would among others simplify export to polars DataFrames, see #236.

You can convert datafusion dataframes to polars like this:

polars_df = df.to_polars()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants