Skip to content

Commit

Permalink
[branch-0.8] Merge from main (#201)
Browse files Browse the repository at this point in the history
* changelog (#188)

* Add Python wrapper for LogicalPlan::Sort (#196)

* Add Python wrapper for LogicalPlan::Aggregate (#195)

* Add Python wrapper for LogicalPlan::Limit (#193)

* Add Python wrapper for LogicalPlan::Filter (#192)

* Add Python wrapper for LogicalPlan::Filter

* clippy

* clippy

* Update src/expr/filter.rs

Co-authored-by: Liang-Chi Hsieh <[email protected]>

---------

Co-authored-by: Liang-Chi Hsieh <[email protected]>

* Add tests for recently added functionality (#199)

* Add experimental support for executing SQL with Polars and Pandas (#190)

* Run `maturin develop` instead of `cargo build` in verification script (#200)

* Implement `to_pandas()` (#197)

* Implement to_pandas()

* Update documentation

* Write unit test

* Add support for cudf as a physical execution engine (#205)

* Update README in preparation for 0.8 release (#206)

* Analyze table bindings (#204)

* method for getting the internal LogicalPlan instance

* Add explain plan method

* Add bindings for analyze table

* Add to_variant

* cargo fmt

* blake and flake formatting

* changelog (#209)

---------

Co-authored-by: Liang-Chi Hsieh <[email protected]>
Co-authored-by: Dejan Simic <[email protected]>
Co-authored-by: Jeremy Dyer <[email protected]>
  • Loading branch information
4 people authored Feb 22, 2023
1 parent d7471cd commit 12347d9
Show file tree
Hide file tree
Showing 35 changed files with 1,484 additions and 151 deletions.
94 changes: 63 additions & 31 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,50 +19,36 @@

# Changelog

## [0.8.0](https://github.com/apache/arrow-datafusion-python/tree/0.8.0) (2023-02-17)
## [0.8.0](https://github.com/apache/arrow-datafusion-python/tree/0.8.0) (2023-02-22)

[Full Changelog](https://github.com/apache/arrow-datafusion-python/compare/0.7.0...0.8.0)
[Full Changelog](https://github.com/apache/arrow-datafusion-python/compare/0.8.0-rc1...0.8.0)

**Implemented enhancements:**

- Add bindings for datafusion\_common::DFField [\#184](https://github.com/apache/arrow-datafusion-python/issues/184)
- Add bindings for DFSchema/DFSchemaRef [\#181](https://github.com/apache/arrow-datafusion-python/issues/181)
- Add bindings for datafusion\_expr Projection [\#179](https://github.com/apache/arrow-datafusion-python/issues/179)
- Add bindings for `TableScan` struct from `datafusion_expr::TableScan` [\#177](https://github.com/apache/arrow-datafusion-python/issues/177)
- Add a "mapping" struct for types [\#172](https://github.com/apache/arrow-datafusion-python/issues/172)
- Improve string representation of datafusion classes \(dataframe, context, expression, ...\) [\#158](https://github.com/apache/arrow-datafusion-python/issues/158)
- Add DataFrame count method [\#151](https://github.com/apache/arrow-datafusion-python/issues/151)
- \[REQUEST\] Github Actions Improvements [\#146](https://github.com/apache/arrow-datafusion-python/issues/146)
- Change default branch name from master to main [\#144](https://github.com/apache/arrow-datafusion-python/issues/144)
- Bump pyo3 to 0.18.0 [\#140](https://github.com/apache/arrow-datafusion-python/issues/140)
- Add script for Python linting [\#134](https://github.com/apache/arrow-datafusion-python/issues/134)
- Add Python bindings for substrait module [\#132](https://github.com/apache/arrow-datafusion-python/issues/132)
- Expand unit tests for built-in functions [\#128](https://github.com/apache/arrow-datafusion-python/issues/128)
- support creating arrow-datafusion-python conda environment [\#122](https://github.com/apache/arrow-datafusion-python/issues/122)
- Build Python source distribution in GitHub workflow [\#81](https://github.com/apache/arrow-datafusion-python/issues/81)
- EPIC: Add all functions to python binding `functions` [\#72](https://github.com/apache/arrow-datafusion-python/issues/72)
- Add support for cuDF physical execution engine [\#202](https://github.com/apache/arrow-datafusion-python/issues/202)
- Make it easier to create a Pandas dataframe from DataFusion query results [\#139](https://github.com/apache/arrow-datafusion-python/issues/139)

**Fixed bugs:**

- Build is broken [\#161](https://github.com/apache/arrow-datafusion-python/issues/161)
- Out of memory when sorting [\#157](https://github.com/apache/arrow-datafusion-python/issues/157)
- window\_lead test appears to be non-deterministic [\#135](https://github.com/apache/arrow-datafusion-python/issues/135)
- Reading csv does not work [\#130](https://github.com/apache/arrow-datafusion-python/issues/130)
- Github actions produce a lot of warnings [\#94](https://github.com/apache/arrow-datafusion-python/issues/94)
- ASF source release tarball has wrong directory name [\#90](https://github.com/apache/arrow-datafusion-python/issues/90)
- Python Release Build failing after upgrading to maturin 14.2 [\#87](https://github.com/apache/arrow-datafusion-python/issues/87)
- Maturin build hangs on Linux ARM64 [\#84](https://github.com/apache/arrow-datafusion-python/issues/84)
- Cannot install on Mac M1 from source tarball from testpypi [\#82](https://github.com/apache/arrow-datafusion-python/issues/82)
- ImportPathMismatchError when running pytest locally [\#77](https://github.com/apache/arrow-datafusion-python/issues/77)
- Build error: could not compile `thiserror` due to 2 previous errors [\#69](https://github.com/apache/arrow-datafusion-python/issues/69)

**Closed issues:**

- Publish documentation for Python bindings [\#39](https://github.com/apache/arrow-datafusion-python/issues/39)
- Add Python binding for `approx_median` [\#32](https://github.com/apache/arrow-datafusion-python/issues/32)
- Release version 0.7.0 [\#7](https://github.com/apache/arrow-datafusion-python/issues/7)
- Integrate with the new `object_store` crate [\#22](https://github.com/apache/arrow-datafusion-python/issues/22)

**Merged pull requests:**

- Update README in preparation for 0.8 release [\#206](https://github.com/apache/arrow-datafusion-python/pull/206) ([andygrove](https://github.com/andygrove))
- Add support for cudf as a physical execution engine [\#205](https://github.com/apache/arrow-datafusion-python/pull/205) ([jdye64](https://github.com/jdye64))
- Run `maturin develop` instead of `cargo build` in verification script [\#200](https://github.com/apache/arrow-datafusion-python/pull/200) ([andygrove](https://github.com/andygrove))
- Add tests for recently added functionality [\#199](https://github.com/apache/arrow-datafusion-python/pull/199) ([andygrove](https://github.com/andygrove))
- Implement `to_pandas()` [\#197](https://github.com/apache/arrow-datafusion-python/pull/197) ([simicd](https://github.com/simicd))
- Add Python wrapper for LogicalPlan::Sort [\#196](https://github.com/apache/arrow-datafusion-python/pull/196) ([andygrove](https://github.com/andygrove))
- Add Python wrapper for LogicalPlan::Aggregate [\#195](https://github.com/apache/arrow-datafusion-python/pull/195) ([andygrove](https://github.com/andygrove))
- Add Python wrapper for LogicalPlan::Limit [\#193](https://github.com/apache/arrow-datafusion-python/pull/193) ([andygrove](https://github.com/andygrove))
- Add Python wrapper for LogicalPlan::Filter [\#192](https://github.com/apache/arrow-datafusion-python/pull/192) ([andygrove](https://github.com/andygrove))
- Add experimental support for executing SQL with Polars and Pandas [\#190](https://github.com/apache/arrow-datafusion-python/pull/190) ([andygrove](https://github.com/andygrove))
- Update changelog for 0.8 release [\#188](https://github.com/apache/arrow-datafusion-python/pull/188) ([andygrove](https://github.com/andygrove))
- Add ability to execute ExecutionPlan and get a stream of RecordBatch [\#186](https://github.com/apache/arrow-datafusion-python/pull/186) ([andygrove](https://github.com/andygrove))
- Dffield bindings [\#185](https://github.com/apache/arrow-datafusion-python/pull/185) ([jdye64](https://github.com/jdye64))
- Add bindings for DFSchema [\#183](https://github.com/apache/arrow-datafusion-python/pull/183) ([jdye64](https://github.com/jdye64))
Expand Down Expand Up @@ -118,6 +104,52 @@
- Update release instructions [\#83](https://github.com/apache/arrow-datafusion-python/pull/83) ([andygrove](https://github.com/andygrove))
- \[Functions\] - Add python function binding to `functions` [\#73](https://github.com/apache/arrow-datafusion-python/pull/73) ([francis-du](https://github.com/francis-du))

## [0.8.0-rc1](https://github.com/apache/arrow-datafusion-python/tree/0.8.0-rc1) (2023-02-17)

[Full Changelog](https://github.com/apache/arrow-datafusion-python/compare/0.7.0-rc2...0.8.0-rc1)

**Implemented enhancements:**

- Add bindings for datafusion\_common::DFField [\#184](https://github.com/apache/arrow-datafusion-python/issues/184)
- Add bindings for DFSchema/DFSchemaRef [\#181](https://github.com/apache/arrow-datafusion-python/issues/181)
- Add bindings for datafusion\_expr Projection [\#179](https://github.com/apache/arrow-datafusion-python/issues/179)
- Add bindings for `TableScan` struct from `datafusion_expr::TableScan` [\#177](https://github.com/apache/arrow-datafusion-python/issues/177)
- Add a "mapping" struct for types [\#172](https://github.com/apache/arrow-datafusion-python/issues/172)
- Improve string representation of datafusion classes \(dataframe, context, expression, ...\) [\#158](https://github.com/apache/arrow-datafusion-python/issues/158)
- Add DataFrame count method [\#151](https://github.com/apache/arrow-datafusion-python/issues/151)
- \[REQUEST\] Github Actions Improvements [\#146](https://github.com/apache/arrow-datafusion-python/issues/146)
- Change default branch name from master to main [\#144](https://github.com/apache/arrow-datafusion-python/issues/144)
- Bump pyo3 to 0.18.0 [\#140](https://github.com/apache/arrow-datafusion-python/issues/140)
- Add script for Python linting [\#134](https://github.com/apache/arrow-datafusion-python/issues/134)
- Add Python bindings for substrait module [\#132](https://github.com/apache/arrow-datafusion-python/issues/132)
- Expand unit tests for built-in functions [\#128](https://github.com/apache/arrow-datafusion-python/issues/128)
- support creating arrow-datafusion-python conda environment [\#122](https://github.com/apache/arrow-datafusion-python/issues/122)
- Build Python source distribution in GitHub workflow [\#81](https://github.com/apache/arrow-datafusion-python/issues/81)
- EPIC: Add all functions to python binding `functions` [\#72](https://github.com/apache/arrow-datafusion-python/issues/72)

**Fixed bugs:**

- Build is broken [\#161](https://github.com/apache/arrow-datafusion-python/issues/161)
- Out of memory when sorting [\#157](https://github.com/apache/arrow-datafusion-python/issues/157)
- window\_lead test appears to be non-deterministic [\#135](https://github.com/apache/arrow-datafusion-python/issues/135)
- Reading csv does not work [\#130](https://github.com/apache/arrow-datafusion-python/issues/130)
- Github actions produce a lot of warnings [\#94](https://github.com/apache/arrow-datafusion-python/issues/94)
- ASF source release tarball has wrong directory name [\#90](https://github.com/apache/arrow-datafusion-python/issues/90)
- Python Release Build failing after upgrading to maturin 14.2 [\#87](https://github.com/apache/arrow-datafusion-python/issues/87)
- Maturin build hangs on Linux ARM64 [\#84](https://github.com/apache/arrow-datafusion-python/issues/84)
- Cannot install on Mac M1 from source tarball from testpypi [\#82](https://github.com/apache/arrow-datafusion-python/issues/82)
- ImportPathMismatchError when running pytest locally [\#77](https://github.com/apache/arrow-datafusion-python/issues/77)

**Closed issues:**

- Publish documentation for Python bindings [\#39](https://github.com/apache/arrow-datafusion-python/issues/39)
- Add Python binding for `approx_median` [\#32](https://github.com/apache/arrow-datafusion-python/issues/32)
- Release version 0.7.0 [\#7](https://github.com/apache/arrow-datafusion-python/issues/7)

## [0.7.0-rc2](https://github.com/apache/arrow-datafusion-python/tree/0.7.0-rc2) (2022-11-26)

[Full Changelog](https://github.com/apache/arrow-datafusion-python/compare/0.7.0...0.7.0-rc2)


## [Unreleased](https://github.com/datafusion-contrib/datafusion-python/tree/HEAD)

Expand Down
20 changes: 10 additions & 10 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

85 changes: 37 additions & 48 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,19 +24,30 @@

This is a Python library that binds to [Apache Arrow](https://arrow.apache.org/) in-memory query engine [DataFusion](https://github.com/apache/arrow-datafusion).

Like pyspark, it allows you to build a plan through SQL or a DataFrame API against in-memory data, parquet or CSV
files, run it in a multi-threaded environment, and obtain the result back in Python.
DataFusion's Python bindings can be used as an end-user tool as well as providing a foundation for building new systems.

It also allows you to use UDFs and UDAFs for complex operations.
## Features

The major advantage of this library over other execution engines is that this library achieves zero-copy between
Python and its execution engine: there is no cost in using UDFs, UDAFs, and collecting the results to Python apart
from having to lock the GIL when running those operations.
- Execute queries using SQL or DataFrames against CSV, Parquet, and JSON data sources
- Queries are optimized using DataFusion's query optimizer
- Execute user-defined Python code from SQL
- Exchange data with Pandas and other DataFrame libraries that support PyArrow
- Serialize and deserialize query plans in Substrait format
- Experimental support for executing SQL queries against Polars, Pandas and cuDF

Its query engine, DataFusion, is written in [Rust](https://www.rust-lang.org/), which makes strong assumptions
about thread safety and lack of memory leaks.
## Comparison with other projects

Technically, zero-copy is achieved via the [c data interface](https://arrow.apache.org/docs/format/CDataInterface.html).
Here is a comparison with similar projects that may help understand when DataFusion might be suitable and unsuitable
for your needs:

- [DuckDB](http://www.duckdb.org/) is an open source, in-process analytic database. Like DataFusion, it supports
very fast execution, both from its custom file format and directly from Parquet files. Unlike DataFusion, it is
written in C/C++ and it is primarily used directly by users as a serverless database and query system rather than
as a library for building such database systems.

- [Polars](http://pola.rs/) is one of the fastest DataFrame libraries at the time of writing. Like DataFusion, it
is also written in Rust and uses the Apache Arrow memory model, but unlike DataFusion it does not provide full SQL
support, nor as many extension points.

## Example Usage

Expand All @@ -47,12 +58,8 @@ The Parquet file used in this example can be downloaded from the following page:

- https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

See the [examples](examples) directory for more examples.

```python
from datafusion import SessionContext
import pandas as pd
import pyarrow as pa

# Create a DataFusion context
ctx = SessionContext()
Expand All @@ -67,60 +74,42 @@ df = ctx.sql("select passenger_count, count(*) "
"group by passenger_count "
"order by passenger_count")

# collect as list of pyarrow.RecordBatch
results = df.collect()

# get first batch
batch = results[0]

# convert to Pandas
df = batch.to_pandas()
pandas_df = df.to_pandas()

# create a chart
fig = df.plot(kind="bar", title="Trip Count by Number of Passengers").get_figure()
fig = pandas_df.plot(kind="bar", title="Trip Count by Number of Passengers").get_figure()
fig.savefig('chart.png')
```

This produces the following chart:

![Chart](examples/chart.png)

## Substrait Support
## More Examples

`arrow-datafusion-python` has bindings which allow for serializing a SQL query to substrait protobuf format and deserializing substrait protobuf bytes to a DataFusion `LogicalPlan`, `PyLogicalPlan` in a Python context, which can then be executed.
See [examples](examples/README.md) for more information.

### Example of Serializing/Deserializing Substrait Plans
### Executing Queries with DataFusion

```python
from datafusion import SessionContext
from datafusion import substrait as ss
- [Query a Parquet file using SQL](./examples/sql-parquet.py)
- [Query a Parquet file using the DataFrame API](./examples/dataframe-parquet.py)
- [Run a SQL query and store the results in a Pandas DataFrame](./examples/sql-to-pandas.py)
- [Query PyArrow Data](./examples/query-pyarrow-data.py)

# Create a DataFusion context
ctx = SessionContext()

# Register table with context
ctx.register_parquet('aggregate_test_data', './testing/data/csv/aggregate_test_100.csv')

substrait_plan = ss.substrait.serde.serialize_to_plan("SELECT * FROM aggregate_test_data", ctx)
# type(substrait_plan) -> <class 'datafusion.substrait.plan'>
### Running User-Defined Python Code

# Alternative serialization approaches
# type(substrait_bytes) -> <class 'list'>, at this point the bytes can be distributed to file, network, etc safely
# where they could subsequently be deserialized on the receiving end.
substrait_bytes = ss.substrait.serde.serialize_bytes("SELECT * FROM aggregate_test_data", ctx)
- [Register a Python UDF with DataFusion](./examples/python-udf.py)
- [Register a Python UDAF with DataFusion](./examples/python-udaf.py)

# Imagine here bytes would be read from network, file, etc ... for example brevity this is omitted and variable is simply reused
# type(substrait_plan) -> <class 'datafusion.substrait.plan'>
substrait_plan = ss.substrait.serde.deserialize_bytes(substrait_bytes)
### Substrait Support

# type(df_logical_plan) -> <class 'substrait.LogicalPlan'>
df_logical_plan = ss.substrait.consumer.from_substrait_plan(ctx, substrait_plan)
- [Serialize query plans using Substrait](./examples/substrait.py)

# Back to Substrait Plan just for demonstration purposes
# type(substrait_plan) -> <class 'datafusion.substrait.plan'>
substrait_plan = ss.substrait.producer.to_substrait_plan(df_logical_plan)
### Executing SQL against DataFrame Libraries (Experimental)

```
- [Executing SQL on Polars](./examples/sql-on-polars.py)
- [Executing SQL on Pandas](./examples/sql-on-pandas.py)

## How to install (from pip)

Expand Down
5 changes: 4 additions & 1 deletion conda/environments/datafusion-dev.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ dependencies:
- pytest
- toml
- importlib_metadata
- python>=3.7,<3.11
- python>=3.10
# Packages useful for building distributions and releasing
- mamba
- conda-build
Expand All @@ -38,4 +38,7 @@ dependencies:
- pydata-sphinx-theme==0.8.0
- myst-parser
- jinja2
# GPU packages
- cudf
- cudatoolkit=11.8
name: datafusion-dev
8 changes: 8 additions & 0 deletions datafusion/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,12 @@
)

from .expr import (
Analyze,
Expr,
Filter,
Limit,
Projection,
Sort,
TableScan,
)

Expand All @@ -63,6 +67,10 @@
"Projection",
"DFSchema",
"DFField",
"Analyze",
"Sort",
"Limit",
"Filter",
]


Expand Down
Loading

0 comments on commit 12347d9

Please sign in to comment.