Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build(deps): update Arrow/Parquet to 52.0, object-store to 0.10 #10765

Merged
merged 19 commits into from
Jun 7, 2024

Conversation

waynexia
Copy link
Member

@waynexia waynexia commented Jun 2, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate labels Jun 3, 2024
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jun 4, 2024
Signed-off-by: Ruihang Xia <[email protected]>
Signed-off-by: Ruihang Xia <[email protected]>
@github-actions github-actions bot added the sql SQL Planner label Jun 5, 2024
@waynexia
Copy link
Member Author

waynexia commented Jun 5, 2024

Okay, the one last thing is to update Cargo.toml after arrow 52.0 is published. This PR itself is ready for review.

@waynexia waynexia marked this pull request as ready for review June 5, 2024 09:30
@alamb
Copy link
Contributor

alamb commented Jun 5, 2024

Epic!

@andygrove
Copy link
Member

@waynexia It looks like you need to run cargo update in datafusion-cli to fix CI

common::collect(merged_aggregate.execute(0, task_ctx.clone())?).await?;
// enlarge memory limit in spill mode
let task_ctx = if spill {
new_spill_ctx(2, 2600)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be configurable? Also could you add a comment explaining what is happening here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya could you also review?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change has two stages to my understanding. The first partial aggr stage has a smaller limit, to meet the expectation that the partial plan has early emit.

let expected = if spill {
vec![
"+---+---------------+-------------+",
"| a | AVG(b)[count] | AVG(b)[sum] |",
"+---+---------------+-------------+",
"| 2 | 1 | 1.0 |",
"| 2 | 1 | 1.0 |",
"| 3 | 1 | 2.0 |",
"| 3 | 2 | 5.0 |",
"| 4 | 3 | 11.0 |",
"+---+---------------+-------------+",
]
} else {

And the new lines enlarge the limit after that "early emit" check. Otherwise the merge aggr would fall because of insufficient memory. (at line 1554)

let result = common::collect(merged_aggregate.execute(0, task_ctx)?).await?;

The root cause of this change is somehow the memory requirement becomes larger. From my investigation, the biggest change compared to the previous is from the RawTable in GroupValuesPrimitive -- it needs about 1000 bytes more:

pub struct GroupValuesPrimitive<T: ArrowPrimitiveType> {
/// The data type of the output array
data_type: DataType,
/// Stores the group index based on the hash of its value
///
/// We don't store the hashes as hashing fixed width primitives
/// is fast enough for this not to benefit performance
map: RawTable<usize>,
/// The group index of the null value if any
null_group: Option<usize>,
/// The values for each group index
values: Vec<T::Native>,
/// The random state used to generate hashes
random_state: RandomState,
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't realized that these literal values were just in tests ... thanks for the explanation

@@ -2032,7 +2037,7 @@ mod tests {
spill: bool,
) -> Result<()> {
let task_ctx = if spill {
new_spill_ctx(2, 3200)
new_spill_ctx(2, 4200)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, seems like we need to make something configurable here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misunderstanding, do you mean the change in 3a33755 🤔 ?

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I would like to understand the aggregate spilling changes more.

Thanks @waynexia

@alamb
Copy link
Contributor

alamb commented Jun 6, 2024

I merged up from main and updated the lock file

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @waynexia -- I agree with @andygrove 's concern about the changes to spilling -- I think we should address them before merging but otherwise this looks really nice to me

🙏

datafusion/optimizer/src/analyzer/type_coercion.rs Outdated Show resolved Hide resolved
@andygrove andygrove merged commit cb9068c into apache:main Jun 7, 2024
25 checks passed
@waynexia waynexia deleted the prepare-arrow-update branch June 7, 2024 13:18
@waynexia
Copy link
Member Author

waynexia commented Jun 7, 2024

Thanks for reviewing!

@abhiaagarwal
Copy link

@waynexia I was inspecting the dependency tree for delta-rs and it seems the dependency didn't get bumped for object-store in proto-common fyi, so 0.9.1 gets pulled.

object_store v0.9.1
└── datafusion-proto-common v39.0.0 (https://github.com/apache/datafusion?tag=39.0.0-rc1#6a4a280e)
    └── datafusion-proto v39.0.0 (https://github.com/apache/datafusion?tag=39.0.0-rc1#6a4a280e)
        └── deltalake-core v0.18.0 (/Users/abhiagarwal/Developer/delta-rs/crates/core)
            ├── deltalake v0.18.0 (/Users/abhiagarwal/Developer/delta-rs/crates/deltalake)
            │   └── deltalake-python v0.18.0 (/Users/abhiagarwal/Developer/delta-rs/python)
            ├── deltalake-aws v0.1.2 (/Users/abhiagarwal/Developer/delta-rs/crates/aws)
            │   └── deltalake v0.18.0 (/Users/abhiagarwal/Developer/delta-rs/crates/deltalake) (*)
            ├── deltalake-azure v0.1.2 (/Users/abhiagarwal/Developer/delta-rs/crates/azure)
            │   └── deltalake v0.18.0 (/Users/abhiagarwal/Developer/delta-rs/crates/deltalake) (*)
            ├── deltalake-catalog-glue v0.1.0 (/Users/abhiagarwal/Developer/delta-rs/crates/catalog-glue)
            │   └── deltalake v0.18.0 (/Users/abhiagarwal/Developer/delta-rs/crates/deltalake) (*)
            ├── deltalake-gcp v0.2.1 (/Users/abhiagarwal/Developer/delta-rs/crates/gcp)
            │   └── deltalake v0.18.0 (/Users/abhiagarwal/Developer/delta-rs/crates/deltalake) (*)
            └── deltalake-mount v0.1.0 (/Users/abhiagarwal/Developer/delta-rs/crates/mount)
                └── deltalake-python v0.18.0 (/Users/abhiagarwal/Developer/delta-rs/python)

@waynexia
Copy link
Member Author

Sorry for causing this! Opened #10848 to fix it.

@abhiaagarwal
Copy link

All good, just wanted to bring it to your attention before 39 formally drops :)

@alamb
Copy link
Contributor

alamb commented Jun 10, 2024

All good, just wanted to bring it to your attention before 39 formally drops :)

FYI @andygrove -- maybe this means we should make another RC for 39.0.0 🤔

We could also potentially make a 39.1.0 release too with this dependency upgrade fix #10848

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…pache#10765)

* fix compile on default feature config

Signed-off-by: Ruihang Xia <[email protected]>

* fix test of common, functions, optimizer and physical-expr

Signed-off-by: Ruihang Xia <[email protected]>

* fix other tests

Signed-off-by: Ruihang Xia <[email protected]>

* fix one last test

Signed-off-by: Ruihang Xia <[email protected]>

* fix clippy warnings

Signed-off-by: Ruihang Xia <[email protected]>

* fix datafusion-cli

Signed-off-by: Ruihang Xia <[email protected]>

* switch to git deps

Signed-off-by: Ruihang Xia <[email protected]>

* regen proto file

Signed-off-by: Ruihang Xia <[email protected]>

* fix pyo3 feature

Signed-off-by: Ruihang Xia <[email protected]>

* fix slt

Signed-off-by: Ruihang Xia <[email protected]>

* fix symmetric hash join cases

Signed-off-by: Ruihang Xia <[email protected]>

* update integration result

Signed-off-by: Ruihang Xia <[email protected]>

* fix up spill test

Signed-off-by: Ruihang Xia <[email protected]>

* shift to the released packages

Signed-off-by: Ruihang Xia <[email protected]>

* Update cargo.lock

* Update datafusion/optimizer/src/analyzer/type_coercion.rs

Co-authored-by: Andrew Lamb <[email protected]>

* update document

Signed-off-by: Ruihang Xia <[email protected]>

* move memory limit to parameter pos

Signed-off-by: Ruihang Xia <[email protected]>

---------

Signed-off-by: Ruihang Xia <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants