Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix compute_record_batch_statistics wrong with projection #8489

Merged
merged 40 commits into from
Dec 16, 2023

Conversation

Asura7969
Copy link
Contributor

Which issue does this PR close?

Closes #8406.

Rationale for this change

Projection size is not calculated correctly

What changes are included in this PR?

Select the correct columns, not all

Are these changes tested?

test_compute_record_batch_statistics

Are there any user-facing changes?

# Conflicts:
#	datafusion/physical-plan/src/joins/hash_join_utils.rs
# Conflicts:
#	datafusion/physical-plan/src/joins/hash_join_utils.rs
.iter()
.flatten()
.map(|b| {
b.columns()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RecordBatch.project method is not used, because there is a clone internally, so there is no need to generate a new RecordBatch here.

num_rows: Precision::Exact(3),
total_byte_size: Precision::Exact(464), // this might change a bit if the way we compute the size changes
total_byte_size: Precision::Exact(byte_size),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is appropriate, if you have any good suggestions please leave a message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok and a nice way to make the code less brittle to future changes in arrow's layout

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious as to why the previous code was Precision::Exact(464).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it happens to be the (current) size of the record batch in the test:

        let batch = RecordBatch::try_new(
            Arc::clone(&schema),
            vec![
                Arc::new(Float32Array::from(vec![1., 2., 3.])),
                Arc::new(Float64Array::from(vec![9., 8., 7.])),
                Arc::new(UInt64Array::from(vec![4, 5, 6])),
            ],
        )?;

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 11, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me -- thank you @Asura7969 and I apologize for the delay in reviewing. cc @Dandandan

num_rows: Precision::Exact(3),
total_byte_size: Precision::Exact(464), // this might change a bit if the way we compute the size changes
total_byte_size: Precision::Exact(byte_size),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok and a nice way to make the code less brittle to future changes in arrow's layout

--------------------CoalesceBatchesExec: target_batch_size=8192
----------------------RepartitionExec: partitioning=Hash([col0@0], 4), input_partitions=1
------------------------MemoryExec: partitions=1, partition_sizes=[3]
----------------ProjectionExec: expr=[col0@2 as col0, col1@3 as col1, col2@4 as col2, col0@0 as col0, col1@1 as col1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks to me like the change is due to the fact the join inputs were reordered and this projection puts the columns back the expected way.

Same thing with the projection below

Copy link
Member

@jackwener jackwener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Asura7969

Comment on lines +148 to +157
let total_byte_size = batches
.iter()
.flatten()
.map(|b| {
projection
.iter()
.map(|index| b.column(*index).get_array_memory_size())
.sum::<usize>()
})
.sum();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

num_rows: Precision::Exact(3),
total_byte_size: Precision::Exact(464), // this might change a bit if the way we compute the size changes
total_byte_size: Precision::Exact(byte_size),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it happens to be the (current) size of the record batch in the test:

        let batch = RecordBatch::try_new(
            Arc::clone(&schema),
            vec![
                Arc::new(Float32Array::from(vec![1., 2., 3.])),
                Arc::new(Float64Array::from(vec![9., 8., 7.])),
                Arc::new(UInt64Array::from(vec![4, 5, 6])),
            ],
        )?;

@alamb alamb merged commit 0fcd077 into apache:main Dec 16, 2023
22 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 16, 2023

Thanks again @Asura7969 @

@Asura7969 Asura7969 deleted the fix_total_byte_size branch December 16, 2023 12:50
@Dandandan
Copy link
Contributor

Thanks @Asura7969 and @alamb @jackwener for the review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

compute_record_batch_statistics wrong with projection
4 participants