Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add string aggregate grouping fuzz test, add MemTable::with_sort_exprs #9190

Merged
merged 1 commit into from
Feb 12, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Feb 10, 2024

Which issue does this PR close?

Part of #7064

Rationale for this change

I want fuzz test coverage for the grouping by single string columns so no regressions are introduced

What changes are included in this PR?

  1. Add fuzz test coverage for both sorted (streaming) and non streaming
  2. Add MemTable::with_sort_exprs which is needed to write this test

Are these changes tested?

Yes

Are there any user-facing changes?

  1. New API MemTable::with_sort_exprs

@github-actions github-actions bot added the core Core DataFusion crate label Feb 10, 2024
@@ -58,6 +60,9 @@ pub struct MemTable {
pub(crate) batches: Vec<PartitionData>,
constraints: Constraints,
column_defaults: HashMap<String, Expr>,
/// Optional pre-known sort order(s). Must be `SortExpr`s.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed this feature to write a test that ran with sorted aggregates

@alamb alamb marked this pull request as ready for review February 10, 2024 16:39
@alamb alamb changed the title Add string aggregagte grouping fuzz test Add string aggregate grouping fuzz test Feb 10, 2024

/// Return a string of random characters of length 1..=max_len
fn random_string(rng: &mut StdRng, max_len: usize) -> String {
// pick characters at random (not just ascii)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats nice....

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @alamb
this makes me think that fuzz tests require a lot of code to maintain the tests, so we probably may want to think about having a separate crate for fuzz testing like you did for array funtions in #9113

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @alamb
this makes me think that fuzz tests require a lot of code to maintain the tests, so we probably may want to think about having a separate crate for fuzz testing like you did for array funtions in #9113

Thanks @comphead -- I agree keeping the fuzz testing code in a central place is a good ideea. In this case, some non trivial chunk of the code is in the test-utils crate (not published to crates.io, etc):

https://github.com/apache/arrow-datafusion/tree/main/test-utils

And then is used as a dev dependency

https://github.com/apache/arrow-datafusion/blob/d7dcb121a55c9d5e7a7701e57a790f4a6da27a9b/datafusion/core/Cargo.toml#L117

I think longer term consolidating fuzz testing into its own crate (like we did for sqllogictests) would be very helpful

async fn aggregate_test() {
/// Tests that streaming aggregate and batch (non streaming) aggregate produce
/// same results
#[tokio::test(flavor = "multi_thread")]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reason to limit this to 8 threads that I know of. Using multi-thread uses more cores if available

@@ -50,18 +61,18 @@ async fn aggregate_test() {
let n = 300;
let distincts = vec![10, 20];
for distinct in distincts {
let mut handles = Vec::new();
let mut join_set = JoinSet::new();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using JoinSet for consistency -- which automatically cancels outstanding tasks on panic

/// to test the streaming group by case
///
/// if large is true, the input batches will be LargeStringArray
async fn group_by_string_test(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the new test

};
join_set.spawn(async move { run_distinct_count_test(generator).await });
}
for generator in StringBatchGenerator::interesting_cases() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored StringBatchGenerator into a common location to share

@alamb alamb changed the title Add string aggregate grouping fuzz test Add string aggregate grouping fuzz test, add MemTable::with_sort_exprs Feb 12, 2024
@alamb alamb merged commit 3c2b542 into apache:main Feb 12, 2024
24 checks passed
@alamb
Copy link
Contributor Author

alamb commented Feb 12, 2024

Thanks again for the review @comphead

@alamb alamb deleted the alamb/gby_fuzz branch February 12, 2024 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants