Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: why nullable of list item is set to true #11626

Merged
merged 11 commits into from
Jul 26, 2024

Conversation

jcsherin
Copy link
Contributor

@jcsherin jcsherin commented Jul 23, 2024

Which issue does this PR close?

Closes #11625.

Rationale for this change

When working on issues related to #8708 there have been multiple PRs which dealt with the nullability of list item in accumulator state. This doc patch makes the reasoning of existing code explicit.

What changes are included in this PR?

Only doc comments are added. There are no code changes.

Aggregate functions which use data type of first argument:

  1. ArrayAgg
  2. NthValueAgg
  3. Count

Aggregate functions which use data type of returned value:

  1. BitwiseOperation
  2. Sum

Are these changes tested?

Are there any user-facing changes?

@jcsherin
Copy link
Contributor Author

Notes:

  1. I felt inline comments made more sense here rather than adding this as part of docs for state_fields. This is primarily intended for the reader of code rather than the user of the library.
  2. There is repetition of comment text in aggregate functions.

Please recommend changes to copy or critique this approach.

@jcsherin jcsherin marked this pull request as ready for review July 23, 2024 18:32
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh I would be putting this to trait object instead of copying through methods.

@alamb
Copy link
Contributor

alamb commented Jul 23, 2024

tbh I would be putting this to trait object instead of copying through methods.

Do you mean add the documentation to https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.AggregateUDFImpl.html#method.state_fields ?

@comphead
Copy link
Contributor

tbh I would be putting this to trait object instead of copying through methods.

Do you mean add the documentation to https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.AggregateUDFImpl.html#method.state_fields ?

yep, it might be a better place imho. Especially if it related to all function implementations

jayzhan211

This comment was marked as outdated.

@jayzhan211 jayzhan211 dismissed their stale review July 24, 2024 00:17

review again

@jcsherin
Copy link
Contributor Author

tbh I would be putting this to trait object instead of copying through methods.

This does not belong in trait object because this affects only a few aggregate functions, not all of them.

For some aggregate functions the intermediate accumulator state often has:

  1. a list of items and,
  2. the type of the item is same as the first argument or the returned value
Field::new_list( 
    format_state_name(args.name, "distinct_array_agg"), 
    Field::new("item", args.input_type.clone(), true), // [1] should always be true
    true,  // [2] or false
)

At first glance it looked like nullable of the list item should be configurable. There were also multiple PRs in this direction before we realized it was all unnecessary.

To get rid of the comment duplication, I'll instead move them to a markdown doc within functions-aggregate and link to it from the comments.

@jcsherin jcsherin marked this pull request as draft July 24, 2024 10:41
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an improvement to me -- thank you @jcsherin and @jayzhan211 and @comphead

I also marked this PR ready for review as it looks good to me

@@ -203,6 +203,7 @@ impl AggregateUDFImpl for BitwiseOperation {
args.name,
format!("{} distinct", self.name()).as_str(),
),
// See COMMENTS.md to understand why nullable is set to true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


## Computing Intermediate State

By setting `nullable` to be always `true` like this we ensure that the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is another rationale that the intermediate results need to be able to represent "saw no rows" (e.g that partition had no values)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For nth_value accumulator, when now rows are present in the partition, then no values are added to the intermediate state.

I haven't checked the other aggregates though. So I don't know for certain if this is the case always. I'll verify and make a follow-on PR if any differences exist. I think we've only looked deeper into nth_value and array_agg (by @jayzhan211) at the moment.

Copy link
Contributor

@alamb alamb Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense -- I vaguely remember that the null was needed in one of the aggregators to distinguish between

  1. only empty lists had been seen []
  2. No lists at all had been seen NULL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Thanks for the pointer. I'll keep this in mind while making pass through the aggregates next time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a minor copy change to disambiguate that the "Computing Intermediate State" section is talking about the nullability of the list item rather than the nullability of the list container.

Sorry for the confusion. I was not clear earlier.

@alamb alamb added the documentation Improvements or additions to documentation label Jul 24, 2024
@alamb alamb marked this pull request as ready for review July 24, 2024 23:54
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Jul 25, 2024
@jcsherin
Copy link
Contributor Author

I pushed a minor copy edit and CI failed. Looking at the error logs it looks to me like it is not related to this change.

@jcsherin
Copy link
Contributor Author

I pushed a minor copy edit and CI failed. Looking at the error logs it looks to me like it is not related to this change.

The clippy errors in CI are being tracked here - #11651.

@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

I merged up from main to get the fix for the clippy errors

@jcsherin
Copy link
Contributor Author

I merged up from main to get the fix for the clippy errors

@alamb Thank you.

In `array_agg` the list is nullable, so changed the example to
`nth_value` where the list is not nullable to be correct.
@alamb
Copy link
Contributor

alamb commented Jul 25, 2024

Thanks @jcsherin

@alamb
Copy link
Contributor

alamb commented Jul 26, 2024

Thanks again - we can iterate on the docs in follow on PRs if there is more to do

@alamb alamb merged commit d6e016e into apache:main Jul 26, 2024
24 checks passed
@jcsherin
Copy link
Contributor Author

Thanks for the review feedback - @alamb, @comphead and for prior discussions @jayzhan211.

@jcsherin jcsherin deleted the doc-list-nullable-in-state-fields branch July 26, 2024 15:05
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document why nullable of list item does not map to schema of first argument
4 participants