Skip to content
This repository was archived by the owner on May 9, 2024. It is now read-only.

Add quantile aggregate. #636

Merged
merged 4 commits into from
Aug 24, 2023
Merged

Add quantile aggregate. #636

merged 4 commits into from
Aug 24, 2023

Conversation

ienkovich
Copy link
Contributor

This patch introduces QUANTILE aggregate to compute precise quantiles.

Quantile supports multiple interpolations. This is to match pandas' quantile options for Modin.

Quantile implementation is similar to approx quantile implementation except it collects all values and computes the precise quantile value. Quantile objects simply collect all values into vectors of raw buffers, memory is allocated via RowSetMemoryOwner in 1-64KB chunks on demand. On reduction, all buffers are collected together. At the finalization, a special chunked iterator is used to call std::nth_element to get values affecting the quantile result in proper places. It might seem from the code that we don't finalize quantiles until data is fetched from the ResutSet but in fact, it is implicitly finalized on rowCount call when we put the ResultSet into the registry (that's why I added a new timer there). In the future, I want to add an explicit finalization stage to be used for quantiles and topK and also avoid TargetValue materialization on rowCount.

For the quantile result type, I decided to keep the original argument type unless midpoint or linear interpolation is used for integer values. In the latter case, the result is fp64.

In the builder, I don't expand the string parser to support interpolations, and the quantile method is supposed to be used for non-default interpolations.

@ienkovich
Copy link
Contributor Author

@kurapov-peter @alexbaden This finally passes Windows tests and is ready for review. I left most of the quantile runtime in the static code to avoid all those missing std:: symbols errors on for LLVM JIT on Windows. It shouldn't affect performance much and for GPU it is not available anyway.

maybeCloneFunctionRecursive(call.getCalledFunction(), is_l0);
// Ignore indirect calls (e.g. virtual function calls).
if (call.getCalledFunction()) {
maybeCloneFunctionRecursive(call.getCalledFunction(), is_l0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this always called was a nice way to catch incorrect signatures or missing functions. I wonder if we could use something like isIndirectCall instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From LLVM docs: getCalledFunction Returns the function called, or null if this is an indirect function invocation or the function signature does not match the call signature.
So it's mostly the same as isIndirectCall but will also avoid SEGFAULT on wrong LLVM IR causing runtime exception instead later on LLVM check. Wrong LLVM appears from time to time in working branches when new features are added and having a proper error with a description is much more preferable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a check in the maybeCloneFunctionRecursive that catches exactly that the function is not present in the module, but expected to be. Adding the if masks the problem, so we will detect it only on IR validation (worst case, after linking). I don't see how it is more preferable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the case. maybeCloneFunctionRecursive is not an IR integrity checker, it gets a declaration and checks if we have a definition of the function in addition to the declaration and copies this definition. At least for CPU, function definitions are optional in the runtime module (and we don't have them for quantile *_impl methods).

Introduced if-statement checks that we have a declaration for a call to be passed to maybeCloneFunctionRecursive. Otherwise, we would simply get a SEGFAULT. It would skip both indirect and incorrect calls, but it is OK because LLVM IR checkers would issue nice verbose errors about wrong calls, we don't need to construct our own checkers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. Function definitions are optional, yes. What is not optional is the presence of the LLVM function in the module that we clone from. It can be a declaration as in your *_impl example. A missing declaration or implementation is an internal error, a bug, so the function should fail on assertion (SEGFAULT). This is precisely what happens since the first thing maybeCloneFunctionRecursive does is check if the function is null. It does not have a nice description, but that's easily fixable.

Integrity checking is not enough. Later stages of code generation do not distinguish between a declaration of a *_impl function (without definition) we should have provided and an intrinsic declaration provided by a GPU driver. Even though the IR is "correct", PTX/SPIRV generation will fail. This leads to CPU fallback, masking a real bug. Cloning a function from the runtime module should not silently ignore missing functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why you want to duplicate such checks. If a function is not present then you won't be able to emit call instruction. And existence of declarations for runtime module code is checked on runtime module compilation. And the problem with a function declaration not provided by GPU driver cannot be solved in this code anyway.

I'm OK with replacing the check with isIndirectCall though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is simply that I had such cases with GPU (both L0 and CUDA). The functions missing are not those that we emit a call to, but those that are copied from the runtime module recursively. It doesn't solve the problem, but not having an assertion failure for those functions that are not present makes it even worse.


size_t res = cur_idx_;
for (size_t i = 0; i < chunks_.size() - 1; ++i) {
res += chunks_[i].size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size does not match the number of pushed elements here, right? Looks like vector's capacity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but all chunks except the last one have to be full.

Signed-off-by: ienkovich <[email protected]>
@ienkovich ienkovich force-pushed the ienkovich/quntile-3 branch from d9dcf5b to e0abac4 Compare August 24, 2023 16:11
@ienkovich ienkovich merged commit c603402 into main Aug 24, 2023
@ienkovich ienkovich deleted the ienkovich/quntile-3 branch August 24, 2023 18:35
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants