Conversation
There was a problem hiding this comment.
Do PTFs support overloading? Would it be possible to later add a sequence function for dates and timestamps?
There was a problem hiding this comment.
Table functions don't support overloading other than resulting from automatic type coercion. For dates or timestamps we'd have to add separate functions. Alternatively, we could model overloading by specifying optional arguments.
There was a problem hiding this comment.
We would have to have exclusive sets of optional arguments, and it sounds too complicated. Maybe just separate functions then. I think it's ok to have sequence() work with long integers and use other names for other data types.
There was a problem hiding this comment.
That's in line with what I was thinking. I chose the name sequence for the most popular sequence function, and considered adding date_sequence as a follow-up.
Overloading with optional arguments is just a theoretical possibility. I wouldn't do it unless the overloads share most arguments.
4ccc1b8 to
be9aa91
Compare
martint
left a comment
There was a problem hiding this comment.
I'm pondering whether it makes sense to allow negative values for step given that the resulting table is unordered. A negative step conveys that the order matters to some extent.
If we do allow it, however, we cannot guarantee that values will come out in descending order. We may even want to exploit that to simplify the implementation by making the iteration always from smaller to larger.
| */ | ||
| package io.trino.spi.ptf; | ||
|
|
||
| public enum EmptyTableFunctionHandle |
There was a problem hiding this comment.
Are there any cases where a connector would need to use this handle directly? If not, let's make it a package-protected class, or even an inner class of TableFunctionAnalysis.
There was a problem hiding this comment.
The intention behind this class was to provide a default empty implementation that functions can use if they find it convenient, instead of defining their own "empty handle" class. It simplifies the analyze() method, as you can skip passing the handle.
There was a problem hiding this comment.
Not sure I understand "skip passing the handle". The handle represents the identity of the function, so what does it mean to skip it?
There was a problem hiding this comment.
The function can be identified by the name, so the ConnectorTableFunctionHandle's role can be limited to carrying the information necessary for execution. With this approach, I thought it would be reasonable to allow the function author skip declaring the handle class, and passing it in analysis if there's no information to pass.
We can change the approach so that functions are identified by the class of the handle. Then we'd have to make the handle required in the analysis. It would be a backward-incompatible change, but it wouldn't break anyone's existing code (only need to adjust the exclude_columns functions). Oh, actually, if we changed the approach to identifying functions, we might want to change the spi method FunctionProvider.getTableFunctionProcessorProvider().
There was a problem hiding this comment.
This is a random value which was convenient for testing the processor when I was using fixed page size.
I'm open to suggestions on what the DEFAULT_SPLIT_SIZE should be, or if we want to set it another way (property?).
There was a problem hiding this comment.
No need to make it configurable. However, I think 2023 is too small -- that's a handful of pages. I would pick a number on the order of tens or hundreds of thousands, maybe even low millions.
core/trino-main/src/main/java/io/trino/operator/table/Sequence.java
Outdated
Show resolved
Hide resolved
testing/trino-tests/src/test/java/io/trino/tests/TestSequenceFunction.java
Outdated
Show resolved
Hide resolved
testing/trino-tests/src/test/java/io/trino/tests/TestSequenceFunction.java
Outdated
Show resolved
Hide resolved
When working on this, I considered allowing only positive steps. However, I found that it would make the function difficult to use in some cases. When the user specifies |
In my opinion, a sequence with positive step also has a notion of ordering. I added a note in the documentation to underline that the result should not be considered ordered. @martint could you please check the docs, and maybe suggest some other ways to make it clear to the user that the result is unordered? |
The difference in implementation between ascending and descending sequence is limited to a short helper method used only during split generation, so that wouldn't ba a big gain. |
That's a good point.
It would if we wanted the core logic to avoid using exceptions for control flow in lieu of testing boundary conditions directly. For instance, instead of doing if (start > Long.MAX_VALUE - step) {
... handle boundary
}
else {
... handle normal condition
}That only works if step is positive. Otherwise you have to invert it and do the comparison against Long.MIN_VALUE. |
Are you suggesting we should do it? It would make the code more complicated, especially that we need to handle both ascending and descending sequences. |
|
I'm thinking something along these lines for the code that generates splits: long splitSize = DEFAULT_SPLIT_SIZE;
ImmutableList.Builder<SequenceFunctionSplit> builder = ImmutableList.builder();
while (!done(start, stop, step)) {
if (step > 0) {
if (start > Long.MAX_VALUE - splitSize) {
// at the edge of the long range, so limit the splitSize to avoid overflow
splitSize = Long.MAX_VALUE - start;
}
if (start + splitSize > stop) {
// last segment. Limit split size the the remaining range
// Equivalent to splitSize = Math.min(splitSize, stop - start) when we know that the subtraction
// won't overflow
splitSize = stop - start;
}
// align splitSize to step boundary
splitSize = splitSize / step * step;
builder.add(new SequenceFunctionSplit(start, start + splitSize));
start += splitSize;
}
else {
// ... similar logic for negative step
}
}
return new FixedSplitSource(builder.build());Alternatively, we can use BigInteger for this part of the code if it simplifies things, since it's not in the main data processing loop. |
|
I tried using BigInteger. The code is much more compact and clear. I added a comment why BigInteger is used. |
There was a problem hiding this comment.
If you pre-align stop to a multiple of step, you don't need this check. All you have to do is pick the smallest (based on absolute value) between splitStop and stop.
There was a problem hiding this comment.
One artifact of this logic is that all but one split will be unbalanced. For instance, if there are 1_000_001 elements in the sequence, one will produce 1_000_000 and the other one will produce 1. It may not matter in general, but it could for some specific workloads.
One possible solution to that would be to calculate the number of splits based on a max number of elements per split, and then divide the elements in the sequence equally among that number of splits. So, in the example above, we'd need 2 splits, and the number of elements in each split would be 500_001 and 500_000, respectively.
There was a problem hiding this comment.
Here's how I'd do it (you'll need to convert it to use BigInteger):
long start = handle.start();
long stop = handle.stop();
long step = handle.step();
stop = (stop - start) / step * step + start; // align to step
long range = stop - start + 1;
long splitCount = range / (maxSplitSize * step) + 1;
long targetRange = (range / splitCount);
targetRange = (targetRange + step) / step * step; // align to ceil(targetRange / step) * step
for (int i = 0; i < splitCount; i++) {
long splitStart = start + i * targetRange;
long splitStop = start + (i + 1) * targetRange - step;
splitStop = step > 0 ? Math.min(stop, splitStop) : Math.max(stop, splitStop);
... emit SequenceFunctionSplit(splitStart, splitStop)
}There was a problem hiding this comment.
I managed to simplify it more with operating on steps instead of ranges.
core/trino-main/src/main/java/io/trino/operator/table/Sequence.java
Outdated
Show resolved
Hide resolved
| The argument ``input`` is a table or a query. | ||
| The argument ``columns`` is a descriptor without types. | ||
|
|
||
| .. function:: sequence(start => bigint, stop => bigint, step => bigint) -> table(sequential_number bigint) |
There was a problem hiding this comment.
@mosabua @colebow can you please add this to https://trino.io/docs/current/functions/list.html#s?
same for exclude columns above
| The argument ``input`` is a table or a query. | ||
| The argument ``columns`` is a descriptor without types. | ||
|
|
||
| .. function:: sequence(start => bigint, stop => bigint, step => bigint) -> table(sequential_number bigint) |
There was a problem hiding this comment.
@mosabua @colebow can you please add this to https://trino.io/docs/current/functions/list.html#s?
same for exclude columns above
Description
Adds a table function
sequencefor generating a column of sequential bigint values.Additional context and related issues
Based on #16584 for collocated changes in connector.
Release notes
( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(X) Release notes are required, with the following suggested text:
Documentation is included.