Add table function sequence by kasiafi · Pull Request #16716 · trinodb/trino

kasiafi · 2023-03-24T15:24:05Z

Description

Adds a table function sequence for generating a column of sequential bigint values.

        SELECT *
        FROM TABLE(sequence(
                        start => 1000000,
                        stop => -2000000,
                        step => -3))

Additional context and related issues

Based on #16584 for collocated changes in connector.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(X) Release notes are required, with the following suggested text:

# General
* Add a built-in table function sequence().

Documentation is included.

nineinchnick · 2023-03-24T15:32:36Z

core/trino-main/src/main/java/io/trino/operator/table/Sequence.java

Do PTFs support overloading? Would it be possible to later add a sequence function for dates and timestamps?

Table functions don't support overloading other than resulting from automatic type coercion. For dates or timestamps we'd have to add separate functions. Alternatively, we could model overloading by specifying optional arguments.

We would have to have exclusive sets of optional arguments, and it sounds too complicated. Maybe just separate functions then. I think it's ok to have sequence() work with long integers and use other names for other data types.

That's in line with what I was thinking. I chose the name sequence for the most popular sequence function, and considered adding date_sequence as a follow-up.

Overloading with optional arguments is just a theoretical possibility. I wouldn't do it unless the overloads share most arguments.

martint

I'm pondering whether it makes sense to allow negative values for step given that the resulting table is unordered. A negative step conveys that the order matters to some extent.

If we do allow it, however, we cannot guarantee that values will come out in descending order. We may even want to exploit that to simplify the implementation by making the iteration always from smaller to larger.

martint · 2023-04-03T23:46:58Z

core/trino-spi/src/main/java/io/trino/spi/ptf/EmptyTableFunctionHandle.java

+ */
+package io.trino.spi.ptf;
+
+public enum EmptyTableFunctionHandle


Are there any cases where a connector would need to use this handle directly? If not, let's make it a package-protected class, or even an inner class of TableFunctionAnalysis.

The intention behind this class was to provide a default empty implementation that functions can use if they find it convenient, instead of defining their own "empty handle" class. It simplifies the analyze() method, as you can skip passing the handle.

Not sure I understand "skip passing the handle". The handle represents the identity of the function, so what does it mean to skip it?

The function can be identified by the name, so the ConnectorTableFunctionHandle's role can be limited to carrying the information necessary for execution. With this approach, I thought it would be reasonable to allow the function author skip declaring the handle class, and passing it in analysis if there's no information to pass.

We can change the approach so that functions are identified by the class of the handle. Then we'd have to make the handle required in the analysis. It would be a backward-incompatible change, but it wouldn't break anyone's existing code (only need to adjust the exclude_columns functions). Oh, actually, if we changed the approach to identifying functions, we might want to change the spi method FunctionProvider.getTableFunctionProcessorProvider().

martint · 2023-04-03T23:50:08Z

core/trino-main/src/main/java/io/trino/operator/table/Sequence.java

This is a random value which was convenient for testing the processor when I was using fixed page size.

I'm open to suggestions on what the DEFAULT_SPLIT_SIZE should be, or if we want to set it another way (property?).

No need to make it configurable. However, I think 2023 is too small -- that's a handful of pages. I would pick a number on the order of tens or hundreds of thousands, maybe even low millions.

changed to 1 million

core/trino-main/src/main/java/io/trino/operator/table/Sequence.java

testing/trino-tests/src/test/java/io/trino/tests/TestSequenceFunction.java

kasiafi · 2023-04-04T08:28:30Z

I'm pondering whether it makes sense to allow negative values for step given that the resulting table is unordered.

When working on this, I considered allowing only positive steps. However, I found that it would make the function difficult to use in some cases. When the user specifies (start, stop, step), it is guaranteed that the start value is in the produced sequence, but the stop value might not be in the sequence (if the step skips over it). If we forced the user to specify an ascending sequence instead of descending, they'd have to compute the start value on their own instead of just reversing the bounds. Otherwise their sequence would be shifted by the remainder.

kasiafi · 2023-04-04T08:34:20Z

A negative step conveys that the order matters to some extent.

In my opinion, a sequence with positive step also has a notion of ordering. I added a note in the documentation to underline that the result should not be considered ordered. @martint could you please check the docs, and maybe suggest some other ways to make it clear to the user that the result is unordered?

kasiafi · 2023-04-04T08:38:21Z

We may even want to exploit that to simplify the implementation by making the iteration always from smaller to larger

The difference in implementation between ascending and descending sequence is limited to a short helper method used only during split generation, so that wouldn't ba a big gain.

martint · 2023-04-04T20:40:54Z

it is guaranteed that the start value is in the produced sequence, but the stop value might not be in the sequence (if the step skips over it)

That's a good point.

The difference in implementation between ascending and descending sequence is limited to a short helper method used only during split generation, so that wouldn't ba a big gain.

It would if we wanted the core logic to avoid using exceptions for control flow in lieu of testing boundary conditions directly. For instance, instead of doing addExact(start, step) and catching the exception, you could do:

if (start > Long.MAX_VALUE - step) {
    ... handle boundary
}
else {
    ... handle normal condition
}

That only works if step is positive. Otherwise you have to invert it and do the comparison against Long.MIN_VALUE.

kasiafi · 2023-04-05T08:25:22Z

It would if we wanted the core logic to avoid using exceptions for control flow in lieu of testing boundary conditions directly.

Are you suggesting we should do it? It would make the code more complicated, especially that we need to handle both ascending and descending sequences.
Please note that exceptions might only occur during split generation, and the number of try-catch blocks is O(number of splits) or O(number of steps) for huge steps. The Processor is free of handling overflow.

martint · 2023-04-07T04:56:16Z

I'm thinking something along these lines for the code that generates splits:

long splitSize = DEFAULT_SPLIT_SIZE;

ImmutableList.Builder<SequenceFunctionSplit> builder = ImmutableList.builder();
while (!done(start, stop, step)) {
    if (step > 0) {
        if (start > Long.MAX_VALUE - splitSize) {
            // at the edge of the long range, so limit the splitSize to avoid overflow
            splitSize = Long.MAX_VALUE - start;
        }

        if (start + splitSize > stop) {
            // last segment. Limit split size the the remaining range
            // Equivalent to splitSize = Math.min(splitSize, stop - start) when we know that the subtraction
            // won't overflow
            splitSize = stop - start;
        }

        // align splitSize to step boundary
        splitSize = splitSize / step * step;

        builder.add(new SequenceFunctionSplit(start, start + splitSize));
        start += splitSize;
    }
    else {
        // ... similar logic for negative step 
    }
}

return new FixedSplitSource(builder.build());

Alternatively, we can use BigInteger for this part of the code if it simplifies things, since it's not in the main data processing loop.

kasiafi · 2023-04-07T14:21:02Z

I tried using BigInteger. The code is much more compact and clear. I added a comment why BigInteger is used.

martint · 2023-04-08T15:27:27Z

core/trino-main/src/main/java/io/trino/operator/table/Sequence.java

If you pre-align stop to a multiple of step, you don't need this check. All you have to do is pick the smallest (based on absolute value) between splitStop and stop.

martint · 2023-04-08T15:31:49Z

core/trino-main/src/main/java/io/trino/operator/table/Sequence.java

One artifact of this logic is that all but one split will be unbalanced. For instance, if there are 1_000_001 elements in the sequence, one will produce 1_000_000 and the other one will produce 1. It may not matter in general, but it could for some specific workloads.

One possible solution to that would be to calculate the number of splits based on a max number of elements per split, and then divide the elements in the sequence equally among that number of splits. So, in the example above, we'd need 2 splits, and the number of elements in each split would be 500_001 and 500_000, respectively.

Here's how I'd do it (you'll need to convert it to use BigInteger):

long start = handle.start(); long stop = handle.stop(); long step = handle.step(); stop = (stop - start) / step * step + start; // align to step long range = stop - start + 1; long splitCount = range / (maxSplitSize * step) + 1; long targetRange = (range / splitCount); targetRange = (targetRange + step) / step * step; // align to ceil(targetRange / step) * step for (int i = 0; i < splitCount; i++) { long splitStart = start + i * targetRange; long splitStop = start + (i + 1) * targetRange - step; splitStop = step > 0 ? Math.min(stop, splitStop) : Math.max(stop, splitStop); ... emit SequenceFunctionSplit(splitStart, splitStop) }

I managed to simplify it more with operating on steps instead of ranges.

core/trino-main/src/main/java/io/trino/operator/table/Sequence.java

findepi · 2023-05-05T06:58:07Z

docs/src/main/sphinx/functions/table.rst

    The argument ``input`` is a table or a query.
    The argument ``columns`` is a descriptor without types.

+.. function:: sequence(start => bigint, stop => bigint, step => bigint) -> table(sequential_number bigint)


@mosabua @colebow can you please add this to https://trino.io/docs/current/functions/list.html#s?

same for exclude columns above

findepi · 2023-05-05T06:58:07Z

docs/src/main/sphinx/functions/table.rst

    The argument ``input`` is a table or a query.
    The argument ``columns`` is a descriptor without types.

+.. function:: sequence(start => bigint, stop => bigint, step => bigint) -> table(sequential_number bigint)


@mosabua @colebow can you please add this to https://trino.io/docs/current/functions/list.html#s?

same for exclude columns above

cla-bot bot added the cla-signed label Mar 24, 2023

nineinchnick reviewed Mar 24, 2023

View reviewed changes

github-actions bot added the docs label Mar 24, 2023

kasiafi force-pushed the 470Sequence branch from edcacf9 to 73886fd Compare March 27, 2023 07:22

kasiafi mentioned this pull request Mar 27, 2023

The scalar function sequence does not handle long overflow correctly #16742

Closed

kasiafi force-pushed the 470Sequence branch 3 times, most recently from 4ccc1b8 to be9aa91 Compare March 30, 2023 07:36

martint reviewed Apr 4, 2023

View reviewed changes

kasiafi force-pushed the 470Sequence branch from be9aa91 to 1a13a1f Compare April 4, 2023 09:47

kasiafi force-pushed the 470Sequence branch from 1a13a1f to 271e697 Compare April 5, 2023 09:36

kasiafi force-pushed the 470Sequence branch from 271e697 to 3186c7b Compare April 7, 2023 14:19

martint reviewed Apr 8, 2023

View reviewed changes

kasiafi force-pushed the 470Sequence branch from 3186c7b to 7a24bc9 Compare April 13, 2023 12:06

martint approved these changes Apr 13, 2023

View reviewed changes

core/trino-main/src/main/java/io/trino/operator/table/Sequence.java Outdated Show resolved Hide resolved

kasiafi force-pushed the 470Sequence branch from 7a24bc9 to 43aafcc Compare April 13, 2023 18:15

Add table function sequence

3c58b58

kasiafi force-pushed the 470Sequence branch from 43aafcc to 3c58b58 Compare April 14, 2023 07:07

kasiafi merged commit f1a7ed5 into trinodb:master Apr 14, 2023

github-actions bot added this to the 414 milestone Apr 14, 2023

colebow mentioned this pull request Apr 19, 2023

Add Trino 414 release notes #17128

Closed

findepi reviewed May 5, 2023

View reviewed changes

colebow mentioned this pull request May 8, 2023

Add table functions to function list #17395

Merged

findepi reviewed May 5, 2023

View reviewed changes

colebow mentioned this pull request May 8, 2023

Add table functions to function list #17395

Merged

Conversation

kasiafi commented Mar 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional context and related issues

Release notes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martint left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kasiafi Apr 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kasiafi commented Apr 4, 2023

Uh oh!

kasiafi commented Apr 4, 2023

Uh oh!

kasiafi commented Apr 4, 2023

Uh oh!

martint commented Apr 4, 2023

Uh oh!

kasiafi commented Apr 5, 2023

Uh oh!

martint commented Apr 7, 2023

Uh oh!

kasiafi commented Apr 7, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

kasiafi commented Mar 24, 2023 •

edited

Loading

kasiafi Apr 5, 2023 •

edited

Loading