More transducers #49735

MasonProtter · 2023-05-10T20:53:17Z

This has been mentioned in a few scattered places (e.g. here #15648 (comment), #33526, various slack and discourse threads, etc.), but I think it'd be good to have a general tracking issue for this as a large scale goal.

Currently, the majority of our infrastructure is built on the paradigm of iterators. These are essentially state machines that tell you how to progress from one state to the next.

Something @tkf was rightfully a big champion of was the use of transducers instead of iterators where possible, with https://github.com/JuliaFolds/Transducers.jl being his gigantic playground for those ideas.

Fundamentally, the idea with transducers is that you replace iterate as your fundamental operation for traversing data, with foldl, and this enables a lot of interesting things. To borrow from the docs in Transducers.jl, this is what

sum(Iterators.filter(iseven, Iterators.map(x -> 2x, xs)))

is essentially doing:

function map_filter_iterators(xs, init)
    ret = iterate(xs)
    ret === nothing && return init
    acc = init
    @goto filter
    local state, x
    while true
        while true                                    # input
            ret = iterate(xs, state)                  #
            ret === nothing && return acc             #
            @label filter                             #
            x, state = ret                            #
            iseven(x) && break             # filter   :
        end                                #          :
        y = 2x              # imap         :          :
        acc += y    # +     :              :          :
    end             # :     :              :          :
    #                 + <-- imap <-------- filter <-- input
end

the Transducers.jl equivalent

foldxl(+, xs |> Map(x -> 2x) |> Filter(iseven))

does this:

function map_filter_transducers(xs, init)
    acc = init
    #              input -> Filter --> Map --> +
    for x in xs  # input    :          :       :
        if iseven(x)  #     Filter     :       :
            y = 2x    #                Map     :
            acc += y  #                        +
        end
    end
    return acc
end

and this is obtained not though clever compiler magic, but just algorithm design. The difference is that with iterators, one writes a loop and pulls values out of the iterator. The loop owns the iterator. With transducers, the transducer owns the loop and pushes out values.

An important practical benefit of transducers is the space of parallelism. Transducers.jl and things built on it like https://github.com/tkf/ThreadsX.jl give really nice APIs for many parallel workflows because whether an algorithm is amenable to parallelism is built into the representation of a transducer. With iterators, many of these things are quite opaque since the fundamental paradigm of iteration is sequential.

Finally, I'll also mention that because our intermediate representations (IR) represent loops in terms of while loops ( gotos) on iterators, this makes it a real pain to take lowered julia code and find out what the structure and intent of the original code was, and a lot of IR level transformations on Julia code need to do a lot of work to rediscover what the original loop layout was.

When represented in terms of folds though, we could preserve a lot more structured information at the IR level which could make compiler level loop optimizations easier.

The text was updated successfully, but these errors were encountered:

Tokazama · 2023-05-11T06:37:07Z

Are you suggesting that IR itself should better represent transducers, or simply that transducers generates better IR as is?

MasonProtter · 2023-05-12T21:00:16Z

Both actually. At least in theory, if we moved from representing loops with foldl instead of goto, it would make some things better / easier, especially when say, trying to interface with MLIR or LoopModels.

In practice, I wouldn't be at all surprised though if it was simply WAY too big a task to change how we internally represent loops, especially given that we've already developed the infrastructure for rediscovering loop flow information inside the compiler with our current goto-based representation.

I just mentioned that as a very far off possible milestone.

oscardssmith · 2023-05-12T21:04:06Z

The main thing that would potentially improve with a foldl based loop IR is that it might allow us to prove that most loops terminate which would make concrete eval a lot more powerful.

MasonProtter · 2023-05-12T21:08:06Z

Especially for statically sized objects like Tuples

stevengj · 2023-12-05T13:40:14Z

Fundamentally, the idea with transducers is that you replace iterate as your fundamental operation for traversing data, with foldl

But what if you don't want to do foldl, e.g. you want to do pairwise mapreduce for accuracy in floating-point reductions (#52397)?

MasonProtter · 2023-12-05T16:16:54Z

foldl is how you'd represent a regular sequential loop. What's nice is that if you have only have transducers that are amenable to re-ordering, and your reducing function is associative, you can then freely switch the foldl out for a parallel_reduce or whatever

MasonProtter · 2023-12-06T09:45:53Z

Maybe put another way @stevengj, what's nice about working with this different style is that all the logic and operations of the loop are encoded in the transducer and the reducing operator, and then the details of how it's scheduled and run (i.e. sequential or pairwise, or parallel or whatever) can be changed by simply swapping foldl for whichever other reduction you may want.

This lets you re-use more code. i.e. in the example I showed above:

foldxl(+, xs |> Map(x -> 2x) |> Filter(iseven))

the reducing operator is +, the data is xs and the transducer is Filter(iseven) ∘ Map(x -> 2x). These different components are modular and can be re-used, we could swap foldxl and replace it with foldxt for a multithreaded reduction, or foldxd for a distributed reduction, or add custom backend executors like those in FoldsCUDA.jl or FoldsThreads.jl.

Because iterators are inherent sequential, turning the iterator equivalent of this operation into a parallel or pairwise reduction is pretty non-trivial.

stevengj · 2023-12-06T15:09:48Z

I guess I got confused by this description:

Fundamentally, the idea with transducers is that you replace iterate as your fundamental operation for traversing data, with foldl

Because what you are describing is not foldl, it is reduce. The foldl operation specifies an associativity, so it is inherently sequential.

ericphanson · 2023-12-09T11:56:42Z

Maybe it could say:

Fundamentally, the idea with transducers is that you replace iterate as your fundamental operation for traversing data, with reductions. foldl provides an ordered sequential reduction analogous to iterate, but the paradigm generalizes better as other reductions (e.g. reduce) can be used as well.

giordano added speculative Whether the change will be implemented is speculative feature Indicates new feature / enhancement requests fold sum, maximum, reduce, foldl, etc. labels May 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More transducers #49735

More transducers #49735

MasonProtter commented May 10, 2023 •

edited

Loading

Tokazama commented May 11, 2023

MasonProtter commented May 12, 2023

oscardssmith commented May 12, 2023

MasonProtter commented May 12, 2023

stevengj commented Dec 5, 2023

MasonProtter commented Dec 5, 2023

MasonProtter commented Dec 6, 2023

stevengj commented Dec 6, 2023

ericphanson commented Dec 9, 2023 •

edited

Loading

More transducers #49735

More transducers #49735

Comments

MasonProtter commented May 10, 2023 • edited Loading

Tokazama commented May 11, 2023

MasonProtter commented May 12, 2023

oscardssmith commented May 12, 2023

MasonProtter commented May 12, 2023

stevengj commented Dec 5, 2023

MasonProtter commented Dec 5, 2023

MasonProtter commented Dec 6, 2023

stevengj commented Dec 6, 2023

ericphanson commented Dec 9, 2023 • edited Loading

MasonProtter commented May 10, 2023 •

edited

Loading

ericphanson commented Dec 9, 2023 •

edited

Loading