A builtin converting the outer `Data.Constr` to a SOP for fast `case`?

## Intro

There exist multiple ways to create data types in Plutus Core:

1. Scott-encoding, long retired by now
2. sums-of-products (SOPs)
3. via the `Data` built-in type

SOPs have by far more [efficient](https://github.com/michaelpj/CIPs/blob/192e14988bdccb600450e6db09977d9ed801ebd2/CIP-0085/README.md#datatypes-using-data) pattern matching than `Data` (even with the latter properly [sped-up](https://github.com/IntersectMBO/plutus/issues/5711)). However at the script interface level we have `Data` and hence to get that fast pattern matching on SOPs we have to convert `Data` to a SOP first and this is such an expensive operation that in a lot of cases it completely defeats the purpose as the cost of conversion may significantly outweigh the benefit of having faster pattern matching.

A prime example of `Data` being generally preferable over SOPs is `ScriptContext`, since values of this type are huge and a script doesn't need to access all of them. This problem is discussed in detail in [this](https://github.com/michaelpj/CIPs/blob/mpj/script-context-sop/CIP-%3F%3F%3F%3F/README.md) CIP, which proposes to make `ScriptContext` into a SOP instead of the current `Data`. This is however problematic in a number of ways:

1. it's a breaking change to the API
2. alternative toolchains may not (and some likely do not) deal with SOPs in any way at all as they may find only using `Data` sufficient, particularly since using `Data` is simply faster in many cases

What if instead of trying to force SOPs upon everybody whereever possible we simply make pattern matching on `Data`-encoded data types more efficient, making it competitive with pattern matching on SOPs?

## High-level idea

There's one deceptively simple way of making pattern matching on `Data`-encoded data types faster: just convert a `Data.Constr` node into the respective `Term.Constr` node and case on the latter instead using the fast `case` primitive.

This is how `Constr` is defined for `Data`:

```haskell
data Data
    = Constr Integer [Data]
    | <...>
```

This is how `Constr` is defined for UPLC `Term`:

```haskell
data Term name uni fun ann
    = Constr !ann !Word64 ![Term name uni fun ann]
    | <...>
```

Any `Data.Constr` is embeddable into `Term` via

1. converting the `Integer` value into `Word64` when possible (this doesn't work with certain interpretations of the `Integer` value such as [this](https://github.com/cardano-foundation/CIPs/pull/608) one, but we ignore those within the scope of this discussion and assume that each `Data` object represents a well-defined SOP term with `Integer` value being indices of constructors of the encoded data type)
2. converting the `[Data]` value into `[Term name uni fun ann]` by wrapping each `Data` with the `Constant` constructor of `Term` (plus some additional plumbing that we won't get into)

For example

```haskell
Data.Constr 0
    [ Data.Constr 1 []
    , Data.I 3
    ]
```

will become

```haskell
Term.Constr () 0
    [ Term.Constant () (Some (ValueOf <...> (Data.Constr 1 [])))
    , Term.Constant () (Some (ValueOf <...> (Data.I 3)))
    ]
```

(where `Some` and `ValueOf` are the additional plumbing that we don't get into, since it's largely irrelevant and will likely change anyway)

I.e. the outer `Data.Constr` becomes a `Term.Constr` and the contents of the former get embedded as-is, allowing us to `case` on the new value.

We're going to call the built-in function implementing this behavior `constrTermFromConstrData`.

## Differences with faster pattern matching for built-in types

#5711 elaborates on the approaches for speeding up pattern matching for built-in types, which may sound directly related to the matters described in this issue, however the two discussions are entirely tangential. Making pattern matching for built-in types faster impacts not only `Data`, but also lists and pairs and potentially other built-in types (maybe even `Bool`). However without `constrTermFromConstrData` there's no way to access all the arguments of a constructor at once in a single `case` expression. I.e. without that builtin we'll have to keep doing what we do now, namely recurse on `ds` from `Constr i ds` and extract arguments one-by-one using pattern matching on built-in lists. That recursion happens on the Plutus side, which is very inefficient, plus it requires a linear amount of builtin calls (regardless of how pattern matching for built-in types is implemented) while with `constrTermFromConstrData` only one builtin call is required (we'll still need a linear amount of work to feed the arguments of the constructor to the function from the respective branch, but that is supposed to be much more efficient than doing builtin calls).

## Implementation details

Feel free to skip this section, it's here for completeness.

In practice, we don't embed `Data.Constr` into `Term.Constr`, because at runtime built-in functions receives values, not terms, so we need to embed `Data.Constr` into the type of values that the specific evaluator uses. It's the same logic as with constants and we use a very similar machinery.

So a more accurate name would be `constrValueFromConstrData` or something, but the users don't really care about these trivialities, so we're going to stick to `constrTermFromConstrData`, since it's a decent enough and more specific name, plus it's easier to explain the high-level details this way and `constrTermFromConstrData` kinda looks better in a program ("value" is way too overloaded).

The Haskell denotation of the Plutus builtin looks like this:

```haskell
constrTermFromConstrDataDenotation
    :: KnownTypeAst TyName DefaultUni sop => Data -> BuiltinResult (Opaque val sop)
constrTermFromConstrDataDenotation (Constr iInteger ds) = do
    iWord64 <- integerToWord64BuiltinResult $ fromValue @val iInteger
    pure . toConstr iWord64 $ map fromValue ds
constrTermFromConstrDataDenotation _ = evaluationFailure
```

which follows the logic outlined above: the `Integer` value becomes a `Word64` and the `Data` arguments get turned into values using `fromValue` (where `Value` means a Haskell value, not what we've been discussing so far, told you it was way too overloaded).

A partial implementation can be found in [this](https://github.com/IntersectMBO/plutus/pull/5776) PR.

## Typing issue

The runtime semantics of this builtin are trivial, but what type is that builtin supposed to have? Well, a good start is

```
constrTermFromConstrData : all (sop :: *). data -> sop
```

which says that the builtin converts `data` values to an arbitrary type that is named `sop`, but can be really anything. But this is an overly permissive type signature: we don't want this builtin to claim that it can produce anything, we only want it to claim to produce SOPs. We could try thinking of somehow insisting that `sop` is indeed a SOP (I've no idea how we could achieve that), but even then, we don't want it to be any SOP, we want it to be some very specific SOP, for example:

```
constrTermFromConstrDataExample : data -> SOP [[data, data], [], [data]]
```

If our builtin was limited to `[[data, data], [], [data]]`, then we could check at runtime that

1. `ds` in `Constr 0 ds` has two elements
2. `ds` in `Constr 1 ds` has zero elements
3. `ds` in `Constr 2 ds` has one element

and that builtin would be reasonably typed and safe to have.

But how could we possibly generalize this example version of the builtin without the ability to quantify over the shape of SOPs and constrain those shapes to have a runtime representation, so that it's possible to check the `Data.Constr`s coming in against that representation? In Haskell we could write something like

```haskell
constrTermFromConstrDataExample :: forall shape. KnownShape shape => Data -> SOP shape
```

but Plutus isn't Haskell and we can't turn such a Haskell function into a built-in one as we can't quantify over `shape`, let alone constrain it with a type class.

Maybe with dependent typing we could somehow connect the given `data` value with its expected shape, but this doesn't sound particularly nice either and we don't have dependent types in the general Plutus any way, let alone in its builtins.

## Builtin families

It is clear that enumerating all practical (let alone possible) instantiations of `constrTermFromConstrData` as individual builtins isn't gonna work: we're not adding hundreds/thousands of new built-in functions. What we can try however is making `constrTermFromConstrData` a family of built-in functions rather than a single builtin, so that `constrTermFromConstrDataExample` is just one member of the family. I.e. we'd have a new constructor of `DefaultFun` that unlike all other constructors takes an argument with the argument being the shape of the final SOP:

```haskell
-- | Each element represents the number of 'Data's in a product from a sum of products. E.g.
--
-- > [2, 0, 1]
--
-- represents
--
-- > SOP [[data, data], [], [data]]
type SopShape = [Word64]

data DefaultFun
    -- Integers
    = AddInteger
    | SubtractInteger
    <...>
    | ConstrTermFromConstrDataExample SopShape
```

This may rule out some use cases (perhaps if one decides to compile an extensional type theory with [W-types](https://ncatlab.org/nlab/show/W-type) to Plutus or something), but this seems so insignificant that it's probably not even worth any additional discussion. How often does the programmer not know the arity of a constructor or the number of constructors of a data type?

The real problem is that adding builtin families invalidates all of our assumptions about builtins, which are, in the language of Haskell constraints, `Enum`, `Bounded` and `Ix`. If you have an infinite number of builtins, you can no longer put something for each of them in an array. You can't put their type signatures in golden files or print them to the user. And we do these kinds of things all over the place.

So while retrofitting builtin families into Plutus seems possible in principle, the costs of such an endeavor would be huge and so would be the risk of screwing something up along the way.


## Singletons

Introducing builtin families sorta amounts to allowing for type applications to be unerasable for builtins. Which is hard in our existing system, but there's a common way to emulate unerasable types: singletons. Normally singletons require GADTs (which are hard and unergonomic in Plutus Core (see [examples](https://github.com/IntersectMBO/plutus/tree/ac7176a80e78cde0985763e51572a64ba7ddb7b1/doc/notes/fomega/gadts)) to the point that we prefer to think of them as entirely unsupported), however in our case we might get away with only using a GADT on the Haskell side.

The idea is that we implement the `constrTermFromConstrData` builtin such that it has the following Plutus type signature:

```
constrTermFromConstrData : sopOfData sop -> data -> sop
```

where `sopOfData` is a built-in type that restricts `sop` to be a sum-of-products-of-data (such as the `SOP [[data, data], [], [data]]` example from the above). Each value of a built-in type comes with a runtime type tag, which in this case we can use to establish the connection between the type information and the runtime content of `sopOfData` thereby essentially propagating type info to the runtime where it can be used to ensure the safety of the conversion from `Data` to `sop`.

Assuming it's going to work out, we'll still run into a problem: doing checks at runtime is expensive. Not only will we need to construct and pass around that `sopOfData` singleton, we'll also need to match on it to ensure that it encodes the exact shape of the given `data` object. All for the sake of making _our_ type system better. Would, say, Aiken users care about those checks slowing their smart contracts down? I doubt it.

## What exactly breaks if we add the unsafe `constrTermFromConstrData`?

Adding the unsafe `constrTermFromConstrData` will make it possible for a well-typed `case` expression to fail with an error message like "expected `n` arguments to constructor `C` but got `m`" (I'm shooting for the unreachable ideal error message here). Now of course UPLC doesn't have any types and so such an error is already possible in UPLC anyway, however this isn't the case in TPLC or PIR and by extension in UPLC that we get by erasing TPLC.

Is this a big deal? I don't know. What I know is this is kinda a not-too-unsafe form of `unsafeCoerce`. Right now one can implement a rather boring `safeCoerce` function that returns its argument back but with a different type:

```haskell
safeCoerce :: (ToData a, FromData b) => a -> b
safeCoerce = fromData . toData
```

One might use if `a` a `type` synonym to `b` for example. If it so happens at runtime that `a` isn't definitionally equal to `b`, then it'll be an immediate evaluation failure (unless we have some really weird `ToData`/`FromData` instances, which I hope we don't).

Adding the unsafe builtin will allow us to define

```haskell
unsafeCoerce :: ToData a => a -> b
unsafeCoerce = constrTermFromConstrData . toData
```

that does not immediately fail with an error at runtime in case `a` and `b` aren't definitionally equal. Instead, it'll return a term that fails being `case`d on (if the other parts of the `case`-expression are well-typed themselves).

So how bad is this? Quite bad actually! Consider the following UPLC expression (pseudo-syntax):

```
case constrTermFromConstrData d of
  \x y -> y
```

It converts a `Data` value `d` to a `Term`, i.e. a `Constr` with a bunch of fields, and then applies the `\x y -> y` function to the two fields (recall that in UPLC branches of case expressions are merely functions). But what if `constrTermFromConstrData d` returns a `Constr` with three fields rather than two? Well, then the function gets applied to too many arguments and depending on what `y` is we'll likely get an error sooner or later. Likely, but not necessarily: what if `y` is a function? Then it gets fed the extra argument that appears last in the overapplied `Constr` and the semantics of the term completely derails from what was originally intended by the programmer, meaning at that point they're at the mercy of undefined behavior.

The opposite situation when `Constr` has only one argument is equally bad. In that case the `\x y -> y` gets applied to only one argument and the whole expression reduces to `\y -> y` and if evaluation stops here, then that's evaluation success! Certainly something that the programmer did not intend to end up with. This doesn't apply to Plutus V3+, because those versions make it necessary for scripts to return `()`, everything else is a failure, but still, maybe the branch intended to run a check and fail, which is now not gonna happen because the `Constr` node turned out to be invalid. Failing to perform a check because of unexpected user input is quite bad.

Note that even if we knew how to give a type to `constrTermFromConstrData`, we'd still have all these issues because the on-chain language is untyped. Even if we can guarantee that our toolchain never produces such mess, we cannot guarantee that other toolchains don't, particularly those that target UPLC directly without using any type system.

## Conclusions

The unsafe `constrTermFromConstrData` is a horrible footgun that we should not add to the language, I believe.
Maybe we can make singletons perform reasonably fast, this should be investigated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A builtin converting the outer `Data.Constr` to a SOP for fast `case`? #5777

Intro

High-level idea

Differences with faster pattern matching for built-in types

Implementation details

Typing issue

Builtin families

Singletons

What exactly breaks if we add the unsafe `constrTermFromConstrData`?

Conclusions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A builtin converting the outer Data.Constr to a SOP for fast case? #5777

Description

Intro

High-level idea

Differences with faster pattern matching for built-in types

Implementation details

Typing issue

Builtin families

Singletons

What exactly breaks if we add the unsafe constrTermFromConstrData?

Conclusions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

A builtin converting the outer `Data.Constr` to a SOP for fast `case`? #5777

What exactly breaks if we add the unsafe `constrTermFromConstrData`?