-
Notifications
You must be signed in to change notification settings - Fork 482
Description
Intro
There exist multiple ways to create data types in Plutus Core:
- Scott-encoding, long retired by now
- sums-of-products (SOPs)
- via the
Databuilt-in type
SOPs have by far more efficient pattern matching than Data (even with the latter properly sped-up). However at the script interface level we have Data and hence to get that fast pattern matching on SOPs we have to convert Data to a SOP first and this is such an expensive operation that in a lot of cases it completely defeats the purpose as the cost of conversion may significantly outweigh the benefit of having faster pattern matching.
A prime example of Data being generally preferable over SOPs is ScriptContext, since values of this type are huge and a script doesn't need to access all of them. This problem is discussed in detail in this CIP, which proposes to make ScriptContext into a SOP instead of the current Data. This is however problematic in a number of ways:
- it's a breaking change to the API
- alternative toolchains may not (and some likely do not) deal with SOPs in any way at all as they may find only using
Datasufficient, particularly since usingDatais simply faster in many cases
What if instead of trying to force SOPs upon everybody whereever possible we simply make pattern matching on Data-encoded data types more efficient, making it competitive with pattern matching on SOPs?
High-level idea
There's one deceptively simple way of making pattern matching on Data-encoded data types faster: just convert a Data.Constr node into the respective Term.Constr node and case on the latter instead using the fast case primitive.
This is how Constr is defined for Data:
data Data
= Constr Integer [Data]
| <...>This is how Constr is defined for UPLC Term:
data Term name uni fun ann
= Constr !ann !Word64 ![Term name uni fun ann]
| <...>Any Data.Constr is embeddable into Term via
- converting the
Integervalue intoWord64when possible (this doesn't work with certain interpretations of theIntegervalue such as this one, but we ignore those within the scope of this discussion and assume that eachDataobject represents a well-defined SOP term withIntegervalue being indices of constructors of the encoded data type) - converting the
[Data]value into[Term name uni fun ann]by wrapping eachDatawith theConstantconstructor ofTerm(plus some additional plumbing that we won't get into)
For example
Data.Constr 0
[ Data.Constr 1 []
, Data.I 3
]will become
Term.Constr () 0
[ Term.Constant () (Some (ValueOf <...> (Data.Constr 1 [])))
, Term.Constant () (Some (ValueOf <...> (Data.I 3)))
](where Some and ValueOf are the additional plumbing that we don't get into, since it's largely irrelevant and will likely change anyway)
I.e. the outer Data.Constr becomes a Term.Constr and the contents of the former get embedded as-is, allowing us to case on the new value.
We're going to call the built-in function implementing this behavior constrTermFromConstrData.
Differences with faster pattern matching for built-in types
#5711 elaborates on the approaches for speeding up pattern matching for built-in types, which may sound directly related to the matters described in this issue, however the two discussions are entirely tangential. Making pattern matching for built-in types faster impacts not only Data, but also lists and pairs and potentially other built-in types (maybe even Bool). However without constrTermFromConstrData there's no way to access all the arguments of a constructor at once in a single case expression. I.e. without that builtin we'll have to keep doing what we do now, namely recurse on ds from Constr i ds and extract arguments one-by-one using pattern matching on built-in lists. That recursion happens on the Plutus side, which is very inefficient, plus it requires a linear amount of builtin calls (regardless of how pattern matching for built-in types is implemented) while with constrTermFromConstrData only one builtin call is required (we'll still need a linear amount of work to feed the arguments of the constructor to the function from the respective branch, but that is supposed to be much more efficient than doing builtin calls).
Implementation details
Feel free to skip this section, it's here for completeness.
In practice, we don't embed Data.Constr into Term.Constr, because at runtime built-in functions receives values, not terms, so we need to embed Data.Constr into the type of values that the specific evaluator uses. It's the same logic as with constants and we use a very similar machinery.
So a more accurate name would be constrValueFromConstrData or something, but the users don't really care about these trivialities, so we're going to stick to constrTermFromConstrData, since it's a decent enough and more specific name, plus it's easier to explain the high-level details this way and constrTermFromConstrData kinda looks better in a program ("value" is way too overloaded).
The Haskell denotation of the Plutus builtin looks like this:
constrTermFromConstrDataDenotation
:: KnownTypeAst TyName DefaultUni sop => Data -> BuiltinResult (Opaque val sop)
constrTermFromConstrDataDenotation (Constr iInteger ds) = do
iWord64 <- integerToWord64BuiltinResult $ fromValue @val iInteger
pure . toConstr iWord64 $ map fromValue ds
constrTermFromConstrDataDenotation _ = evaluationFailurewhich follows the logic outlined above: the Integer value becomes a Word64 and the Data arguments get turned into values using fromValue (where Value means a Haskell value, not what we've been discussing so far, told you it was way too overloaded).
A partial implementation can be found in this PR.
Typing issue
The runtime semantics of this builtin are trivial, but what type is that builtin supposed to have? Well, a good start is
constrTermFromConstrData : all (sop :: *). data -> sop
which says that the builtin converts data values to an arbitrary type that is named sop, but can be really anything. But this is an overly permissive type signature: we don't want this builtin to claim that it can produce anything, we only want it to claim to produce SOPs. We could try thinking of somehow insisting that sop is indeed a SOP (I've no idea how we could achieve that), but even then, we don't want it to be any SOP, we want it to be some very specific SOP, for example:
constrTermFromConstrDataExample : data -> SOP [[data, data], [], [data]]
If our builtin was limited to [[data, data], [], [data]], then we could check at runtime that
dsinConstr 0 dshas two elementsdsinConstr 1 dshas zero elementsdsinConstr 2 dshas one element
and that builtin would be reasonably typed and safe to have.
But how could we possibly generalize this example version of the builtin without the ability to quantify over the shape of SOPs and constrain those shapes to have a runtime representation, so that it's possible to check the Data.Constrs coming in against that representation? In Haskell we could write something like
constrTermFromConstrDataExample :: forall shape. KnownShape shape => Data -> SOP shapebut Plutus isn't Haskell and we can't turn such a Haskell function into a built-in one as we can't quantify over shape, let alone constrain it with a type class.
Maybe with dependent typing we could somehow connect the given data value with its expected shape, but this doesn't sound particularly nice either and we don't have dependent types in the general Plutus any way, let alone in its builtins.
Builtin families
It is clear that enumerating all practical (let alone possible) instantiations of constrTermFromConstrData as individual builtins isn't gonna work: we're not adding hundreds/thousands of new built-in functions. What we can try however is making constrTermFromConstrData a family of built-in functions rather than a single builtin, so that constrTermFromConstrDataExample is just one member of the family. I.e. we'd have a new constructor of DefaultFun that unlike all other constructors takes an argument with the argument being the shape of the final SOP:
-- | Each element represents the number of 'Data's in a product from a sum of products. E.g.
--
-- > [2, 0, 1]
--
-- represents
--
-- > SOP [[data, data], [], [data]]
type SopShape = [Word64]
data DefaultFun
-- Integers
= AddInteger
| SubtractInteger
<...>
| ConstrTermFromConstrDataExample SopShapeThis may rule out some use cases (perhaps if one decides to compile an extensional type theory with W-types to Plutus or something), but this seems so insignificant that it's probably not even worth any additional discussion. How often does the programmer not know the arity of a constructor or the number of constructors of a data type?
The real problem is that adding builtin families invalidates all of our assumptions about builtins, which are, in the language of Haskell constraints, Enum, Bounded and Ix. If you have an infinite number of builtins, you can no longer put something for each of them in an array. You can't put their type signatures in golden files or print them to the user. And we do these kinds of things all over the place.
So while retrofitting builtin families into Plutus seems possible in principle, the costs of such an endeavor would be huge and so would be the risk of screwing something up along the way.
Singletons
Introducing builtin families sorta amounts to allowing for type applications to be unerasable for builtins. Which is hard in our existing system, but there's a common way to emulate unerasable types: singletons. Normally singletons require GADTs (which are hard and unergonomic in Plutus Core (see examples) to the point that we prefer to think of them as entirely unsupported), however in our case we might get away with only using a GADT on the Haskell side.
The idea is that we implement the constrTermFromConstrData builtin such that it has the following Plutus type signature:
constrTermFromConstrData : sopOfData sop -> data -> sop
where sopOfData is a built-in type that restricts sop to be a sum-of-products-of-data (such as the SOP [[data, data], [], [data]] example from the above). Each value of a built-in type comes with a runtime type tag, which in this case we can use to establish the connection between the type information and the runtime content of sopOfData thereby essentially propagating type info to the runtime where it can be used to ensure the safety of the conversion from Data to sop.
Assuming it's going to work out, we'll still run into a problem: doing checks at runtime is expensive. Not only will we need to construct and pass around that sopOfData singleton, we'll also need to match on it to ensure that it encodes the exact shape of the given data object. All for the sake of making our type system better. Would, say, Aiken users care about those checks slowing their smart contracts down? I doubt it.
What exactly breaks if we add the unsafe constrTermFromConstrData?
Adding the unsafe constrTermFromConstrData will make it possible for a well-typed case expression to fail with an error message like "expected n arguments to constructor C but got m" (I'm shooting for the unreachable ideal error message here). Now of course UPLC doesn't have any types and so such an error is already possible in UPLC anyway, however this isn't the case in TPLC or PIR and by extension in UPLC that we get by erasing TPLC.
Is this a big deal? I don't know. What I know is this is kinda a not-too-unsafe form of unsafeCoerce. Right now one can implement a rather boring safeCoerce function that returns its argument back but with a different type:
safeCoerce :: (ToData a, FromData b) => a -> b
safeCoerce = fromData . toDataOne might use if a a type synonym to b for example. If it so happens at runtime that a isn't definitionally equal to b, then it'll be an immediate evaluation failure (unless we have some really weird ToData/FromData instances, which I hope we don't).
Adding the unsafe builtin will allow us to define
unsafeCoerce :: ToData a => a -> b
unsafeCoerce = constrTermFromConstrData . toDatathat does not immediately fail with an error at runtime in case a and b aren't definitionally equal. Instead, it'll return a term that fails being cased on (if the other parts of the case-expression are well-typed themselves).
So how bad is this? Quite bad actually! Consider the following UPLC expression (pseudo-syntax):
case constrTermFromConstrData d of
\x y -> y
It converts a Data value d to a Term, i.e. a Constr with a bunch of fields, and then applies the \x y -> y function to the two fields (recall that in UPLC branches of case expressions are merely functions). But what if constrTermFromConstrData d returns a Constr with three fields rather than two? Well, then the function gets applied to too many arguments and depending on what y is we'll likely get an error sooner or later. Likely, but not necessarily: what if y is a function? Then it gets fed the extra argument that appears last in the overapplied Constr and the semantics of the term completely derails from what was originally intended by the programmer, meaning at that point they're at the mercy of undefined behavior.
The opposite situation when Constr has only one argument is equally bad. In that case the \x y -> y gets applied to only one argument and the whole expression reduces to \y -> y and if evaluation stops here, then that's evaluation success! Certainly something that the programmer did not intend to end up with. This doesn't apply to Plutus V3+, because those versions make it necessary for scripts to return (), everything else is a failure, but still, maybe the branch intended to run a check and fail, which is now not gonna happen because the Constr node turned out to be invalid. Failing to perform a check because of unexpected user input is quite bad.
Note that even if we knew how to give a type to constrTermFromConstrData, we'd still have all these issues because the on-chain language is untyped. Even if we can guarantee that our toolchain never produces such mess, we cannot guarantee that other toolchains don't, particularly those that target UPLC directly without using any type system.
Conclusions
The unsafe constrTermFromConstrData is a horrible footgun that we should not add to the language, I believe.
Maybe we can make singletons perform reasonably fast, this should be investigated.