Skip to content
Closed
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions CIP-????/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
CIP: ?
Title: Deterministic universal almost-unique Plutus Constructors
Category: Plutus
Status: Proposed
Authors:
- Niels Mündler <[email protected]>
Implementors: [Niels Mündler <[email protected]>]
Discussions:
- https://github.com/cardano-foundation/CIPs/pull/608
Created: 2023-10-20
License: CC-BY-4.0
---

<!-- Existing categories:

- Meta | For meta-CIPs which typically serves another category or group of categories.
- Wallets | For standardisation across wallets (hardware, full-node or light).
- Tokens | About tokens (fungible or non-fungible) and minting policies in general.
- Metadata | For proposals around metadata (on-chain or off-chain).
- Tools | A broad category for ecosystem tools not falling into any other category.
- Plutus | Changes or additions to Plutus
- Ledger | For proposals regarding the Cardano ledger (including Reward Sharing Schemes)
- Catalyst | For proposals affecting Project Catalyst / the Jörmungandr project

-->

Note: I will use record / Plutus Data exchangibly throughout the document.

## Abstract
Plutus Constructor IDs are currently heavily focused around their origin in Haskell. They are usually used to distinguish different constructors of a single declared datatype.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section lacks an elaborate example, like an actual Data object. It took me a while to figure out what you mean by a constructor ID and I'm a Plutus developer.

In contrast, one may introduce universally recognized datatypes that are identified by a unique constructor id and can be expected to behave in a specified way (i.e. contain specific fields with specific types).
For this purpose, we introduce a generic way to compute an almost unique, deterministic and universal constructor id for objects based on their name and field types.
Note that it is not expected that every language adopts this standard as a default (i.e. for Haskell-like languages there might not be much use of it).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is Haskell different here?

Copy link
Contributor Author

@nielstron nielstron Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python it is common to declare Sum Types / Unions after the declaration of the specific types i.e.

class A

class B

class C

AB = Union[A, B]
ABC = Union[A, B, C]

In Haskell the definition of the Sum Types / Union at the same time declares the involved alternatives, hence all involved alternatives are known at the time of declaration (and known to be distinct)

data AB = A | B
data ABC = A | B | C -- I guess this would throw an error for redeclaring the constructors A and B?

However, it is rather a recommendation for a choice in case interoperable datatypes with unique constructor ids are useful to an application (i.e. oracles) or language design (i.e. imperative languages).

## Motivation: why is this CIP necessary?

The current approach to constructor ids is heavily focused around the Haskell-ish way of defining record types.
An object can be one of a set of predefined set of entities, distinguished by constructor ids. I.e. the optional Redeemer type is either `Some Redeemer` or `None`.
Because we know that anything of optional integer type can be either two of these, only two numbers (0/1) are required to distinguish them.
If we introduce a third constructor (i.e. `Some Datum`), potentially all other constructors change and the two implementations are not compatible anymore.

Moreover there are other Plutus language frontends that allow freely declaring objects and mixing them into Union types (such as OpShin), which is akin to the imperative style of declaring classes.
This allows for example to declare a universally accepted type `Nothing` that can be freely mixed with `Redeemer` and `Datum` into `Union[Nothing, Redeemer, Datum]`.
The only requirement to ensure that this works properly is that all records that are mixed into the Union have distinct contstructor ids.
This is currently implemented manually, which is tedious and a potential source of errors.

## Specification
<!-- The technical specification should describe the proposed improvement in sufficient technical detail. In particular, it should provide enough information that an implementation can be performed solely on the basis of the design in the CIP. This is necessary to facilitate multiple, interoperable implementations. This must include how the CIP should be versioned. If a proposal defines structure of on-chain data it must include a CDDL schema in it's specification.-->
The deterministic, universal and almost-unique Plutus constructors are computed recursively based on the type definition of a record.
We first compute a string `ustr(X)` based on the type definition of X. Then we perform a sha256 hash on the UTF8 encoding of this string and interpret the resulting hex digest as a big endian encoded integer.
The integer is taken modulo 2^32. The resulting integer is the almost-unique, universal, deterministic constructor id of the plutus datum.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any discussion on what happens in case of collision?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes collision seems bad here. I don't think it could lead to an attack, but I'm not 100% sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of collision during declaration of Sum Type, the compiler has to deny compilation and ask the user to manually declare a constructor id for the involved types. I think this is rare enough to have practically no impact on computation. Regarding attacks I don't think this schema is more vulnerable than any other schema. In the current Plutus schema constructor id overlaps are practically omnipresent (though never in any sum types that occur in the compiled contract)


The following function describes how to compute `ustr(X)` for a type recursively.

```
ustr(bytes) := "bytes"
ustr(integer) := "int"
// This covers the case where the structure of the object is now known from the perspective of the class, i.e. when any BuiltinData is allowed
ustr(PlutusData) := "any"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not "data"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably an oversight but not too relevant
cf #608 (comment)

// This covers the case where the type of the elements in the list are not known in advance
ustr(list) := "list"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isn't that list<data>?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally, this CIP is committing to a type-definition language that might not be appropriate for everyone, as witnessed by quirks like this.

Moreover, we already have at least two type-definition languages that we could use:

  1. CDDL (since Data is a subset of CBOR)
  2. CIP-57

Why not use one of those?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isn't that list<data>?

Hm, I had this comment as well, but either I failed to hit "comment" or GitHub lost it (it's been glitchy lately for me).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a valid point. I am currently looking towards making this be compatible with the CIP57 definitions. However CIP57 does not make reproducibility a big thing (i.e. concrete ordering of JSON map elements does usually not matter) however here it is relevant - a re-definition or at least specifictation of a "canonical" blueprint from which to hash is unavoidable.


ustr(list<X>) := "list<" + ustr(X) + ">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not parsing these so I think it's fine, but probably worth clarifying that it's not a problem if e.g. type names contain < or other special characters.


ustr(map<X,Y>) := "map<" + ustr(X) + "," + ustr(Y) + ">"

ustr(union<X,Y,...,Z>) := "union<" + ustr(X) + "," + ustr(Y) + "," + ... + "," + ustr(Y) + ">"

ustr(constr(name)<id, fields[f1:X,f2:Y,...,fn:Z]>) :=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So ustr is used to encode both types and terms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, ustr converts only types into strings, concrete values are not relevant.

"cons[" + name + "](" + str(id) + ";"
+ f1 + ":" + ustr(X) + "," + f2 + ":" + ustr(Y) + "," + ... + "," + fn + ":" + ustr(Z) +
")"
```

Where `name` and `f1` to `fn` refer to the name of the record and the names of its fields respectively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why include names? If one group of developers has data Option a = Some a | None and another has data Maybe a = Just a | Nothing why not let those two data types to be considered the same, if they mean the same thing and are encoded the same way via PlutusData anyway?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented later, but: you have to have either the name or the id of the constructor, otherwise you can't distinguish two constructors with the same fields. And the whole point of this proposal is to not set the ids, so it has to be names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented later, but: you have to have either the name or the id of the constructor, otherwise you can't distinguish two constructors with the same fields. And the whole point of this proposal is to not set the ids, so it has to be names.

I mean, you can get the hash of a constructor from the structure of the data type (\a -> Either a () for both the example types) and the id of the constructor. You don't need to globally specify the id this way, just for the purpose of computing a hash (and the way hashes are computed can be arbitrary as long as they're almost unique). Or am I misunderstanding it?

Since the constructor id of a records is not known when computing its constructor id, the constructor id string is set to `_` for this computation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not known, why include it in there in the first place? The entire concept feels awkwardly circular even though you get out of the infinite recursion with that wildcard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree its a bit weird... but I do want to distinguish between classes with same names and fields but different constructor ids to avoid nasty suprises to the user.

class A:
CONSTR_ID = 0

B = A

class A:
CONSTR_ID = 1

class X:
x: Union[A, B]

If it is pulled out of the ustr for constructors then we loose modularity of the function 🤔

As an example, the constructor id of record `A` with fields `b` (record `B`, constructor id 5 with one integer field `i`) and `c` (integer) would result in `ustr(A) = ustr(constr(A)<_,fields[b:B,c:integer]>) = "cons[A](_;b:" + ustr(constr(B)<5,fields[i:integer]>) + ",c:int)" = "cons[A](_;b:cons[B](i:int),c:int)"`.

## Rationale: how does this CIP achieve its goals?
<!-- The rationale fleshes out the specification by describing what motivated the design and what led to particular design decisions. It should describe alternate designs considered and related work. The rationale should provide evidence of consensus within the community and discuss significant objections or concerns raised during the discussion.

It must also explain how the proposal affects the backward compatibility of existing solutions when applicable. If the proposal responds to a CPS, the 'Rationale' section should explain how it addresses the CPS, and answer any questions that the CPS poses for potential solutions.
-->
We definetly want a few properties on the CONSTR_IDs

- _small_: ideally the constr_id integer should be as small as possible, as smaller integers are encoded more efficiently in CBOR and save the end user minutxo and txfees (constr_ids are encoded as the cbor tag up to 7 bit size, after that encoded as generic integer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you run any experiments on whether using your version makes scripts more expensive (including deserialization time)? I'd expect them to become, but not sure about the scale, perhaps not by a lot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything this proposal seems vanishingly unlikely to generate tags that are small? We're taking the result mod 2^32, so I'd expect to probably get uniform numbers over that range, which are going to be way higher than 2^7.

More generally, since this proposal wants global ids, there can only be 7 types globally that get the small ids. So I think this will definitely perform worse on space, but that might not matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Types will definitely perform worse on space, I assume most (i.e. roughly 50%) of tags will have size around 2^32. Small refers to these tags being smaller than i.e. 64 bytes (like a script or datum hash). Likely doing modulo 2^64 would not make a big difference on size/cost either but improve uniqueness, so I am looking into adding this as a change.

- _unique_: There should be as little overlap with other values as possible, so that we can group together classes in unions without having to worry about setting/overwriting the constr id. This is reflected by the unique choice of identifiers in `ustr`.
- _deterministic_: Datatypes that are defined in libraries may be imported in arbitrary contexts. the constr_id must therefore not depend on i.e. what other Unions the datatype is being used in or what other datatypes are declared in its surroundings. This rules out the Haskell approach and any automatically incrementing global counters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the uniqueness that rules out the Haskell approach, not determinism.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thats correct.


Note that the implementation first computes a `ustr` in human readable form and then transforms it into an integer. This is intentional, since the alternatives (directly computing a large unique number or similar approaches) are much more difficult to debug.

To ensure that this does not only take the structural definition but also the intended usage into account, names of records are taken into account for the computation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see. It is rather strange in that whoever introduces a data type first decides for everybody else how they should name its constructor and fields. I'm also not sure how much safety it adds, Names are not type safety. I do normally believe that nominal > structural, but the underlying PlutusData is structural anyway and it seems potentially irritating to enforce the same names for all parties, plus names don't guarantee specific semantics anyway. Dunno.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have to have the names if you're not including an id. Otherwise you can't distinguish the constructors of

data Foo = A Int Int | B Int Int

I don't understand the stated reasoning (why does it matter to "take the intended usage into account"?), but I think you do need it.


There is no issue with backwards compatability when adopting this implementation as an opt-in choice for users.
PlutusTx and most other languages allow explicitly setting the constructor id of objects anyways.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I feel like we've always viewed constructors ids as constructor indices. We're discussing the possibility of converting Data objects to SOPs via a builtin and this can only work if constructor ids are interpreted as indices. I'll ask the team about the perspective that you bring, it's certainly new to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing that necessitates any particular interpretation of the integers in a Constr. We have generally assumed they will be indices: in particular, favouring small numbers is a reflection of that (also see https://www.ietf.org/id/draft-bormann-cbor-notable-tags-09.html#name-enumerated-alternative-data).

The point about conversion to SOPs is a good one. If we are able to offer a fast conversion from Data to SOPs, then you really will want to use indices rather than arbitrary ids (since if you want to case analyze the constructor with tag n, you need to provide alternatives for all of the n-1 previous tags too!).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like it can become a major issue for the compatability with native SOP.

Note that due to determinism, types defined this way can be supported in third party languages as well by hard coding the computed constructor id and overwriting the default of the implementation language.


## Path to Active

### Acceptance Criteria
- Implementation in at least one Smart Contract Language

### Implementation Plan
- Implementation in pycardano / OpShin. See the reference implementation [here](https://github.com/Python-Cardano/pycardano/pull/272).

## Copyright

[CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode)