Skip to content

Conversation

@nielstron
Copy link
Contributor

@nielstron nielstron commented Oct 20, 2023

OpShin has recently gone the path of redefining the way it uses constructor ids internally to distinguish types. Since the implementation is language agnostic, we provide the specification here for other languages or applications to adopt as they see suitable.


(rendered proposal in branch)

@nielstron nielstron changed the title Add CIP for unique, deterministic constructor IDs Almost-unique, deterministic constructor IDs Oct 20, 2023
@rphair rphair changed the title Almost-unique, deterministic constructor IDs CIP-???? | Deterministic universal almost-unique Plutus Constructors Oct 21, 2023
@rphair rphair added the Category: Plutus Proposals belonging to the 'Plutus' category. label Oct 21, 2023
Copy link
Collaborator

@rphair rphair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nielstron - seems detailed & well presented enough to be technically considerable.

First can you please review these guidelines for Plutus proposals - since they have to adhere to some process considerations to facilitate Plutus acceptance & implementation? To begin with, there are a few items missing from Path to Active (edit: confirmed no core changes)

Meanwhile maybe @michaelpj @L-as @KtorZ can review the idea itself, plus anyone else they can tag from the Plutus teams (I may not be current on best people to ask; it's been a while since someone submitted a Plutus CIP).

@rphair
Copy link
Collaborator

rphair commented Oct 24, 2023

@nielstron I've put this on the agenda for initial review at next CIP meeting: https://hackmd.io/@cip-editors/76

Copy link
Contributor

@effectfully effectfully left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This subverts my entire intuition on what a Data object means. Which is fine, evidently my intuition wasn't covering some of the ways people use Data objects. I'll bring this CIP to the team and get back to you once we've discussed it nternally. Thank you for submitting it!


ustr(union<X,Y,...,Z>) := "union<" + ustr(X) + "," + ustr(Y) + "," + ... + "," + ustr(Y) + ">"

ustr(constr(name)<id, fields[f1:X,f2:Y,...,fn:Z]>) :=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So ustr is used to encode both types and terms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, ustr converts only types into strings, concrete values are not relevant.

```

Where `name` and `f1` to `fn` refer to the name of the record and the names of its fields respectively.
Since the constructor id of a records is not known when computing its constructor id, the constructor id string is set to `_` for this computation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not known, why include it in there in the first place? The entire concept feels awkwardly circular even though you get out of the infinite recursion with that wildcard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree its a bit weird... but I do want to distinguish between classes with same names and fields but different constructor ids to avoid nasty suprises to the user.

class A:
CONSTR_ID = 0

B = A

class A:
CONSTR_ID = 1

class X:
x: Union[A, B]

If it is pulled out of the ustr for constructors then we loose modularity of the function 🤔

")"
```

Where `name` and `f1` to `fn` refer to the name of the record and the names of its fields respectively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why include names? If one group of developers has data Option a = Some a | None and another has data Maybe a = Just a | Nothing why not let those two data types to be considered the same, if they mean the same thing and are encoded the same way via PlutusData anyway?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented later, but: you have to have either the name or the id of the constructor, otherwise you can't distinguish two constructors with the same fields. And the whole point of this proposal is to not set the ids, so it has to be names.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented later, but: you have to have either the name or the id of the constructor, otherwise you can't distinguish two constructors with the same fields. And the whole point of this proposal is to not set the ids, so it has to be names.

I mean, you can get the hash of a constructor from the structure of the data type (\a -> Either a () for both the example types) and the id of the constructor. You don't need to globally specify the id this way, just for the purpose of computing a hash (and the way hashes are computed can be arbitrary as long as they're almost unique). Or am I misunderstanding it?


Note that the implementation first computes a `ustr` in human readable form and then transforms it into an integer. This is intentional, since the alternatives (directly computing a large unique number or similar approaches) are much more difficult to debug.

To ensure that this does not only take the structural definition but also the intended usage into account, names of records are taken into account for the computation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see. It is rather strange in that whoever introduces a data type first decides for everybody else how they should name its constructor and fields. I'm also not sure how much safety it adds, Names are not type safety. I do normally believe that nominal > structural, but the underlying PlutusData is structural anyway and it seems potentially irritating to enforce the same names for all parties, plus names don't guarantee specific semantics anyway. Dunno.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have to have the names if you're not including an id. Otherwise you can't distinguish the constructors of

data Foo = A Int Int | B Int Int

I don't understand the stated reasoning (why does it matter to "take the intended usage into account"?), but I think you do need it.

-->
We definetly want a few properties on the CONSTR_IDs

- _small_: ideally the constr_id integer should be as small as possible, as smaller integers are encoded more efficiently in CBOR and save the end user minutxo and txfees (constr_ids are encoded as the cbor tag up to 7 bit size, after that encoded as generic integer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you run any experiments on whether using your version makes scripts more expensive (including deserialization time)? I'd expect them to become, but not sure about the scale, perhaps not by a lot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything this proposal seems vanishingly unlikely to generate tags that are small? We're taking the result mod 2^32, so I'd expect to probably get uniform numbers over that range, which are going to be way higher than 2^7.

More generally, since this proposal wants global ids, there can only be 7 types globally that get the small ids. So I think this will definitely perform worse on space, but that might not matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Types will definitely perform worse on space, I assume most (i.e. roughly 50%) of tags will have size around 2^32. Small refers to these tags being smaller than i.e. 64 bytes (like a script or datum hash). Likely doing modulo 2^64 would not make a big difference on size/cost either but improve uniqueness, so I am looking into adding this as a change.

Plutus Constructor IDs are currently heavily focused around their origin in Haskell. They are usually used to distinguish different constructors of a single declared datatype.
In contrast, one may introduce universally recognized datatypes that are identified by a unique constructor id and can be expected to behave in a specified way (i.e. contain specific fields with specific types).
For this purpose, we introduce a generic way to compute an almost unique, deterministic and universal constructor id for objects based on their name and field types.
Note that it is not expected that every language adopts this standard as a default (i.e. for Haskell-like languages there might not be much use of it).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is Haskell different here?

Copy link
Contributor Author

@nielstron nielstron Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python it is common to declare Sum Types / Unions after the declaration of the specific types i.e.

class A

class B

class C

AB = Union[A, B]
ABC = Union[A, B, C]

In Haskell the definition of the Sum Types / Union at the same time declares the involved alternatives, hence all involved alternatives are known at the time of declaration (and known to be distinct)

data AB = A | B
data ABC = A | B | C -- I guess this would throw an error for redeclaring the constructors A and B?

<!-- The technical specification should describe the proposed improvement in sufficient technical detail. In particular, it should provide enough information that an implementation can be performed solely on the basis of the design in the CIP. This is necessary to facilitate multiple, interoperable implementations. This must include how the CIP should be versioned. If a proposal defines structure of on-chain data it must include a CDDL schema in it's specification.-->
The deterministic, universal and almost-unique Plutus constructors are computed recursively based on the type definition of a record.
We first compute a string `ustr(X)` based on the type definition of X. Then we perform a sha256 hash on the UTF8 encoding of this string and interpret the resulting hex digest as a big endian encoded integer.
The integer is taken modulo 2^32. The resulting integer is the almost-unique, universal, deterministic constructor id of the plutus datum.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any discussion on what happens in case of collision?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes collision seems bad here. I don't think it could lead to an attack, but I'm not 100% sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of collision during declaration of Sum Type, the compiler has to deny compilation and ask the user to manually declare a constructor id for the involved types. I think this is rare enough to have practically no impact on computation. Regarding attacks I don't think this schema is more vulnerable than any other schema. In the current Plutus schema constructor id overlaps are practically omnipresent (though never in any sum types that occur in the compiled contract)

Note: I will use record / Plutus Data exchangibly throughout the document.

## Abstract
Plutus Constructor IDs are currently heavily focused around their origin in Haskell. They are usually used to distinguish different constructors of a single declared datatype.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section lacks an elaborate example, like an actual Data object. It took me a while to figure out what you mean by a constructor ID and I'm a Plutus developer.

To ensure that this does not only take the structural definition but also the intended usage into account, names of records are taken into account for the computation.

There is no issue with backwards compatability when adopting this implementation as an opt-in choice for users.
PlutusTx and most other languages allow explicitly setting the constructor id of objects anyways.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I feel like we've always viewed constructors ids as constructor indices. We're discussing the possibility of converting Data objects to SOPs via a builtin and this can only work if constructor ids are interpreted as indices. I'll ask the team about the perspective that you bring, it's certainly new to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing that necessitates any particular interpretation of the integers in a Constr. We have generally assumed they will be indices: in particular, favouring small numbers is a reflection of that (also see https://www.ietf.org/id/draft-bormann-cbor-notable-tags-09.html#name-enumerated-alternative-data).

The point about conversion to SOPs is a good one. If we are able to offer a fast conversion from Data to SOPs, then you really will want to use indices rather than arbitrary ids (since if you want to case analyze the constructor with tag n, you need to provide alternatives for all of the n-1 previous tags too!).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like it can become a major issue for the compatability with native SOP.

Copy link
Contributor

@michaelpj michaelpj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine, if it's what you want. It doesn't seem massively appealing to me, but that's okay if other people want to use it. Naming data types and constructors via their transitive structural hash a la unison isn't a crazy idea.

I think the proposal could do with discussing what (to me) is the obvious alternative: wrapping. In the Haskell world, if I want to have either a type T or a type U, I write

data TorU = ItsT T | ItsU U

This corresponds to adding another layer of constructor tagging to tell me which one I'm looking at, so there's no problem if T and U use the same tags (and indeed, the actual implementation of Haskell very much does use constructor tags in this way). This doesn't seem much worse to me than anonymous unions, and it avoids the problem of tag clashes entirely.

So the usefulness of this proposal is limited to cases where you really want use values of various overlapping types interchangeably, and wrapping isn't acceptable. The use cases aren't really clear enough for me to say whether or not that's common.


Note that the implementation first computes a `ustr` in human readable form and then transforms it into an integer. This is intentional, since the alternatives (directly computing a large unique number or similar approaches) are much more difficult to debug.

To ensure that this does not only take the structural definition but also the intended usage into account, names of records are taken into account for the computation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have to have the names if you're not including an id. Otherwise you can't distinguish the constructors of

data Foo = A Int Int | B Int Int

I don't understand the stated reasoning (why does it matter to "take the intended usage into account"?), but I think you do need it.

<!-- The technical specification should describe the proposed improvement in sufficient technical detail. In particular, it should provide enough information that an implementation can be performed solely on the basis of the design in the CIP. This is necessary to facilitate multiple, interoperable implementations. This must include how the CIP should be versioned. If a proposal defines structure of on-chain data it must include a CDDL schema in it's specification.-->
The deterministic, universal and almost-unique Plutus constructors are computed recursively based on the type definition of a record.
We first compute a string `ustr(X)` based on the type definition of X. Then we perform a sha256 hash on the UTF8 encoding of this string and interpret the resulting hex digest as a big endian encoded integer.
The integer is taken modulo 2^32. The resulting integer is the almost-unique, universal, deterministic constructor id of the plutus datum.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes collision seems bad here. I don't think it could lead to an attack, but I'm not 100% sure.

ustr(bytes) := "bytes"
ustr(integer) := "int"
// This covers the case where the structure of the object is now known from the perspective of the class, i.e. when any BuiltinData is allowed
ustr(PlutusData) := "any"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not "data"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably an oversight but not too relevant
cf #608 (comment)

// This covers the case where the structure of the object is now known from the perspective of the class, i.e. when any BuiltinData is allowed
ustr(PlutusData) := "any"
// This covers the case where the type of the elements in the list are not known in advance
ustr(list) := "list"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isn't that list<data>?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally, this CIP is committing to a type-definition language that might not be appropriate for everyone, as witnessed by quirks like this.

Moreover, we already have at least two type-definition languages that we could use:

  1. CDDL (since Data is a subset of CBOR)
  2. CIP-57

Why not use one of those?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isn't that list<data>?

Hm, I had this comment as well, but either I failed to hit "comment" or GitHub lost it (it's been glitchy lately for me).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a valid point. I am currently looking towards making this be compatible with the CIP57 definitions. However CIP57 does not make reproducibility a big thing (i.e. concrete ordering of JSON map elements does usually not matter) however here it is relevant - a re-definition or at least specifictation of a "canonical" blueprint from which to hash is unavoidable.

// This covers the case where the type of the elements in the list are not known in advance
ustr(list) := "list"

ustr(list<X>) := "list<" + ustr(X) + ">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not parsing these so I think it's fine, but probably worth clarifying that it's not a problem if e.g. type names contain < or other special characters.

")"
```

Where `name` and `f1` to `fn` refer to the name of the record and the names of its fields respectively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented later, but: you have to have either the name or the id of the constructor, otherwise you can't distinguish two constructors with the same fields. And the whole point of this proposal is to not set the ids, so it has to be names.

-->
We definetly want a few properties on the CONSTR_IDs

- _small_: ideally the constr_id integer should be as small as possible, as smaller integers are encoded more efficiently in CBOR and save the end user minutxo and txfees (constr_ids are encoded as the cbor tag up to 7 bit size, after that encoded as generic integer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything this proposal seems vanishingly unlikely to generate tags that are small? We're taking the result mod 2^32, so I'd expect to probably get uniform numbers over that range, which are going to be way higher than 2^7.

More generally, since this proposal wants global ids, there can only be 7 types globally that get the small ids. So I think this will definitely perform worse on space, but that might not matter.


- _small_: ideally the constr_id integer should be as small as possible, as smaller integers are encoded more efficiently in CBOR and save the end user minutxo and txfees (constr_ids are encoded as the cbor tag up to 7 bit size, after that encoded as generic integer)
- _unique_: There should be as little overlap with other values as possible, so that we can group together classes in unions without having to worry about setting/overwriting the constr id. This is reflected by the unique choice of identifiers in `ustr`.
- _deterministic_: Datatypes that are defined in libraries may be imported in arbitrary contexts. the constr_id must therefore not depend on i.e. what other Unions the datatype is being used in or what other datatypes are declared in its surroundings. This rules out the Haskell approach and any automatically incrementing global counters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's the uniqueness that rules out the Haskell approach, not determinism.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thats correct.

To ensure that this does not only take the structural definition but also the intended usage into account, names of records are taken into account for the computation.

There is no issue with backwards compatability when adopting this implementation as an opt-in choice for users.
PlutusTx and most other languages allow explicitly setting the constructor id of objects anyways.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing that necessitates any particular interpretation of the integers in a Constr. We have generally assumed they will be indices: in particular, favouring small numbers is a reflection of that (also see https://www.ietf.org/id/draft-bormann-cbor-notable-tags-09.html#name-enumerated-alternative-data).

The point about conversion to SOPs is a good one. If we are able to offer a fast conversion from Data to SOPs, then you really will want to use indices rather than arbitrary ids (since if you want to case analyze the constructor with tag n, you need to provide alternatives for all of the n-1 previous tags too!).

@michaelpj
Copy link
Contributor

Generally I'd like to see some more discussion of usecases, and maybe some indication that someone other than OpShin is interested in this.

@rphair
Copy link
Collaborator

rphair commented Nov 15, 2023

Adding to #608 (comment) from CIP Editors' meeting today, where it was also brought up, that @nielstron we would be interested in seeing responses to the reviews already presented before we can more fully consider this as a CIP. Until then it does seem like it could instead be a useful "best practice" document with a good idea that might not be compelling for others to adopt.

@rphair
Copy link
Collaborator

rphair commented Aug 20, 2024

@nielstron as far as I can tell, the last 2 commits 5d7861c & 1723c1b haven't addressed the feedback that I tried to summarise in #608 (comment) 9 months ago.

We are tagging some proposals Abandoned but we never tagged this one Waiting for Author first (I'm trying this week to overhaul our editing process with tagging stale PR's) so I'm applying that tag now. I expect that you can address the review points above (please make some notes in the relevant conversations) or explain why the points have all addressed.

It's just a matter of explaining your points clearly I believe... once we see that this is done we can put it back on the CIP meeting agenda to look at promoting it to a candidate; if no further progress the next few weeks it will probably be moved onto the Abandoned list & closed soon afterward.

@rphair rphair added the State: Waiting for Author Proposal showing lack of documented progress by authors. label Aug 20, 2024
@rphair
Copy link
Collaborator

rphair commented Sep 24, 2024

@nielstron let's make one last call for updates before closing this as "abandoned" but please if you are interested in pursuing this then respond to the last comment & we'll plug it back into the review process.

@rphair rphair added State: Likely Abandoned Close if confirmed abandoned (long waiting). and removed State: Waiting for Author Proposal showing lack of documented progress by authors. labels Sep 24, 2024
@nielstron
Copy link
Contributor Author

Hi @rphair ,
I don't have time to work on this CIP anymore, so feel free to tag it as abandoned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Category: Plutus Proposals belonging to the 'Plutus' category. State: Likely Abandoned Close if confirmed abandoned (long waiting).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants