Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codegen of unions, and their keyed representations #60

Merged
merged 12 commits into from
Jul 8, 2020
Merged
4 changes: 3 additions & 1 deletion nodeBuilder.go
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,9 @@ type MapAssembler interface {
// For plain maps (that is, not structs or unions masquerading as maps),
// the empty string can be used as a parameter, and the returned NodePrototype
// can be assumed applicable for all values.
// Using an empty string for a struct or union will return a nil NodePrototype.
// Using an empty string for a struct or union will return nil,
// as will using any string which isn't a field or member of those types.
//
// (Design note: a string is sufficient for the parameter here rather than
// a full Node, because the only cases where the value types vary are also
// cases where the keys may not be complex.)
Expand Down
36 changes: 36 additions & 0 deletions schema/errors.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,21 @@ import (
"fmt"
)

// TODO: errors in this package remain somewhat slapdash.
//
// - ipld.ErrUnmatchable is used as a catch-all in some places, and contains who-knows-what values wrapped in the Reason field.
// - sometimes this wraps things like strconv errors... and on the one hand, i'm kinda okay with that; on the other, maybe saying a bit more with types before getting to that kind of shrug would be nice.
// - we probably want to use `Type` values, right?
// - or do we: because then we probably need a `Repr bool` next to it, or lots of messages would be nonsensical.
// - this is *currently* problematic because we don't actually generate type info consts yet. Hopefully soon; but the pain, meanwhile, is... substantial.
// - "substantial" is an understatement. it makes incremental development almost impossible because stringifying error reports turn into nil pointer crashes!
// - other ipld-wide errors like `ipld.ErrWrongKind` *sometimes* refer to a TypeName... but don't *have* to, because they also arise at the merely-datamodel level; what would we do with these?
// - it's undesirable (not to mention intensely forbidden for import cycle reasons) for those error types to refer to schema.Type.
// - if we must have TypeName treated stringily in some cases, is it really useful to use full type info in other cases -- inconsistently?
// - regardless of where we end up with this, some sort of an embed for helping deal with munging and printing this would probably be wise.
// - generally, whether you should expect an "ipld.Err*" or a "schema.Err*" from various methods is quite unclear.
// - it's possible that we should wrap *all* schema-level errors in a single "ipld.ErrSchemaNoMatch" error of some kind, to fix the above. as yet undecided.

// ErrNoSuchField may be returned from lookup functions on the Node
// interface when a field is requested which doesn't exist, or from Insert
// on a MapBuilder when a key doesn't match a field name in the structure.
Expand All @@ -19,3 +34,24 @@ func (e ErrNoSuchField) Error() string {
}
return fmt.Sprintf("no such field: %s.%s", e.Type.Name(), e.FieldName)
}

// ErrNotUnionStructure means data was fed into a union assembler that can't match the union.
//
// This could have one of several reasons, which are explained in the detail text:
//
// - there are too many entries in the map;
// - the keys of critical entries aren't found;
// - keys are found that aren't any of the expected critical keys;
// - etc.
//
// TypeName is currently a string... see comments at the top of this file for
// remarks on the issues we need to address about these identifiers in errors in general.
type ErrNotUnionStructure struct {
TypeName string

Detail string
}

func (e ErrNotUnionStructure) Error() string {
return fmt.Sprintf("cannot match schema: union structure constraints for %s caused rejection: %s", e.TypeName, e.Detail)
}
118 changes: 118 additions & 0 deletions schema/gen/go/HACKME_wip.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,121 @@ the existing allowNull), and another for both (for "nullable optional" fields).
Every NodeAssembler would then have to support that, just as they each support allowNull now.

I think the above design is valid, but it's not implemented nor tested yet.


### AssignNode optimality

The AssignNode methods we generate currently do pretty blithe things with large structures:
they iterate over the given node, and hurl entries into the assembler's AssignKey and AssignValue methods.

This isn't always optimal.
For any structure that is more efficient when fed info in an ideal order, we might want to take account of that.

For example, unions with representation mode "inline" are a stellar example of this:
if the discriminant key comes first, they can work *much, much* more efficiently.
By contrast, if the discriminant key shows up late in the object, it is necessary to
have buffered *all the other* data, then backtrack to handle it once the discriminant is found and parsed.

At best, this probably means iterating once, plucking out the discriminant entry,
and then *getting a new iterator* that starts from the beginning (which shifts
the buffer problem to the Node we're consuming data from).

Even more irritatingly: since NodeAssembler has to accept entries in any order
if it is to accept information streamingly from codecs, the NodeAssembler
*also* has to be ready to do the buffering work...
TODO ouch what are the ValueAssembler going to yield for dealing with children?
TODO we have to hand out dummy ValueAssembler types that buffer... a crazy amount of stuff. (Reinvent refmt.Tok?? argh.) cannot avoid???
TODO this means where errors arise from will be nuts: you cant say if anything is wrong until you figure out the discriminant. then we replay everything? your errors for deeper stuff will appear... uh... midway, from a random AssembleValue finishing that happens to be for the discriminant. that is not pleasant.

... let's leave that thought aside: suffice to say, some assemblers are *really*
not happy or performant if they have to accept things in unpleasant orderings.

So.

We should flip all this on its head. The AssignNode methods should lean in
on the knowledge they have about the structure they're building, and assume
that the Node we're copying content from supports random access:
pluck the fields that we care most about out first with direct lookups,
and only use iteration to cover the remaining data that the new structure
doesn't care about the ordering of.

Perhaps this only matters for certain styles of unions.


### sidenote about codec interfaces

Perhaps we should get used to the idea of codec packages offering two styles of methods:

- `UnmarshalIntoAssembler(io.Reader, ipld.NodeAssembler) error`
- this is for when you have opinions about what kind of in-memory format should be used
- `Unmarshal(io.Reader) (ipld.Node, error)`
- this is for when you want to let the codec pick.

We might actually end up preferring the latter in a fair number of cases.

Looking at this inline union ordering situation described above:
the best path through that (other than saying "don't fking use inline unions,
and if you do, put the discriminant in the first fking entry or gtfo") would probably be
to do a cbor (or whatever) unmarshal that produces the half-deserialized skip-list nodes
(which are specialized to the cbor format rather than general purpose, but we want that in this story)...
and those can then claim to do random access, thereby letting them take on the "buffering".
This approach would let the serialization-specialized nodes take on the work,
rather than forcing the union's NodeAssembler to do buffer at a higher level...
which is good because doing that buffering in a structured way at a higher level
is actually more work and causes more memory fragmentation and allocations.

Whew.

I have not worked out what this implies for multicodecs or other muxes that do compositions of codecs.


### enums of union keys

It's extremely common to have an enum that is the discrimant values of a union.

We should make a schema syntax for that.

We tend to generate such an enum in codegen anyway, for various purposes.
Might as well let people name it outright too, if they have the slightest desire to do so.

(Doesn't apply to kinded unions.)


### can reset methods be replaced with duff's device?

Yes. Well, sort of. Okay, no.

It's close! Assemblers were all written such that their zero values are ready to go.

However, there's a couple of situations where you *wouldn't* want to blithely zero everything:
for example, if an assembler has to do some allocations, but they're reusable,
you wouldn't want to turn those other objects into garbage by zeroing the pointer to them.
See the following section about new-alloc child assemblers for an example of this.


### what's up with new-alloc child assemblers?

Mostly, child assemblers are embedded in the assembler for the type that contains them;
this is part of our allocation amortization strategy and important to performance.
However, it doesn't always apply:
Sometimes we *need* independently allocated assemblers, even when they're child assemblers:
recursive structures need this (otherwise, how big would the slab be? infinite? no; halt).
Sometimes we also just *want* them, somewhat more mildly: if a union is one of several things,
and some of them are uncommonly used but huuuuge, then maybe we'd rather allocate the child assemblers
individually on demand rather than pay a large resident memory cost to embed all the possibilities.

There's a couple things to think about with these:

- resetting assemblers with a duff's device strategy wouldn't recursively reset these;
it would just orphan them. While possibly leaving them pointed into some parts of memory in the parent slab ('cm' in particular comes to mind).
This could be a significant correctness issue.
- But who's responsibility is it to "safe" this? Nilling 'w' proactively should also make this pretty innocuous, as one option (but we don't currently do this).

- if the parent assembler is being used in some highly reusable situation (e.g. it's a list value or map value),
is the parent able to hold onto and re-use the child assembler? We probably usually still want to do this, even if it's in a separate piece of heap.
- For unions, there's a question of if we should hold onto each child assembler, or just the most recent; that's a choice we could make and tune.
If the answer is "most recent only", we could even crank down the resident size by use of more interfaces instead of concrete types (at the cost of some other runtime performance debufs, most likely).

We've chosen to discard the possibility of duff's device as an assembler resetting implementation.
As a result, we don't have to do proactive 'w'-nil'ing in places we might otherwise have to.
And union assemblers hold on to all child assembler types they've ever needed.
4 changes: 2 additions & 2 deletions schema/gen/go/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,8 +92,8 @@ Legend:
| feature | accessors | builders |
|:-------------------------------|:---------:|:--------:|
| unions | ... | ... |
| ... type level | | |
| ... keyed representation | | |
| ... type level | | |
| ... keyed representation | | |
| ... envelope representation | ✘ | ✘ |
| ... kinded representation | ✘ | ✘ |
| ... inline representation | ✘ | ✘ |
Expand Down
33 changes: 32 additions & 1 deletion schema/gen/go/adjunctCfg.go
Original file line number Diff line number Diff line change
@@ -1,11 +1,18 @@
package gengo

import (
"fmt"
"strings"

"github.com/ipld/go-ipld-prime/schema"
)

// This entire file is placeholder-quality implementations.
//
// The AdjunctCfg struct should be replaced with an IPLD Schema-specified thing!
// The values in the unionMemlayout field should be an enum;
// etcetera!

type FieldTuple struct {
TypeName schema.TypeName
FieldName string
Expand All @@ -15,10 +22,13 @@ type AdjunctCfg struct {
typeSymbolOverrides map[schema.TypeName]string
fieldSymbolLowerOverrides map[FieldTuple]string
fieldSymbolUpperOverrides map[FieldTuple]string
maybeUsesPtr map[schema.TypeName]bool // treat absent as true
maybeUsesPtr map[schema.TypeName]bool // treat absent as true
unionMemlayout map[schema.TypeName]string // "embedAll"|"interface"; maybe more options later, unclear for now.

// note: PkgName doesn't appear in here, because it's...
// not adjunct data. it's a generation invocation parameter.
// ... this might not hold up in the future though.
// There are unanswered questions about how (also, tbf, *if*) we'll handle generation of multiple packages which use each other's types.
}

// TypeSymbol returns the symbol for a type;
Expand Down Expand Up @@ -64,3 +74,24 @@ func (cfg *AdjunctCfg) MaybeUsesPtr(t schema.Type) bool {
// Perhaps structs and unions are the only things likely to benefit from pointers.
return true
}

// UnionMemlayout returns a plain string at present;
// there's a case-switch in the templates that processes it.
// We validate that it's a known string when this method is called.
// This should probably be improved in type-safety,
// and validated more aggressively up front when adjcfg is loaded.
func (cfg *AdjunctCfg) UnionMemlayout(t schema.Type) string {
if t.Kind() != schema.Kind_Union {
panic(fmt.Errorf("%s is not a union!", t.Name()))
}
v, ok := cfg.unionMemlayout[t.Name()]
if !ok {
return "embedAll"
}
switch v {
case "embedAll", "interface":
return v
default:
panic(fmt.Errorf("invalid config: unionMemlayout values must be either \"embedAll\" or \"interface\", not %q", v))
}
}
23 changes: 1 addition & 22 deletions schema/gen/go/genStruct.go
Original file line number Diff line number Diff line change
Expand Up @@ -272,28 +272,7 @@ func (g structBuilderGenerator) EmitNodeAssemblerType(w io.Writer) {
`, w, g.AdjCfg, g)
}
func (g structBuilderGenerator) EmitNodeAssemblerMethodBeginMap(w io.Writer) {
// We currently disregard sizeHint. It's not relevant to us.
// We could check it strictly and emit errors; presently, we don't.
// This method contains a branch to support MaybeUsesPtr because new memory may need to be allocated.
// This allocation only happens if the 'w' ptr is nil, which means we're being used on a Maybe;
// otherwise, the 'w' ptr should already be set, and we fill that memory location without allocating, as usual.
doTemplate(`
func (na *_{{ .Type | TypeSymbol }}__Assembler) BeginMap(int) (ipld.MapAssembler, error) {
switch *na.m {
case schema.Maybe_Value, schema.Maybe_Null:
panic("invalid state: cannot assign into assembler that's already finished")
case midvalue:
panic("invalid state: it makes no sense to 'begin' twice on the same assembler!")
}
*na.m = midvalue
{{- if .Type | MaybeUsesPtr }}
if na.w == nil {
na.w = &_{{ .Type | TypeSymbol }}{}
}
{{- end}}
return na, nil
}
`, w, g.AdjCfg, g)
emitNodeAssemblerMethodBeginMap_strictoid(w, g.AdjCfg, g)
}
func (g structBuilderGenerator) EmitNodeAssemblerMethodAssignNull(w io.Writer) {
emitNodeAssemblerMethodAssignNull_recursive(w, g.AdjCfg, g)
Expand Down
23 changes: 1 addition & 22 deletions schema/gen/go/genStructReprMap.go
Original file line number Diff line number Diff line change
Expand Up @@ -324,28 +324,7 @@ func (g structReprMapReprBuilderGenerator) EmitNodeAssemblerType(w io.Writer) {
`, w, g.AdjCfg, g)
}
func (g structReprMapReprBuilderGenerator) EmitNodeAssemblerMethodBeginMap(w io.Writer) {
// We currently disregard sizeHint. It's not relevant to us.
// We could check it strictly and emit errors; presently, we don't.
// This method contains a branch to support MaybeUsesPtr because new memory may need to be allocated.
// This allocation only happens if the 'w' ptr is nil, which means we're being used on a Maybe;
// otherwise, the 'w' ptr should already be set, and we fill that memory location without allocating, as usual.
doTemplate(`
func (na *_{{ .Type | TypeSymbol }}__ReprAssembler) BeginMap(int) (ipld.MapAssembler, error) {
switch *na.m {
case schema.Maybe_Value, schema.Maybe_Null:
panic("invalid state: cannot assign into assembler that's already finished")
case midvalue:
panic("invalid state: it makes no sense to 'begin' twice on the same assembler!")
}
*na.m = midvalue
{{- if .Type | MaybeUsesPtr }}
if na.w == nil {
na.w = &_{{ .Type | TypeSymbol }}{}
}
{{- end}}
return na, nil
}
`, w, g.AdjCfg, g)
emitNodeAssemblerMethodBeginMap_strictoid(w, g.AdjCfg, g)
}
func (g structReprMapReprBuilderGenerator) EmitNodeAssemblerMethodAssignNull(w io.Writer) {
emitNodeAssemblerMethodAssignNull_recursive(w, g.AdjCfg, g)
Expand Down
Loading