Schema types rebased to use codegen types for the data #107

warpfork · 2020-11-02T21:17:14Z

We have here a new schema package which is based on the codegen'd golang types which recognize documents matching the schema-schema.

Meaning we're one step quite drastically closer to self-hosting tools for handling IPLD Schemas, among other things.

This comes in two packages:

the "schemadmt" package -- where "dmt" stands for "Data Model Tree" (think: like AST, except this one is divorced from syntax) -- is 99% codegen, and a few helper methods manually attached to that.
This package (like any codegen output) can be combined with your choice of IPLD codec in order to parse a schema!
the "schema2" package -- temporary name: this will replace the current schema package when done -- which has types created from the schemadmt data, but further validating it, and adding even more helpful accessor methods.
This is necessary to do as a separate package because some of the validations it performs are actually graph properties (meaning they're unsuitable for validation by the schema rules alone). We also add a bunch of methods which do cyclic lookups, so, again this means pointers are suitable and the data model tree alone needs decorators to enable this.

This is a draft at present. However, it's already very close to ready to replace the current schema package.
When it's ready to do so, the rest of the codegen shouldn't change very noticably at all -- we're still advertising all the same access methods as before.

Forgive me for tooting my own horn a bit here, but... rewriting this against the codegen'd types was just a beautiful experience.
The autocompletion for navigating the data was helpful, and the type checking helped steer things to correct completion on several occasions.
Being able to write a type switch for the union members, for example, and getting nice compile-time errors if you got a wrong type in there? Brilliant.
The code quality increased drastically.

In one case, the increased strictness also guided directly to the creation of a TypeReference type, which is now the thing to use for map keys whenever dealing with a mixture of named and anonymous types --
and now all of the new schema handling code works with this type, and in doing so, did almost all of the work necessary to actually get us anonymous types (previously not implemented); and all of the places where type names vs inline defintions are converted to type references are super clear.

Caveat to reviewers: I did burn out on trying to DRY in some of this code. I know.

The BuildTypeSystem function and all of its validation logic is fairly alarming (in particular, some of the deeper cases in unions which require looking at member type info).
These can almost certainly be refactored in the future -- but I think I'd like to go about this by getting more test coverage first; merging a checkpoint; and then tearing at that in separate PRs later, with those test cases to insure the simplifications actually keep all the details correct.
Some parts of the logic probably can be DRY'd be a little application of cleverness; some parts probably can't, but even those might be better to break down into smaller functions just to keep variable scopes smaller and readability improved.
For this first pass, though, grinding things out in the most skullduggerish way possible seemed best. I figure lexical linearity is easier to review than cleverness.

(We may also want to benchmark the BuildTypeSystem function and/or read its assembly to make sure it's reasonably fast.
We'll probably have codegen outputs invoking this function in the future at init time, if we want codegen packages to have their full type info available.
This isn't done yet, but is probably a feature in our near future.)

Some variable naming is also admittedly atrocious.
I tried to keep some kind of convention going with "dmt" suffixes for any variables that are the raw (well, relatively raw; they're still schema-lensed) data trees, but it's not always consistent, especially around typenames.
There's also an awful lot of "something2" variables where a type switch destructures something and I couldn't immediately think of a better name as I was hacking along.
These could also get better in future refactors (and maybe this will be particularly opportune to do at the same time as some of the refactors suggested in the previous paragraph, which would presumably lead to smaller functions).

Some other things still in the "future work" category:

anonymous type creation. It's much closer now, due to the TypeReference thing making it clear where these branches need to show up, but they're still stubbed.
"copy" types will cause a panic. However, I think the solution here is that we should actually remove these from the schema spec entirely. Writing this code made it very clear what a bizarre exception they would be to almost every rule; they're really not good. (Maybe something like this can exist in the schema DSL porcelain, but it really does not belong much farther than that.)
lots of the errors around schema validation are very stringy at the moment. This might be okay because I can't really imagine wanting to use them programmatically, but I'd say about 95% of the time I think that, I'm eventually wrong, so maybe we'd better just assume there's work to do here.
some other things I've probably forgot, and will rediscover on the next re-read of this :)

I expect to rebase this several times before it reaches a finished state. (Hopefully, we might also cram some codegen output size reduction patches in before this as well. But we'll see.) So, early review welcome at this point, but do expect changes.

warpfork · 2020-11-30T16:38:31Z

... This is probably about to get plowed under. We're planning significant interface changes for a v0.7 release, and that will affect the codegen outputs in this PR significantly. I don't think saving the snapshot of the old (current) codegen will provide any significant utility, so, this will get rebased and so in the future, the git history will only show the new codegen (after v0.7).

I'll just leave this PR up as it stands for a while, but it'll become unmergable in the near future, and I might not get back to it for a while.

…r the fully validated data which is implemented by retaining and accessing into the raw data.

Move the existing setup from the schema-schema "demo" dir to here; and rig it up with go generate conventions that I'm hoisting back from mvdan's https://github.com/ipld/go-ipld-adl-hamt/blob/master/gen.go . Move the parse tests with it.

filenames change due to #105 . gofmt also applied for the first time. from here on out: `go generate` should just cause these files to be automagically updated and formatted.

warpfork · 2020-12-13T21:22:06Z

Takebacksies: I'm going to try merge this shortly after all.

I reevaluated my concern that this was going to be an annoyingly large hunk to carry around in git: it is not. Measurements indicated it's roughly a hundred kilobytes in .git/objects when all is said and done. I daresay we can live with that. (We also reevaluated how big the interface changes for v0.7 will be, and it's less than originally planned, meaning the delta we expect in the near future for that is also much less than feared, even further reducing the concern about git history bloat.)

warpfork · 2020-12-13T21:24:24Z

I've updated the code to actually be nicely regenerable using standard go generate, so that should make it much more reasonable to keep this in sync and maintained going forward. Other logical bits might still deserve review over time, but... this is such a big area of functionality it's going to be a "subsequent PRs" area no matter what, so might as well lean into that.

Since #121, presence of fields is actually checked... but that code also doesn't understand implicit fields yet, which makes us need a lot of filler. Also the lack of the "members" field for unions? That was just plain wrong. Good think we're catching things like that now.

Cannot quite wire that up yet because of some other still incomplete features.

warpfork added 3 commits December 4, 2020 17:38

draft of schema types using codegen for data model, with a package fo…

2e68fd3

…r the fully validated data which is implemented by retaining and accessing into the raw data.

regenerate.

8681ab7

filenames change due to #105 . gofmt also applied for the first time. from here on out: `go generate` should just cause these files to be automagically updated and formatted.

warpfork force-pushed the schema-selfhost branch from a123674 to 8681ab7 Compare December 13, 2020 21:17

warpfork added 2 commits December 13, 2020 23:15

comments on how to connect this test to full logical validation.

ad7aa3e

Cannot quite wire that up yet because of some other still incomplete features.

warpfork force-pushed the schema-selfhost branch from e47543b to ad7aa3e Compare December 13, 2020 22:16

warpfork merged commit ae54a57 into master Dec 13, 2020

warpfork deleted the schema-selfhost branch December 13, 2020 22:22

aschmahmann mentioned this pull request May 14, 2021

Release v0.9.0 ipfs/kubo#8058

Closed

71 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Schema types rebased to use codegen types for the data #107

Schema types rebased to use codegen types for the data #107

warpfork commented Nov 2, 2020

warpfork commented Nov 30, 2020

warpfork commented Dec 13, 2020

warpfork commented Dec 13, 2020

Schema types rebased to use codegen types for the data #107

Schema types rebased to use codegen types for the data #107

Conversation

warpfork commented Nov 2, 2020

warpfork commented Nov 30, 2020

warpfork commented Dec 13, 2020

warpfork commented Dec 13, 2020