Skip to content
This repository has been archived by the owner on Aug 14, 2019. It is now read-only.

Individual breaches, Clay edition #1169

Merged
merged 56 commits into from
May 17, 2019
Merged

Individual breaches, Clay edition #1169

merged 56 commits into from
May 17, 2019

Conversation

philipcmonk
Copy link
Contributor

This rather a large PR; I'm happy to walk reviewers through it in person if that's easier. I think most people have some context for these changes.

A few changes here, most of which are due to a single issue. There was a 1-line fix or a 2000-line fix. I chose the latter, of course. The most important changes are these:

%drip

(only the changes in sys/vane/behn.hoon)

While debugging an indivdual breach bug, it turned out the issue was that clay was getting the sunk notification before ames, so clay's subscribers were trying to restart syncs before ames cleared its state -- which clears the attempted restarts. One solution would be to reverse the order of the sunk notifications (the 1-line fix), but the real issue is that when clay gives a subscription update, it shouldn't do that in the same event.

This is a textbook case for what we've informally called "toss". In reality, there are two related things we've called "toss". I've named one of them "drip", and that's the one I've implemented. "toss" means to "pass" in a later event (often the next event, but not necessarily if other stuff is in the queue). "drip" means to "give" in a later event.

Eventually, these should be implemented in arvo alongside give and pass. Then, in places where you want to give in the next event, you change:

[duct %give %writ ~]

For

[duct %drip %writ ~]

For now, I've implemented it in behn, with a "slip" calling convention, so it becomes:

[duct %slip %b %drip !>([%writ ~])]

Note that the sign will come into the receiving vane from behn now. If all instances of that gift will be dripped through behn, the the receiving vane can switch the associated sign from the sending vane to behn. If not all will be dripped, then it must include a sign entry for both vanes. This wrinkle will go away when drip is implemented in arvo.

Clay drip

All subscription updates from clay (%writ and %wris gifts) are now dripped through behn. Thus, a commit can't fail because a subscriber crashed. Additionally, this make it much more likely that a commit or a merge will take more than one arvo event (since both involve several round-trips to ford, involving drips). It has always been the case that a commit could take more than one event, but it was pretty rare so some code assumed it didn't.

Relevant to individual breaches, this makes it so that any clay changes resulting from a sunk notification trigger kiln in the next event, so Ames has already had time to handle the sunk notification.

Transactions in clay

Several parts of clay assumed that a commit would take exactly one arvo event. Since that's no longer the case, we needed a way to keep an in-progress clay commit and maintain a queue of pending commits or merges which will be started after the current one finishes. In other words, we needed "macrotransactions" -- transactions which take more than one arvo event.

We use the "clad" monad, which is very similar to the "ph" monad, to provide this transaction layer. All effects are stored until the transaction is completed, when they are all emitted at once. In case of upgrade, if you don't write any special upgrade logic in +load, the current write transaction will continue un-upgraded until it finishes (so a copy of the old kernel will stay until then). If you changed enough that you prefer to kill and restart the current transaction, you may do so with confidence that you won't leave it in an inconsistent state.

Overall, I believe this makes clay drastically less fragile. In particular, the data flow in the code is much more readable, and I'm no longer afraid that small changes will introduces subtle bugs in far-away sections of the code. Also, I was able to deleted hundreds of lines of comments describing the invariants required in the code because they were no longer necessary.

This is the vast majority of the changes in clay.hoon.

Individual breaches for clay

This implements the "forget they ever existed" interpretation of breaches for clay. In particular, we cancel all clay subscriptions to that ship, wipe our clay memory of it, and clear the ford cache entirely. This code is in clay.hoon in +handle-task under case %sunk (you can just grep for "sunk"). There are some related changes in lib/hood/kiln.hoon to properly restart syncs when the other ship sinks.

The main thing I've tested is that when a parent breaches, the child stays synced to their new instance.

Limitations:

This doesn't fix the ames/jael bootstrapping issues. In practice, this means that to breach a parent, you must either (1) make sure the old parent keeps running until it pushes the jael breach notification to all its children or (2) have all its children switch to another authority for jael, eg with |eth-manage %look. The first is not possible if you destroyed your ship too badly, and second isn't 100-year-compatible. This will be my next task.

@jtobin jtobin changed the base branch from next to master May 15, 2019 11:28
@jtobin
Copy link
Contributor

jtobin commented May 15, 2019

@philipcmonk I've retargeted this at master, which apparently conflicts in a few places. I took a quick peek to see if I could just handle said conflicts outright, but they look substantial enough that I didn't touch anything for the time being.

@philipcmonk
Copy link
Contributor Author

Transactions can only produce effects and not direct state changes, because once you're in clad the clay state is COW-frozen at what it was at the beginning of the transaction, right?

They can produce both effects and values. In this case, for example, we produce an updated dome (desk state) and rang (global object store). The code that evaluates the monad (+take-commit/+take-merge) can then integrate that with your state. In this case, we replace the dome (which means we need to have a write lock on the dome) and we +uni:by the rang (which means if we want to garbage-collect the rang, we should probably do it between transactions).

(I'm still side-eyeing the opaque wireless continuation-pile model of a side-effecting asynchronous computation, but it does seem reasonable for something morally equivalent to a single ford build).

Wireless computing is the future. Well, at least when you're inside the monad (the monad interpreter uses wires, of course). Although, I think I might add a path field to the notes we can produce, which will be appended to the wire for convenient |verb. But our old habit of including state in wires was very bad not only because it was slow, but because it left all the non-upgradable state lying around the system. With this approach, at least all the non-upgradable state is in the vane responsible for it, which can deal with it appropriately.

@philipcmonk
Copy link
Contributor Author

@jtobin whoops, I'll merge from there.

@ohAitch
Copy link
Contributor

ohAitch commented May 15, 2019

They can produce both effects and values. In this case, for example, we produce an updated dome (desk state) and rang (global object store). The code that evaluates the monad (+take-commit/+take-merge) can then integrate that with your state. In this case, we replace the dome (which means we need to have a write lock on the dome) and we +uni:by the rang (which means if we want to garbage-collect the rang, we should probably do it between transactions).

"Produces an effect which is evaluated by clay itself" is what I was gesturing at wrt indirect state changes, nod

But our old habit of including state in wires was very bad not only because it was slow, but because it left all the non-upgradable state lying around the system. With this approach, at least all the non-upgradable state is in the vane responsible for it, which can deal with it appropriately.

My concern is not upgradable state, but upgradable code: it used to be possible to add a printf in the middle of a transaction(or expand a step's input type according to an interface change), now it is not. I agree that the /next-state-name/<@p>/<@p>/<@ud>/<@asdf> pattern is undesirable once there are more than a couple steps, and should be replaced with /[%next-state-name]/<transaction-id> (map transaction-id formal-data) pair; but clad also gets rid of explicitly named states, leaving no affordances to rearrange the state flow graph or even the code handling each transition.

@philipcmonk
Copy link
Contributor Author

Adding printfs without restarting a transaction is an interesting use I hadn't thought of, but in general upgrading code isn't always possible in the middle of a transaction, and our usual approach has been "ignore any ships in the middle of a transaction and hope they don't sink", which is worse. My guess is that in practice restarting the transaction on upgrade will be fine for debugging.

@belisarius222
Copy link
Contributor

belisarius222 commented May 15, 2019

I think I introduced the unix sync bug when I redid the mime cache a few months ago. Sorry! I didn't realize it broke syncing to Unix.

The reason I messed with the mime cache was to reduce restart time. When you restart Urbit (at least on master), I believe it rebuilds every mounted file in the filesystem.

This meant restarting a ship could take in the tens of seconds, if I recall correctly. I feel lile there should be a way to unbork unix syncs without losing the caching.

@joemfb
Copy link
Member

joemfb commented May 15, 2019

It should be noted that this PR is now also a proposal to merge next into master, as this work was done on top of next.

@joemfb joemfb mentioned this pull request May 15, 2019
@philipcmonk
Copy link
Contributor Author

The reason I messed with the mime cache was to reduce restart time. When you restart Urbit (at least on master), I believe it rebuilds every file in the filesystem.

Ah, you're correct. It looks like sync portion of restarts take about 5-20 seconds without the mime cache, and about 2 seconds with it. I re-added it, updating it on every %ergo as described above. This fixes the bug, gives us the better start times, and does so more consistently (there were times before when the cache wouldn't be full unless you changed files, so startup would be very slow). Fun fact: hoon files are read in as /text/plain but written as /text/x-hoon, which broke the cache lookup often.

As an aside, I dislike how ford's build times seem to be extremely variable. Maybe it's missing caches for spurious reasons?

@philipcmonk
Copy link
Contributor Author

It should be noted that this PR is now also a proposal to merge next into master, as this work was done on top of next.

Is next deprecated? Should everything branch off of and PR to master?

@jtobin
Copy link
Contributor

jtobin commented May 16, 2019

It should be noted that this PR is now also a proposal to merge next into master, as this work was done on top of next.

Is next deprecated? Should everything branch off of and PR to master?

next is officially deprecated. I'll actually go ahead and delete the branch soon enough.

@ohAitch
Copy link
Contributor

ohAitch commented May 16, 2019 via email

@belisarius222
Copy link
Contributor

Some of Clay's Ford builds have around 1500 sub-builds, which has a tendency to flush the Ford cache. This is a common problem in caching, and systems that can handle this sort of thing well are called scan-resistant. Ford's cache replacement is a basic clock algorithm and is not scan-resistant.

@jtobin
Copy link
Contributor

jtobin commented May 16, 2019

There seems to be a fairly frequent oscillation between master being stable+deployed and having a "devel" branch, and deploys being tied to stable releases with master containing whatever code is desired for deploy sometime soon. Is there a collection anywhere of arguments for the respective development models?

Not to derail the conversation around the actual contribution here too much, but the idea is to move to the latter scheme you've mentioned -- i.e. to scrap the non-master branches that have semantic operational meaning ascribed to them (so: next, hotfix) and simply target PRs at master. Master should always build and generally work, but it may not at any given moment correspond to a proper release, or what is deployed. Releases are to be accomplished by tags, as they are at present.

(Hotfixes, which are really just "unscheduled releases," can be accomplished by applying commits to the relevant release tag in a distinct branch or what have you.)

This generally just feels simpler than the "dev branch" model, which has felt superfluous to me in all projects I've worked on that use it, but it arguably requires a technical release manager of sorts to make sure things go smoothly. I've volunteered for that job, so hopefully things go smoothly.. 😄

Copy link
Member

@joemfb joemfb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sorry for the delayed review. FWIW, I have no objections to merging next into master now. The only issue I can see is that %crud error printing in %dill now assumes the tweaked event-replacement log from cc-release, and that's a minor inconvenience at most.

A tangent on restart and caching: it should be generally knowledge that all the vere caches are completely cleared on restart. This is done out of an abundance of caution after the livenet memory issues from earlier this year.

These isolated clay transactions seem like a big improvement. From a style standpoint, it feels like more of the data transformation could be factored out of the bind invocations, which would bring them closer together and make the I/O and state transitions even clearer. And it would be nice to bundle up +clad with its interpreter(s) (or to factor out some kind of generic interpreter). But these are just nits ... Nice work!

@philipcmonk philipcmonk mentioned this pull request May 16, 2019
@philipcmonk
Copy link
Contributor Author

And it would be nice to bundle up +clad with its interpreter(s) (or to factor out some kind of generic interpreter).

Addressed to some extent in #1174.

@jtobin jtobin merged commit ec1b5a2 into master May 17, 2019
@jtobin jtobin deleted the philip/individual-breaches branch May 17, 2019 21:31
@philipcmonk philipcmonk mentioned this pull request May 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants