Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition causing SymbolBucket isEmpty errors #3681

Merged
merged 11 commits into from
Dec 5, 2016
Merged

Fix race condition causing SymbolBucket isEmpty errors #3681

merged 11 commits into from
Dec 5, 2016

Conversation

lbud
Copy link
Contributor

@lbud lbud commented Nov 22, 2016

Fixes #3663 which was introduced after recent optimizations.

Unfortunately since was caused by a race condition I don't think we could create a useful regression test…

edited to summarize the below discussion + better reflect what this PR does:

Fixes #3663:

  • There was a race condition wherein two calls to VectorTileWorkerSource#reloadTile would interleave calls to parse for the same tile, which would cause its SymbolBucket members to reset themselves while assumed to be complete for the purpose of transference. SymbolBucket.prototype.isEmpty throws TypeErrors on startup #3663 then spewed errors because in serializing buckets, some symbol buckets didn't have an arrays member with its array groups.
    • To fix, VectorTileWorkerSource#reloadTile now checks to see if its WorkerTile is still parsing (using the existing WorkerTile.status), if so, defers its next parse until it receives a response from the first: vector_tile_worker_source.js#L89-L102
  • This was masking a second race condition, which is also fixed in this PR: calls to WorkerTile.redoPlacement got deferred if a previous parse call to the same tile wasn't finished (with WorkerTile.redoPlacementAfterDone); and then were called in WorkerTile#parse's done function, after receiving glyph/icon assets and re-parsing symbol buckets but before serializing buckets to return via the parse callback. This meant the subset of buckets that were re-placed in redoPlacement — all symbol buckets — were already serialized and tripping assertion errors here
    • However, the only times we would ever interleave WorkerTile#parse and WorkerTile#redoPlacement would be when WorkerTile#parse is waiting on glyph/icon assets from the main thread. WorkerTile#parse only places symbols after this, meaning we'd never need to re-call a deferred redoPlacement after this as it would return the same result. To fix, I removed the redoPlacementAfterDone concept altogether here; those interleaved redoPlacement calls set this.pitch and then return.

@jfirebaugh
Copy link
Contributor

Can you describe what the race was?

@lbud
Copy link
Contributor Author

lbud commented Nov 22, 2016

@jfirebaugh sure — here's what I gather:

  • When a WorkerTile is responsible for collecting icons and/or glyphs, it waits on callbacks from the main thread to indicate that it has the glyph stacks/sprites it needs (here)
  • Only once it has received two done messages (either from real asset dependency callbacks or, if the tile doesn't require glyphs/icons, simple passthrough function calls), it prepares a SymbolBucket (here),
  • which is when ArrayGroups are created (here) based on the symbol interfaces (here).
  • We don't wait on bucket.prepare callbacks before finishing the loop and calling the worker's final callback (here) to transfer data back to the main thread, so sometimes when serializing the buckets when checking for isEmpty the symbol bucket's arrays object was not yet created, throwing an error when searching for its individual ArrayGroup members (here).

(In all other layer cases, ArrayGroups are created in the bucket's constructor unless already present (here))

I guess the alternate fix here might be something like: create another nested function inside gotDependencies that waits on a bucket.prepare callback (passed through to createArrays) to be called n=symbolBuckets.length times before calling done to transfer data back to the main thread. I'm not familiar enough with the threads' responsibilities to be sure which is "right?"

(Also in looking into this I realized that SymbolBucket.place also calls createArrays, which seems like we're unnecessarily redefining this.arrays. Should this be removed?)

@jfirebaugh
Copy link
Contributor

What do you mean by bucket.prepare callback?

The behavior as I understand it is:

  • If there are no symbol buckets, then done is called immediately, without waiting on any dependencies. In this situation, the SymbolBucket.isEmpty error shouldn't be possible, because there aren't any symbol buckets.
  • If there are symbol buckets, then we wait for all dependencies. When they are complete, then bucket.prepare and bucket.place are called for all symbol buckets before calling done. This should result in all symbol buckets having an arrays member.

So I'm still not clear on how we can wind up with symbol buckets without arrays. I might be missing something and it is an expected state some other way, or the bug might be elsewhere.

(Also in looking into this I realized that SymbolBucket.place also calls createArrays, which seems like we're unnecessarily redefining this.arrays. Should this be removed?)

SymbolBucket#place is called on it's own in WorkerTile#redoPlacement, so the createArrays there is necessary. I think it might be safe to remove from SymbolBucket#prepare though.

@anandthakker
Copy link
Contributor

anandthakker commented Nov 23, 2016

@lbud @jfirebaugh I hit a similar race/exception a few weeks ago during the WaPo project, but I had thought it was because of our hacky custom source, and so didn't report it here. Possibly unrelated to what's happening here, but I'll describe in case it's helpful.

Essentially, the issue I had was inconsistent state in the glyph/icon callbacks. I think was happening for us looked something like:

  • Main: loadTile()
  • Worker: Worker#loadTile => WorkerTile#parse => getGlyphs(..., cb1) ...
    • Main: request glyphs asset...
  • Main: loadTile() // same tile requested before first one completes
  • Worker: Worker#reloadTile => WorkerTile#parse (replaces this.symbolBuckets) => getGlyphs(..., cb2) ...
  • Main: ... glyph asset response

@jfirebaugh
Copy link
Contributor

@anandthakker Doesn't

// schedule tile reloading after it has been loaded
tile.reloadCallback = callback;
prevent reloadTile being called for a tile that hasn't completed loading yet?

@anandthakker
Copy link
Contributor

@jfirebaugh Ah. Yeah, it does... maybe it was two reloadTiles rather than a loadTile+reloadTile. Even when it's reloading, there's still the noop get{Glyphs,Icons} requests that @lbud referred to, right?

@jfirebaugh
Copy link
Contributor

Yeah, looks like two reloadTiles could produce a race where a worker tile is interleaving multiple calls to parse for the same tile. If that's what's causing this issue, I think the right fix is to prevent it from happening (maybe introduce a reloading state for Tile?), rather than adding a guard to SymbolBucket#isEmpty.

@lbud
Copy link
Contributor Author

lbud commented Nov 23, 2016

@jfirebaugh @anandthakker thanks for the help — I was able to demonstrate that condition where it's reloading a tile before a previous parse on the same tile is done. Will fix now.

@lbud
Copy link
Contributor Author

lbud commented Nov 23, 2016

😞 thought I fixed this in 07c9d5a, but upon further interaction with the map (rotating/pitching, so I suspect redoPlacement) I've been using to debug came upon some assertion errors from serialization that I have to assume I just introduced — still digging.

@lbud
Copy link
Contributor Author

lbud commented Nov 29, 2016

The redoPlacement race I mentioned is unrelated to the above fix (so presumably should be replicable outside of this branch — will verify) and occurs when workerTile.parse and workerTile.redoPlacement are called in very quick succession, as follows (apologies for the verbose description, mostly for my own understanding):

meanwhile, in the redoPlacement that's just been called again:

At this point,

  • workerTile.symbolBuckets are mid-serialization/transfer from being called in parse; these are now partially transferred
  • workerTile.symbolBuckets are also starting a new serialization/transfer process from being called in redoPlacement, because we didn't wait to re-call the batch of SymbolBucket.createArrays until after the first transfer was complete
  • we then trip this assert because we're trying to newly transfer a set of partially transferred arrays

(Presumably (though I haven't confirmed yet) resetting symbol buckets' arrays while parse was still serializing them would also cause bugs from that standpoint (serializing empty arrays where there should be symbol content))

@lbud
Copy link
Contributor Author

lbud commented Nov 29, 2016

Confirmed that the above AssertionError can be nearly replicated in master — the isEmpty bug was masking it before, but with a bandaid like 4cacc0e that has nothing to do with tile reparsing, this happens as well.

@bbodien
Copy link

bbodien commented Nov 30, 2016

Hi guys, are you anywhere near getting this merged and included in a release? I'm experiencing the problems described in #3663 (I'm using setFilter()), so I can't upgrade past v0.26.0, and would love to benefit from the performance improvements in v0.27.0

Lauren Budorick added 2 commits November 30, 2016 17:08
…es race condition between WorkerTile.parse and WorkerTile.redoPlacement
@lbud
Copy link
Contributor Author

lbud commented Dec 1, 2016

@bbodien yes, we are actively working on it and will include the fix in the next release.

@lbud
Copy link
Contributor Author

lbud commented Dec 1, 2016

I think this is ready — @jfirebaugh could I bother you for a review? 🙏 OP updated to reflect issues/fixes.

@jfirebaugh
Copy link
Contributor

redoPlacement races ahead and resets each SymbolBucket's arrays (with SymbolBucket#place), which can still be mid-serialization.

Can you explain this in more detail? Bucket serialization is a synchronous operation, so I'm not clear how it can interleave with SymbolBucket#place. (Like the main JS thread, each worker is individually "single threaded" -- it can't race with itself unless there's an explicit asynchronous operation involved, like waiting for a message to be posted.)

@lbud
Copy link
Contributor Author

lbud commented Dec 3, 2016

@jfirebaugh per chat, I was mistaken in my diagnosis of the problem (based on the single worker thread being synchronous) but had inadvertently fixed it by moving redoPlacement after buckets are serialized. I rolled back the callbacks I had unnecessarily added in 1959cce and updated OP.

Lauren Budorick added 2 commits December 2, 2016 17:23
@lbud
Copy link
Contributor Author

lbud commented Dec 3, 2016

@jfirebaugh capturing from voice: I

  • removed the redoPlacementAfterDone callback altogether (as it was both unnecessary and wouldn't have returned deferred data as intended anyway),
  • moved WorkerTile#parse's collisionTile to be more localized to its use (as it is unnecessary in this FeatureIndex), and
  • factored out a bit of symbol placement code to be shared between parse and redoPlacement.

@jfirebaugh
Copy link
Contributor

Almost there -- the last thing I see is that the new CollisionTile(this.angle, ...) needs to happen later in the process (while done is executing), so that it picks up the most recent this.angle value set by redoPlacement. One way to do this would be to create it inside, and return it from, placeSymbols, and then move the call to placeSymbols into done.

@lbud
Copy link
Contributor Author

lbud commented Dec 3, 2016

@jfirebaugh 👍 addressed in dbd4259 (also, looks like we were not updating this.pitch in redoPlacement for the sake of re-parsing).

@@ -144,14 +141,9 @@ class WorkerTile {
deps++;
if (deps === 2) {
for (const bucket of this.symbolBuckets) {
// Layers are shared and may have been used by a WorkerTile with a different zoom.
Copy link
Contributor

@jfirebaugh jfirebaugh Dec 5, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to happen immediately prior to prepare as well, since prepare uses layer properties. (Making this less fragile is tracked in #3479.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@jfirebaugh jfirebaugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SymbolBucket.prototype.isEmpty throws TypeErrors on startup
5 participants