Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race when registering multiple service workers for the same scope #783

Closed
catalinb opened this issue Nov 19, 2015 · 30 comments
Closed

Race when registering multiple service workers for the same scope #783

catalinb opened this issue Nov 19, 2015 · 30 comments
Milestone

Comments

@catalinb
Copy link

Consider a sequence of two register() calls for two worker scripts on the same scope. The first service worker rejects its install event waitUntil promise after some time, resulting in the
registration being cleared. If the second register() is called while the first service worker is in the installing state but hasn't rejected yet, it may end up completing with a registration that was removed from the scope to registration map.

The order of events is as follows:
0. call register() for the first service worker

  1. registration1 is at step 14 from the Install Algorithm: Wait for task to have executed or been discarded.
  2. call register() for the second service worker.
  3. registration2 hits step 4.1 from the Register Algorithm. This steps invokes Get Registration and will find
    the registration created by the previous call.
  4. registration2 executes atomically all the way to step 3 from Update
  5. the first sw rejects and registration1 hits 16.3.1 from the Install Algorithm: Invoke Clear Registration
  6. registration2 hits step 4.2 and continues with the update steps using a registration that was removed from the registration map when Clear Registration was invoked above.

This can happen because step 4.2 from Update can race with step 15 from Install.

@jungkees
Copy link
Collaborator

Right. That scenario makes them race. I think we should not invoke Clear Registration when the install fails. The subsequent Update will invoke the Install with a new script or clear it out (in Update steps) if it encounters an error anyway.

So, let's remove Install step 11.4.1.5 and Install step 16.3. WDYT?

/cc @mattto

@jungkees jungkees added this to the Version 1 milestone Nov 30, 2015
@wanderview
Copy link
Member

So, let's remove Install step 11.4.1.5 and Install step 16.3. WDYT?

I think the issue here is that Register touches global state outside of a synchronization mechanism. I don't think removing steps within the synchronization mechanism is the right fix.

I think we could fix this with a few tweaks, but it really feels like we should simplify the synchronization mechanisms here.

Can we do something like this?

  1. Replace registration queue, installation queue, activation queue, etc with a single per-scope work queue.
  2. Every Registration, Update, Unregister, etc would get a slot in this queue. Only one of these could ever be running at a single time.
  3. Nothing that touches global state like the registration map would happen outside of the queue.

This would mean we could not be installing a SW while activating another one, but personally I think that's fine. I don't think its worth the complexity to optimize this case and it would be better to enforce strict synchronization.

@jungkees
Copy link
Collaborator

jungkees commented Dec 1, 2015

That option was considered but not proper to synchronize the Unregister with Register/Update where Unregister is effective only when all the clients are unloaded. That is, it's not practical for all the Register/Update requests coming from different tabs and Soft Updates in parallel to hold until the condition for Unregister is met.

@wanderview
Copy link
Member

That option was considered but not proper to synchronize the Unregister with Register/Update where Unregister is effective only when all the clients are unloaded. That is, it's not practical for all the Register/Update requests coming from different tabs and Soft Updates in parallel to hold until the condition for Unregister is met.

Its fine to still use a flag on the registration to indicate an unregister is in progress without blocking. I'm just saying a single queue to step through job state would be the easiest to understand and avoid the races we are seeing in the spec at the moment.

@jungkees
Copy link
Collaborator

jungkees commented Dec 4, 2015

I basically don't want to break any behavior that Blink implemented in conformance to the current spec so far. But I'm willing to refine the algorithms for simpler reasoning and the expected behavior if both Blink and Gecko agree.

Here's the outline of the refinement (if agreed) based on what's suggested:

  • Q: A job queue
  • J: A job: fetch a script and make it get through to .waiting (i.e. Update to Install). There are two types of jobs: 1) Jr: register job and 2) Ju: update job.
  • Jobs are entirely serialized in Q.
  • Activate is not serialized and can just be invoked in parallel. It basically makes the current .waiting to .active. (It simply returns when .waiting is null.)
    (* I don't think it's reasonable to put Activate in the single unit of a job serialization as triggering Active is indeterministic.)
  • An update job (Ju) can't be inserted in Q if Q isn't empty. (That update attempt is just ignored.) <-- I'd like to know Chrome's position on this.
    • If we allow multiple (Ju)s in Q, the expected behavior for "Ju - Jr with a new scriptURL - Ju" will be ambiguous.
    • Note: Currently, Soft Update doesn't insert Ju in Q in this case, but register.update() does. That inserted update job makes the current .installing a redundant and go ahead with its own Update/Install steps.
  • Unregister is not serialized and can just be invoked in parallel. It basically sets the uninstalling flag. When no client is using the registration, the UA clears it out right away; otherwise, the last client unload will do.
  • When the unistalling flag is set, an update attempt just returns not scheduling an update job (Ju).
  • When the unistalling flag is set, a register attempt unsets the flag and go ahead with scheduling a register job (Jr).

@wanderview @mattto Based on the discussion here, I'll try to refine the spec algorithms.

@wanderview
Copy link
Member

I'm less certain if we need this large a change now that we're removing script URL from the registration and making it an argument to the job. I need to think about it more.

About the original race, if we move the .register() accesses to the registration into a job queue then we could make the Clear Registration conditional on if the queue is empty. So we only clear when there is no competing registration, etc.

Perhaps another way would be to have it set the uninstalling flag instead of calling Clear Registration.

In the end, I expect what chrome does is not racey. If we could just accurately describe that algorithm in the spec, then we could match it.

@wanderview
Copy link
Member

Thinking about this more, how about this:

Instead of passing the registration to the job as an argument, we instead pass the scope. Then, once we are in the job synchronization mechanism we pull the registration out the map again each time. This lets us explicitly handle the case where the scope is not in the map. For a register() operation we would create a new registration. For an update() we would simply abort, etc.

This should allow us to avoid or explicitly handle any races around accessing the registration map.

Thoughts?

@jungkees
Copy link
Collaborator

jungkees commented Dec 8, 2015

Sounds good. It's basically what we discussed in terms of putting register jobs into a job synchronization mechanism. For this, I think the algorithm thread queue should be defined as a per-scope queue instead of a per-registration queue as there's no registration for first time registrations. I'll check out if we have any further problem to solve to spec it this way.

Regarding maintaining multiple queues, do you agree to keep them? Having blocking waits during the register/update job steps (e.g. fetching the script, waiting for oninstall execution, etc.), I think it makes sense to allow the next request to have a chance to replace the current job.

@wanderview
Copy link
Member

Yea, I think multiple queues is ok. I think we just have to be very cautious about making any job waiting in a queue only reference transient parameters and not shared state.

@wanderview
Copy link
Member

I'm effectively implementing this in gecko in this bug:

https://bugzilla.mozilla.org/show_bug.cgi?id=1231974

@wanderview
Copy link
Member

So, having implemented this, I can see there are still some races:

  1. Start with an active worker with script A.
  2. A register() adds an Update job with script B to the update queue.
  3. Add a second Update job with script A to the update queue.
  4. The first Update is popped off the update queue, verifies that we are registering a new script, and queues an Install job for script B in the install queue.
  5. The second Update job is popped off the update queue. It still just sees the script A active worker as the newest worker, so thinks its ok to update with script A. It also queues an Install job, but with script A.
  6. Install job for script B runs.
  7. Install job for script A runs.

End result is we lose the registration of script B.

To help close this race I'm trying to check for changes in newest worker script URL right before queuing the install job from an update, but I think there is still the potential for a race.

If the install was performed as part of the same job as the update this race would not exist. We could not start the next update until the previous install had completed.

@wanderview
Copy link
Member

To help close this race I'm trying to check for changes in newest worker script URL right before queuing the install job from an update, but I think there is still the potential for a race.

Actually, to better fix this race I'm going to pass a value into the install job indicating whether the script URL can change or not. That way the update job can indicate that changing the script is not permitted and the install job can abort.

Not sure thats the best way to represent this in the spec.

@jungkees
Copy link
Collaborator

  1. The second Update job is popped off the update queue. It still just sees the script A active worker as the newest worker, so thinks its ok to update with script A. It also queues an Install job, but with script A.

The current behavior in the spec is the second job will replace the first job. That is in the example above, when the second update job gets in the synchronized steps, it'll replace the current installing worker in the first place by terminating the executing installing worker. Then, this update job will continue and advance to the Install phase if the conditions are all met. The terminated installing worker will get redundant in the Install step 16. I think I missed to pop the top element from the queue in this error case. Will double check it.

@jungkees
Copy link
Collaborator

In the end, I expect what chrome does is not racey. If we could just accurately describe that algorithm in the spec, then we could match it.

I'll review what chrome does now and revisit the spec algorithms to make them congruent.

This issue merges #789.

@wanderview
Copy link
Member

The current behavior in the spec is the second job will replace the first job. That is in the example above, when the second update job gets in the synchronized steps, it'll replace the current installing worker in the first place by terminating the executing installing worker. Then, this update job will continue and advance to the Install phase if the conditions are all met. The terminated installing worker will get redundant in the Install step 16. I think I missed to pop the top element from the queue in this error case. Will double check it.

In my example above there is no installing worker yet. The first install job has not been executed yet. Its a race between running the install job or the second update job.

@jungkees
Copy link
Collaborator

Understood your concern. So it was between the first install and the second update.

End result is we lose the registration of script B.

I think the result would be what we expected as the second job's worker (script A) would be the waiting worker at the end. The rationale that I relied on to not break the state is the second install will never run before the first install completes. So the second job's worker will be the final waiting worker. (registration's last update check time is the only state that Update algorithm updates.)

But I admit that this is subtle and not easy to reason about. Also there can be behavior mismatch when the second update aborts in the example scenario: the spec says the first job will continue and make script B be the waiting worker but Chrome may have already doomed the first job when the second update is running.

I'm leaning toward adopting a single job queue. I believe that's what Chrome currently does and what @wanderview wanted if that behavior can be spec'ed.

@jungkees
Copy link
Collaborator

I think this is the summary of the expected behavior and what Chrome does now: #396 (comment).

@jungkees
Copy link
Collaborator

FYI, I started working on this on the nightly version in the "job-structure-revision" branch: https://github.com/slightlyoff/ServiceWorker/blob/job-structure-revision/spec/service_worker/index.html. It's WIP (447d2d6). Please don't review it in detail. A lot of things to update. But you may want to check it out to see if this revision will be what we'd want.

@jungkees
Copy link
Collaborator

@wanderview @mattto I just updated the spec as discussed: 7b49e85. The basic behavior is:

  • It uses a single job queue where register/update/unregister jobs are scheduled in FIFO order.
  • The same job is aggregated when scheduled.
  • The UA dooms installing worker in favor of the next job in the queue if possible

I intended to write it conforming to the current implementation. Please review it.

/cc @slightlyoff @jakearchibald @annevk

@wanderview
Copy link
Member

nit: There is a broken link to Run Job from the Finish Job section.

@wanderview
Copy link
Member

Can we talk about this at the f2f in January? @jungkees, @mattto, will you be there?

@jungkees
Copy link
Collaborator

nit: There is a broken link to Run Job from the Finish Job section.

Fixed: 44f97f1.

Can we talk about this at the f2f in January? @jungkees, @mattto, will you be there?

Yes, I'll be there. Any feedback before that would be great, too.

@mfalken
Copy link
Member

mfalken commented Jan 5, 2016

Can we talk about this at the f2f in January? @jungkees, @mattto, will you be there?

I'm still TBD, I have some personal stuff going on in January. If I can't make it, I'll make an effort to join remotely at least for the late afternoon Pacific time.

@mfalken
Copy link
Member

mfalken commented Jan 5, 2016

The UA dooms installing worker in favor of the next job in the queue if possible

This is something Chrome does but I'm aiming to remove it. I believe it adds unnecessary complexity. It's simpler to just let the installing worker finish doing its thing or eventually timeout. I originally added it to stop waitUntil(new Promise({}) from blocking the queue, but now Chrome has event timeouts so that is not a concern.

@wanderview
Copy link
Member

but now Chrome has event timeouts so that is not a concern.

How long of a timeout do you use for this?

@mfalken
Copy link
Member

mfalken commented Jan 5, 2016

Right now, 30 seconds for executing sync JS (while(true) {}), five minutes for any event (event.waitUntil(new Promise())).

@mfalken
Copy link
Member

mfalken commented Jan 22, 2016

Just to reiterate before the F2F: I'm really happy if the spec is as simple as possible. The preemptive dooming of an installing worker was something the old Chrome implementation did to prevent getting stuck forever. The same can be accomplished with a simple timeout.

@jungkees
Copy link
Collaborator

I think the only case we should consider before removing it is the job with the same scope and a different script url wanting to preempt.

@jakearchibald
Copy link
Contributor

F2F resolution: Single job queue may fix this. @wanderview to review.

@wanderview
Copy link
Member

I've implemented this and landed it in FF48. I found a few stray problems that I opened up as separate issues. Overall, though, I think this makes the algorithm much less racy. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants