-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race when registering multiple service workers for the same scope #783
Comments
Right. That scenario makes them race. I think we should not invoke Clear Registration when the install fails. The subsequent Update will invoke the Install with a new script or clear it out (in Update steps) if it encounters an error anyway. So, let's remove Install step 11.4.1.5 and Install step 16.3. WDYT? /cc @mattto |
I think the issue here is that Register touches global state outside of a synchronization mechanism. I don't think removing steps within the synchronization mechanism is the right fix. I think we could fix this with a few tweaks, but it really feels like we should simplify the synchronization mechanisms here. Can we do something like this?
This would mean we could not be installing a SW while activating another one, but personally I think that's fine. I don't think its worth the complexity to optimize this case and it would be better to enforce strict synchronization. |
That option was considered but not proper to synchronize the Unregister with Register/Update where Unregister is effective only when all the clients are unloaded. That is, it's not practical for all the Register/Update requests coming from different tabs and Soft Updates in parallel to hold until the condition for Unregister is met. |
Its fine to still use a flag on the registration to indicate an unregister is in progress without blocking. I'm just saying a single queue to step through job state would be the easiest to understand and avoid the races we are seeing in the spec at the moment. |
I basically don't want to break any behavior that Blink implemented in conformance to the current spec so far. But I'm willing to refine the algorithms for simpler reasoning and the expected behavior if both Blink and Gecko agree. Here's the outline of the refinement (if agreed) based on what's suggested:
@wanderview @mattto Based on the discussion here, I'll try to refine the spec algorithms. |
I'm less certain if we need this large a change now that we're removing script URL from the registration and making it an argument to the job. I need to think about it more. About the original race, if we move the .register() accesses to the registration into a job queue then we could make the Clear Registration conditional on if the queue is empty. So we only clear when there is no competing registration, etc. Perhaps another way would be to have it set the uninstalling flag instead of calling Clear Registration. In the end, I expect what chrome does is not racey. If we could just accurately describe that algorithm in the spec, then we could match it. |
Thinking about this more, how about this: Instead of passing the registration to the job as an argument, we instead pass the scope. Then, once we are in the job synchronization mechanism we pull the registration out the map again each time. This lets us explicitly handle the case where the scope is not in the map. For a register() operation we would create a new registration. For an update() we would simply abort, etc. This should allow us to avoid or explicitly handle any races around accessing the registration map. Thoughts? |
Sounds good. It's basically what we discussed in terms of putting register jobs into a job synchronization mechanism. For this, I think the algorithm thread queue should be defined as a per-scope queue instead of a per-registration queue as there's no registration for first time registrations. I'll check out if we have any further problem to solve to spec it this way. Regarding maintaining multiple queues, do you agree to keep them? Having blocking waits during the register/update job steps (e.g. fetching the script, waiting for oninstall execution, etc.), I think it makes sense to allow the next request to have a chance to replace the current job. |
Yea, I think multiple queues is ok. I think we just have to be very cautious about making any job waiting in a queue only reference transient parameters and not shared state. |
I'm effectively implementing this in gecko in this bug: |
So, having implemented this, I can see there are still some races:
End result is we lose the registration of script B. To help close this race I'm trying to check for changes in newest worker script URL right before queuing the install job from an update, but I think there is still the potential for a race. If the install was performed as part of the same job as the update this race would not exist. We could not start the next update until the previous install had completed. |
Actually, to better fix this race I'm going to pass a value into the install job indicating whether the script URL can change or not. That way the update job can indicate that changing the script is not permitted and the install job can abort. Not sure thats the best way to represent this in the spec. |
The current behavior in the spec is the second job will replace the first job. That is in the example above, when the second update job gets in the synchronized steps, it'll replace the current installing worker in the first place by terminating the executing installing worker. Then, this update job will continue and advance to the Install phase if the conditions are all met. The terminated installing worker will get redundant in the Install step 16. I think I missed to pop the top element from the queue in this error case. Will double check it. |
I'll review what chrome does now and revisit the spec algorithms to make them congruent. This issue merges #789. |
In my example above there is no installing worker yet. The first install job has not been executed yet. Its a race between running the install job or the second update job. |
Understood your concern. So it was between the first install and the second update.
I think the result would be what we expected as the second job's worker (script A) would be the waiting worker at the end. The rationale that I relied on to not break the state is the second install will never run before the first install completes. So the second job's worker will be the final waiting worker. (registration's last update check time is the only state that Update algorithm updates.) But I admit that this is subtle and not easy to reason about. Also there can be behavior mismatch when the second update aborts in the example scenario: the spec says the first job will continue and make script B be the waiting worker but Chrome may have already doomed the first job when the second update is running. I'm leaning toward adopting a single job queue. I believe that's what Chrome currently does and what @wanderview wanted if that behavior can be spec'ed. |
I think this is the summary of the expected behavior and what Chrome does now: #396 (comment). |
FYI, I started working on this on the nightly version in the "job-structure-revision" branch: https://github.com/slightlyoff/ServiceWorker/blob/job-structure-revision/spec/service_worker/index.html. It's WIP (447d2d6). Please don't review it in detail. A lot of things to update. But you may want to check it out to see if this revision will be what we'd want. |
@wanderview @mattto I just updated the spec as discussed: 7b49e85. The basic behavior is:
I intended to write it conforming to the current implementation. Please review it. |
nit: There is a broken link to Run Job from the Finish Job section. |
This is something Chrome does but I'm aiming to remove it. I believe it adds unnecessary complexity. It's simpler to just let the installing worker finish doing its thing or eventually timeout. I originally added it to stop |
How long of a timeout do you use for this? |
Right now, 30 seconds for executing sync JS ( |
Just to reiterate before the F2F: I'm really happy if the spec is as simple as possible. The preemptive dooming of an installing worker was something the old Chrome implementation did to prevent getting stuck forever. The same can be accomplished with a simple timeout. |
I think the only case we should consider before removing it is the job with the same scope and a different script url wanting to preempt. |
F2F resolution: Single job queue may fix this. @wanderview to review. |
I've implemented this and landed it in FF48. I found a few stray problems that I opened up as separate issues. Overall, though, I think this makes the algorithm much less racy. Thanks! |
Consider a sequence of two register() calls for two worker scripts on the same scope. The first service worker rejects its install event waitUntil promise after some time, resulting in the
registration being cleared. If the second register() is called while the first service worker is in the installing state but hasn't rejected yet, it may end up completing with a registration that was removed from the scope to registration map.
The order of events is as follows:
0. call register() for the first service worker
the registration created by the previous call.
This can happen because step 4.2 from Update can race with step 15 from Install.
The text was updated successfully, but these errors were encountered: