-
Notifications
You must be signed in to change notification settings - Fork 868
Missing notifications due to stuck background workers #6837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| open fun doOnError(params: PARAM): Result { | ||
| open fun doOnError(params: PARAM, failureMessage: String): Result { | ||
| // Forward the error | ||
| return Result.success(inputData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previously we were always using the inputData which contained the lastFailureMessage which mean the next time the work ran the data would be restore with the pre-existing failure, causing the worker to never execute doSafeWork
the fix is to consume the message and use the updated state for the result (so that it can be persisted without the lastFailureMessage)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so we do not forward the error anymore to the next worker, since it has been removed from the params parameter, right? (maybe update the comment above). The original idea to transmit the error was to ensure that the last worker of the chain will handle it, without breaking the whole worker chain forever.
It's still strange to me that the lastFailureMessage get restored when we start a new work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also surprised by this, I think it's caused by the lastFailureMessage causing a merge conflict in the input data
// requireBackgroundSync
val data = WorkerParamsFactory.toData(
Params(
sessionId = sessionId,
timeout = serverTimeoutInSeconds,
delay = 0L,
periodic = false,
random = UUID.randomUUID().toString() // create a unique id for each work request
)
)Logging the work request data and the inputData on job execution
// No forced crash - with random
requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"7675"}, }
doWork input: Data {WORKER_PARAMS_JSON : {"random":"7675"}, }
requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"107a"}, }
doWork input: Data {WORKER_PARAMS_JSON : {"random":"107a"}, }
requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"c502"}, }
doWork input: Data {WORKER_PARAMS_JSON : {"random":"c502"}, }
requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"a22b"}, }
doWork input: Data {WORKER_PARAMS_JSON : {"random":"a22b"}, }
// Force crash
requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"49b1"}, }
doWork input: Data {WORKER_PARAMS_JSON : {"random":"49b1"}, }
requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"fca1"}, }
doWork input: Data {WORKER_PARAMS_JSON : {"random":"49b1","lastFailureMessage":"error"}, }
requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"b7e6"}, }
doWork input: Data {WORKER_PARAMS_JSON : {"random":"49b1","lastFailureMessage":"error"}, }
// Force crash removed
requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"97c9"}, }
doWork input: Data {WORKER_PARAMS_JSON : {"random":"49b1","lastFailureMessage":"error"}, }
requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"3e36"}, }
doWork input: Data {WORKER_PARAMS_JSON : {"random":"49b1","lastFailureMessage":"error"}, }
notice that as soon as the lastFailureMessage is modified the request data becomes ignored, which I assume is caused by the unspecified ordering of the default OverwritingInputMerger https://developer.android.com/reference/androidx/work/InputMerger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after a bit more investigation, the default behaviour is -
CoroutineWorker/ListenableWorkerdo not callstopwhendoWorkcompletes which means the worker often already "exists", causingExistingWorkPolicy.APPEND_OR_REPLACEto use the append flow- When appending...
Resultpayloads merge with the input payloads (egResult.success(foobar)), in our case it means overwriting the json key and replacing the original request payload with the worker result - Successful
SyncWorkerprovide no payloadResult.success(), hence why they receive the new request payloads
another solution is to set an InputMerger which always favours the latest input specifically for the SyncWorker
class Merger : InputMerger() {
override fun merge(inputs: MutableList<Data>): Data {
return inputs.first()
}
}I'm not aware of the historical details around the other workers but as they're based on CoroutineWorker/ListenableWorker they may also suffer from the same issue of becoming stuck in the failure state if they use append policies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another solution is to set an InputMerger which always favours the latest input specifically for the SyncWorker
This should be far less confusing IMHO. The API of the work manager is so strange...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes if there are chained workers we need to cancel using this mechanism, otherwise workers will be stuck if we return Failed state. This worker API is so bad...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will break some flow if we consume the failure, as we always return Success, if there is a chain of workers, the next one will be running when it shouldn't...
We should probably let the Worker implementation decide if he wants to consume the error or not, instead of doing that in SessionSafeCoroutineWorker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in those cases, should we be using Result.Failure to cancel the chain? My understanding is that APPEND_OR_REPLACE would allow the chain to be recreated the next time a request comes in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rework the PR to make use of the chain extension (NoMerge), we can think about the worker flows as a separate ticket (as other workers may be getting stuck)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated 1fd1a4e
| } | ||
| } catch (throwable: Throwable) { | ||
| buildErrorResult(params, throwable.localizedMessage ?: "error") | ||
| buildErrorResult(params, "${throwable::class.java.name}: ${throwable.localizedMessage ?: "N/A error message"}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added extra details about the error to help debug in the future
| return Result.success(inputData) | ||
| .also { Timber.e("Work cancelled due to input error from parent") } | ||
| return Result.success(WorkerParamsFactory.toData(paramClass, params)) | ||
| .also { Timber.e("Work cancelled due to input error from parent: $failureMessage") } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
includes the failure message in the error log
| } | ||
|
|
||
| internal fun SessionWorkerParams.consumeFailure(): String? { | ||
| return lastFailureMessage.also { lastFailureMessage = null } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could avoid making lastFailureMessage a var by using the data.copy fun. It could be painful since we have an interface here, so every impl would have to do the copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to avoid that as we have quite a few worker params, happy to make the change if preferred!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the var like this for now. Thanks!
b03a1ee to
1a21ec5
Compare
- the sync worker makes use of the CoroutineWorker which does not stop when the work completes, this means we often append to the existing worker. When appending by default the previous worker result payload is merged with (or in our case overwrites) the input data instead, meaning any failure state is set and kept until the worker stops, which in turns causes the sync worker to never sync - the fix is to make use of an input merge that always favour the request input data instead of the previous worker results
1a21ec5 to
1fd1a4e
Compare
|
Kudos, SonarCloud Quality Gate passed! |
| .setConstraints(WorkManagerProvider.workConstraints) | ||
| .setBackoffCriteria(BackoffPolicy.LINEAR, WorkManagerProvider.BACKOFF_DELAY_MILLIS, TimeUnit.MILLISECONDS) | ||
| .setInputData(data) | ||
| .startChain(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line is the fix, matches the multiple event worker by ignoring previous results when appending new workers
bmarty
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the update. I am more comfortable with this version of the fix.








Type of change
Content
The sync worker makes use of the
CoroutineWorkerwhich does not stop when the work completes, this means we often append to the existing worker when scheduling background syncs. When appending by default the previous worker result payload is merged with (or in our case overwrites) the input data instead, meaning any failure state is set and kept until the worker stops, which in turns causes theSyncWorkerto skip syncing.The fix is to make use of an InputMerger that always favours the request input data instead of the previous worker results via the existing
startChainextension.Motivation and context
Fixes #6836 - implementations of
SessionSafeCoroutineWorkersuffer from uncaught exceptions causing the worker to become stuck in the failure state.We update the work params/input data on error but never reset the state back to the default arguments
Screenshots / GIFs
No UI changes
Tests
SyncWorkerSettings -> General -> Clear cachefixes the stuck stateTested devices