Missing notifications due to stuck background workers #6837

ouchadam · 2022-08-15T16:38:11Z

Type of change

Feature
Bugfix
Technical
Other :

Content

The sync worker makes use of the CoroutineWorker which does not stop when the work completes, this means we often append to the existing worker when scheduling background syncs. When appending by default the previous worker result payload is merged with (or in our case overwrites) the input data instead, meaning any failure state is set and kept until the worker stops, which in turns causes the SyncWorker to skip syncing.
The fix is to make use of an InputMerger that always favours the request input data instead of the previous worker results via the existing startChain extension.

Motivation and context

Fixes #6836 - implementations of SessionSafeCoroutineWorker suffer from uncaught exceptions causing the worker to become stuck in the failure state.

We update the work params/input data on error but never reset the state back to the default arguments

Screenshots / GIFs

No UI changes

Tests

Add a forced uncaught crash to the SyncWorker
Install the app and attempt to receive messages whilst the app is in the background
Remove the forced crash and attempt receive messages whilst the app is in the background
Notice the background jobs are still failing with the original error
Settings -> General -> Clear cache fixes the stuck state

MatrixWorkerFactory.createWorker for org.matrix.android.sdk.internal.session.sync.job.SyncWorker
SyncWorker Params(sessionId=8742ee0c2fc165532212ce2480e7bbca, timeout=0, delay=0, periodic=false, forceImmediate=false, lastFailureMessage=java.lang.RuntimeException: manual throw)
Work cancelled due to input error from parent: java.lang.RuntimeException: manual throw

Tested devices

Physical
Emulator
OS version(s): 31

ouchadam · 2022-08-15T16:40:15Z

...k-android/src/main/java/org/matrix/android/sdk/internal/worker/SessionSafeCoroutineWorker.kt

-    open fun doOnError(params: PARAM): Result {
+    open fun doOnError(params: PARAM, failureMessage: String): Result {
        // Forward the error
-        return Result.success(inputData)


previously we were always using the inputData which contained the lastFailureMessage which mean the next time the work ran the data would be restore with the pre-existing failure, causing the worker to never execute doSafeWork

the fix is to consume the message and use the updated state for the result (so that it can be persisted without the lastFailureMessage)

OK, so we do not forward the error anymore to the next worker, since it has been removed from the params parameter, right? (maybe update the comment above). The original idea to transmit the error was to ensure that the last worker of the chain will handle it, without breaking the whole worker chain forever.

It's still strange to me that the lastFailureMessage get restored when we start a new work.

I was also surprised by this, I think it's caused by the lastFailureMessage causing a merge conflict in the input data

// requireBackgroundSync val data = WorkerParamsFactory.toData( Params( sessionId = sessionId, timeout = serverTimeoutInSeconds, delay = 0L, periodic = false, random = UUID.randomUUID().toString() // create a unique id for each work request ) )

Logging the work request data and the inputData on job execution

// No forced crash - with random requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"7675"}, } doWork input: Data {WORKER_PARAMS_JSON : {"random":"7675"}, } requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"107a"}, } doWork input: Data {WORKER_PARAMS_JSON : {"random":"107a"}, } requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"c502"}, } doWork input: Data {WORKER_PARAMS_JSON : {"random":"c502"}, } requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"a22b"}, } doWork input: Data {WORKER_PARAMS_JSON : {"random":"a22b"}, } // Force crash requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"49b1"}, } doWork input: Data {WORKER_PARAMS_JSON : {"random":"49b1"}, } requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"fca1"}, } doWork input: Data {WORKER_PARAMS_JSON : {"random":"49b1","lastFailureMessage":"error"}, } requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"b7e6"}, } doWork input: Data {WORKER_PARAMS_JSON : {"random":"49b1","lastFailureMessage":"error"}, } // Force crash removed requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"97c9"}, } doWork input: Data {WORKER_PARAMS_JSON : {"random":"49b1","lastFailureMessage":"error"}, } requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"3e36"}, } doWork input: Data {WORKER_PARAMS_JSON : {"random":"49b1","lastFailureMessage":"error"}, }

notice that as soon as the lastFailureMessage is modified the request data becomes ignored, which I assume is caused by the unspecified ordering of the default OverwritingInputMerger https://developer.android.com/reference/androidx/work/InputMerger

after a bit more investigation, the default behaviour is -

CoroutineWorker/ListenableWorker do not call stop when doWork completes which means the worker often already "exists", causing ExistingWorkPolicy.APPEND_OR_REPLACE to use the append flow

When appending... Result payloads merge with the input payloads (eg Result.success(foobar)), in our case it means overwriting the json key and replacing the original request payload with the worker result

Successful SyncWorker provide no payload Result.success(), hence why they receive the new request payloads

another solution is to set an InputMerger which always favours the latest input specifically for the SyncWorker

class Merger : InputMerger() { override fun merge(inputs: MutableList<Data>): Data { return inputs.first() } }

I'm not aware of the historical details around the other workers but as they're based on CoroutineWorker/ListenableWorker they may also suffer from the same issue of becoming stuck in the failure state if they use append policies

another solution is to set an InputMerger which always favours the latest input specifically for the SyncWorker

This should be far less confusing IMHO. The API of the work manager is so strange...

Yes if there are chained workers we need to cancel using this mechanism, otherwise workers will be stuck if we return Failed state. This worker API is so bad...

I think it will break some flow if we consume the failure, as we always return Success, if there is a chain of workers, the next one will be running when it shouldn't...
We should probably let the Worker implementation decide if he wants to consume the error or not, instead of doing that in SessionSafeCoroutineWorker

in those cases, should we be using Result.Failure to cancel the chain? My understanding is that APPEND_OR_REPLACE would allow the chain to be recreated the next time a request comes in

I'll rework the PR to make use of the chain extension (NoMerge), we can think about the worker flows as a separate ticket (as other workers may be getting stuck)

updated 1fd1a4e

ouchadam · 2022-08-15T16:40:39Z

...k-android/src/main/java/org/matrix/android/sdk/internal/worker/SessionSafeCoroutineWorker.kt

            }
        } catch (throwable: Throwable) {
-            buildErrorResult(params, throwable.localizedMessage ?: "error")
+            buildErrorResult(params, "${throwable::class.java.name}: ${throwable.localizedMessage ?: "N/A error message"}")


added extra details about the error to help debug in the future

ouchadam · 2022-08-15T16:41:03Z

...k-android/src/main/java/org/matrix/android/sdk/internal/worker/SessionSafeCoroutineWorker.kt

-        return Result.success(inputData)
-                .also { Timber.e("Work cancelled due to input error from parent") }
+        return Result.success(WorkerParamsFactory.toData(paramClass, params))
+                .also { Timber.e("Work cancelled due to input error from parent: $failureMessage") }


includes the failure message in the error log

bmarty · 2022-08-16T10:12:36Z

matrix-sdk-android/src/main/java/org/matrix/android/sdk/internal/worker/SessionWorkerParams.kt

+}
+
+internal fun SessionWorkerParams.consumeFailure(): String? {
+    return lastFailureMessage.also { lastFailureMessage = null }


Maybe we could avoid making lastFailureMessage a var by using the data.copy fun. It could be painful since we have an interface here, so every impl would have to do the copy.

I was trying to avoid that as we have quite a few worker params, happy to make the change if preferred!

Let's keep the var like this for now. Thanks!

…ble class name

- the sync worker makes use of the CoroutineWorker which does not stop when the work completes, this means we often append to the existing worker. When appending by default the previous worker result payload is merged with (or in our case overwrites) the input data instead, meaning any failure state is set and kept until the worker stops, which in turns causes the sync worker to never sync - the fix is to make use of an input merge that always favour the request input data instead of the previous worker results

sonarqubecloud · 2022-08-17T16:50:27Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

54.5% Coverage
0.0% Duplication

ouchadam · 2022-08-22T10:15:40Z

matrix-sdk-android/src/main/java/org/matrix/android/sdk/internal/session/sync/job/SyncWorker.kt

                    .setConstraints(WorkManagerProvider.workConstraints)
                    .setBackoffCriteria(BackoffPolicy.LINEAR, WorkManagerProvider.BACKOFF_DELAY_MILLIS, TimeUnit.MILLISECONDS)
                    .setInputData(data)
+                    .startChain(true)


this line is the fix, matches the multiple event worker by ignoring previous results when appending new workers

bmarty

LGTM, thanks for the update. I am more comfortable with this version of the fix.

ouchadam added the A-Notifications label Aug 15, 2022

ouchadam commented Aug 15, 2022

View reviewed changes

ouchadam requested review from a team and yostyle and removed request for a team August 15, 2022 16:41

bmarty reviewed Aug 16, 2022

View reviewed changes

ouchadam force-pushed the feature/adm/missing-notifications branch from b03a1ee to 1a21ec5 Compare August 16, 2022 14:50

ouchadam added 2 commits August 17, 2022 15:18

including the worker failure message in the logs and including throwa…

a56a7ad

…ble class name

ouchadam force-pushed the feature/adm/missing-notifications branch from 1a21ec5 to 1fd1a4e Compare August 17, 2022 14:22

adding changelog entry

a3a76d1

ouchadam added Z-NextRelease For issues and PRs which should be included in the NextRelease. PR-Small PR with less than 20 updated lines labels Aug 22, 2022

ouchadam commented Aug 22, 2022

View reviewed changes

bmarty approved these changes Aug 22, 2022

View reviewed changes

ouchadam merged commit 841ee33 into develop Aug 22, 2022

ouchadam deleted the feature/adm/missing-notifications branch August 22, 2022 16:20

Missing notifications due to stuck background workers #6837

Missing notifications due to stuck background workers #6837

Uh oh!

Conversation

ouchadam commented Aug 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of change

Content

Motivation and context

Screenshots / GIFs

Tests

Tested devices

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ouchadam Aug 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ganfra Aug 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Aug 17, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bmarty left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ouchadam commented Aug 15, 2022 •

edited

Loading

ouchadam Aug 16, 2022 •

edited

Loading

ganfra Aug 17, 2022 •

edited

Loading