Skip to content

Conversation

@ouchadam
Copy link
Contributor

@ouchadam ouchadam commented Aug 15, 2022

Type of change

  • Feature
  • Bugfix
  • Technical
  • Other :

Content

  • The sync worker makes use of the CoroutineWorker which does not stop when the work completes, this means we often append to the existing worker when scheduling background syncs. When appending by default the previous worker result payload is merged with (or in our case overwrites) the input data instead, meaning any failure state is set and kept until the worker stops, which in turns causes the SyncWorker to skip syncing.

  • The fix is to make use of an InputMerger that always favours the request input data instead of the previous worker results via the existing startChain extension.

Motivation and context

Fixes #6836 - implementations of SessionSafeCoroutineWorker suffer from uncaught exceptions causing the worker to become stuck in the failure state.

We update the work params/input data on error but never reset the state back to the default arguments

Screenshots / GIFs

No UI changes

Tests

  • Add a forced uncaught crash to the SyncWorker
  • Install the app and attempt to receive messages whilst the app is in the background
  • Remove the forced crash and attempt receive messages whilst the app is in the background
  • Notice the background jobs are still failing with the original error
  • Settings -> General -> Clear cache fixes the stuck state
MatrixWorkerFactory.createWorker for org.matrix.android.sdk.internal.session.sync.job.SyncWorker
SyncWorker Params(sessionId=8742ee0c2fc165532212ce2480e7bbca, timeout=0, delay=0, periodic=false, forceImmediate=false, lastFailureMessage=java.lang.RuntimeException: manual throw)
Work cancelled due to input error from parent: java.lang.RuntimeException: manual throw

Tested devices

  • Physical
  • Emulator
  • OS version(s): 31

open fun doOnError(params: PARAM): Result {
open fun doOnError(params: PARAM, failureMessage: String): Result {
// Forward the error
return Result.success(inputData)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously we were always using the inputData which contained the lastFailureMessage which mean the next time the work ran the data would be restore with the pre-existing failure, causing the worker to never execute doSafeWork

the fix is to consume the message and use the updated state for the result (so that it can be persisted without the lastFailureMessage)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so we do not forward the error anymore to the next worker, since it has been removed from the params parameter, right? (maybe update the comment above). The original idea to transmit the error was to ensure that the last worker of the chain will handle it, without breaking the whole worker chain forever.

It's still strange to me that the lastFailureMessage get restored when we start a new work.

Copy link
Contributor Author

@ouchadam ouchadam Aug 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also surprised by this, I think it's caused by the lastFailureMessage causing a merge conflict in the input data

// requireBackgroundSync
val data = WorkerParamsFactory.toData(
        Params(
                sessionId = sessionId,
                timeout = serverTimeoutInSeconds,
                delay = 0L,
                periodic = false,
                random = UUID.randomUUID().toString()  // create a unique id for each work request
        )
)

Logging the work request data and the inputData on job execution

// No forced crash - with random

requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"7675"}, }
doWork input: Data {WORKER_PARAMS_JSON : {"random":"7675"}, }
requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"107a"}, }
doWork input: Data {WORKER_PARAMS_JSON : {"random":"107a"}, }
requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"c502"}, }
doWork input: Data {WORKER_PARAMS_JSON : {"random":"c502"}, }
requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"a22b"}, }
doWork input: Data {WORKER_PARAMS_JSON : {"random":"a22b"}, }


// Force crash

requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"49b1"}, }
doWork input: Data {WORKER_PARAMS_JSON : {"random":"49b1"}, }
requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"fca1"}, }
doWork input: Data {WORKER_PARAMS_JSON : {"random":"49b1","lastFailureMessage":"error"}, }
requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"b7e6"}, }
doWork input: Data {WORKER_PARAMS_JSON : {"random":"49b1","lastFailureMessage":"error"}, }


// Force crash removed

requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"97c9"}, }
doWork input: Data {WORKER_PARAMS_JSON : {"random":"49b1","lastFailureMessage":"error"}, }
requireBackground sync params: Data {WORKER_PARAMS_JSON : {"random":"3e36"}, }
doWork input: Data {WORKER_PARAMS_JSON : {"random":"49b1","lastFailureMessage":"error"}, }

notice that as soon as the lastFailureMessage is modified the request data becomes ignored, which I assume is caused by the unspecified ordering of the default OverwritingInputMerger https://developer.android.com/reference/androidx/work/InputMerger

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after a bit more investigation, the default behaviour is -

  • CoroutineWorker/ListenableWorker do not call stop when doWork completes which means the worker often already "exists", causing ExistingWorkPolicy.APPEND_OR_REPLACE to use the append flow
  • When appending... Result payloads merge with the input payloads (eg Result.success(foobar)), in our case it means overwriting the json key and replacing the original request payload with the worker result
  • Successful SyncWorker provide no payload Result.success(), hence why they receive the new request payloads

another solution is to set an InputMerger which always favours the latest input specifically for the SyncWorker

class Merger : InputMerger() {
    override fun merge(inputs: MutableList<Data>): Data {
        return inputs.first()
    }
}

I'm not aware of the historical details around the other workers but as they're based on CoroutineWorker/ListenableWorker they may also suffer from the same issue of becoming stuck in the failure state if they use append policies

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another solution is to set an InputMerger which always favours the latest input specifically for the SyncWorker

This should be far less confusing IMHO. The API of the work manager is so strange...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes if there are chained workers we need to cancel using this mechanism, otherwise workers will be stuck if we return Failed state. This worker API is so bad...

Copy link
Member

@ganfra ganfra Aug 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will break some flow if we consume the failure, as we always return Success, if there is a chain of workers, the next one will be running when it shouldn't...
We should probably let the Worker implementation decide if he wants to consume the error or not, instead of doing that in SessionSafeCoroutineWorker

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in those cases, should we be using Result.Failure to cancel the chain? My understanding is that APPEND_OR_REPLACE would allow the chain to be recreated the next time a request comes in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rework the PR to make use of the chain extension (NoMerge), we can think about the worker flows as a separate ticket (as other workers may be getting stuck)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated 1fd1a4e

}
} catch (throwable: Throwable) {
buildErrorResult(params, throwable.localizedMessage ?: "error")
buildErrorResult(params, "${throwable::class.java.name}: ${throwable.localizedMessage ?: "N/A error message"}")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added extra details about the error to help debug in the future

return Result.success(inputData)
.also { Timber.e("Work cancelled due to input error from parent") }
return Result.success(WorkerParamsFactory.toData(paramClass, params))
.also { Timber.e("Work cancelled due to input error from parent: $failureMessage") }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

includes the failure message in the error log

@ouchadam ouchadam requested review from a team and yostyle and removed request for a team August 15, 2022 16:41
}

internal fun SessionWorkerParams.consumeFailure(): String? {
return lastFailureMessage.also { lastFailureMessage = null }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could avoid making lastFailureMessage a var by using the data.copy fun. It could be painful since we have an interface here, so every impl would have to do the copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to avoid that as we have quite a few worker params, happy to make the change if preferred!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the var like this for now. Thanks!

@ouchadam ouchadam force-pushed the feature/adm/missing-notifications branch from b03a1ee to 1a21ec5 Compare August 16, 2022 14:50
- the sync worker makes use of the CoroutineWorker which does not stop when the work completes, this means we often append to the existing worker. When appending by default the previous worker result payload is merged with (or in our case overwrites) the input data instead, meaning any failure state is set and kept until the worker stops, which in turns causes the sync worker to never sync
- the fix is to make use of an input merge that always favour the request input data instead of the previous worker results
@ouchadam ouchadam force-pushed the feature/adm/missing-notifications branch from 1a21ec5 to 1fd1a4e Compare August 17, 2022 14:22
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

54.5% 54.5% Coverage
0.0% 0.0% Duplication

@ouchadam ouchadam added Z-NextRelease For issues and PRs which should be included in the NextRelease. PR-Small PR with less than 20 updated lines labels Aug 22, 2022
.setConstraints(WorkManagerProvider.workConstraints)
.setBackoffCriteria(BackoffPolicy.LINEAR, WorkManagerProvider.BACKOFF_DELAY_MILLIS, TimeUnit.MILLISECONDS)
.setInputData(data)
.startChain(true)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is the fix, matches the multiple event worker by ignoring previous results when appending new workers

Copy link
Member

@bmarty bmarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the update. I am more comfortable with this version of the fix.

@ouchadam ouchadam merged commit 841ee33 into develop Aug 22, 2022
@ouchadam ouchadam deleted the feature/adm/missing-notifications branch August 22, 2022 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Notifications PR-Small PR with less than 20 updated lines Z-NextRelease For issues and PRs which should be included in the NextRelease.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SessionSafeCoroutineWorker becomes stuck until a cache clear

4 participants