Fail to reprocess records from previous version #5581

saig0 · 2020-10-13T11:42:14Z

Describe the bug
When upgrading Zeebe to a new version then it can happen that a partition fails to start and stays unhealthy after the upgrade.

The issue is caused by a conceptional problem in the reprocessing. On reprocessing, the broker restores/rehydrates the data (i.e. RocksDB) by reading the records on the log stream and do the processing again (without writing any follow-up record). If the behavior of the workflow engine changes in the new version (e.g. during a non-user-facing refactoring, or a bug fix) then it may write different data in the state on reprocessing that doesn't match to the records on the log stream. As a result, the state may be corrupted, or the records are not reprocessed (i.e. preconditions are not fulfilled).

The issue can be omitted if a snapshot is created before. If there are no new records processed after the snapshot then no reprocessing is performed.

To Reproduce
See #5251, #5268, #5393

Expected behavior
I can upgrade Zeebe to a new version and continue with my existing data.

Log/Stacktrace
See the linked issues.

Environment:

OS: [e.g. Linux]
Zeebe Version: 0.24.2, 0.25.0-SNAPSHOT
Configuration: [e.g. exporters etc.]

The text was updated successfully, but these errors were encountered:

saig0 · 2020-10-13T11:49:40Z

To improve the situation, we implemented a check in the reprocessing to detect such situations (#5381), and an endpoint to trigger a snapshot creation (#5405).

npepinpe · 2020-11-09T15:18:14Z

I'm assigning it to @saig0 for now for myself so I can track who's working on it, and that there is someone working on it.

npepinpe · 2021-01-12T09:22:06Z

Once we start on an issue breakdown, please create a milestone instead and close this issue (and reference the milestone).

strawhat5 · 2021-01-13T05:40:45Z

Just to add a bit more findings on my side: The issue can happen not just while upgrading, but also during a simple restart of a broker. I came across it in 0.25.3, where one of the brokers got restarted and fell into EndlessRetryStrategy hell.

2021-01-04 07:13:24.041 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-1] [31mERROR[m io.zeebe.util.retry.EndlessRetryStrategy - Catched exception class java.lang.IllegalStateException with message Expected to find a workflow deployed with key '2251799814815370' but not found., will retry...
java.lang.IllegalStateException: Expected to find a workflow deployed with key '2251799814815370' but not found.
	at io.zeebe.engine.state.deployment.WorkflowState.getFlowElement(WorkflowState.java:99) ~[zeebe-workflow-engine-0.25.3.jar:0.25.3]
	at io.zeebe.engine.processing.bpmn.BpmnStreamProcessor.getElement(BpmnStreamProcessor.java:144) ~[zeebe-workflow-engine-0.25.3.jar:0.25.3]
	at io.zeebe.engine.processing.bpmn.BpmnStreamProcessor.processRecord(BpmnStreamProcessor.java:89) ~[zeebe-workflow-engine-0.25.3.jar:0.25.3]
	at io.zeebe.engine.processing.streamprocessor.TypedRecordProcessor.processRecord(TypedRecordProcessor.java:51) ~[zeebe-workflow-engine-0.25.3.jar:0.25.3]
	at io.zeebe.engine.processing.streamprocessor.ReProcessingStateMachine.lambda$chooseOperationForEvent$5(ReProcessingStateMachine.java:322) ~[zeebe-workflow-engine-0.25.3.jar:0.25.3]
	at io.zeebe.db.impl.rocksdb.transaction.ZeebeTransaction.run(ZeebeTransaction.java:79) ~[zeebe-db-0.25.3.jar:0.25.3]
	at io.zeebe.engine.processing.streamprocessor.ReProcessingStateMachine.lambda$processUntilDone$2(ReProcessingStateMachine.java:297) ~[zeebe-workflow-engine-0.25.3.jar:0.25.3]
	at io.zeebe.util.retry.ActorRetryMechanism.run(ActorRetryMechanism.java:36) ~[zeebe-util-0.25.3.jar:0.25.3]
	at io.zeebe.util.retry.EndlessRetryStrategy.run(EndlessRetryStrategy.java:50) ~[zeebe-util-0.25.3.jar:0.25.3]
	at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73) [zeebe-util-0.25.3.jar:0.25.3]
	at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.25.3.jar:0.25.3]
	at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) [zeebe-util-0.25.3.jar:0.25.3]
	at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:94) [zeebe-util-0.25.3.jar:0.25.3]
	at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:78) [zeebe-util-0.25.3.jar:0.25.3]
	at io.zeebe.util.sched.ActorThread.run(ActorThread.java:191) [zeebe-util-0.25.3.jar:0.25.3]

saig0 · 2021-01-13T06:08:49Z

@strawhat5 thanks for sharing 👍 I'm not aware of this bug. It seems not directly related to this issue. Please open a new issue if you've more information about it or how it can be reproduced.

strawhat5 · 2021-01-13T07:09:09Z

It is related to #5251 as we have the same parallel workflow deployed here as in #5251. In this case though, I did not perform any version upgrade, one of the broker pod randomly restarted and went into an inconsistent state during reprocessing.

I was just emphasizing on this line, that the reprocessing failure can happen even during a normal restart:

Describe the bug
When upgrading Zeebe to a new version

saig0 · 2021-01-26T13:35:30Z

The new concept of the workflow processing is described here: ZEP 004
The progress is tracked by the milestone: https://github.com/zeebe-io/zeebe/milestone/51

saig0 added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog Impact: Availability severity/high Marks a bug as having a noticeable impact on the user with no known workaround labels Oct 13, 2020

saig0 mentioned this issue Oct 13, 2020

Write an Zeebe upgrade guide #5534

Merged

8 tasks

npepinpe added Priority: High and removed Status: Needs Priority labels Oct 14, 2020

npepinpe mentioned this issue Nov 6, 2020

Forking parallel gateway is not correctly reprocessed on upgrade from 0.23 to 0.24 #5268

Closed

npepinpe assigned saig0 Nov 9, 2020

npepinpe added Status: Ready and removed Status: Planned labels Dec 14, 2020

npepinpe added Status: In Progress and removed Status: Ready labels Jan 12, 2021

saig0 closed this as completed Jan 26, 2021

caugustus-ezcater mentioned this issue Feb 2, 2021

Cluster starts with no leader for partition #3670

Closed

npepinpe mentioned this issue Feb 5, 2021

java.lang.IllegalStateException: Expected the active token count to be positive but was -1 #5251

Closed

github-merge-queue bot pushed a commit that referenced this issue Mar 14, 2024

fix(migration): refresh target indices before getting docs count (#5581)

9a479e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to reprocess records from previous version #5581

Fail to reprocess records from previous version #5581

saig0 commented Oct 13, 2020 •

edited

Loading

saig0 commented Oct 13, 2020

npepinpe commented Nov 9, 2020

npepinpe commented Jan 12, 2021

strawhat5 commented Jan 13, 2021

saig0 commented Jan 13, 2021

strawhat5 commented Jan 13, 2021

saig0 commented Jan 26, 2021

Fail to reprocess records from previous version #5581

Fail to reprocess records from previous version #5581

Comments

saig0 commented Oct 13, 2020 • edited Loading

saig0 commented Oct 13, 2020

npepinpe commented Nov 9, 2020

npepinpe commented Jan 12, 2021

strawhat5 commented Jan 13, 2021

saig0 commented Jan 13, 2021

strawhat5 commented Jan 13, 2021

saig0 commented Jan 26, 2021

saig0 commented Oct 13, 2020 •

edited

Loading