Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to reprocess records from previous version #5581

Closed
saig0 opened this issue Oct 13, 2020 · 7 comments
Closed

Fail to reprocess records from previous version #5581

saig0 opened this issue Oct 13, 2020 · 7 comments
Assignees
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround

Comments

@saig0
Copy link
Member

saig0 commented Oct 13, 2020

Describe the bug
When upgrading Zeebe to a new version then it can happen that a partition fails to start and stays unhealthy after the upgrade.

The issue is caused by a conceptional problem in the reprocessing. On reprocessing, the broker restores/rehydrates the data (i.e. RocksDB) by reading the records on the log stream and do the processing again (without writing any follow-up record). If the behavior of the workflow engine changes in the new version (e.g. during a non-user-facing refactoring, or a bug fix) then it may write different data in the state on reprocessing that doesn't match to the records on the log stream. As a result, the state may be corrupted, or the records are not reprocessed (i.e. preconditions are not fulfilled).

The issue can be omitted if a snapshot is created before. If there are no new records processed after the snapshot then no reprocessing is performed.

To Reproduce
See #5251, #5268, #5393

Expected behavior
I can upgrade Zeebe to a new version and continue with my existing data.

Log/Stacktrace
See the linked issues.

Environment:

  • OS: [e.g. Linux]
  • Zeebe Version: 0.24.2, 0.25.0-SNAPSHOT
  • Configuration: [e.g. exporters etc.]
@saig0 saig0 added kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog Impact: Availability severity/high Marks a bug as having a noticeable impact on the user with no known workaround labels Oct 13, 2020
@saig0
Copy link
Member Author

saig0 commented Oct 13, 2020

To improve the situation, we implemented a check in the reprocessing to detect such situations (#5381), and an endpoint to trigger a snapshot creation (#5405).

@npepinpe
Copy link
Member

npepinpe commented Nov 9, 2020

I'm assigning it to @saig0 for now for myself so I can track who's working on it, and that there is someone working on it.

@npepinpe
Copy link
Member

Once we start on an issue breakdown, please create a milestone instead and close this issue (and reference the milestone).

@strawhat5
Copy link

Just to add a bit more findings on my side: The issue can happen not just while upgrading, but also during a simple restart of a broker. I came across it in 0.25.3, where one of the brokers got restarted and fell into EndlessRetryStrategy hell.

2021-01-04 07:13:24.041 [Broker-0-StreamProcessor-1] [Broker-0-zb-actors-1] [31mERROR[m io.zeebe.util.retry.EndlessRetryStrategy - Catched exception class java.lang.IllegalStateException with message Expected to find a workflow deployed with key '2251799814815370' but not found., will retry...
java.lang.IllegalStateException: Expected to find a workflow deployed with key '2251799814815370' but not found.
	at io.zeebe.engine.state.deployment.WorkflowState.getFlowElement(WorkflowState.java:99) ~[zeebe-workflow-engine-0.25.3.jar:0.25.3]
	at io.zeebe.engine.processing.bpmn.BpmnStreamProcessor.getElement(BpmnStreamProcessor.java:144) ~[zeebe-workflow-engine-0.25.3.jar:0.25.3]
	at io.zeebe.engine.processing.bpmn.BpmnStreamProcessor.processRecord(BpmnStreamProcessor.java:89) ~[zeebe-workflow-engine-0.25.3.jar:0.25.3]
	at io.zeebe.engine.processing.streamprocessor.TypedRecordProcessor.processRecord(TypedRecordProcessor.java:51) ~[zeebe-workflow-engine-0.25.3.jar:0.25.3]
	at io.zeebe.engine.processing.streamprocessor.ReProcessingStateMachine.lambda$chooseOperationForEvent$5(ReProcessingStateMachine.java:322) ~[zeebe-workflow-engine-0.25.3.jar:0.25.3]
	at io.zeebe.db.impl.rocksdb.transaction.ZeebeTransaction.run(ZeebeTransaction.java:79) ~[zeebe-db-0.25.3.jar:0.25.3]
	at io.zeebe.engine.processing.streamprocessor.ReProcessingStateMachine.lambda$processUntilDone$2(ReProcessingStateMachine.java:297) ~[zeebe-workflow-engine-0.25.3.jar:0.25.3]
	at io.zeebe.util.retry.ActorRetryMechanism.run(ActorRetryMechanism.java:36) ~[zeebe-util-0.25.3.jar:0.25.3]
	at io.zeebe.util.retry.EndlessRetryStrategy.run(EndlessRetryStrategy.java:50) ~[zeebe-util-0.25.3.jar:0.25.3]
	at io.zeebe.util.sched.ActorJob.invoke(ActorJob.java:73) [zeebe-util-0.25.3.jar:0.25.3]
	at io.zeebe.util.sched.ActorJob.execute(ActorJob.java:39) [zeebe-util-0.25.3.jar:0.25.3]
	at io.zeebe.util.sched.ActorTask.execute(ActorTask.java:122) [zeebe-util-0.25.3.jar:0.25.3]
	at io.zeebe.util.sched.ActorThread.executeCurrentTask(ActorThread.java:94) [zeebe-util-0.25.3.jar:0.25.3]
	at io.zeebe.util.sched.ActorThread.doWork(ActorThread.java:78) [zeebe-util-0.25.3.jar:0.25.3]
	at io.zeebe.util.sched.ActorThread.run(ActorThread.java:191) [zeebe-util-0.25.3.jar:0.25.3]

@saig0
Copy link
Member Author

saig0 commented Jan 13, 2021

@strawhat5 thanks for sharing 👍 I'm not aware of this bug. It seems not directly related to this issue. Please open a new issue if you've more information about it or how it can be reproduced.

@strawhat5
Copy link

It is related to #5251 as we have the same parallel workflow deployed here as in #5251. In this case though, I did not perform any version upgrade, one of the broker pod randomly restarted and went into an inconsistent state during reprocessing.

I was just emphasizing on this line, that the reprocessing failure can happen even during a normal restart:

Describe the bug
When upgrading Zeebe to a new version

@saig0
Copy link
Member Author

saig0 commented Jan 26, 2021

The new concept of the workflow processing is described here: ZEP 004
The progress is tracked by the milestone: https://github.com/zeebe-io/zeebe/milestone/51

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes an issue or PR as a bug scope/broker Marks an issue or PR to appear in the broker section of the changelog severity/high Marks a bug as having a noticeable impact on the user with no known workaround
Projects
None yet
Development

No branches or pull requests

3 participants