Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle IO errors between store_log_entry and end_of_append_batch #607

Open
Besroy opened this issue Dec 12, 2024 · 0 comments
Open

handle IO errors between store_log_entry and end_of_append_batch #607

Besroy opened this issue Dec 12, 2024 · 0 comments

Comments

@Besroy
Copy link
Contributor

Besroy commented Dec 12, 2024

In the current implementation, after all data is received in a batch, raft saves the log entries into the log store and performs pre-commit. Then the end_of_append_batch step ensures that all data is written. However, if an IO error occurs between the save_log_entry and end_of_append_batch stages, the error may cause the HomeStore to become stuck (https://github.com/eBay/HomeStore/blob/master/src/lib/replication/repl_dev/raft_repl_dev.cpp#L454) or crash(https://github.com/eBay/HomeStore/blob/master/src/lib/replication/repl_dev/raft_repl_dev.cpp#L873).
For example:
t1: handle_raft_event and pass ( all data received)
t2: append log to log store and add rreq into state machine
t3: precommit pass
t4: async write failed at on_push_data_received, trigger handle_error / on_fetch_data_received and crash
t5: end_of_append_batch, wait for all data written, stuck or crash

Note that we cannot do unlink at handle_error if IO error occurs directly(in the following case), so we need to find a solutaion to handle error more gracefully, and one potential approach could be emergent gc.
t1: Handle the raft event and pass.
t2: Append log to the log store and save the LSN into the state machine.
t3: Fail to write due to an IO error, and then trigger handle error and remove it from the state machine.
t4: In the pre-commit phase, can't find the LSN, leading to a nullptr exception.

Also need flip to mock IO error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant