-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE-NNNN] archival: Use read-write fence in the ntp_archiver #24574
[CORE-NNNN] archival: Use read-write fence in the ntp_archiver #24574
Conversation
2e39592
to
a392416
Compare
Retry command for Build#59774please wait until all jobs are finished before running the slash command
|
CI test resultstest results on build#59774
test results on build#59783
test results on build#59824
test results on build#59868
test results on build#59928
test results on build#59958
test results on build#59971
test results on build#59988
test results on build#60029
test results on build#60039
test results on build#60048
test results on build#60128
test results on build#60147
test results on build#60319
test results on build#60433
test results on build#60927
|
a392416
to
ce0cb32
Compare
Retry command for Build#59783please wait until all jobs are finished before running the slash command
|
ce0cb32
to
06ca1b0
Compare
Retry command for Build#59824please wait until all jobs are finished before running the slash command
|
06ca1b0
to
63eb5f5
Compare
Retry command for Build#59868please wait until all jobs are finished before running the slash command
|
63eb5f5
to
b7836dd
Compare
b7836dd
to
a52b914
Compare
Retry command for Build#59928please wait until all jobs are finished before running the slash command
|
a52b914
to
9cb55f3
Compare
Retry command for Build#59958please wait until all jobs are finished before running the slash command
|
/ci-repeat 1 |
Retry command for Build#59988please wait until all jobs are finished before running the slash command
|
83a0abc
to
8b33548
Compare
Retry command for Build#60029please wait until all jobs are finished before running the slash command
|
/ci-repeat 1 |
Retry command for Build#60039please wait until all jobs are finished before running the slash command
|
ba45e29
to
9b0a748
Compare
Retry command for Build#60048please wait until all jobs are finished before running the slash command
|
/ci-repeat 1 |
9b0a748
to
779bc98
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a ducktape test that reproduces CORE-8769 so that we can verify this as a fix?
This is not a bugfix and CORE-8769 is just a high level description of the feature, I don't think we have something to reproduce. But since it's used for every upload/GC/retention operation you can be sure that the coverage is good. |
Sorry, is it this one? https://redpandadata.atlassian.net/browse/CORE-8742 |
Thanks! I guess we can use these examples to drive the construction fault injection tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. I added several questions to help with my understanding of the changes.
High level it looks like this "fencing" is preventing certain types of state from being replicated into the log, is that correct?
The other, complementary aspect of fencing, is the usage of fences on read sides to disallow problematic states. Does this PR have such a thing? Especially if this does exist, how do you expect the system to behave after upgrade if the system already contains a problematic state? In this case, the former would prevent new problems, but what about existing problems in persistent state?
if (mdiff.empty()) { | ||
vlog(_rtclog.debug, "No upload metadata collected, returning early"); | ||
co_return total; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels like this make things more complicated and fragile? before the fencing we'd still drop through and build a new batch even if mdiff was empty. if fencing is in place to prevent duplicates from being applied, why are we now bailing early here before replicating? is this an optimization? if so, it seems like it would be preferential to have a unified code path? please add some comments to explain what is happening.
if the fencing feature is disabled, then unsafe_add gets set to true automatically. is mdiff.empty check related too to this feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a cleanup, the check for emptiness is done on the lower level in the archival STM. If the array is empty nothing will be replicated but we will go into the STM, take locks etc. It's much better to bail out early. The mdiff
could become empty not because of fencing (the fencing is applied in the STM when the commands are applied). It can become empty if uploads has failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can become empty if uploads has failed.
right but it's in here without any explanation and it's unclear if it is related to fencing since that is what the commit is related to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not related to the fencing and can be removed. But it's a minor thing and it is also a small improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not related to the fencing and can be removed. But it's a minor thing and it is also a small improvement.
That's fine, it's just not clear at all when reviewing code what it is. A two line commit that adds that improvement, or a self-review-comment could make that clear.
/// Replicate archival metadata | ||
ss::future<ntp_archiver::upload_group_result> replicate_archival_metadata( | ||
archival_stm_fence fence, | ||
std::list<wait_uploads_complete_result> finished_uploads); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list will have one or two elements
std::vector
|
||
if (add_segments.empty()) { | ||
if (segments.empty()) { | ||
co_return errc::success; | ||
} | ||
|
||
storage::record_batch_builder b( | ||
model::record_batch_type::archival_metadata, model::offset(0)); | ||
for (auto& meta : add_segments) { | ||
iobuf key_buf = serde::to_iobuf(add_segment_cmd::key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't really understand what is being deduplicated. maybe if you move some of the code further away in the file (e.g. at the bottom) that the diff would render in a way that is more obvious.
another non-obvious thing is locking. you were calling do_add_segment under a lock from add_segment. but do_add_segment seems to have been removed but i don't see the lock now in the new code. so that's confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The archival_metadata_stm
has methods that replicate some commands like add_segments
. But it also has a command builder utility that allows to build custom config batch. Most of the archival_metadata_stm
methods that build batches are just creating the command builder internally and using it to create and replicate the batch (the command builder is doing all the locking internally). The add_segments
was an exception. Initially, in the ntp_archiver
I started using command builder instead of the add_segments
to add read_write_fence
command. But that created a situation where we have two different code paths that can add segments to the STM.
In this commit I used the command builder to build the same batch and replicate it. I also added the fence so the ntp_archiver
could just use add_segments
method as before. The code is covered by tests extensively so there is no need to add new ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you address what seems to be the removal of some locking? if this is just deduplicating, the i'd expect that locking were retained. so maybe it is retained some how indirectly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The locking is not removed, the locking is done in the command_builder
. It acquires the lock inside the replicate
method. The actual batch generation doesn't have to happen under the lock. It's just a transformation of the input.
The snapshot doesn't serialize the applied-offset field of the manifest. This commit fixes this. The field is used by the fencing mechanism. Prior to this commit (and the PR that adds this commit) the field was write only. It was updated and never actually used because we don't have any code in the write path that generates read_write_fence commands for the archival STM. Before the fencing mechanism could be used the snapshot has to be fixed, otherwise the replicas may diverge. Signed-off-by: Evgeny Lazin <[email protected]>
Previously, the `read_write_fence` command was added to the `archival_metadata_stm` to ensure consistency. The command operates as follows: - The STM stores the offset of the last applied command. - The `read_write_fence` command is initialized with the last applied offset and is included as the first entry in the archival STM record batch. - When the batch is replicated and eventually applied to the STM's in-memory state, the `read_write_fence` compares its stored offset with the STM's last applied offset. - If the offsets do not match, all remaining commands in the batch are discarded. - If the offsets match, the remaining commands are applied. This mechanism ensures that if any changes to the STM's in-memory state occur after the ntp_archiver reads the last applied offset, the batch is aborted. This requires the `read_write_fence` to always be the first command in the record batch. Currently, the `ntp_archiver` does not emit the `read_write_fence` command. This commit addresses that by implementing the following logic: 1. The `ntp_archiver` reads the last applied offset from the STM at the start of a new upload iteration. 2. It searches for an upload candidate and performs the upload. 3. The `ntp_archiver` replicates metadata describing the upload, including a `read_write_fence` command in the record batch. 4. The `read_write_fence` is initialized using the last applied offset cached at the beginning. This ensures that the configuration batch generated by the `ntp_archiver` can only be applied to the STM if its state remains unchanged. Any concurrent changes to the STM state will prevent the batch from being applied. This utility ensures that if any changes to the STM are made while segments are uploaded to S3, there will be no metadata conflict. The same holds true if a portion of the log is replayed by the STM twice. Every configuration batch can only be applied to a specific version of the archival STM state. This makes the archival STM effectively idempotent. Note, that there is no way for the 'ntp_archiver' to replay any log messages twice or make any concurrent modifications to the archival STM. This is only possible if we have some bug. We're not relying on this mechanism for correctness and only using it as a safety net. Signed-off-by: Evgeny Lazin <[email protected]>
The check is no longer needed because we're using fencing mechanism to prevent concurrency. The check is actually performed when the segment is uploaded. The removed check was used as an optimistic concurrency control mechanism and is now replaced by the RW-fence. Also, deprecate the error code that STM uses to communicate the consistency violation to the caller. The error code is not used anywhere else. Signed-off-by: Evgeny Lazin <[email protected]>
The option was never turned on by default. Our plan was to eventually enable it but it is now replaced with fencing mechanism which is always on and can't be disabled. Signed-off-by: Evgeny Lazin <[email protected]>
This commit enables rw-fence mechanism for housekeeping reuploads. In case if the housekeeping is running on a stale version of the STM snapshot the update will be rejected. Signed-off-by: Evgeny Lazin <[email protected]>
Use rw-fence for both normal retention/GC and spillover. Currently, the upload metadata validation only checks added segments. Because of that there is still some room for concurrent updates during the housekeeping (GC or retention). With rw-fence the metadata updates related to retention or GC will become safer. Signed-off-by: Evgeny Lazin <[email protected]>
Previous implementation of the archival STM had a bug because of which the last applied offset wasn't added to the snapshot. This could potentially make replicas diverge. The solution is to add a feature flag and use the rw-fence mechanism only if all replicas are upgraded. Signed-off-by: Evgeny Lazin <[email protected]>
Split method into two parts: - wait_uploads_complete - replicate_archival_metadata The first method is waiting until all uploads are finished. But unlike wait_uploads it doesn't replicate the metadata. Instead of doing this it just returns all information needed to replicate it. The second method takes the output of the wait_uploads_complete and replicates it. Finally, the wait_uploads method is now implemented using this two new methods. This is a preparation for the next commit. Currently, when we're uploading both compacted and non-compacted segments in the same iteration of the upload loop they both replicated independently. This worked fine before (with the exception of the race condition on _projected_manifest_clean_at). But with the fencing mechanism one of this wait_uploads calls will fail to replicate because of the fence. Currently, the same fence is used to invoke wait_uploads for compacted and non-compacted segments. The solution to this problem is to use wait_uploads_complete to wait until both compacted and non-compacted uploads are completed and then use a single replicate_archival_metadata call. This call will be able to use the fence value correctly. Signed-off-by: Evgeny Lazin <[email protected]>
Put non-compacted segment metadata and compacted segment metadata into the same archival control batch. This is needed to make fencing work correctly if both comapcted and non-compacted uploads are made in the same upload loop iteraton. Signed-off-by: Evgeny Lazin <[email protected]>
Signed-off-by: Evgeny Lazin <[email protected]>
Switch back to using 'add_segments' method. The implementation is now using builder to generate commands. There is no longer need to duplicate the logic and the logging will become the same as it was before the fence was added. Signed-off-by: Evgeny Lazin <[email protected]>
All checks and initialization are more readable this way. Signed-off-by: Evgeny Lazin <[email protected]>
The new cluster config is called 'cloud_storage_disable_archival_stm_rw_fence'. If it's set to 'true' it disables the fencing mechanism. The config is set to false by default and the description text mentioned that it shouldn't be set. Signed-off-by: Evgeny Lazin <[email protected]>
bba518c
to
ccb9790
Compare
@Lazin i have no idea what changed in the force push it contains 18,000 LOC of change across 500 files. I realize it was to fix a conflict so it was necessary, but what changed with respect to this PR? |
@Lazin can you also respond to the high level review comment? #24574 (review) |
cc @dotnwat
No. It just prevent commands from being applied to the in-memory state in the wrong order. In a nutshell, if we started upload based on the archival STM state that was generated by applying offsets from 0 to 38727 we will add a
The reader in this case is a background loop of the |
Thanks @Lazin for being patient with all my questions! |
The fence is a value which is produced by recording the current offset of the last applied archival STM command. The fence is added to the archival STM config batch. If no commands were applied to the STM the offset of the last applied command will be the same and the configuration batch will be applied. Otherwise, if some command was applied after the record batch was constructed the batch will not be applied.
This is essentially a concurrency control mechanism. The commands are already present in the STM but not used as for now. The existing metadata validation mechanism in the NTP archiver is still working but we no longer need to check the commands in the STM. If every archival STM command batch starts with the fence the batches become idempotent. This eliminates various scenarios in which Tiered-Storage metadata could be corrupted.
Some examples:
last_uploaded_offset
and then the most recent part of the log is shipped from another replica. Then the STM replays commands starting from insync offset which means that it applies some commands twice. If all the commands containrw_fence
there will be no metadata corruption.ntp_archiver
stops on the old shard and then starts on the new shard. The problem is that on the new shard it starts in the middle of the term which means that the archival STM may apply commands asynchronously. If the segment upload is started concurrently with the application of commands we may upload segment which doesn't align well with the last segment.ntp_archiver
is restarted mid-term. This is similar to x-shard movement case but the archiver is rested because we used an escape hatch to reset the manifest or because we enabled compaction for the topic.All the problems mentioned above are fixed using other means. We should still use
rw_fence
mechanism to avoid similar problems in the future.Backports Required
Release Notes