Skip to content

Conversation

@hemantk-12
Copy link
Contributor

What changes were proposed in this pull request?

We found two race conditions issues HDDS-10524 and HDDS-10590 which are fixed in PR #6443 and PR #6456 respectively.

There is still an issue with the existing way batch snapshot purge is processed.

As part of the snapshot purge, the deep clean flag of the next active snapshot, and the global and path previous of the next global and path level snapshots get updated. For this, updatedSnapInfos and updatedPathPreviousAndGlobalSnapshots maps are maintained in OMSnapshotPurgeRequest, and then these maps are flushed sequentially in OMSnapshotPurgeResponse. There is a problem with that and can cause chain corruption.
For example, let's assume as part of deep clean info update, snapshots are updated as {E -> E', F -> F', B' -> B'', G -> G'} and kept in updatedSnapInfos: [E', F', B'', G'] and previous snapshots are updated as {A - > A', B -> B', C -> C', D -> D'} and kept in updatedPathPreviousAndGlobalSnapshots: [A', B', C', D'].
After the purge final snapshot list should be [A', B'', C', D', E', F', G'] but because these maps are added to the batch sequentially [A', B', C', D', E', F', G'] or [A', B'', C', D', E', F', G'] depending on which one is added to the batch first code. The problem can still exist even if you fix the order of maps flush.

Ideally, these should be flushed in the same order the purge batch is processed.

This change is to fix the issue by changing the snapshot purge to take one snapshot at a time rather than the list of snapshots. For backward compatibility when Ratis transaction contains a list of snapshots, a new object is introduced to maintain the order of transaction and flush in the same order, they were updated in OMSnapshotPurgeRequest.

What is the link to the Apache JIRA

HDDS-9198

How was this patch tested?

Added and updated unit tests.

@hemantk-12 hemantk-12 added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Apr 8, 2024
@aswinshakil aswinshakil self-requested a review May 16, 2024 14:18
@SaketaChalamchala
Copy link
Contributor

@swamirishi could you please take a look?

@hemantk-12 hemantk-12 requested a review from swamirishi July 29, 2024 19:25
@swamirishi
Copy link
Contributor

swamirishi commented Aug 6, 2024

@hemantk-12 Can't we just lookup the object in the maps updatedPathPreviousAndGlobalSnapshots & updatedSnapInfos instead of just getting a fresh copy of the snapshotInfo from the table over and over again
Can't we have a single map like updateSnapshotInfoMap which has all the keys that have been updated?

SnapshotInfo updatedSnapshotInfo = omMetadataManager.getSnapshotInfoTable().get(snapInfo.getTableKey()); & SnapshotInfo nextPathSnapInfo =
        nextPathSnapshotKey != null ? metadataManager.getSnapshotInfoTable().get(nextPathSnapshotKey) : null;

    SnapshotInfo nextGlobalSnapInfo =
        nextGlobalSnapshotKey != null ? metadataManager.getSnapshotInfoTable().get(nextGlobalSnapshotKey) : null;
``` in  the functions updateSnapshotInfoAndCache & updateSnapshotChainAndCache instead of this we could have a function like:
`private SnapshotInfo getLatestSnapshotInfo(String key, Map<String, SnapshotInfo> updatedMap) {
     if (updatedMap.containsKey(key)) {
           return updatedMap.get(key);
     } 
    return metadataManager.getSnapshotInfoTable().get(nextPathSnapshotKey);
}`

@hemantk-12
Copy link
Contributor Author

hemantk-12 commented Aug 6, 2024

@hemantk-12 Can't we just lookup the object in the maps updatedPathPreviousAndGlobalSnapshots & updatedSnapInfos instead of just getting a fresh copy of the snapshotInfo from the table over and over again Can't we have a single map like updateSnapshotInfoMap which has all the keys that have been updated?

I think we can have a local map to OMSnapshotPurgeRequest. I'll address this as part of HDDS-11137. We have locks right now, and I'm uncomfortable replacing it with Map as suggested.
It will be easy after we remove the locks and double access of object.

@hemantk-12
Copy link
Contributor Author

hemantk-12 commented Aug 6, 2024

As per @swamirishi's suggestion, we can maintain a Map that will work as a local cache to OMSnapshotPurgeRequest and returns update snapshotInfo all the time.
Currently, the problem is that we maintain two maps updatedPathPreviousAndGlobalSnapshots and updatedSnapInfos which creates an issue with batch processing because order matters. If we replace these maps with one map, it would solve the problem and we can still keep the batch purge operation.

Closing this PR in favor of #7045

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants