Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations for Chunk's hash validation #204

Open
lurenpluto opened this issue Apr 12, 2023 · 0 comments
Open

Optimizations for Chunk's hash validation #204

lurenpluto opened this issue Apr 12, 2023 · 0 comments
Labels
Any Idea Any problem/ideas/suggestions BDT bucky data transfer protocol CYFS Stack This is CYFS Stack feature New feature Performance about performance issues

Comments

@lurenpluto
Copy link
Member

Chunk storage design

Currently, chunks managed by cyfs-stack are divided into two types based on the location of the managed data storage: external and internal:

  • External: stored as files in any directory, with the cyfs-stack's tracker responsible for recording these associations. The probability of data invalidation and errors is higher.
  • Internal: placed in the data/chunk-cache directory, managed by the chunk-manager. The probability of data invalidation and errors is lower.

For example, the problem mentioned as follows:
#158
#201

Solutions currently in use

At present, in order to handle chunk data errors, a "validation during reading" mode is used. Each time a chunk is requested, the data is validated when it is read from the target disk file. This approach is simple but also has some problems:

  • Various special cases during reading
    If there are partial reads, it cannot be handled correctly.
  • Performance waste
    Every time a chunk is read, it is validated, which is not necessary for the same chunk. In most cases, the corresponding chunk file may not change or have errors, and frequently requesting the same chunk will add a lot of extra overhead.

Possible improvements

So, considering several aspects, relevant optimizations and improvements can be made:

1. Add a regular local chunk validation mechanism

  • Similar to the GC mechanism, it periodically scans the chunks recorded in the NDC and tracker, and attempts to validate them, updating the status after validation, such as the last validation time and validation result.
  • Perform validation in a trigger point mode, for example, when a chunk is requested and it is found that it has not been validated for a long time, the validation operation can be immediately triggered.

Based on this, when the corresponding chunk is requested, if it is found that the last validation result was incorrect, the "data mismatch" error can be directly returned to the caller without further validation.

2. Add validation at the BDT layer on the requester side

Currently, BDT does not have a step to validate the chunk hash during the transfer process of file and chunks. According to the design principle, the cyfs-stack layer should ensure that the chunk data requested from elsewhere is correct (similar to the download operation in Web 2 browsers). Therefore, it seems necessary for the BDT layer to provide this validation mechanism, at least as an optional option. @photosssa

@lurenpluto lurenpluto added feature New feature Performance about performance issues BDT bucky data transfer protocol CYFS Stack This is CYFS Stack labels Apr 12, 2023
@lurenpluto lurenpluto added the Any Idea Any problem/ideas/suggestions label Apr 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Any Idea Any problem/ideas/suggestions BDT bucky data transfer protocol CYFS Stack This is CYFS Stack feature New feature Performance about performance issues
Projects
Status: 💡Any Idea
Development

No branches or pull requests

1 participant