Skip to content

Local pinning fix#255

Merged
jmozah merged 5 commits intomasterfrom
pinfix
Jun 9, 2020
Merged

Local pinning fix#255
jmozah merged 5 commits intomasterfrom
pinfix

Conversation

@jmozah
Copy link
Copy Markdown
Contributor

@jmozah jmozah commented Jun 4, 2020

Problem: Sometime a pinned chunk gets GCd and is not available in localstore. This was not simulatable. It once occurred in FairData setup and now in @santicomp2014 's setup in global pinning.

Fix: Even though we don't know the exact scenario when this happens, a code walk through found out that we are not checking for pinIndex presence when a chunk is added in gcIndex. We found 4 such places that are missed out and fixing it in this PR.

@jmozah jmozah added the bug Something isn't working label Jun 4, 2020
@jmozah jmozah self-assigned this Jun 4, 2020
@jmozah jmozah requested review from acud and janos and removed request for janos June 5, 2020 11:04
@jmozah jmozah changed the title [WIP] Local pinning fix Local pinning fix Jun 5, 2020
@acud
Copy link
Copy Markdown
Contributor

acud commented Jun 5, 2020

@jmozah If this PR is ready for review, please elaborate on what exactly is the bug and how it is fixed in the PR description.

Comment thread pkg/localstore/mode_get.go Outdated
// add new entry to gc index ONLY if it is not present in pinIndex
ok, err := db.pinIndex.Has(item)
if err != nil {
fmt.Println("mode_get: Not adding in gcIndex", hex.EncodeToString(item.Address))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove Println.

Comment thread pkg/localstore/mode_put.go Outdated
// add new entry to gc index ONLY if it is not present in pinIndex
ok, err := db.pinIndex.Has(item)
if err != nil {
fmt.Println("mode_put: Not adding in gcIndex", hex.EncodeToString(item.Address))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove Println.

Comment thread pkg/localstore/mode_set.go Outdated

ok, err := db.pinIndex.Has(item)
if err != nil {
fmt.Println("mode_set: Not adding in gcIndex", hex.EncodeToString(item.Address))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove Println.

Comment thread pkg/localstore/mode_set.go Outdated
// Add in gcIndex only if this chunk is not pinned
ok, err := db.pinIndex.Has(item)
if err != nil {
fmt.Println("mode_set: Not adding in gcIndex", hex.EncodeToString(item.Address))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove Println.

Comment thread pkg/localstore/gc.go Outdated
Comment on lines +119 to +122
yes, err := db.pinIndex.Has(item)
if err == nil && yes {
fmt.Println("GCing pinned item ", hex.EncodeToString(item.Address))
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed as it is used only to print a line.

@jmozah
Copy link
Copy Markdown
Contributor Author

jmozah commented Jun 5, 2020

Update: Just now @santicomp2014 confirmed that this works in there setup.

@jmozah
Copy link
Copy Markdown
Contributor Author

jmozah commented Jun 5, 2020

@jmozah If this PR is ready for review, please elaborate on what exactly is the bug and how it is fixed in the PR description.

@acud. Added the descriptions. Hope you are happy now.

Comment thread pkg/localstore/gc_test.go
Comment thread pkg/localstore/gc_test.go Outdated
pinnedChunks := make([]swarm.Address, 0)

// upload random chunks above db capacity to see if chunks are still pinned
for i := 0; i < 2000; i++ {
Copy link
Copy Markdown
Contributor

@acud acud Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the capacity is 100, why do you need to insert 6000? and every time the CI runs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the test case,... The issue occurs when we bombard chunks greater that dbCapacity ..more than once... 2000 is a big number for CI, May be we can reduce it to 200.

Comment thread pkg/localstore/localstore_test.go
Comment thread pkg/localstore/gc_test.go Outdated
}
gotChunk, err := db.Get(context.Background(), storage.ModeGetRequest, swarm.NewAddress(outItem.Address))
if err != nil {
fmt.Println("Pinned chunk missing ", addr)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please remove all fmt.Println references? If there's a reason to annotate an error please use t.Fatalf("something has occured %v" , err)

Comment thread pkg/localstore/gc_test.go Outdated

}

func addRandomChunks(t *testing.T, count int, db *DB) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you use this in all of the tests above?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because other tests dont need this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm seeing a lot of Puts and Sets above this, if this test helper would have a few parameters, it could save you all of the manual Puts and Sets

@acud
Copy link
Copy Markdown
Contributor

acud commented Jun 5, 2020

Hope you are happy now

You don't need to please me @jmozah. I want to know what I'm approving. Having a broad problem space that is difficult to define makes such PRs very problematic. Please note now that you're adding extra DB operations for every operation that we're doing (extra get to check the pin index). So instead of understanding what's causing this in the first place, we're taking the approach of blindly solving it by adding more redundant operations which will obviously slow down every db op.

It is also unclear to me how is it, that chunks that are pinned are not cleared out of the gc index when the exclude trigger is hit within the garbage collection run. If I would put my money on something it would be that something there is broken (either some shed.Item does not have the right fields passed to the gc for the exclude index or some other peculiarity. If the item is in the exclude index too and gets GCd then we have some problem around that area).

@jmozah
Copy link
Copy Markdown
Contributor Author

jmozah commented Jun 8, 2020

You don't need to please me @jmozah.

That was a joke. :-)

So instead of understanding what's causing this in the first place, we're taking the approach of blindly solving it by adding more redundant operations which will obviously slow down every db op.

That not true, sorry. if you don't understand the fix, I can shepherd you through the problem and the fix.

It is also unclear to me how is it, that chunks that are pinned are not cleared out of the gc index when the exclude trigger is hit within the garbage collection run. If I would put my money on something it would be that something there is broken (either some shed.Item does not have the right fields passed to the gc for the exclude index or some other peculiarity. If the item is in the exclude index too and gets GCd then we have some problem around that area).

Actually they will get cleared from exchudeIndex once they come there. Then after that the pinned chunk should not come to gcIndex (we have the same condition in ModeSetUpload), that was the design. This is to avoid doin the pinIndex.Has() in every GC, For now this is the logical fix. I get that the performance will degrade, but in this release we are not not optimizing for the performance.. are we?

Copy link
Copy Markdown
Contributor

@acud acud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmozah I understand the fix perfectly fine. You're claiming that the chunks should not be in the garbage collection index if they are in the pin index, but that is not correct according to the old design. That is why the gcExcludeIndex index was introduced, so that you could do a gcIndex ¬ gcExcludeIndex then GC. I accept the fact that there might be some data race or some other problem and that it could be theoretically that some chunks might not be in the exclude index in the first place, which might make them to get GC. But the nature of the fix does not expose where is this consistency problem in localstore.

The changes that you've done in this PR basically mean that gcExcludeIndex is no longer necessary. You can try to remove it completely and see if the problem persists, since in that case we would be doing a significant amount of redundant operations.

Did you check that the chunks that are missing also exist in the exclude gc index? Their presence in the pin index before the fix is one thing, but their protection from being garbage collected is secured with their presence in the exclude index.

Another question about the tests that you've introduced here: do they fail persistently without the fix? do they flake? if yes by which ratio?

Comment thread pkg/localstore/gc_test.go Outdated

}

func addRandomChunks(t *testing.T, count int, db *DB) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm seeing a lot of Puts and Sets above this, if this test helper would have a few parameters, it could save you all of the manual Puts and Sets

Comment thread pkg/localstore/gc_test.go Outdated

_, err = db.Get(context.Background(), storage.ModeGetRequest, rch.Address())
if err != nil {
fmt.Println("Pinned chunk missing ", rch.Address())
Copy link
Copy Markdown
Contributor

@acud acud Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again and again we ask to remove fmt.Println from the codebase. Can we please agree that you review your own code and get rid of such trivial mistakes that consistently come up in every review?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops... my bad.. will remove it

@jmozah
Copy link
Copy Markdown
Contributor Author

jmozah commented Jun 9, 2020

you're claiming that the chunks should not be in the garbage collection index if they are in the pin index, but that is not correct according to the old design.

No. That is how we designed. I would be very happy if you could help me remember what the design was. You can look at the original places we did this. but left out in few other places, which is biting us now.

That is why the gcExcludeIndex index was introduced, so that you could do a gcIndex ¬ gcExcludeIndex then GC.

gcExcludeIndex was introduced for another reason altogether. You first upload a file and then pin it in the code. If the chunk enters in to gcIndex before pinning actually happens, then gcExcludeIndex would come handy and will remove the item from gcIndex just before GC is done. You can also check in the code that once the chunk is removed from the gcIndex, it will also be removed from the gcExclude Index . So this is basically 1 time saving mechanism that we built. After that, gcIndex is not supposed to have a pinned chunk by design. So basically after after one round the gcExcludeIndex will be empty. gcExcludeIndex is only for the just uploaded chunks, pinIndex is for all chunks ever pinned. This comment might remind you. Also this is the @janos comment that triggered the gcExcludeIndex design.

The changes that you've done in this PR basically mean that gcExcludeIndex is no longer necessary. You can try to remove it completely and see if the problem persists, since in that case we would be doing a significant amount of redundant operations.

As said above, gcExcludeIndex is necessary. otherwise we might loose some chunks that enter GC between uploading and pinning the file.

Did you check that the chunks that are missing also exist in the exclude gc index? Their presence in the pin index before the fix is one thing, but their protection from being garbage collected is secured with their presence in the exclude index.

gcExcludeIndex becomes empty after one round. So there is no protection after the first round of protection from gcExcludeIndex.

Another question about the tests that you've introduced here: do they fail persistently without the fix? do they flake? if yes by which ratio?

They always pass, as i said i could not reproduce this in unit tests cases here. The same scenario creates issue in @santicomp2014's setup. So i added it as a precaution.

@acud
Copy link
Copy Markdown
Contributor

acud commented Jun 9, 2020

So then the problem is clear no? If pinning the chunk is two-phased operation then there will always be a possibility for a data race where a GC runs between the upload phase and the pinning phase. Solving this thus must be an atomic operation where we can say that the chunk upload API can take a header saying to pin the chunk immediately on upload, and then create a ModePutUploadPin which takes care of adding the item into the pin index, gc exclude index and not to the GC index.

And to emphasize, there is still a possibility for a data race even with this fix, where the chunk is added to the gc without it being in the pin index, this can happen when the chunk is accessed before it is pinned (you are relying on the fact that it is in the pin index when you setAccess on Get, so if it not yet pinned it will get garbage collected). This fix narrows the window of this possible data race, but does not eliminate it entirely.

@acud
Copy link
Copy Markdown
Contributor

acud commented Jun 9, 2020

I'm also willing to bet a few bucks that the problem is still there, and that it will show itself again when you try to upload and pin a large collection (let's say a large directory with a lot of files, that will take a bit longer to traverse for pinning)

@jmozah
Copy link
Copy Markdown
Contributor Author

jmozah commented Jun 9, 2020

So then the problem is clear no? If pinning the chunk is two-phased operation then there will always be a possibility for a data race where a GC runs between the upload phase and the pinning phase.

Pinning is a two phased operation always. You can upload using a API and later call a another API to pin. That's the design at that time (although i wont do it if i do it in bee). That does not mean there will be a race condition. That means that there can be chunks that are going to be GC'd and we need to remove them from gcIndex.

And to emphasize, there is still a possibility for a data race even with this fix, where the chunk is added to the gc without it being in the pin index, this can happen when the chunk is accessed before it is pinned

That's precisely the reason of gcExcludeIndex. You still don't seem to get the gcExcludeIndex sole existence.

This fix narrows the window of this possible data race, but does not eliminate it entirely.

Can you explain the data race? where is the data race? without knowing here is the data race how are you claiming that it is narrowing the possibility.

I'm also willing to bet a few bucks that the problem is still there, and that it will show itself again when you try to upload and pin a large collection (let's say a large directory with a lot of files, that will take a bit longer to traverse for pinning)

I am also of the same impression that the problem might still be there. until we know the issue exactly, we cannot say for sure that it is fixed. Nor i am claiming that this will fix it 100%. This is a guard fix that needs to be there logically as per the design and it is not here. This PR adds it. Got it.

@jmozah jmozah merged commit 4062b5c into master Jun 9, 2020
@jmozah jmozah deleted the pinfix branch June 9, 2020 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants