Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete leaves dangling entities, allows double-upserts #2148

Closed
aphyr opened this issue Feb 21, 2018 · 4 comments
Closed

Delete leaves dangling entities, allows double-upserts #2148

aphyr opened this issue Feb 21, 2018 · 4 comments
Assignees
Labels
kind/bug Something is broken.
Milestone

Comments

@aphyr
Copy link

aphyr commented Feb 21, 2018

On the current Dgraph nightly (2018/02/20), in a healthy five-node cluster, without any faults, concurrently upserting, deleting, and reading records by a single, indexed integer attribute key can result in queries for key returning long-lasting records with only a UID and no key attached; as if the UID were still present in that key's index, even though the triple has been deleted. Moreover, reads may actually see one of these dangling deleted records concurrently with a new record for that key. Moreover, reads may observe two complete records, even though one has been deleted.

This violates snapshot isolation, since reads are clearly not seeing a consistent point-in-time representation of state. It breaks internal consistency, since you can query for a record with eq(key, 4) and get back a record which doesn't have a key at all. Since a single process can actually see deleted data, it's definitely not sequential.

You can reproduce this on Jepsen 282623ca5b733097b9a524f70a451ca40151310e by running

lein run test -w delete -f --package-url https://github.com/dgraph-io/dgraph/releases/download/nightly/dgraph-linux-amd64.tar.gz --time-limit 120 --concurrency 100`.

For future users: you'll need 1.0.3 or 1.0.4, since this is on the nightlies and I don't think those have stable URLs. Test code is here.

The schema for this test is key: int @index(int) .. We read keys by querying

{ q(func: eq(key, $key)) {
  uid
  key
}}

... delete by querying for just the UID and issuing a delete for every attribute with {uid: 0x123}, and upsert by reading for an existing record with that key, and if none exists, inserting a new one. All operations occur within a transaction.

In this run, for key 0, an entity 0x298 already exists for the key, but is deleted by process 2. Processes 0, 1, 6, and 8 confirm the deletion of that entity: they find no matches for that query. Process 2 upserts a new record, which is visible on reads by process 5 and 6: {:uid "0x2ab", :key 0}. Concurrently with that upsert, process 4 deletes the key. Concurrent reads return dangling records: {:uid "0x2ab"} with no key. Other processes continue trying to delete key 0; their queries observe uid 0x2ab and issue a delete for it, but apparently with no effect. This dangling record, 0x2ab, persists for the remainder of the test--just over two minutes.

screenshot from 2018-02-20 19-41-38

In this diagram, time flows down, each process is a vertical track, operations are colored bars. Blue means success, pink is a known failed operation; e.g. it did not occur.

Moreover, queries may return multiple dangling records:

[{:uid "0xcf"} {:uid "0x110"}]

... a dangling record and a complete record:

[{:uid "0xcf"} {:uid "0x177", :key 1}]

... or multiple complete records for the same key:

[{:uid "0x148b"} {:uid "0x14f6", :key 13}]
[{:uid "0x148b"} {:uid "0x150c", :key 13}]
[{:uid "0x148b", :key 13} {:uid "0x150c", :key 13}]

For instance, in this run, shortly after an upsert, process 33 manages to observe two complete, coexisting records for the same key: 0x25fb, and 0x26e0. Note that 0x25fb was already deleted, which means it went from dangling to non-dangling. Note also that in order for the upsert to take place, it must have failed to observe 0x25fb, or it wouldn't have inserted anything. Something really odd is going on here.

screenshot from 2018-02-20 20-49-11

Even weirder, in 20180220T205917.000-0600.zip, 0x7e7 is a dangling record which process 36 can see and apparently successfully delete (although the deletion doesn't actually delete the dangling record). However, right after reading 0x7e7, process 36 attempts a delete and finds nothing to delete. Then 0x7e7 goes right back to being a dangling pointer and is visible to process 36 again--this time as a full record, in conjunction with the newly upserted 0x808. 0x808 turns out to be deletable. 0x707 goes back to being a full record, then a dangling record after that.

screenshot from 2018-02-20 21-06-11

@manishrjain manishrjain added the kind/bug Something is broken. label Feb 21, 2018
@janardhan1993 janardhan1993 self-assigned this Mar 7, 2018
@janardhan1993
Copy link
Contributor

janardhan1993 commented Mar 13, 2018

1.#2220 Should fix some of the cases where this test was failing.(Due to issues with SP*)

@manishrjain manishrjain added the kind/bug Something is broken. label Mar 21, 2018
@janardhan1993
Copy link
Contributor

The test is passing now with 'https://transfer.sh/13hS3v/dgraph-linux-amd64.tar.gz', Fixed by #2261

@manishrjain manishrjain added this to the Sprint-000 milestone Mar 28, 2018
@manishrjain manishrjain modified the milestones: Sprint-000, Sprint-001 Apr 2, 2018
@ghost ghost removed the backlog label Apr 4, 2018
@janardhan1993
Copy link
Contributor

This passes now with server_side_sequencing hardcoded in server.
@aphyr Need to add @upsert directive for this test as well.
We also need to use latest client and set sequencing mode after this PR (https://github.com/dgraph-io/dgraph4j/pull/54/files) is merged.

@aphyr
Copy link
Author

aphyr commented Apr 4, 2018

I'm going to build the new client and give that a shot. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something is broken.
Development

No branches or pull requests

3 participants