Skip to content

Conversation

@russelldb
Copy link
Contributor

kv679_dataloss_fb.erl test was part of the original kv679 suite, but at the time of 2.1
it was put in it's own branch. See #719 for discussion and decision. In short, Riak Test is made up of passing tests,
and kv679_dataloss_fb.erl still failed. Adding this back to a
branch off develop as I know have a fix of this issue.

There is also a new test, kv679_dataloss_fb2.erl, that shows another dataloss edge case in riak, even in 2.1 with
per-key-actor-epochs enabled. The test is a little convoluted, and is
based on a quickcheck counter example, included in the riak_kv/test
directory. In short, both this test, and the other kv679_dataloss_fb
test, show that even with multiple replicas acking/storing, a single
disk error on a single replica is enough to cause acked writes to be
silently and permanently lost. For a replicated database, that is bad

This test was part of the original kv679 suite, but at the time of 2.1
it was put in it's own branch. Riak Test is made up of passing tests,
and this test (kv679_dataloss_fb.erl) still fails. Adding this back to a
branch off develop in preparation for a fix of this issue.
This test shows a dataloss edge case in riak, even in 2.1 with
per-key-actor-epochs enabled. The test is a litte convoluted, and is
based on a quickcheck counter example, included in the riak_kv/test
directory. In short, both this test, and the other kv679_dataloss_fb
test, show that even with multiple replicas acking/storing, a single
disk error on a single replica is enough to cause acked writes to be
silently and permanently lost. For a replicated database, that is bad.
@russelldb
Copy link
Contributor Author

I'm going to be opening the PR against riak_kv that provides a fix for both these tests in a short while, and will edit to provide link when I do. In the meantime, feel free to run these against 2.1 and up.

@nickelization
Copy link
Contributor

nickelization commented Mar 27, 2017

Both tests make sense, and I confirmed they both fail without basho/riak_kv#1643 and pass with it. Still reviewing the riak_kv PR, but the test PR looks good to me. Thanks, Russell!

@nickelization
Copy link
Contributor

+1

The partition repair test deletes all the data at a partition, and then
repairs it from neighbouring partitions. The subset of repaired data
that was originally coordinated by the deleted partition's vnode showed
up as `notfound` since the latest kv679 changes here
basho/riak_kv#1643. The reason is that the fix
in the KV repo adds a new actor to the repaired key's vclock. Prior to
this change `verify` in partition_repair.erl did a simple equality check
on the binary encoded riak_object values. This change takes into account
that a new actor may be in the vclock at the repaired vnode, and uses a
semantic equality check based on riak_object merge and riak object
equal.
@JeetKunDoug
Copy link
Contributor

+1 - merging to fix partition_repair before next Giddyup run. Still need to add the two new tests to the Giddyup database.

@JeetKunDoug JeetKunDoug merged commit 8170137 into basho:develop Apr 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants