-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDEV-17565: Sporadic Galera failures when testing MariaDB with mtr #4
base: 3.x
Are you sure you want to change the base?
Conversation
This patch fixes several bugs in Galera that lead to sporadic failures when testing MariaDB server with the mtr due to false error messages (or warnings) that are not related to the new fixes. The patch includes a some changes taken from the latest versions of Galera from Codership, as well as some changes taken from PXC version of Galera, as well as some corrections made by me. 1) Some mtr tests sometimes fails due to "[Warning] WSREP: gcs_caused() returned -1" warnings: Currently gcs_.caused() function works only when the group is primary, and fails if the group is non-primary or even if the group in a transient state (during configuration changes). Instead of failing immediately, this patch changes gcs_.caused() to return EAGAIN error code when function was called while group in a transient state. On receiving EAGAIN error code ReplicatorSMM::causal_read() retries to obtain a new seqno (by calling gcs_.caused() again). 2) Some mtr tests sometimes fails due to "[Warning] WSREP: Failed to report last committed <number>" warnings: This is because when processing cluster configuration changes, the GCS layer does not always timely update the group->last_applied variable. To correct this error, I added an additional call to the group_redo_last_applied() function. In addition, to protect against other similar situations, I added a cycle to re-call gcs_.set_last_applied() in case of failure due to interruption of internal operations in the current Galera implementation. 3) Some mtr test sometimes fails when node is evicted from the cluster in middle of SST. Even when node evicted, the SST script may completes normally. After this, the node calls the gcs_join() function and tries to join the cluster. However, this is impossible, because the node is already evicted. Therefore, the _join() function (which called from gcs_join) fails. Then node does IST (which also fails), after/during which it is aborted. To fix this, we should avoid joining the cluster through gcs_join function if node is evicted. To do this, we should check the current connection state in the gcs_join() function to return from it immediately if the node's communication channel was closed. 4) If SST fails due to a network error, the node that acted as a donor sometimes does not return to its original state, which leads to failure due to the inability to continue the test execution (due to a timeout). If sst_sent() fails node should restore itself back to joined state. The sst_sent function can fail. commonly due to network errors, where DONOR may lose connectivity to JOINER (or existing cluster). But on re-join it should restore the original state without waiting for transition to JOINER state. SST failure on JOINER will gracefully shutdown the joiner. https://jira.mariadb.org/browse/MDEV-17565
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address my mostly trivial review commends and wait for review from Alexey.
act->type = GCS_ACT_COMMIT_CUT; | ||
|
||
int const buf_len(sizeof(commit_cut)); | ||
void* const buf(malloc(buf_len)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear to me is this memory allocation freed automatically?
@@ -255,7 +255,7 @@ group_check_donor (gcs_group_t* group) | |||
gu_warn ("Donor %s is no longer in the group. State transfer cannot " | |||
"be completed, need to abort. Aborting...", donor_id); | |||
|
|||
gu_abort(); | |||
// gu_abort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why leave commented code ?
@@ -311,7 +333,10 @@ namespace galera | |||
return ret; | |||
} | |||
|
|||
gcs_seqno_t caused() { return global_seqno_; } | |||
void caused(gcs_seqno_t& seqno, gu::datetime::Date& wait_until) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope recent compiler does not produce a error on unused parameter, have you tried e.g. gcc 9 ?
@@ -1420,6 +1452,7 @@ galera::ReplicatorSMM::process_conf_change(void* recv_ctx, | |||
assert(app_req_len <= 0); | |||
log_fatal << "View callback failed. This is unrecoverable, " | |||
<< "restart required."; | |||
local_monitor_.leave(lo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment why this is necessary.
@@ -1428,14 +1461,13 @@ galera::ReplicatorSMM::process_conf_change(void* recv_ctx, | |||
log_fatal << "Local state UUID " << state_uuid_ | |||
<< " is different from group state UUID " << group_uuid | |||
<< ", and SST request is null: restart required."; | |||
local_monitor_.leave(lo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here also comment.
@@ -1528,6 +1560,7 @@ galera::ReplicatorSMM::process_conf_change(void* recv_ctx, | |||
{ | |||
log_fatal << "Internal error: unexpected next state for " | |||
<< "non-prim: " << next_state << ". Restart required."; | |||
local_monitor_.leave(lo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment also here.
libgcs_env.Append(CPPFLAGS = ' -Wno-missing-field-initializers') | ||
libgcs_env.Append(CPPFLAGS = ' -Wno-effc++') | ||
libgcs_env.Append(CCFLAGS = ' -Wno-missing-field-initializers') | ||
libgcs_env.Append(CCFLAGS = ' -Wno-variadic-macros') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is these changes really needed and why ?
@@ -985,8 +999,6 @@ wsrep_status_t galera::ReplicatorSMM::causal_read(wsrep_gtid_t* gtid) | |||
// at monitor drain and disallowing further waits until | |||
// configuration change related operations (SST etc) have been | |||
// finished. | |||
gu::datetime::Date wait_until(gu::datetime::Date::calendar() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove this as it is used below ?
This patch fixes several bugs in Galera that lead to sporadic
failures when testing MariaDB server with the mtr due to false
error messages (or warnings) that are not related to the new
fixes.
The patch includes a some changes taken from the latest
versions of Galera from Codership, as well as some changes
taken from PXC version of Galera, as well as some corrections
made by me.
"[Warning] WSREP: gcs_caused() returned -1" warnings:
Currently gcs_.caused() function works only when the group
is primary, and fails if the group is non-primary or even if
the group in a transient state (during configuration changes).
Instead of failing immediately, this patch changes gcs_.caused()
to return EAGAIN error code when function was called while
group in a transient state. On receiving EAGAIN error code
ReplicatorSMM::causal_read() retries to obtain a new seqno
(by calling gcs_.caused() again).
"[Warning] WSREP: Failed to report last committed "
warnings:
This is because when processing cluster configuration changes,
the GCS layer does not always timely update the group->last_applied
variable.
To correct this error, I added an additional call to the
group_redo_last_applied() function. In addition, to protect
against other similar situations, I added a cycle to re-call
gcs_.set_last_applied() in case of failure due to interruption
of internal operations in the current Galera implementation.
the cluster in middle of SST.
Even when node evicted, the SST script may completes normally.
After this, the node calls the gcs_join() function and tries
to join the cluster. However, this is impossible, because the
node is already evicted. Therefore, the _join() function
(which called from gcs_join) fails. Then node does IST
(which also fails), after/during which it is aborted.
To fix this, we should avoid joining the cluster through
gcs_join function if node is evicted. To do this, we should
check the current connection state in the gcs_join() function
to return from it immediately if the node's communication
channel was closed.
as a donor sometimes does not return to its original state,
which leads to failure due to the inability to continue
the test execution (due to a timeout).
If sst_sent() fails node should restore itself back to joined
state. The sst_sent function can fail. commonly due to network
errors, where DONOR may lose connectivity to JOINER (or existing
cluster). But on re-join it should restore the original state
without waiting for transition to JOINER state. SST failure
on JOINER will gracefully shutdown the joiner.
https://jira.mariadb.org/browse/MDEV-17565