MDEV-17565: Sporadic Galera failures when testing MariaDB with mtr #4

sysprg · 2018-10-30T02:03:23Z

This patch fixes several bugs in Galera that lead to sporadic
failures when testing MariaDB server with the mtr due to false
error messages (or warnings) that are not related to the new
fixes.

The patch includes a some changes taken from the latest
versions of Galera from Codership, as well as some changes
taken from PXC version of Galera, as well as some corrections
made by me.

Some mtr tests sometimes fails due to
"[Warning] WSREP: gcs_caused() returned -1" warnings:

Currently gcs_.caused() function works only when the group
is primary, and fails if the group is non-primary or even if
the group in a transient state (during configuration changes).

Instead of failing immediately, this patch changes gcs_.caused()
to return EAGAIN error code when function was called while
group in a transient state. On receiving EAGAIN error code
ReplicatorSMM::causal_read() retries to obtain a new seqno
(by calling gcs_.caused() again).

Some mtr tests sometimes fails due to
"[Warning] WSREP: Failed to report last committed "
warnings:

This is because when processing cluster configuration changes,
the GCS layer does not always timely update the group->last_applied
variable.

To correct this error, I added an additional call to the
group_redo_last_applied() function. In addition, to protect
against other similar situations, I added a cycle to re-call
gcs_.set_last_applied() in case of failure due to interruption
of internal operations in the current Galera implementation.

Some mtr test sometimes fails when node is evicted from
the cluster in middle of SST.

Even when node evicted, the SST script may completes normally.
After this, the node calls the gcs_join() function and tries
to join the cluster. However, this is impossible, because the
node is already evicted. Therefore, the _join() function
(which called from gcs_join) fails. Then node does IST
(which also fails), after/during which it is aborted.

To fix this, we should avoid joining the cluster through
gcs_join function if node is evicted. To do this, we should
check the current connection state in the gcs_join() function
to return from it immediately if the node's communication
channel was closed.

If SST fails due to a network error, the node that acted
as a donor sometimes does not return to its original state,
which leads to failure due to the inability to continue
the test execution (due to a timeout).

If sst_sent() fails node should restore itself back to joined
state. The sst_sent function can fail. commonly due to network
errors, where DONOR may lose connectivity to JOINER (or existing
cluster). But on re-join it should restore the original state
without waiting for transition to JOINER state. SST failure
on JOINER will gracefully shutdown the joiner.

https://jira.mariadb.org/browse/MDEV-17565

This patch fixes several bugs in Galera that lead to sporadic failures when testing MariaDB server with the mtr due to false error messages (or warnings) that are not related to the new fixes. The patch includes a some changes taken from the latest versions of Galera from Codership, as well as some changes taken from PXC version of Galera, as well as some corrections made by me. 1) Some mtr tests sometimes fails due to "[Warning] WSREP: gcs_caused() returned -1" warnings: Currently gcs_.caused() function works only when the group is primary, and fails if the group is non-primary or even if the group in a transient state (during configuration changes). Instead of failing immediately, this patch changes gcs_.caused() to return EAGAIN error code when function was called while group in a transient state. On receiving EAGAIN error code ReplicatorSMM::causal_read() retries to obtain a new seqno (by calling gcs_.caused() again). 2) Some mtr tests sometimes fails due to "[Warning] WSREP: Failed to report last committed <number>" warnings: This is because when processing cluster configuration changes, the GCS layer does not always timely update the group->last_applied variable. To correct this error, I added an additional call to the group_redo_last_applied() function. In addition, to protect against other similar situations, I added a cycle to re-call gcs_.set_last_applied() in case of failure due to interruption of internal operations in the current Galera implementation. 3) Some mtr test sometimes fails when node is evicted from the cluster in middle of SST. Even when node evicted, the SST script may completes normally. After this, the node calls the gcs_join() function and tries to join the cluster. However, this is impossible, because the node is already evicted. Therefore, the _join() function (which called from gcs_join) fails. Then node does IST (which also fails), after/during which it is aborted. To fix this, we should avoid joining the cluster through gcs_join function if node is evicted. To do this, we should check the current connection state in the gcs_join() function to return from it immediately if the node's communication channel was closed. 4) If SST fails due to a network error, the node that acted as a donor sometimes does not return to its original state, which leads to failure due to the inability to continue the test execution (due to a timeout). If sst_sent() fails node should restore itself back to joined state. The sst_sent function can fail. commonly due to network errors, where DONOR may lose connectivity to JOINER (or existing cluster). But on re-join it should restore the original state without waiting for transition to JOINER state. SST failure on JOINER will gracefully shutdown the joiner. https://jira.mariadb.org/browse/MDEV-17565

janlindstrom

Please address my mostly trivial review commends and wait for review from Alexey.

janlindstrom · 2019-12-05T06:11:07Z

gcs/src/gcs_core.cpp

-                act->type                 = GCS_ACT_COMMIT_CUT;
+
+            int   const buf_len(sizeof(commit_cut));
+            void* const buf(malloc(buf_len));


It is not clear to me is this memory allocation freed automatically?

janlindstrom · 2019-12-05T06:11:46Z

gcs/src/gcs_group.cpp

@@ -255,7 +255,7 @@ group_check_donor (gcs_group_t* group)
        gu_warn ("Donor %s is no longer in the group. State transfer cannot "
                 "be completed, need to abort. Aborting...", donor_id);

-        gu_abort();
+        // gu_abort();


Why leave commented code ?

janlindstrom · 2019-12-05T06:13:23Z

galera/src/galera_gcs.hpp

@@ -311,7 +333,10 @@ namespace galera
            return ret;
        }

-        gcs_seqno_t caused() { return global_seqno_; }
+        void caused(gcs_seqno_t& seqno, gu::datetime::Date& wait_until)


I hope recent compiler does not produce a error on unused parameter, have you tried e.g. gcc 9 ?

janlindstrom · 2019-12-05T06:17:41Z

galera/src/replicator_smm.cpp

@@ -1420,6 +1452,7 @@ galera::ReplicatorSMM::process_conf_change(void*                    recv_ctx,
        assert(app_req_len <= 0);
        log_fatal << "View callback failed. This is unrecoverable, "
                  << "restart required.";
+        local_monitor_.leave(lo);


Add comment why this is necessary.

janlindstrom · 2019-12-05T06:17:57Z

galera/src/replicator_smm.cpp

@@ -1428,14 +1461,13 @@ galera::ReplicatorSMM::process_conf_change(void*                    recv_ctx,
        log_fatal << "Local state UUID " << state_uuid_
                  << " is different from group state UUID " << group_uuid
                  << ", and SST request is null: restart required.";
+        local_monitor_.leave(lo);


Here also comment.

janlindstrom · 2019-12-05T06:18:27Z

galera/src/replicator_smm.cpp

@@ -1528,6 +1560,7 @@ galera::ReplicatorSMM::process_conf_change(void*                    recv_ctx,
        {
            log_fatal << "Internal error: unexpected next state for "
                      << "non-prim: " << next_state << ". Restart required.";
+            local_monitor_.leave(lo);


Add comment also here.

janlindstrom · 2019-12-05T06:20:31Z

gcs/src/SConscript

-libgcs_env.Append(CPPFLAGS = ' -Wno-missing-field-initializers')
-libgcs_env.Append(CPPFLAGS = ' -Wno-effc++')
+libgcs_env.Append(CCFLAGS = ' -Wno-missing-field-initializers')
+libgcs_env.Append(CCFLAGS = ' -Wno-variadic-macros')



Is these changes really needed and why ?

janlindstrom · 2019-12-05T10:27:18Z

galera/src/replicator_smm.cpp

@@ -985,8 +999,6 @@ wsrep_status_t galera::ReplicatorSMM::causal_read(wsrep_gtid_t* gtid)
        // at monitor drain and disallowing further waits until
        // configuration change related operations (SST etc) have been
        // finished.
-        gu::datetime::Date wait_until(gu::datetime::Date::calendar()


Why did you remove this as it is used below ?

janlindstrom self-assigned this Dec 4, 2019

janlindstrom reviewed Dec 5, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDEV-17565: Sporadic Galera failures when testing MariaDB with mtr #4

MDEV-17565: Sporadic Galera failures when testing MariaDB with mtr #4

sysprg commented Oct 30, 2018

janlindstrom left a comment

janlindstrom Dec 5, 2019

janlindstrom Dec 5, 2019

janlindstrom Dec 5, 2019

janlindstrom Dec 5, 2019

janlindstrom Dec 5, 2019

janlindstrom Dec 5, 2019

janlindstrom Dec 5, 2019

janlindstrom Dec 5, 2019

MDEV-17565: Sporadic Galera failures when testing MariaDB with mtr #4

Are you sure you want to change the base?

MDEV-17565: Sporadic Galera failures when testing MariaDB with mtr #4

Conversation

sysprg commented Oct 30, 2018

janlindstrom left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment