Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MDEV-25855 Added support for Galera replication with cluster auto bootstrapping #377

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

tymonx
Copy link

@tymonx tymonx commented May 21, 2021

This patch add support for Galera replication. It fixes #28 Support Galera Replication.

Features:

  • it detects if Galera replication was enabled using mysql
    configuration files or provided mysqld command line arguments
  • on default it enables cluster auto bootstrap feature
  • on default the first cluster node is used for cluster auto bootstrapping
    based on the wsrep_cluster_address parameter from mysql
    configuration files, mysqld command line arguments or by setting the
    WSREP_CLUSTER_ADDRESS environment variable
  • cluster auto bootstrap feature can be disabled by setting the
    WSREP_SKIP_AUTO_BOOTSTRAP environment variable
  • use the WSREP_AUTO_BOOTSTRAP_ADDRESS environment variable to explicitly
    choice other node for cluster bootstrapping
  • cluster node hostnames or IP addresses must be valid to enable cluster
    auto bootstrapping

How to use it.

  1. Prepare mysql configuration file galera.cnf:
[galera]
wsrep_on                       = ON
wsrep_sst_method               = rsync
wsrep_provider                 = /usr/lib/libgalera_smm.so
bind-address                   = 0.0.0.0
binlog_format                  = row
default_storage_engine         = InnoDB
innodb_doublewrite             = 1
innodb_autoinc_lock_mode       = 2
innodb_flush_log_at_trx_commit = 2
  1. Remove write permission for others (it fixes Warning: World-writable config file):
chmod o-w galera.cnf
  1. Prepare Docker Compose file docker-compose.yml:
services:
    node:
        image: mariadb
        restart: always
        environment:
            WSREP_CLUSTER_ADDRESS: "${WSREP_CLUSTER_ADDRESS:-}"
            MYSQL_ROOT_PASSWORD: example
        volumes:
            - ./galera.cnf:/etc/mysql/conf.d/10-galera.cnf:ro,z
        command:
            - --wsrep-cluster-address=gcomm://db_node_1,db_node_2,db_node_3
        deploy:
            replicas: 3
  1. Start Docker Compose:
docker-compose --project-name db up

To start N MariaDB instances using environment variable:

WSREP_CLUSTER_ADDRESS="gcomm://db_node_1,db_node_2,db_node_3,db_node_4,db_node_5"
docker-compose --project-name db up --scale node="$(echo "${WSREP_CLUSTER_ADDRESS}" | tr ',' ' ' | wc -w)"

To start N MariaDB instances using mysql configuration file:

docker-compose --project-name db up --scale node="$(grep -i wsrep_cluster_address <name>.cnf | tr -d ' ' | tr ',' ' ' | wc -w)"

To start N MariaDB instances using POSIX script helper:

#!/usr/bin/env sh

# usage: scale.sh <project-name> <service-name> <scale>
#    ie: scale.sh db node 5

PROJECT_NAME="${1:-db}"
SERVICE_NAME="${2:-node}"
SCALE="${3:-3}"

WSREP_CLUSTER_ADDRESS="gcomm://${PROJECT_NAME}_${SERVICE_NAME}_1"

for i in $(seq 2 "${SCALE}"); do
    WSREP_CLUSTER_ADDRESS="${WSREP_CLUSTER_ADDRESS},${PROJECT_NAME}_${SERVICE_NAME}_${i}"
done

docker-compose --project-name "${PROJECT_NAME}" up --scale "${SERVICE_NAME}"="${SCALE}"

Example usage:

./scale.sh db node 5

@julienfritsch44
Copy link
Collaborator

@janlindstrom do you think you can review this, please?

@grooverdan grooverdan changed the title Added support for Galera replication with cluster auto bootstrapping MDEV-25855 Added support for Galera replication with cluster auto bootstrapping Jun 4, 2021
@janlindstrom
Copy link

I must say I do not know much about docker but changes do look reasonable.

@grooverdan
Copy link
Member

Thanks @janlindstrom.

@tymonx sorry I've been so slow, I am progressing. I've been podman{,-compose} testing being a userspace only limits some for the things like unique IP addresses per node (probably will have a way eventually), and I've been reacquainting myself with galera and compose to ensure that its the right design.

I'm pretty happy so far. Just been composing test cases.

Success:

  • detection of volume state and the initialization

Not Yet (to be fixed eventually):

  • ports on the cluster address should be ignored (very small change to docker_address_match).

What was the rational behind the order in: docker_ip_match "$resolved" || docker_ip_match "$1" || docker_hostname_match "$resolved" || docker_hostname_match "$1" ? Wouldn't you take direct $1 matches before a resolution?

@ChristianCiach
Copy link

ChristianCiach commented Jun 27, 2021

Hi @tymonx! Thank you for doing this! We are currently evaluating bitnami/mariadb-galera, but we are seeing quite a lot of bugs. Some of these bugs happen because this image is not designed for host-networking --network host and using IP addresses instead of hostnames for the wsrep-cluster-address (even though this is recommended by the galera documentation).

Please make sure that your PR also works in these cases.

Also, you may want to provide an option to force a container into bootstrap mode. When the whole cluster crashes, it may happen that no node is safe_to_bootstrap. When this happens, one node must be forced to bootstrap. On native mariadb installations, you would just run mysqld --wsrep-new-cluster again after editing the grastate.dat to set set_to_bootstrap=1. The Bitnami-image image provides the environment variable MARIADB_GALERA_FORCE_SAFETOBOOTSTRAP (see https://github.com/bitnami/bitnami-docker-mariadb-galera/blob/3b93659e7d0647a5bf3810cc204d71d834120266/10.5/debian-10/rootfs/opt/bitnami/scripts/libmariadbgalera.sh#L99).

But after thinking about this for a minute, this is probably not necessary here, because the user could just pass --wsrep-new-cluster as a command to docker run, right? (This is not possible when using the Bitnami image, which is probably why they invented the environment variable).

@ChristianCiach
Copy link

ChristianCiach commented Jun 27, 2021

It would be nice if you could provide a way to force a node into bootstrap mode just once. In case of a cluster crash, I want a node to force-bootstrap just once to repair the cluster. But when I do docker restart when the cluster is working again, I don't want the container to force-bootstrap again.

Edit: I have no idea how this could be archived...

@grooverdan
Copy link
Member

@ChristianCiach thanks for your interest and describing the requirements/use cases. The number of variants is what is taking this so long to review. While the aim is not to be comprehensive on the first functionality I do aim to use an implementation that needs will be stable.

Yes --wsrep-new-cluster can be passed as an argument as a force option, but like what you mentioned on restart this isn't desired, so a different option/variable is needed.

I'm going to consider this bootstrap first, and then recovery as the next step.

@ChristianCiach
Copy link

ChristianCiach commented Jun 27, 2021

Bitnami's MARIADB_GALERA_FORCE_SAFETOBOOTSTRAP has the same issue, as it also doesn't remove itself. When using this environment variable, you have to remember to re-deploy the container without this variable after the cluster has recovered.

@tymonx
Copy link
Author

tymonx commented Jun 27, 2021

I'm back :)

  • ports on the cluster address should be ignored (very small change to docker_address_match).

Fixed. I have also added line for striping cluster addresses options ?option1=value1[&option2=value2] :

# it removes URI schemes like gcomm://
address="${address#[[:graph:]]*://}"

# it removes port suffix per address
address="${address/:[0-9]*//}"

# it removes options suffix ?option1=value1[&option2=value2]
address="${address%\?[[:graph:]]*}"

What was the rational behind the order in: docker_ip_match "$resolved" || docker_ip_match "$1" || docker_hostname_match "$resolved" || docker_hostname_match "$1" ? Wouldn't you take direct $1 matches before a resolution?

I have just randomly hitting on my keyboard. No specific reasons. I have already changed order, first hostnames.

I have added new changes after some intense testing on various environments, Docker Compose, Docker Swarm, QEMU, Fedora CoreOS, with/without virtualization or physical machines.

  1. DNS resolve lookups for IP -> hostname and hostname -> IP. This will allow to correctly match IP address or hostname node.

Reasons:

  • Docker Compose/Swarm creates implicitly two hostnames <service-name>-<id>.<network-name> and random hash. This will allow to match with <service-name>-<id>.<network-name> or <service-name>-<id>
  • Virtual machines like QEMU hides guest (container with MariaDB) in own network with own IP. It is possible to set hostname from Compose/Swarm like this -netdev user,id=<name>,hostname=$(hostname) -device virtio-net,netdev=<name> and use <service-name>-<id>.<network-name> or <service-name>-<id>
  • The machine hostname can have any name that is not reachable from network. DNS reverse lookup resolves that
  1. I have fixed YAML example in PR description. Proper SELinux label should be :ro,z not :ro,Z Configure the selinux label

To Do:

  • Checking the $wsrepdir/gvwstate.dat file is not enough. On graceful container shutdown this file is removed by the MariaDB daemon. This will cause to run bootstrapping again. I'm currently looking into that to improve this.

@ChristianCiach
Copy link

ChristianCiach commented Jun 27, 2021

To be honest, I don't fully trust your ip/hostname detection logic. There are too many "but what if"s. For example, what happens if the machine has multiple network devices and the container is deployed using "host networking"? Also, I've seen many environments where dns reverse lookup is just not possible.

I would like to be able to explicitly define the node address of the current container. For example, if wsrep_cluster_address is gcomm://172.28.180.96,172.28.180.97,172.28.180.98, I would like to be able to explicitly define the node address of the second node to 172.28.180.97. If you already know the node address of the current node, there is no need to guess anymore. In fact, I already do pass the node address to the container using --wsrep_node_address.

@tymonx
Copy link
Author

tymonx commented Jun 27, 2021

@ChristianCiach no problem, I can add a comparison with the wsrep-node-address value.

It depends on user needs. For example wsrep-node-address is useless when someone is using replicas or global mode. Because it requires to somehow set the wsrep-node-address per each created container.

@ChristianCiach
Copy link

ChristianCiach commented Jun 27, 2021

Yes, of course, I agree with you :) It is not always possible to have different configurations for each node. For example, if you want to scale your cluster up/down dynamically (for example using Docker Swarm services or Kubernetes StatefulSet), then it is very hard or even impossible to set wsrep-node-address.

I think it would be awesome if you could at least look at wsrep-node-address if it is set, just like you said! Also, please support both cases, where wsrep-node-address is defined inside a .cnf file or passed as a command by using --wsrep-node-address.

Again, thank you so much for doing this. It already looks very promising!.

@tymonx
Copy link
Author

tymonx commented Jun 27, 2021

I think it would be awesome if you could at least look at wsrep-node-address if it is set, just like you said! Also, please support both cases, where wsrep-node-address is defined inside a .cnf file or passed as a command by using --wsrep-node-address

Sure. It is very reasonable to do that. I was thinking about the same.

@tymonx
Copy link
Author

tymonx commented Jun 27, 2021

@ChristianCiach I have already added support for the --wsrep-node-address.

When someone will provide the wsrep-node-address from configuration files or command line it will skip auto Docker address match mechanism to select proper node for bootstrapping. On default it compares to the first value from the wsrep-cluster-address. To choice other node, use the WSREP_AUTO_BOOTSTRAP_ADDRESS environment variable.

@grooverdan
Copy link
Member

Just to share some rough stuff I've been looking at (that covers other galera options) and needing to reread the above:

diff --git a/docker-entrypoint.sh b/docker-entrypoint.sh
index 1b10dc2..e51dc02 100755
--- a/docker-entrypoint.sh
+++ b/docker-entrypoint.sh
@@ -359,7 +359,25 @@ docker_ip_match() {
 #    ie: docker_address_match node1
 # it returns true if provided value match with container IP address or container hostname. Otherwise it returns false
 docker_address_match() {
-       local resolved="$(resolveip --silent "$1" 2>/dev/null)" # it converts hostname to ip or vice versa
+       local host=${1%%:*}
+       local port=${1#*:}
+       if [ -n "$port" ]; then
+               local wsrep_provider_options="$(mysql_get_config wsrep_provider_options)"
+               wsrep_provider_options=( ${wsrep_provider_options//,/ } )
+               for opt in "${wsrep_provider_options=[@]}"; do
+                       if [[ "$opt" =~ gmcast.listen_addr.* ]]; then
+                               local val="${opt#*=[[:graph:]]*://}"
+                               case "$val" in
+                                       ${host}:${port})        return 1 ;;
+                                       0.0.0.0:${port})        break ;;
+                                       *:${port})              break ;;
+                                       *)                      return 0;;
+                               esac
+                       fi
+               done
+
+       fi
+       local resolved="$(resolveip --silent "$host" 2>/dev/null)" # it converts hostname to ip or vice versa
 
        docker_ip_match "$resolved" || docker_ip_match "$1" || docker_hostname_match "$resolved" || docker_hostname_match "$1"
 }

As a crude hack with:

#!/bin/bash
podman pod stop db && podman pod rm db
podman pod create --name=db  --share net
for n in 1 2 3
do
	podman create --name=db_node_$n --pod=db \
	       	--security-opt label=disable --label io.podman.compose.config-hash=123 --label io.podman.compose.project=db --label io.podman.compose.version=0.0.1 --label com.doc
ker.compose.container-number=$n --label com.docker.compose.service=node \
		-e MARIADB_ROOT_PASSWORD=example \
		--add-host node:127.0.0.1 --add-host db_node_1:127.0.0.1 --add-host db_node_2:127.0.0.1 --add-host db_node_3:127.0.0.1 \
		--restart always \
		mariadb:testgalera --port $(( 3306 - 1 + $n )) --wsrep_cluster_address=gcomm://db_node_1:4567,db_node_2:4577,db_node_3:4587 --wsrep-node-address=127.0.0.1 --wsrep_
provider_options="gmcast.listen_addr=tcp://0.0.0.0:$(( 4567 + ( $n - 1 ) * 10 ))" --wsrep-on=1 --wsrep-provider=/usr/lib/libgalera_smm.so --binlog_format=ROW
done

Is there a point at which the autobootstrap is (always?) applied if you are actually starting from an empty datadir? Anything else is recovery.

Should non-first nodes not initialize with /docker-entrypoint-initdb.d/ (and rely on galera sst)?

@tymonx
Copy link
Author

tymonx commented Jun 28, 2021

Is there a point at which the autobootstrap is (always?) applied if you are actually starting from an empty datadir?

Docker Daemon (I don't know about Podman) always creates a volume for container. If container stops and starts again (including restarting), files are still present. Bootstrapping will not fire.

I have also tested and confirmed that graceful shutdown docker --kill SIGTERM <container> the mysqld daemon will remove the gvwstate.dat file.

I'm looking into more proper solution to handle this.

@tymonx
Copy link
Author

tymonx commented Jun 28, 2021

For Podman I cannot simple strip port numbers from wsrep-cluster-address. It should be also included for comparison. Because Podman works on 127.0.0.1 vs Docker that always creates container with own IP address.

@tymonx
Copy link
Author

tymonx commented Jun 28, 2021

Working Podman example script to start N containers in db pod for commit 45149e2:

#!/usr/bin/env sh

NODES="${1:-3}"

options="--add-host db_node_1:127.0.0.1"
address="db_node_1:4567"

for i in $(seq 2 "${NODES}"); do
    options="${options} --add-host db_node_$i:127.0.0.1"
    address="${address},db_node_$i:$(( 4567 + ( $i - 1 ) * 10 ))"
done

podman pod stop db
podman pod rm db
podman pod create --name=db --share net

for i in $(seq 1 "${NODES}"); do
    podman create \
        --pod=db \
        --name=db_node_$i \
        --security-opt label=disable \
        --env MARIADB_ROOT_PASSWORD=example \
        --restart always \
        ${options:+${options}} \
        mariadb:dev \
        --port $(( 3305 + $i )) \
        --wsrep_cluster_address="gcomm://${address}" \
        --wsrep-node-address="db_node_$i:$(( 4567 + ( $i - 1 ) * 10 ))" \
        --wsrep-on=on \
        --wsrep-provider=/usr/lib/libgalera_smm.so \
        --binlog_format=row
done

podman pod start db

View logs:

podman logs --follow db_node_1

Output:

View:
  id: b98b33bc-d845-11eb-99df-0245217d5d15:2
  status: primary
  protocol_version: 4
  capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
  final: no
  own_index: 2
  members(3):
        0: b988176f-d845-11eb-994b-576602aed1c3, c38f5df9273f
        1: b98a6f95-d845-11eb-a223-e6bc7724aaf5, 7268f0eb1373
        2: b98aa0de-d845-11eb-945b-e21ff9f0f09e, 81f7c1cfefbe

@tymonx
Copy link
Author

tymonx commented Jun 29, 2021

Added support for the safe_to_bootstrap from the grastate.dat file. This will work in case of graceful shutdown of all nodes but step-by-step. Galera writes 1 to the last gracefully shutdown node.

For Docker Compose users after docker-compose up they should call manually docker stop db_node_<n> per each node. Invoking the docker-compose stop command or hitting CTRL + C combination on the keyboard will gracefully shutdown all nodes at the same time and Galera cannot handle this properly.

Dockerfile.template Outdated Show resolved Hide resolved
@grooverdan
Copy link
Member

I've based and squashed the commits up. Shell check changed a few things. As a basic bootstrap its ok. I'm still looking at what crash recovery would look like. Probably need to make our own state transition diagram.

https://galeracluster.com/library/documentation/crash-recovery.html

tymonx and others added 2 commits February 15, 2022 13:02
This patch add support for Galera replication.

Features:
- It detects if Galera replication was enabled wsrep_on=ON
- By default it enables cluster auto bootstrap feature
- By default the first cluster node is used for cluster auto bootstrapping
  based on the wsrep_cluster_address parameter or by setting the
  `WSREP_CLUSTER_ADDRESS` environment variable
- cluster auto bootstrap feature can be disabled by setting the
  `WSREP_SKIP_AUTO_BOOTSTRAP` environment variable
- use the `WSREP_AUTO_BOOTSTRAP_ADDRESS` environment variable to explicitly
  choice other node for cluster bootstrapping
- cluster node hostnames or IP addresses must be valid to enable cluster
  auto bootstrapping

How to use it.

1. Prepare MariaDB configuration file `galera.cnf`:

```plaintext
[galera]
wsrep_on                       = ON
wsrep_sst_method               = mariabackup
wsrep_provider                 = /usr/lib/libgalera_smm.so
binlog_format                  = row
default_storage_engine         = InnoDB
innodb_doublewrite             = 1
innodb_autoinc_lock_mode       = 2
```

2. Make it read-only:

```plaintext
chmod 444 galera.cnf
```

3. Prepare Docker Compose file `docker-compose.yml`:

```yaml
services:
    node:
        image: mariadb
        restart: always
        security_opt:
            - label=disable
        environment:
            WSREP_CLUSTER_ADDRESS: "${WSREP_CLUSTER_ADDRESS:-}"
            MARIADB_ROOT_PASSWORD: example
        volumes:
            - ./galera.cnf:/etc/mysql/conf.d/10-galera.cnf:ro
        command:
            - --wsrep-cluster-address=gcomm://db_node_1,db_node_2,db_node_3
        deploy:
            replicas: 3
```

4. Start Docker Compose:

```plaintext
docker-compose --project-name db up
```

To start N MariaDB instances using environment variable:

```plaintext
WSREP_CLUSTER_ADDRESS="gcomm://db_node_1,db_node_2,db_node_3,db_node_4,db_node_5"
docker-compose --project-name db up --scale node="$(echo "${WSREP_CLUSTER_ADDRESS}" | tr ',' ' ' | wc -w)"
```

To start N MariaDB instances using MariaDB configuration file:

```plaintext
docker-compose --project-name db up --scale node="$(grep -i wsrep_cluster_address <name>.cnf | tr -d ' ' | tr ',' ' ' | wc -w)"
```

Closes: MariaDB#28
@grooverdan grooverdan force-pushed the feature-support-galera-replication branch from 5fbf4c6 to 38bebe0 Compare February 15, 2022 05:02
@grooverdan
Copy link
Member

@ChristianCiach et all. I welcome any summary of the test cases needed. MDEV-25855 (preferred) or here. I have looked though the bitnami galera issue referenced above, and the blog from which I'll derive some cases too.

@jozefrebjak
Copy link

Hello, any news with this PR ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Support Galera Replication
7 participants