-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd2 fails to start with option ETCD_WAL_DIR #7287
Comments
I suspect you're booting etcd with a wal-dir that's already populated but with a wiped member directory. etcd fails because it expects a member directory but it's not there: $ ./bin/etcd -wal-dir waldir
[ctrl-c]
# remove member dir, keep waldir
$ rm -rf default.etcd
$ ./bin/etcd -wal-dir waldir |
The storage for the wal dir is created by the cloud-config (I tested it with a whole new cluster and new servers). The following snippet of the cloud-config is used to create the wal-dir storage:
It is started before the server is installed with and mounted with that piece of a cloud-config:
The whole storage should be (and is) empty. According to the docs the member folder should not change if |
@pizzarabe Is your WAL directory writable? Maybe check the permission? |
@pizzarabe kindly ping |
etcd2 runs as the user etcd, I changed the owner of the /var/wal to etcd (and set SELinux to permissive), but still no success.
I followed https://coreos.com/etcd/docs/latest/etcd-live-cluster-reconfiguration.html#replace-a-failed-etcd-member-on-coreos-container-linux to remove and add the node to the cluster (so I can remove the wal dir located at /var/lib/etcd2/member) Anyway, starting the node with wal-dir does not work:
after removing the wal-dir option from the systemd unit file the service works.
The errormsg is different, but I guess it is related. |
Did you remove any files in data-dir before restart? Simple steps to reproduce would be helpful. |
@gyuho yes, I removed the whole content
To reproduce:
if I skip
it works :/ Like I said, I followed the docs at https://coreos.com/etcd/docs/latest/etcd-live-cluster-reconfiguration.html#replace-a-failed-etcd-member-on-coreos-container-linux |
If you remove |
It will print out
|
I misread the comment. I think it would be useful. Thanks!
|
No, the old wal dir was located at /var/lib/etcd2/wal and was removed with |
If you remove all the data, wal directories completely, how does your etcd log show this line?
This means you had |
Maybe I did not remove the member dir at that point... Anyway, I removed it now too. If I try to start etcd2 with a dedicated wal dir and no member dir I get the following error (again):
Wthout removing the member dir and a dedicated wal-dir configured:
If I do not configure wal-dir, it always works (with and without the member dir)
|
Can you share your |
Mar 01 16:43:56 alien6 etcd2[20036]: cannot write to member directory: open /var/lib/etcd2/member/.touch: no such file or directory
Mar 01 16:43:56 alien6 systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Mar 01 16:43:56 alien6 systemd[1]: Failed to start etcd2. This is expected behavior; it should fail to start. It needs a better error message like I suggested.
Mar 01 16:49:42 alien6 etcd2[20510]: open /var/lib/etcd2/member/snap: no such file or directory
Mar 01 16:49:42 alien6 systemd[1]: etcd2.service: Main process exited, code=exited, status=1/FAILURE
Mar 01 16:49:42 alien6 systemd[1]: Failed to start etcd2 This seems corrupted in some way. The member directory is missing contents. It should not start.
It's filling in the wal data. If the member directory already exists, it probably should refuse to start if there are no wal files. The member directory and wal directory depend on each other. If either is missing data, etcd should usually refuse to boot. This isn't as much of a problem when the wal directory is inside the member directory since if the member directory is missing, the wal directory will be missing too. |
@gyuho this is the normal installation of etcd from CoreOS (Container Linux) 1325.1.0 alpha. I did not changed anything on the source or sth. like that.
I am not sure if I understand that correctly, both ways should fail? How can I configure a dedicated wal dir then? |
@pizzarabe the dedicated wal directory should be created / destroyed along with the member directory. It would be configured at the time of member directory creation. |
So this seems like a real bug and not a layer 8 error? |
yes
发自网易邮箱大师
On 03/03/2017 16:41, pizzarabe wrote:
So this seems like a real bug and not a layer 8 error?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@pizzarabe the bug is that it's not giving a reasonable error on panic. If etcd tries to start with some configured bits on disk but others missing, it should refuse to boot. |
If this is the only bug here, how should I configure the wal dir to get it working? |
@pizzarabe if you delete the member directory, then delete the wal directory too. If the member is already initialized with a wal directory inside the memory directory, move it to what's defined in |
@heyitsanthony Okay, this does work if I copy/move the wal dir to the dedicated directory after the node joined the cluster, if I try to bootstrap the cluster with new nodes they fail with: The node is clean, a new OS installation and the dedicated wal dir storage was wiped before the installation:
And again, removing the line |
This is similar to the problem from before. There's an existing wal directory and no member directory. etcd must have both the wal directory and member directory or neither in order to boot. etcd is interpreting the |
I can confirm, that this is exactly the problem (creating a dir under that path solved that problem for me) |
Bootstrapping a new etcd2 Server on CoreOS with a dedicated wal directory fails with the following error message:
cannot write to member directory: open /var/lib/etcd2/member/.touch: no such file or directory
removing the wal-dir flag resolves that problem.
Here is the config of a etcd2 cluster node:
according to the docs the wal-dir option should only change the path of the wal files, I don't know why he got problems with the data-dir...
The text was updated successfully, but these errors were encountered: