Skip to content

Commit d46bd90

Browse files
refactor: add troublshooting for etcd nospace docs (#548)
* add troublshooting for etcd nospace docs --------- Co-authored-by: Denise <[email protected]>
1 parent 2db7171 commit d46bd90

File tree

2 files changed

+212
-0
lines changed

2 files changed

+212
-0
lines changed

vcluster/troubleshoot/_category_.json

+5
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
{
2+
"label": "Troubleshoot",
3+
"position": "8",
4+
"collapsible": true
5+
}
+207
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
---
2+
title: Resolve etcd NOSPACE alarm in vCluster
3+
sidebar_label: etcd alarm - NOSPACE
4+
sidebar_position: 1
5+
description: Diagnose and resolve the etcd NOSPACE alarm in vCluster.
6+
---
7+
8+
import Flow, { Step } from '@site/src/components/Flow';
9+
10+
# Resolve etcd NOSPACE alarm in vCluster
11+
12+
The etcd `NOSPACE` alarm signals that the etcd database inside the vCluster has used its available disk space. When this occurs, etcd fails its health checks, which causes the control plane to become unresponsive. As a result, all cluster operations—such as deploying workloads, updating resources, or managing cluster components—are blocked, and the vCluster is unusable until the issue is resolved.
13+
14+
## Error message
15+
16+
You might find the following error in the logs of your etcd pods, if the etcd has run out of storage space:
17+
18+
```bash title="etcd NOSPACE alarm"
19+
etcdhttp/metrics.go:86 /health error ALARM NOSPACE status-code 503
20+
```
21+
22+
<details id="etcd-nospace-alarm">
23+
<summary>Identifying an etcd `NOSPACE` alarm in vCluster</summary>
24+
25+
When interacting with the affected vCluster using `kubectl`, API requests fail with timeout errors:
26+
27+
```bash
28+
Error from server: etcdserver: request timed out
29+
```
30+
31+
Additionally, the etcd health metrics endpoint returns a `503` status code and the following error:
32+
33+
```text
34+
etcdhttp/metrics.go:86 /health error ALARM NOSPACE status-code 503
35+
```
36+
37+
To verify the `NOSPACE` alarm, run the following command against the etcd instance:
38+
39+
```bash
40+
etcdctl alarm list --endpoints=https://$ETCD_SRVNAME:2379 [...]
41+
```
42+
43+
The output displays the triggered alarm:
44+
45+
```text
46+
memberID:XXXXX alarm:NOSPACE
47+
```
48+
</details>
49+
50+
## Causes
51+
52+
The `NOSPACE` alarm occurs due to two common conditions:
53+
54+
- **Excessive etcd data growth:** A large number of objects—such as Deployments, ConfigMaps, and Secrets—can fill etcd’s storage if regular compaction is not performed.
55+
56+
- **Synchronization conflicts:** Conflicting objects between the vCluster and host cluster can trigger continuous sync loops. For example, a Custom Resource Definition (CRD) modified by the host cluster might sync back to the vCluster repeatedly. This behavior quickly fills etcd’s backend storage.
57+
58+
## Solution
59+
60+
To resolve the issue, compact and defragment the etcd database to free up space. Then, reconfigure etcd with automatic compaction and increase its storage quota to prevent recurrence.
61+
62+
<Flow id="resolve-etcd-nospace-alarm">
63+
64+
<Step>
65+
66+
**Identify if there's a syncing conflict**.
67+
68+
Check for objects that might be caught in a sync loop:
69+
70+
```bash
71+
kubectl -n <namespace> logs <vcluster-pod> | grep -i "sync" | grep -i "error"
72+
```
73+
If you find a problematic object, pause syncing for it in your vCluster config.
74+
</Step>
75+
76+
<Step>
77+
78+
**Compact and defragment etcd**.
79+
80+
- Connect to each etcd pod. Access the etcd pod using the following command:
81+
82+
```bash
83+
kubectl -n <namespace> exec -it <etcd-pod-name> -- sh
84+
```
85+
86+
- Set environment variables. Export the etcd service name as an environment variable:
87+
88+
```bash
89+
export ETCD_SRVNAME=<etcd-pod-name>
90+
```
91+
92+
- Get the current revision number. Retrieve the current revision number of etcd using the following command:
93+
94+
```bash
95+
etcdctl endpoint status --write-out json \
96+
--endpoints=https://$ETCD_SRVNAME:2379 \
97+
--cacert=/run/config/pki/etcd-ca.crt \
98+
--key=/run/config/pki/etcd-peer.key \
99+
--cert=/run/config/pki/etcd-peer.crt
100+
```
101+
102+
- Compact the etcd database. Compact etcd to remove old data and free up disk space:
103+
104+
```bash
105+
etcdctl --command-timeout=600s compact <revision-number> \
106+
--endpoints=https://$ETCD_SRVNAME:2379 \
107+
--cacert=/run/config/pki/etcd-ca.crt \
108+
--key=/run/config/pki/etcd-peer.key \
109+
--cert=/run/config/pki/etcd-peer.crt
110+
```
111+
Replace `<revision-number>` with the value retrieved from the previous command.
112+
113+
- Defragment etcd. Defragment etcd to optimize disk usage and improve performance:
114+
115+
```bash
116+
etcdctl --command-timeout=600s defrag \
117+
--endpoints=https://$ETCD_SRVNAME:2379 \
118+
--cacert=/run/config/pki/etcd-ca.crt \
119+
--key=/run/config/pki/etcd-peer.key \
120+
--cert=/run/config/pki/etcd-peer.crt
121+
```
122+
123+
- Repeat for all etcd pods in your cluster.
124+
125+
</Step>
126+
127+
<Step>
128+
129+
**Verify disk usage reduction**.
130+
131+
132+
Check that the operation freed up space:
133+
134+
```bash
135+
etcdctl endpoint status -w table \
136+
--endpoints=https://$ETCD_SRVNAME:2379 \
137+
--cacert=/run/config/pki/etcd-ca.crt \
138+
--key=/run/config/pki/etcd-peer.key \
139+
--cert=/run/config/pki/etcd-peer.crt
140+
```
141+
</Step>
142+
143+
<Step>
144+
145+
**Disarm the NOSPACE alarm**.
146+
147+
Remove the alarm to restore normal operation:
148+
149+
```bash
150+
etcdctl alarm disarm \
151+
--endpoints=https://$ETCD_SRVNAME:2379 \
152+
--cacert=/run/config/pki/etcd-ca.crt \
153+
--key=/run/config/pki/etcd-peer.key \
154+
--cert=/run/config/pki/etcd-peer.crt
155+
```
156+
157+
</Step>
158+
</Flow>
159+
160+
## Prevention
161+
162+
Update your vCluster configuration to prevent future occurrences. Use the following recommended settings to enable automatic maintenance of your etcd database:
163+
164+
```yaml title="vcluster.yaml"
165+
controlPlane:
166+
backingStore:
167+
etcd:
168+
embedded:
169+
enabled: false
170+
deploy:
171+
enabled: true
172+
statefulSet:
173+
enabled: true
174+
extraArgs:
175+
- '--auto-compaction-mode=periodic'
176+
- '--auto-compaction-retention=30m'
177+
- '--quota-backend-bytes=8589934592'
178+
```
179+
180+
This configuration enables periodic compaction every 30 minutes, sets etcd quota to 8GB, and uses deployed etcd instead of embedded for better control. You can edit parameters based on your needs.
181+
182+
## Verification
183+
184+
After completing the solution steps:
185+
186+
1. Check that etcd pods are healthy:
187+
188+
```bash
189+
kubectl -n <namespace> get pods | grep etcd
190+
```
191+
192+
2. Verify that vCluster is functioning properly:
193+
194+
```bash
195+
kubectl -n <namespace> get pods
196+
kubectl -n <namespace> logs <vcluster-pod> | grep -i "alarm"
197+
```
198+
199+
## Best practices
200+
201+
To ensure optimal etcd performance in vCluster:
202+
203+
- **Monitor etcd disk usage**: Use metrics tools to track disk usage and set up alerts for high usage levels.
204+
- **Enable automated compaction**: Configure compaction with `--auto-compaction-mode=periodic` and `--auto-compaction-retention=30m` to clean up old data.
205+
- **Size etcd storage appropriately**: Set `--quota-backend-bytes` based on usage, with a buffer for growth.
206+
- **Defragment etcd regularly**: Optimize disk usage by defragmenting etcd periodically.
207+
- **Resolve syncing conflicts**: Identify and fix syncing issues to prevent unnecessary data growth.

0 commit comments

Comments
 (0)