Skip to content

Commit 6169642

Browse files
kate-osbornamimimor
authored andcommitted
Add scale test results for 1.2 (nginx#1734)
Problem: Need results for 1.2 scale tests. Solution: Add results.
1 parent 1d39271 commit 6169642

File tree

27 files changed

+431
-6
lines changed

27 files changed

+431
-6
lines changed

internal/mode/static/nginx/config/upstreams_template.go

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ package config
22

33
// FIXME(kate-osborn): Dynamically calculate upstream zone size based on the number of upstreams.
44
// 512k will support up to 648 upstream servers for OSS.
5-
// NGINX Plus needs 1m to support roughly the same amount of servers.
5+
// NGINX Plus needs 1m to support roughly the same amount of servers (556 upstream servers).
66
// https://github.com/nginxinc/nginx-gateway-fabric/issues/483
77
var upstreamsTemplateText = `
88
{{ range $u := . }}

tests/scale/results/1.2.0/1.2.0.md

+367
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,367 @@
1+
# Results for v1.2.0
2+
3+
<!-- TOC -->
4+
- [Results for v1.2.0](#results-for-v120)
5+
- [Summary](#summary)
6+
- [Versions](#versions)
7+
- [Tests](#tests)
8+
- [Scale Listeners](#scale-listeners)
9+
- [Scale HTTPS Listeners](#scale-https-listeners)
10+
- [Scale HTTPRoutes](#scale-httproutes)
11+
- [Scale Upstream Servers](#scale-upstream-servers)
12+
- [Scale HTTP Matches](#scale-http-matches)
13+
- [Future Improvements](#future-improvements)
14+
<!-- TOC -->
15+
16+
## Summary
17+
18+
- Overall, reloads and reload times look similar to 1.1.
19+
- As expected, for N+ there are no reloads for upstream servers.
20+
- Number of batch events has reduced, subsequently increasing the average time of each batch.
21+
- Memory, CPU usage, and time to ready numbers are similar to 1.1.
22+
- In general, N+ and OSS numbers are similar. In a few tests, the N+ CPU usage is lower than OSS. In one test, the memory and time to ready numbers are higher than OSS.
23+
- No concerning errors or restarts.
24+
25+
## Versions
26+
27+
NGF version:
28+
29+
```text
30+
"version":"edge"
31+
"commit":"e1d6ebb5065bab73af3a89faba4f49c7a5b971cd"
32+
"date":"2024-03-18T20:41:44Z"
33+
```
34+
35+
NGINX:
36+
37+
```text
38+
nginx version: nginx/1.25.4
39+
built by gcc 12.2.1 20220924 (Alpine 12.2.1_git20220924-r10)
40+
built with OpenSSL 3.1.3 19 Sep 2023 (running with OpenSSL 3.1.4 24 Oct 2023)
41+
```
42+
43+
NGINX Plus:
44+
45+
```text
46+
nginx version: nginx/1.25.3 (nginx-plus-r31-p1)
47+
built by gcc 13.2.1 20231014 (Alpine 13.2.1_git20231014)
48+
built with OpenSSL 3.1.4 24 Oct 2023
49+
```
50+
51+
Kubernetes: `v1.29.2-gke.1217000`
52+
53+
## Tests
54+
55+
### Scale Listeners
56+
57+
Reloads:
58+
59+
| OSS/N+ | Total | Total Errors | Ave Time (ms) | <= 500ms |
60+
|--------|-------|--------------|--------------------|----------|
61+
| OSS | 128 | 0 | 135.21245321573875 | 100% |
62+
| N+ | 128 | 0 | 137.11475409836066 | 100% |
63+
64+
65+
Event Batch Processing:
66+
67+
| OSS/N+ | Total | Ave Time (ms) | <= 500ms | <= 1000ms | <= 5000ms |
68+
|--------|-------|--------------------|----------|-----------|-----------|
69+
| OSS | 384 | 274.37551927060446 | 83.85% | 95.05% | 100% |
70+
| N+ | 384 | 276.705856062934 | 82.29% | 91.15% | 100% |
71+
72+
**NGINX Errors**: None.
73+
74+
**NGF Errors**: None.
75+
76+
**Pod Restarts**: None.
77+
78+
**CPU and Memory:**
79+
80+
**OSS**:
81+
82+
![CPU.png](TestScale_Listeners/CPU.png)
83+
![Memory.png](TestScale_Listeners/Memory.png)
84+
85+
**N+:**
86+
87+
![CPU.png](TestScale_Listeners_Plus/CPU.png)
88+
![Memory.png](TestScale_Listeners_Plus/Memory.png)
89+
90+
**Time To Ready:**
91+
92+
**OSS**:
93+
94+
![TTR.png](TestScale_Listeners/TTR.png)
95+
96+
**N+:**
97+
98+
![TTR.png](TestScale_Listeners_Plus/TTR.png)
99+
100+
**Findings**
101+
102+
- Reload count and reload times are similar to 1.1.
103+
- Fewer event batches.
104+
- Event batch processing time has increased.
105+
- Memory and CPU look similar.
106+
- Time to ready numbers look similar.
107+
- CPU is slightly better for N+, memory and time to ready numbers are slightly worse.
108+
109+
### Scale HTTPS Listeners
110+
111+
Reloads:
112+
113+
| OSS/N+ | Total | Total Errors | Ave Time (ms) | <= 500ms |
114+
|--------|-------|--------------|--------------------|----------|
115+
| OSS | 128 | 0 | 158.3148148148148 | 100% |
116+
| N+ | 128 | 0 | 155.80952380952382 | 100% |
117+
118+
119+
Event Batch Processing:
120+
121+
| OSS/N+ | Total | Ave Time (ms) | <= 500ms | <= 1000ms | <= 5000ms |
122+
|--------|-------|-------------------|----------|-----------|-----------|
123+
| OSS | 446 | 376.7357377720319 | 380/446 | 95% | 100% |
124+
| N+ | 446 | 582.7019230769231 | 370/446 | 95% | 100% |
125+
126+
127+
**NGINX Errors**: None.
128+
129+
**NGF Errors**: None.
130+
131+
**Pod Restarts**: None.
132+
133+
**CPU and Memory**
134+
135+
**OSS:**
136+
137+
![CPU.png](TestScale_HTTPSListeners/CPU.png)
138+
![Memory.png](TestScale_HTTPSListeners/Memory.png)
139+
140+
**N+:**
141+
142+
![CPU.png](TestScale_HTTPSListeners_Plus/CPU.png)
143+
![Memory.png](TestScale_HTTPSListeners_Plus/Memory.png)
144+
145+
**Time To Ready:**
146+
147+
**OSS:**
148+
149+
![TTR.png](TestScale_HTTPSListeners/TTR.png)
150+
151+
**N+:**
152+
153+
![TTR.png](TestScale_HTTPSListeners_Plus/TTR.png)
154+
155+
**Findings**
156+
157+
- Reloads have gone up slightly, but the reload time is similar.
158+
- Fewer event batches.
159+
- Event processing time has increased.
160+
- Memory and CPU look similar.
161+
- Time to ready numbers are slightly worse.
162+
- N+ event processing time is greater than OSS; could be a one-off.
163+
- A few of the requests failed with `remote error: tls: unrecognized name`. This seems to be a transient error. I was able to pass traffic to these server names after the test ended.
164+
- CPU is slightly better for N+, memory and time to ready numbers are similar.
165+
166+
### Scale HTTPRoutes
167+
168+
Reloads:
169+
170+
| OSS/N+ | Total | Total Errors | Ave Time (ms) | <= 500ms | <= 1000ms |
171+
|--------|-------|--------------|-------------------|----------|-----------|
172+
| OSS | 1001 | 0 | 383.3207313264937 | 75.72% | 100% |
173+
| N+ | 1001 | 0 | 363.4901960784314 | 79.02% | 100% |
174+
175+
176+
Event Batch Processing:
177+
178+
| OSS/N+ | Total | Ave Time (ms) | <= 500ms | <= 1000ms | <= 5000ms |
179+
|--------|-------|--------------------|----------|-----------|-----------|
180+
| OSS | 1005 | 470.87463201381377 | 59.64% | 99.7% | 100% |
181+
| N+ | 1005 | 448.05991285403053 | 63.84% | 99.8% | 100% |
182+
183+
184+
> Note: In the scale tests for the 1.1 release, we tested with and without a delay. Since the results were very similar in the 1.1 release, I dropped the test with the delay.
185+
186+
**NGINX Errors**: None.
187+
188+
**NGF Errors**: None.
189+
190+
**Pod Restarts**: None.
191+
192+
**CPU and Memory**:
193+
194+
**OSS:**
195+
196+
![CPU.png](TestScale_HTTPRoutes/CPU.png)
197+
![Memory.png](TestScale_HTTPRoutes/Memory.png)
198+
199+
**N+:**
200+
201+
![CPU.png](TestScale_HTTPRoutes_Plus/CPU.png)
202+
![Memory.png](TestScale_HTTPRoutes_Plus/Memory.png)
203+
204+
**Time To Ready**:
205+
206+
**OSS:**
207+
208+
![TTR.png](TestScale_HTTPRoutes/TTR.png)
209+
210+
The peak at point (511, 30) corresponds the following error in the [results.csv](TestScale_HTTPRoutes/results.csv): `"Get ""http://35.236.49.243/"": dial tcp 35.236.49.243:80: i/o timeout"`.
211+
The logs of the `nginx-gateway` container show that the 511th HTTPRoute (named `route-510` due to zero-indexing) is reconciled and configured in under 500ms:
212+
213+
```text
214+
INFO 2024-03-19T18:55:13.993993097Z {"HTTPRoute":{…}, "controller":"httproute", "controllerGroup":"gateway.networking.k8s.io", "controllerKind":"HTTPRoute", "level":"info", "msg":"Reconciling the resource", "name":"route-510", "namespace":"default", "reconcileID":"e1041f73-d1b2-401c-8dfb-757021e08507", "ts":"2024-03-19T18:55:13Z"}
215+
INFO 2024-03-19T18:55:13.994026717Z {"HTTPRoute":{…}, "controller":"httproute", "controllerGroup":"gateway.networking.k8s.io", "controllerKind":"HTTPRoute", "level":"info", "msg":"Upserted the resource", "name":"route-510", "namespace":"default", "reconcileID":"e1041f73-d1b2-401c-8dfb-757021e08507", "ts":"2024-03-19T18:55:13Z"}
216+
INFO 2024-03-19T18:55:13.994033817Z {"level":"info", "logger":"eventLoop", "msg":"added an event to the next batch", "total":1, "ts":"2024-03-19T18:55:13Z", "type":"*events.UpsertEvent"}
217+
INFO 2024-03-19T18:55:13.994044917Z {"batchID":522, "level":"info", "logger":"eventLoop.eventHandler", "msg":"Handling events from the batch", "total":1, "ts":"2024-03-19T18:55:13Z"}
218+
DEBUG 2024-03-19T18:55:13.994052427Z {"batchID":522, "level":"debug", "logger":"eventLoop.eventHandler", "msg":"Started processing event batch", "ts":"2024-03-19T18:55:13Z"}
219+
INFO 2024-03-19T18:55:14.005555375Z {"level":"info", "logger":"nginxFileManager", "msg":"Deleted file", "path":"/etc/nginx/conf.d/http.conf", "ts":"2024-03-19T18:55:14Z"}
220+
INFO 2024-03-19T18:55:14.005593935Z {"level":"info", "logger":"nginxFileManager", "msg":"Deleted file", "path":"/etc/nginx/conf.d/config-version.conf", "ts":"2024-03-19T18:55:14Z"}
221+
INFO 2024-03-19T18:55:14.005765195Z {"level":"info", "logger":"nginxFileManager", "msg":"Wrote file", "path":"/etc/nginx/conf.d/http.conf", "ts":"2024-03-19T18:55:14Z"}
222+
INFO 2024-03-19T18:55:14.005780375Z {"level":"info", "logger":"nginxFileManager", "msg":"Wrote file", "path":"/etc/nginx/conf.d/config-version.conf", "ts":"2024-03-19T18:55:14Z"}
223+
INFO 2024-03-19T18:55:14.413460919Z {"batchID":522, "level":"info", "logger":"eventLoop.eventHandler", "msg":"NGINX configuration was successfully updated", "ts":"2024-03-19T18:55:14Z"}
224+
```
225+
226+
Which makes me believe that this error happened at the LoadBalancer and is not significant.
227+
228+
When the point (511, 30) is removed, the time to ready graph looks like the following:
229+
230+
![TTR-without-peak.png](TestScale_HTTPRoutes/TTR-without-peak.png)
231+
232+
**N+:**
233+
234+
![TTR.png](TestScale_HTTPRoutes_Plus/TTR.png)
235+
236+
**Findings**
237+
238+
- Reload count and times look similar to 1.1.
239+
- Fewer event batches.
240+
- Event processing time increased.
241+
- Memory and CPU are pretty similar.
242+
- Time to ready numbers, when the outlier is removed, are similar.
243+
- CPU, memory, and time to ready numbers are similar for N+ and OSS.
244+
245+
### Scale Upstream Servers
246+
247+
| OSS/N+ | # Upstream Servers | Start Time (UNIX) | End Time (UNIX) | Duration (s) |
248+
|--------|--------------------|-------------------|-----------------|--------------|
249+
| OSS | 648 | 1710883196 | 1710883236 | 40 |
250+
| N+ | 556 | 1710886186 | 1710886227 | 41 |
251+
252+
Reloads:
253+
254+
| OSS/N+ | Total | Total Errors | Ave Time (ms) | <= 500ms |
255+
|--------|-------|--------------|--------------------|----------|
256+
| OSS | 122 | 0 | 125.82786885245902 | 100% |
257+
| N+ | 0 | 0 | N/A | N/A |
258+
259+
Event Batch Processing:
260+
261+
| OSS/N+ | Total | Ave Time (ms) | <=500ms | <=1000ms | <=5000ms |
262+
|--------|-------|--------------------|---------|----------|----------|
263+
| OSS | 122 | 209.40983606557376 | 100% | 100% | 100% |
264+
| N+ | 113 | 131.24778761061947 | 98.23% | 99.12% | 100% |
265+
266+
267+
> Note:
268+
> The Prometheus `rate` queries did not return a value, probably due to the small duration. The average times were calculated by dividing the sum by the count.
269+
270+
**NGINX Errors**: None.
271+
272+
**NGF Errors**: None.
273+
274+
**Pod Restarts**: None.
275+
276+
**CPU and Memory**:
277+
278+
**OSS:**
279+
280+
![CPU.png](TestScale_UpstreamServers/CPU.png)
281+
![Memory.png](TestScale_UpstreamServers/Memory.png)
282+
283+
**N+:**
284+
285+
![CPU.png](TestScale_UpstreamServers_Plus/CPU.png)
286+
![Memory.png](TestScale_UpstreamServers_Plus/Memory.png)
287+
288+
**Findings**
289+
290+
- Fewer reloads for OSS.
291+
- Reload time is much lower, but the 1.1 number looks like it could be wrong. All reloads were under 500ms, but the average time is 1126ms. My guess is that the average time is actually 112.6ms.
292+
- Fewer event batches.
293+
- Similar event processing time.
294+
- CPU peak looks higher than 1.1, but it's hard to tell since the 1.1 graph does not look like it captures the peak.
295+
- Memory looks similar.
296+
- N+ results confirm no reloads for upstream servers.
297+
- CPU is better (peak at .05 vs .2) for N+, memory is similar.
298+
299+
### Scale HTTP Matches
300+
301+
**Results for the first match**:
302+
303+
OSS:
304+
305+
```text
306+
Running 30s test @ http://cafe.example.com
307+
2 threads and 10 connections
308+
Thread Stats Avg Stdev Max +/- Stdev
309+
Latency 2.93ms 35.39ms 1.04s 99.56%
310+
Req/Sec 4.97k 344.60 5.89k 72.21%
311+
296946 requests in 30.10s, 104.78MB read
312+
Requests/sec: 9865.48
313+
Transfer/sec: 3.48MB
314+
315+
```
316+
317+
N+:
318+
319+
```text
320+
Running 30s test @ http://cafe.example.com
321+
2 threads and 10 connections
322+
Thread Stats Avg Stdev Max +/- Stdev
323+
Latency 0.99ms 180.94us 4.30ms 73.44%
324+
Req/Sec 5.02k 290.27 5.83k 71.05%
325+
299941 requests in 30.10s, 105.84MB read
326+
Requests/sec: 9964.72
327+
Transfer/sec: 3.52MB
328+
```
329+
330+
**Results for the last match**:
331+
332+
OSS:
333+
334+
```text
335+
Running 30s test @ http://cafe.example.com
336+
2 threads and 10 connections
337+
Thread Stats Avg Stdev Max +/- Stdev
338+
Latency 1.21ms 274.51us 4.20ms 70.17%
339+
Req/Sec 4.12k 340.53 5.30k 68.89%
340+
246174 requests in 30.10s, 86.86MB read
341+
Requests/sec: 8178.64
342+
Transfer/sec: 2.89MB
343+
```
344+
345+
N+:
346+
347+
```text
348+
Running 30s test @ http://cafe.example.com
349+
2 threads and 10 connections
350+
Thread Stats Avg Stdev Max +/- Stdev
351+
Latency 1.34ms 4.40ms 207.21ms 99.85%
352+
Req/Sec 4.17k 352.84 5.12k 68.39%
353+
249209 requests in 30.10s, 87.93MB read
354+
Requests/sec: 8279.45
355+
Transfer/sec: 2.92MB
356+
```
357+
358+
**Findings**:
359+
360+
- For N+, the first match response times are slightly better than the last match.
361+
- For OSS, the last match response times are slightly better than the first match.
362+
- N+ performance is slightly better than OSS for the first match, but slightly worse for the last match.
363+
364+
## Future Improvements
365+
366+
- Check that the statuses of the Gateway API resources are updated after each scaling event.
367+
- Measure the time it takes for NGF to update the status of the Gateway API resources after creating or updating the resources.
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading

0 commit comments

Comments
 (0)