-
Notifications
You must be signed in to change notification settings - Fork 7
ccy
This test experiments how OpenStack handles a client burst. In the following we run two rally scenarios with different levels of concurrency. The first scenario is a boot and delete (b&d) and the second one a boot and list (b&l).
In the following, we analysis errors that appear during tests.
- physical nodes
- 1 control, 1 network, 1 util, 3 rabbitmq-node, 20 computes.
- control
- neutron_server, nova_conductor, nova_scheduler, nova_novncproxy, nova_consoleauth, nova_api, glance_api, glance_registry, keystone, rabbitmq, mariadb, memcached, cron, kolla_toolbox, heka, cadvisor, grafana, influx, docker_registry, collectd
- network
- neutron_metadata_agent, neutron_l3_agent, neutron_dhcp_agent, neutron_openvswitch_agent, neutron_openvswitch_agent, openvswitch_db, keepalived, haproxy, cron, kolla_toolbox, heka, cadvisor
- util
- cadvisor, grafana, influx, (rally)
- rabbitmq-node
- rabbitmq, cadvisor, collectd, heka
- compute
- nova_ssh, nova_libvirt, nova_compute_fake_1, …, nova_compute_fake_#fake, openvswitch_db, openvswitch_vswitchd, neutron_openvswitch_agent, neutron_openvswitch_agent_fake_1, …, neutron_openvswitch_agent_fake_#fake, cron, kolla_toolbox, heka, cadvisor
First, find the name of the host machine.
cd results
vagrant up concurrency
XPHOST=`vagrant ssh-config concurrency |grep HostName |awk '{print $2}'`
Then, create ssh tunnels
# Get an access to the grafana
ssh -NL 3000:${XPHOST}:3000 rennes.g5k
# Get an access to the kibana
ssh -NL 5601:${XPHOST}:5601 rennes.g5k
# Get an access to the nginx with kolla logs
ssh -NL 8000:${XPHOST}:8000 rennes.g5k
In this test suite the architecture is always of 100 computes (5 fakes over 20 physical nodes) and the concurrency varies from 100 to 1000 user in parallel.
The test suite is as followed:
- a. 100 concurrent users that tries to b&d 200 vm, then wait 2 minutes. b. 100 concurrent users that tries to b&l 200 vm, then wait 5 minutes.
- a. 1000 concurrent users that tries to b&d 2000 vm, then wait 2 minutes. b. 1000 concurrent users that tries to b&l 2000 vm, then wait 5 minutes.
- a. 100 concurrent users that tries to b&d 200 vm, then wait 2 minutes. b. 100 concurrent users that tries to b&l 200 vm, then wait 5 minutes.
- a. 500 concurrent users that tries to b&d 1000 vm, then wait 2 minutes. b. 500 concurrent users that tries to b&l 1000 vm, then wait 5 minutes.
TL;DR: Test 1 and 3 performs well. Test 2 quickly goes wrong facing with problems from mariadb, nova-scheduler, nova-conductor. Test 4 eventually goes wrong after a certain amount of time. The aim of test number 3 was to be sure that OpenStack can retrieve a normal behaviour after a big charge (such as test 2).
Why: Most often returned error raises by a rally client. This error
always hide another error. A rally client that launches a VM calls
wait_for=[fn:wait_for] that waits for a given resource to come into a
given status with a timeout of 60 seconds. =wait_for
raises this
error when the timeout exceeded. Next question are: Why timeout
exceeded? Which service is responsible for giving no response in 60
seconds?
What: The number of concurrent users is so huge that haproxy
performs many retries that slow down the test execution. To confirm
this we go with another experiment (available under the namespace
ccy200x5-haproxy1000vs400
). With a default configuration for
haproxy, we records 534 retries over keystone_internal
.
file:images/ccy200x5-hap400-retries.png
There are many possibilities here, but our first guess is that the
number of concurrent users is so huge that haproxy performs many
retries that slow down the test execution. To confirm this we go with
another experiment (available under the namespace
ccy200x5-haproxy1000vs400
). With a default haproxy, we records 534
retries over keystone_internal
.
So first, we tune haproxy and increase the number of available
connections. Then, the number of retries is smaller but there is still
some, meaning that in case of a big burst we should increase the
timeout connect
. We could also increase the number of thread in
haproxy.
We try this second possibilities in the test ccy100x5-key10vs20
and
now there is no more time out. There is only one kind of error that is
database deadlock.
By the way, without considering haproxy, this test shows that the
keystone_internal
represents a bottleneck for OpenStack in case of
client burst.
the number of retries record is 534.
Actually, all retries
appears over the keystone_internal
interface. haproxy records 534
retries during our test with 1000 concurrent users.
Two possibilities here, first tune haproxy and increase the
For instance, haproxy
keystone_internal
My first guess is that this error comes from a limitation of rally when you run many users at the same time.
Possibility: The number of retries of haproxy causes this timeout. There is a lot of retries on keystone_internal. haproxy waits 10s before doing a retries and one request could be retries 3 times. Thus, from retrie to retrie it takes 30s to
http://www.haproxy.org/download/1.4/doc/configuration.txt
connection time includes retries time.
response time is time spend waiting for the server to send a full HTTP response.
Collectd misses total time. I thing I have to do something like ct + rt.
is not in the test with 5 rally in parallel.
To get the response we have to take a close look into haproxy. Default configuration of haproxy limits the number of connection to one service to 2000 and the global number of connection to 4000. Moreover
we run two others experiment that launches 1000
concurrent rally clients on a tunned vs untunned haproxy. Results of
the experiment are under the namespace
ccy200x5-hap100000vs4000-cpt20-nfk05
.
[fn:wait_for] https://github.com/openstack/rally/blob/fbdc1b186ad7f2ea3c4f0fdeea6a6039c298dce6/rally/task/utils.py#L104