Skip to content
RFish ⚓ edited this page Oct 10, 2016 · 1 revision

Concurrency Experimentation

This test experiments how OpenStack handles a client burst. In the following we run two rally scenarios with different levels of concurrency. The first scenario is a boot and delete (b&d) and the second one a boot and list (b&l).

In the following, we analysis errors that appear during tests.

Configuration

physical nodes
1 control, 1 network, 1 util, 3 rabbitmq-node, 20 computes.
control
neutron_server, nova_conductor, nova_scheduler, nova_novncproxy, nova_consoleauth, nova_api, glance_api, glance_registry, keystone, rabbitmq, mariadb, memcached, cron, kolla_toolbox, heka, cadvisor, grafana, influx, docker_registry, collectd
network
neutron_metadata_agent, neutron_l3_agent, neutron_dhcp_agent, neutron_openvswitch_agent, neutron_openvswitch_agent, openvswitch_db, keepalived, haproxy, cron, kolla_toolbox, heka, cadvisor
util
cadvisor, grafana, influx, (rally)
rabbitmq-node
rabbitmq, cadvisor, collectd, heka
compute
nova_ssh, nova_libvirt, nova_compute_fake_1, …, nova_compute_fake_#fake, openvswitch_db, openvswitch_vswitchd, neutron_openvswitch_agent, neutron_openvswitch_agent_fake_1, …, neutron_openvswitch_agent_fake_#fake, cron, kolla_toolbox, heka, cadvisor

Get results

First, find the name of the host machine.

cd results
vagrant up concurrency
XPHOST=`vagrant ssh-config concurrency |grep HostName |awk '{print $2}'`

Then, create ssh tunnels

# Get an access to the grafana
ssh -NL 3000:${XPHOST}:3000 rennes.g5k
# Get an access to the kibana
ssh -NL 5601:${XPHOST}:5601 rennes.g5k
# Get an access to the nginx with kolla logs
ssh -NL 8000:${XPHOST}:8000 rennes.g5k

Results

Reference architecture: 100 computes & 1 rally (ccy0100-1000-cpt20-nfk05)

In this test suite the architecture is always of 100 computes (5 fakes over 20 physical nodes) and the concurrency varies from 100 to 1000 user in parallel.

The test suite is as followed:

  1. a. 100 concurrent users that tries to b&d 200 vm, then wait 2 minutes. b. 100 concurrent users that tries to b&l 200 vm, then wait 5 minutes.
  2. a. 1000 concurrent users that tries to b&d 2000 vm, then wait 2 minutes. b. 1000 concurrent users that tries to b&l 2000 vm, then wait 5 minutes.
  3. a. 100 concurrent users that tries to b&d 200 vm, then wait 2 minutes. b. 100 concurrent users that tries to b&l 200 vm, then wait 5 minutes.
  4. a. 500 concurrent users that tries to b&d 1000 vm, then wait 2 minutes. b. 500 concurrent users that tries to b&l 1000 vm, then wait 5 minutes.

TL;DR: Test 1 and 3 performs well. Test 2 quickly goes wrong facing with problems from mariadb, nova-scheduler, nova-conductor. Test 4 eventually goes wrong after a certain amount of time. The aim of test number 3 was to be sure that OpenStack can retrieve a normal behaviour after a big charge (such as test 2).

Errors during test 2a

Timed out waiting for a reply to message ID

Why: Most often returned error raises by a rally client. This error always hide another error. A rally client that launches a VM calls wait_for=[fn:wait_for] that waits for a given resource to come into a given status with a timeout of 60 seconds. =wait_for raises this error when the timeout exceeded. Next question are: Why timeout exceeded? Which service is responsible for giving no response in 60 seconds?

What: The number of concurrent users is so huge that haproxy performs many retries that slow down the test execution. To confirm this we go with another experiment (available under the namespace ccy200x5-haproxy1000vs400). With a default configuration for haproxy, we records 534 retries over keystone_internal.

file:images/ccy200x5-hap400-retries.png

There are many possibilities here, but our first guess is that the number of concurrent users is so huge that haproxy performs many retries that slow down the test execution. To confirm this we go with another experiment (available under the namespace ccy200x5-haproxy1000vs400). With a default haproxy, we records 534 retries over keystone_internal.

So first, we tune haproxy and increase the number of available connections. Then, the number of retries is smaller but there is still some, meaning that in case of a big burst we should increase the timeout connect. We could also increase the number of thread in haproxy.

We try this second possibilities in the test ccy100x5-key10vs20 and now there is no more time out. There is only one kind of error that is database deadlock.

By the way, without considering haproxy, this test shows that the keystone_internal represents a bottleneck for OpenStack in case of client burst.

the number of retries record is 534.

Actually, all retries appears over the keystone_internal interface. haproxy records 534 retries during our test with 1000 concurrent users.

Two possibilities here, first tune haproxy and increase the

For instance, haproxy keystone_internal

My first guess is that this error comes from a limitation of rally when you run many users at the same time.

Possibility: The number of retries of haproxy causes this timeout. There is a lot of retries on keystone_internal. haproxy waits 10s before doing a retries and one request could be retries 3 times. Thus, from retrie to retrie it takes 30s to

http://www.haproxy.org/download/1.4/doc/configuration.txt

connection time includes retries time.

response time is time spend waiting for the server to send a full HTTP response.

Collectd misses total time. I thing I have to do something like ct + rt.

is not in the test with 5 rally in parallel.

To get the response we have to take a close look into haproxy. Default configuration of haproxy limits the number of connection to one service to 2000 and the global number of connection to 4000. Moreover

we run two others experiment that launches 1000 concurrent rally clients on a tunned vs untunned haproxy. Results of the experiment are under the namespace ccy200x5-hap100000vs4000-cpt20-nfk05.

[fn:wait_for] https://github.com/openstack/rally/blob/fbdc1b186ad7f2ea3c4f0fdeea6a6039c298dce6/rally/task/utils.py#L104

Unknown Error (HTTP 504)

Failed to get the resource

Unable to establish connection to keystone-public

Unable to establish connection to

Gateway Timeout (HTTP 504)

Service Unavailable (HTTP 503)

Reference architecture (1000 compute)