-
Notifications
You must be signed in to change notification settings - Fork 7
xp load clust rabbitmq
Evaluate with a clustered rabbitmq.
- physical nodes
- 1 control, 1 network, 20 computes, 1 util, 3 rabbitmq-node.
- control
- neutron_server, nova_scheduler, nova_novncproxy, nova_consoleauth, nova_api, glance_api, glance_registry, keystone, cron, memcached kolla_toolbox, heka, cadvisor, docker_registry, nova_conductor,collectd
- network
- neutron_metadata_agent, neutron_l3_agent, neutron_dhcp_agent, neutron_openvswitch_agent, neutron_openvswitch_agent, openvswitch_db, keepalived, cron, kolla_toolbox, haproxy, heka, cadvisor, collectd
- compute
- nova_ssh, nova_libvirt, nova_compute_fake_1, …, nova_compute_fake_#fake, openvswitch_db, openvswitch_vswitchd, neutron_openvswitch_agent, neutron_openvswitch_agent_fake_1, …, neutron_openvswitch_agent_fake_#fake, cron, kolla_toolbox, heka, cadvisor, , collectd
- util
- cadvisor, grafana, influx (rally)
- 3 rabbitmq-node
- rabbitmq, cadvisor, collectd, heka
First, find the name of the host machine.
cd results
vagrant up load-clust-rabbit
deploy 1000 nodes openstack
boot_and_delete concurrency=50 times=100
wait
boot_and_list concurrency=50 times=100
More concretely :
./kolla-g5k.py ; sleep 500; ./kolla-g5k.py bench --scenarios=vanilla-boot-delete-then-boot-list.txt --times=100 --concurrency=50 --wait=100 ; sleep 300; ./kolla-g5k.py bench --scenarios=vanilla-boot-delete-then-boot-list.txt --times=100 --concurrency=50 --wait=100
> cat vanilla-boot-delete-then-boot-list.txt
disco-rally-boot-and-delete.json
disco-rally-boot-and-list.json
In nova-scheduler : ComputeFilter returns 0 host
: VMs can’t find a host to start because compute are declared offline.
Hypothesis : Compute nodes state aren’t updated within the service_down_time
interval and thus declared offline.
Solutions :
-
report_interval (10s)
is too low and too much pressure on the system. Increasing it would put less pressure on the conductor and the DB.service_down_time
must be increased accordingly. - ~instance_sync_interval (120s) ~ is also putting some load on the scheduler.
- scale the conductor
In rabbitmq : Closing connections due to {handshake_timeout,frame_header} or {handshake_timeout, timeout}
Hypothesis : Too much latency on rabbimq when opening a connection
Solutions:
- Increase the timeout
- Decrease the load
Goal : see the effect of applying the solution 1. & 2. of observation 1.
- Compute filter is ok for every requests (checked in nova-scheduler.log)
- Only 12 errors when deleting instances.
But since instances aren’t sync often, scheduler have a less acurate view of the system. This could lead to more retries when running an “almost full” system.
Goal : see the effect of increasing the handshake timeout.
- No more timeout on rabbitmq logs
- Every tests passed
Goal : try to absorb the load of the compute reports by increasing the number of conductors (here 10).
- (nova-scheduler) ComputeFilter returns 0 host:
grep "ComputeFilter returned 0 hosts" nova-scheduler.log |grep -o "req-[[:alnum:]]*" | uniq |wc -l ->
291
- some DB errors at the rally levels (deconnection, reentrant calls)
- many time out
- many failures in rally tests
increasing the number of conductor obviously doesn’t solve anything.
Goal : see if adding more scheduler can help in making the rally benchs pass.
all tests passed
Goal : see if the system can handle the load in the long term (times = 10000)
- Many 504 errors starting at iteration #9000.
- Mysql connections is hitting the 2000 connection limit of haproxy
Let’s confirm with a second run
confirmed
Goal : Increase haproxy limit (global maxconn: 100000 and frontend 2000)
all test passed
Makes haproxy as transparent as possible makes life easier!
Running 8 schedulers on 8 hosts is easy with kolla-g5k, but running 8 schedulers on the same host become a little bit more tricky. Here is a method :
- create a /scheduler_1 to handle the logs
- gives the permissions : chown 162 /scheduler_1
- start the container :
docker run -ti --net=host -v /scheduler_1:/var/log/kolla:rw -v /etc/localtime:/etc/localtime:ro -v /etc/kolla/nova-scheduler/:/var/lib/kolla/config_files/:ro -e "KOLLA_SERVICE_NAME=nova-scheduler" -e "KOLLA_CONFIG_STRATEGY=COPY_ALWAYS" -e "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" -e "KOLLA_BASE_DISTRO=centos" -e "KOLLA_INSTALL_TYPE=binary" -e "KOLLA_INSTALL_METATYPE=rdo" -e "PS1=$(tput bold)($(printenv KOLLA_SERVICE_NAME))$(tput sgr0)[$(id -un)@$(hostname -s) $(pwd)]$ " --name nova_scheduler_1 -d kolla/centos-binary-nova-scheduler:2.0.2
repeat the above for each scheduler (a small bash script will fit)
Note : the service hostname will be the same, the database will hold only one record for the scheduler. Is it an issue ? Benchmarks have been run successfully using this configuration. Note: On g5k, it’s maybe a better idea to use /tmp as parent directory (size limitation on / otherwise)