Skip to content

Conversation

@MarcialRosales
Copy link
Contributor

@MarcialRosales MarcialRosales commented Oct 20, 2021

This fixes #863 (WIP)

Summary Of Changes

The issues that i identified were mainly these ones:

  • Persistent volume operations such as volume expansion take far longer than initially expected. The timeout used is 10m which should be long enough. But the reality is that sometimes it takes longer. I wonder if it makes sense to get the PVC's events before aborting the test case so that we can know what has happened to the PVC up this point. For the time being, I have added a logging statement that logs the PVC's status.conditions.
  • Test cases that checked ConfigMap's annotation rabbitmq.com/pluginsUpdatedAt were very time sensitive. I managed to reproduce it a few times so it seemed that 30 seconds timeout was too tight. We could ramp it up to 1 minute without any issues. The next timeout would be 4minutes which is when we check that the annotation should no longer exist.
  • The last issue was related to testing protocols such as MQTT, STOMP, and Streams. I observed that even though the stateful set was fully ready , the ports were not fully opened and ready to attend traffic. Therefore, it was required to add an extra function that would wait until the port was ready.

I noticed that it randomly failed establishing connection.
Raising it from 10 to 30 seconds seems to have fixed it, at least,
when running in GKE.
and wait until stream port is ready only when the
feature flag stream_queue is enabled
to 1minute. We still have 3 more minutes before the
annotation is removed.
to get RabbitMQCluster resource.
Also provided a description of asserted
generation values to help troubleshoot
when it occurs.
TODO: Provide an assertion message that gives us
information about the PVC such as events or conditions.
So that we can know why it has not expanded.
Even though RabbitMQ diagnostics tool says that the stream
port is ready to accept connections, the fact to the matter is
that the stream client cannot connect on the first attempt.
The stream client connects via the k8s nodePort not directly to
the stream port.
to send a message via stream protocol
Copy link
Contributor

@ChunyiLyu ChunyiLyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general! Just one comment about the stream test:

One issue I see with the change to stream test is that the test now checks for the feature flag which requires a running RMQ. If we are using RMQ without the feature, the test will deploy a dedicated cluster just to find out that stream is not supposed and then skipping the test and tearing down the RMQ. Personally I would prefer verifying the image before deploying the RabbitMQ to save unnecessary cluster creation. What do you think?

@Zerpet
Copy link
Member

Zerpet commented Oct 25, 2021

There seems to be a legitimate go vet error in the unit/integration tests:

https://github.com/rabbitmq/cluster-operator/runs/3965814556?check_suite_focus=true

@MarcialRosales
Copy link
Contributor Author

@ChunyiLyu i see what you mean .. but I thought that given that the hasFeatureEnabled function is under the system_test package it would only be called against a running cluster. But you are right that if someone called it just to check if the release has streams enabled it wont work as it requires you to have it running.

The reason why I changed it was because i thought it was more reliable and simpler to check straightaway if the feature was enabled rather than checking for versions.

@MarcialRosales
Copy link
Contributor Author

@ChunyiLyu I have moved the test case for the stream protocol back to where it was originally, i.e. together with the stomp and mqtt test cases. This way, as you pointed out, we do not deploy a RabbitMQ cluster unnecessarily if the stream plugin/feature is not enabled.

@ansd ansd self-requested a review November 3, 2021 15:14
Copy link
Member

@ansd ansd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the readiness probe checks whether the AMQP port is open. As soon as the AMQP port is open, the Pod will start accepting traffic. However, we observe in the system tests that other ports take longer to open causing the system tests to fail.

When looking at the logs:

2021-10-15 13:39:26.629506+02:00 [info] <0.44.0> Application rabbitmq_management started on 'rabbit@host'
2021-10-15 13:39:26.630140+02:00 [info] <0.891.0> MQTT: will wait for 300 more ms for cluster members to join before triggering a Raft leader election
2021-10-15 13:39:26.931683+02:00 [info] <0.891.0> MQTT: will wait for 300 more ms for cluster members to join before triggering a Raft leader election
2021-10-15 13:39:27.233728+02:00 [info] <0.891.0> MQTT: will wait for 300 more ms for cluster members to join before triggering a Raft leader election
2021-10-15 13:39:27.535770+02:00 [info] <0.891.0> MQTT: will wait for 300 more ms for cluster members to join before triggering a Raft leader election
2021-10-15 13:39:27.837307+02:00 [info] <0.891.0> MQTT: will wait for 300 more ms for cluster members to join before triggering a Raft leader election
2021-10-15 13:39:28.138570+02:00 [info] <0.891.0> MQTT: will wait for 300 more ms for cluster members to join before triggering a Raft leader election
2021-10-15 13:39:28.440774+02:00 [info] <0.891.0> MQTT: will wait for 300 more ms for cluster members to join before triggering a Raft leader election
2021-10-15 13:39:28.742584+02:00 [info] <0.891.0> MQTT: will wait for 300 more ms for cluster members to join before triggering a Raft leader election
2021-10-15 13:39:29.044627+02:00 [info] <0.891.0> MQTT: will wait for 300 more ms for cluster members to join before triggering a Raft leader election
2021-10-15 13:39:29.346577+02:00 [info] <0.891.0> MQTT: will wait for 300 more ms for cluster members to join before triggering a Raft leader election
2021-10-15 13:39:29.648615+02:00 [info] <0.891.0> MQTT: will wait for 300 more ms for cluster members to join before triggering a Raft leader election
2021-10-15 13:39:29.950628+02:00 [info] <0.891.0> MQTT: will wait for 300 more ms for cluster members to join before triggering a Raft leader election
2021-10-15 13:39:30.252812+02:00 [info] <0.891.0> MQTT: will wait for 300 more ms for cluster members to join before triggering a Raft leader election
2021-10-15 13:39:30.554050+02:00 [info] <0.891.0> MQTT: will wait for 300 more ms for cluster members to join before triggering a Raft leader election
2021-10-15 13:39:30.855862+02:00 [info] <0.891.0> MQTT: will wait for 300 more ms for cluster members to join before triggering a Raft leader election
2021-10-15 13:39:31.157550+02:00 [info] <0.891.0> MQTT: will wait for 300 more ms for cluster members to join before triggering a Raft leader election
2021-10-15 13:39:31.459744+02:00 [info] <0.891.0> MQTT: will wait for 300 more ms for cluster members to join before triggering a Raft leader election
2021-10-15 13:39:31.781635+02:00 [debug] <0.897.0> mqtt_node: ra_log:init recovered last_index_term {0,0} first index 0
2021-10-15 13:39:31.823148+02:00 [debug] <0.897.0> mqtt_node: recover -> recover in term: 0 machine version: 1
2021-10-15 13:39:31.823761+02:00 [debug] <0.897.0> mqtt_node: recovering state machine version 0:1 from index 0 to 0

we see why the MQTT plugin takes longer to initialise (compared to AQMP port being opened). It needs to trigger Raft leader election which takes 5 seconds in the output above.

I'd favour to solve the root cause of this issue instead of implementing a workaround in the system test because users will observe the same issue.

One possible solution would be to define an MQTT readiness check if MQTT plugin is enabled (instead of AMQP readiness check) because we know that the MQTT port takes longer to open than the AMQP port.

WDYT?

@ChunyiLyu ChunyiLyu merged commit c91a0c8 into main Nov 10, 2021
@ChunyiLyu ChunyiLyu deleted the fix-system-tests branch November 10, 2021 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

System tests flakes

5 participants