You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the case of ec2-terminate-by-tag with MANAGED_SUBGROUP=enable (when EC2 instances are managed by an ASG), there is an issue when trying to execute only once the chaos.
In general, setting CHAOS_INTERVAL=TOTAL_CHAOS_DURATION is the way to get a single time execution.
But if we set CHAOS_INTERVAL<TOTAL_CHAOS_DURATION the chaos is failing because of the following behavior :
it seems that the code does a loop during all the CHAOS_DURATION
and inside it, it loops over the instanceIDList so it can try to stop the same instance multiple times during the Chaos duration. In the case of (MANAGED_SUBGROUP=disable), the instance is "stopped" instead of terminated so it will stop/start/stop... the same instance without any issue. But in the case of MANAGED_SUBGROUP=enable, the instance is "terminated" and it causes an issue as the instance has not been removed from the instanceIDList , it cannot be stopped as it is not existing anymore in the next iteration...
The only way to have a success is to set CHAOS_INTERVAL=TOTAL_CHAOS_DURATION but then we have to wait CHAOS_INTERVAL_TOTAL for nothing at the end of the chaos first (and only) iteration.
=> the case when the interval is 0/1 should be handled
In the case of MANAGED_SUBGROUP=enable, the instance has to be removed from the instanceIDList to avoid trying to stop it again in the next iterations.
How to reproduce it (as minimally and precisely as possible):
tag an instance with chaos=allowed
Launch the experiment with : MANAGED_SUBGROUP=enable, TOTAL_CHAOS_DURATION=500s (a sufficient time to allow the ASG to terminate the stopped instance), CHAOS_INTERVAL=0 (or any value < TOTAL_CHAOS_DURATION), and `INSTANCE_TAG= 'chaos:allowed'
the instance is stopped, after several minutes, the instance is terminated by the ASG
the code is waiting CHAOS_INTERVAL => 0 seconds
the instance is still in the list of instances to stop => the experiment is failing with err: ec2 instance failed to stop, err: IncorrectInstanceState: This instance 'i-0fd0da669ea93c044' is not in a state from which it can be stopped.
The only way to not fail is to set CHAOS_INTERVAL=TOTAL_CHAOS_DURATION but then :
the instance is stopped, after several minutes, the instance is terminated by the ASG
the code is waiting CHAOS_INTERVAL ! (so 400s for nothing)
the experiment is successful (but with a useless waiting period of CHAOS_INTERVAL)
Anything else we need to know?:
@ksatchit and @uditgaurav already agreed on that missing behavior, thanks to them for the support to understand this issue 👍
The text was updated successfully, but these errors were encountered:
Additionnal info : the above behavior also causes an issue if I tag more than one instance with the chaos: allowed
Indeed, if I tag 2 instances, 2 targeted instances are detected, the first is stopped, terminated, and then as the instanceId is still in the list, the code tries to stop the 1st instance again, which is terminated, and it loops over this error...
the workflow is never ending, I have to delete it manually and of course the 2nd instance is never stopped.
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT
What happened:
In the case of
ec2-terminate-by-tag
withMANAGED_SUBGROUP=enable
(when EC2 instances are managed by an ASG), there is an issue when trying to execute only once the chaos.In general, setting
CHAOS_INTERVAL=TOTAL_CHAOS_DURATION
is the way to get a single time execution.But if we set
CHAOS_INTERVAL<TOTAL_CHAOS_DURATION
the chaos is failing because of the following behavior :it seems that the code does a loop during all the CHAOS_DURATION
litmus-go/chaoslib/litmus/ec2-terminate-by-tag/lib/ec2-terminate-by-tag.go
Line 82 in b6d04fb
instanceIDList
so it can try to stop the same instance multiple times during the Chaos duration. In the case of (MANAGED_SUBGROUP=disable
), the instance is "stopped" instead of terminated so it will stop/start/stop... the same instance without any issue. But in the case ofMANAGED_SUBGROUP=enable
, the instance is "terminated" and it causes an issue as the instance has not been removed from the instanceIDList , it cannot be stopped as it is not existing anymore in the next iteration...The only way to have a success is to set
CHAOS_INTERVAL=TOTAL_CHAOS_DURATION
but then we have to wait CHAOS_INTERVAL_TOTAL for nothing at the end of the chaos first (and only) iteration.=> the case when the interval is 0/1 should be handled
The details are explained in this Slack discussion : https://kubernetes.slack.com/archives/CNXNB0ZTN/p1643826054494339?thread_ts=1643739932.025119&cid=CNXNB0ZTN
What you expected to happen:
In the case of
MANAGED_SUBGROUP=enable
, the instance has to be removed from theinstanceIDList
to avoid trying to stop it again in the next iterations.How to reproduce it (as minimally and precisely as possible):
chaos=allowed
MANAGED_SUBGROUP=enable
, TOTAL_CHAOS_DURATION=500s (a sufficient time to allow the ASG to terminate the stopped instance), CHAOS_INTERVAL=0 (or any value < TOTAL_CHAOS_DURATION), and `INSTANCE_TAG= 'chaos:allowed'err: ec2 instance failed to stop, err: IncorrectInstanceState: This instance 'i-0fd0da669ea93c044' is not in a state from which it can be stopped.
The only way to not fail is to set
CHAOS_INTERVAL=TOTAL_CHAOS_DURATION
but then :Anything else we need to know?:
@ksatchit and @uditgaurav already agreed on that missing behavior, thanks to them for the support to understand this issue 👍
The text was updated successfully, but these errors were encountered: