Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selenium Gird Scaler | Edge active sessions not being properly counted #2709

Closed
Wolfe1 opened this issue Mar 2, 2022 · 8 comments · Fixed by #3062
Closed

Selenium Gird Scaler | Edge active sessions not being properly counted #2709

Wolfe1 opened this issue Mar 2, 2022 · 8 comments · Fixed by #3062
Labels
bug Something isn't working

Comments

@Wolfe1
Copy link
Contributor

Wolfe1 commented Mar 2, 2022

Report

Been working on getting a selenium grid 4 deployment working in AKS using Keda to scale. Chrome and Firefox have been no problem but scaling Edge browser pods seems to have a bug.

Currently using the following scaled-object:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: selenium-edge-scaledobject
  namespace: legion-selenium-grid
  labels:
    deploymentName: selenium-edge-node-deployment
spec:
  minReplicaCount: 0
  maxReplicaCount: 80
  scaleTargetRef:
    name: selenium-edge-node-deployment
  triggers:
    - type: selenium-grid
      metadata:
        url: 'https://selenium-grid-qe-tools.accruentsystems.com/graphql'
        browserName: 'MicrosoftEdge'

New sessions in the queue appear as 'MicrosoftEdge' and scale up just fine:
image

However, taking a look at the active sessions I noticed that the browserName on the active sessions becomes 'msedge':
image

Thus when it comes time for Keda to check for a scale-down event it sees that there are no active sessions and proceeds scale to 0.

Expected Behavior

Keda detects that Edge pods are still in use and scales down gradually as needed.

Actual Behavior

When scaling down Keda will remove all Edge pods regardless of current active sessions.

Steps to Reproduce the Problem

  1. Deploy selenium grid and Edge deployments
  2. Apply scaled-object for edge (see description)
  3. Can be done in a variety of ways but an easy way: Run a single long (10 minute +) test on selenium grid. Grid will scale up to allow your test to run but will eventually scale to 0 before your test can finish.

Logs from KEDA operator

example

KEDA Version

2.6.1

Kubernetes Version

1.21

Platform

Microsoft Azure

Scaler Details

Selenium Grid

Anything else?

I am not in any way familiar with Go and am having issues deploying a local change to AKS but I believe all that really needs done here is a specific check for "MicrosoftEdge" as a browserName and then simply use the session-specific naming of 'msedge' in that situation.

Probably something like this (though im sure there is a more elegant way to put it). Around line 225 in selenium_grid_scaler:

var sessions = seleniumResponse.Data.SessionsInfo.Sessions
	for _, session := range sessions {
		var capability = capability{}
		var browserCapName = ""
		if browserName == "MicrosoftEdge" {
			browserCapName = "msedge"
		} else {
			browserCapName = browserName
		}
		if err := json.Unmarshal([]byte(session.Capabilities), &capability); err == nil {
			if capability.BrowserName == browserCapName {
				if strings.HasPrefix(capability.BrowserVersion, browserVersion) {
					count++
				} else if browserVersion == DefaultBrowserVersion {
					count++
				}
			}
		} else {
			seleniumGridLog.Error(err, fmt.Sprintf("Error when unmarshaling sessions info: %s", err))
		}
	}
@Wolfe1 Wolfe1 added the bug Something isn't working label Mar 2, 2022
@tomkerkhove tomkerkhove moved this to Proposed in Roadmap - KEDA Core Mar 2, 2022
@Wolfe1
Copy link
Contributor Author

Wolfe1 commented Mar 8, 2022

Testing this a bit further and it seems that this limits the scale-up as well.

For example, take this testing scenario:

  • Scaled object allows up to 20 edge pods at once, normal scale-up behavior.
  • I have 10 Edge pods, each currently running a test.
  • I have 5 sessions for edge waiting in the session queue.

What we would expect to happen here is that we would scale up to 15 pods in order to allow the 5 tests in the queue to start. In actuality, nothing happens. I believe this is also because of the edge sessions having a different browserName when they are in an active session. Keda knows there are 10 pods up but does not see any active sessions under the name "MicrosoftEdge" so it assumes the sessions in the queue will be picked up soon and thus does not scale.

@JorTurFer
Copy link
Member

Hi,
I'm not an expert but this seems an error in the upstream, I mean. KEDA recovers those values and applies the logic in base of them, but I guess that the consistency between them is part of the upstream responsibility.

Maybe I'm missing something but I think that if we apply that patch and then the upstream uses the same name in both cases, KEDA will fail again.

thoughts @kedacore/keda-core-contributors ?

@Wolfe1
Copy link
Contributor Author

Wolfe1 commented Mar 9, 2022

@JorTurFer After talking with some contributors for selenium it seems that the edgeDriver accepts a few values as valid (having trouble finding any sort of a list though) when starting but will in the end self identify as 'msedge' every time:

image

It would seem in this case that the driver session would never report anything other than "msedge".

Modifying my session call to appear as "msedge" gets it in the session queue but my current hurdle is getting the Edge node to accept the "msedge" browserName stereotype as something to pick up on the node.

Still working on this so there may be a way to get this working as-is. Even though this is an upstream responsibility I think a small modification in one/two file(s) here would be much simpler than large code changes across multiple projects and tools.


Perhaps this could be an optional parameter such as sessionBrowserName that if set is used for session identification, otherwise ignored and browserName is used as normal? This would allow flexibility outside of this case for edge as well.

@JorTurFer
Copy link
Member

JorTurFer commented Mar 10, 2022

I agree with that optional parameter. Maybe explaining it well in the docs we could fill the gap and don't be attached to future changes in their side because it's just remove the value for the optional parameter (but this edge case should be documented IMO)
Let's give some time to other @kedacore/keda-core-contributors for sharing their thoughts?

@Wolfe1
Copy link
Contributor Author

Wolfe1 commented Apr 12, 2022

@JorTurFer Is there a tag needed for this to be discussed or have someone chime in?

@zroubalik
Copy link
Member

zroubalik commented Apr 13, 2022

An optional parameter seems like the best solution for this, together with a proper documentation. Anyone up for a contribution?

@Wolfe1
Copy link
Contributor Author

Wolfe1 commented May 20, 2022

@zroubalik I have some time so taking a stab at this now.

@Wolfe1
Copy link
Contributor Author

Wolfe1 commented May 23, 2022

I have a working solution for this but ran into a bit of a snag: #3061

Going to see about fixing that as well in this as it should be fairly simple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants