Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows agents are soooooooooo slooooooooooooooooooow #3117

Open
daniel-beck opened this issue Sep 2, 2022 · 20 comments
Open

Windows agents are soooooooooo slooooooooooooooooooow #3117

daniel-beck opened this issue Sep 2, 2022 · 20 comments

Comments

@daniel-beck
Copy link

Service(s)

ci.jenkins.io

Summary

Looking through some successful builds in https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/activity I see wildly different build durations per platform:

https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-7050/2/pipeline

  • Linux JDK11: ~ 1 hr 50 min
  • Linux JDK17: ~ 1 hr 40 min
  • Windows JDK11: ~ 5 hrs 50 minutes

https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-7047/3/pipeline

  • Linux JDK11: ~ 1 hr 40 min
  • Linux JDK17: ~ 1 hr 35 min
  • Windows JDK11: ~ 5 hrs

https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-7054/1/pipeline

  • Linux JDK11: ~ 2 hrs
  • Linux JDK17: ~ 1 hr 35 min
  • Windows JDK11: ~ 5 hrs 40 min

https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/master/4002/pipeline

  • Linux JDK11: ~ 1 hr 30 min
  • Linux JDK17: ~ 2 hrs 5 min
  • Windows JDK11: ~ 5 hrs 50 min

Waiting six hours (almost a work day) to get an incrementals deployment seems too long, especially when Linux builds are reliably done in two hours, sometimes less.

Reproduction steps

No response

@daniel-beck daniel-beck added the triage Incoming issues that need review label Sep 2, 2022
@NotMyFault
Copy link
Member

Waiting six hours (almost a work day) to get an incrementals deployment

It feels like you get a lucky shot for incrementals, if you build at peak times.
I can't recall the number of times builds had to be restarted recently, because Windows machines were either disconnected or killed.

@basil
Copy link
Collaborator

basil commented Sep 2, 2022

I think the Windows builds have been very slow since I re-enabled them in jenkinsci/jenkins#6024. I am not sure if throwing more hardware resources at the problem would necessarily improve performance. I seem to recall a Jira ticket being open about one possible cause: the creation and subsequent deletion of a fresh Jenkins home directory for each test (which involves extracting many .jpi files that each contain many tiny files) is far slower on Windows than it is on Unix-like systems. I think that ticket mentioned the idea of implementing plugin class loading without unzipping each plugin .jpi file using something like Tomcat's unpackWARs=false, i.e. https://github.com/apache/tomcat/blob/5190f92b5e8288cde5c0f4a9814b46166e6447bb/java/org/apache/catalina/webresources/JarWarResourceSet.java. That might be possible in theory, but it would be some amount of work to implement, and based on the comments on this page the result might be just trading one performance problem for another. I doubt there is an easy or practical option in the short to medium term.

Flakiness could be mitigated in the short term by adding e.g. retry(count: 3, conditions: [kubernetesAgent(), nonresumable()]) to the Jenkinsfile as was done in buildPlugin() and the BOM Jenkinsfile, though this merely hides the problem rather than fixing the root cause. Being one to prefer fixing the root cause, I would not oppose such a change (nor did I oppose it for buildPlugin() and the BOM Jenkinsfile) but I have not gone out of my way to implement it.

Perhaps we ought to declare that we are not getting much value from Windows testing and reduce its scope to just those tests in the org.jvnet.hudson.test.SmokeTest group by adding -Psmoke-test to the core Jenkinsfile on Windows. While lowering test coverage, that would improve test runtime and cut costs, and if we do not feel the value is high it may be a decent tradeoff.

@daniel-beck
Copy link
Author

daniel-beck commented Sep 2, 2022

Perhaps we ought to declare that we are not getting much value from Windows testing and reduce its scope to just those tests in the org.jvnet.hudson.test.SmokeTest group by adding -Psmoke-test to the core Jenkinsfile on Windows. While lowering test coverage, that would improve test runtime and cut costs, and if we do not feel the value is high it may be a decent tradeoff.

Another alternative might be to do incrementals deployment once one (or both) of the Linux builds passed, so we don't wait for the slower Windows build to finish? While we'd want Windows coverage before merging, I would expect it to be a rare occurrence that we actively have to wait for builds to finish; while waiting for an incrementals deployment is probably more common? Of course, the use case of integrated core + plugin PRs also isn't that common…

Might need a careful look at incrementals validation to see whether this is even doable.

@jtnord
Copy link

jtnord commented Sep 27, 2022

random thought. I believe on windows server that disk write caching is disabled by default (it is enabled by default on client OSes). If we are using ephemeral machines then if it is disabled, enabling it may well help a bit. (I know Jenkins startup is slower on windows than linux for the same hardware - but not normally by the factor that is observed in this ticket).

There are 2 options write caching, and write-cache buffer flushing. the latter option may also help (but depending on the drive it could hinder).

@basil
Copy link
Collaborator

basil commented Sep 27, 2022

@lemeurherve added this to the not-actionable-by-infra-team milestone 7 days ago

Is this really the case? I think the Jenkinsfile changes I suggested above are actionable by the infrastructure team.

@dduportal
Copy link
Contributor

@lemeurherve added this to the not-actionable-by-infra-team milestone 7 days ago

Is this really the case? I think the Jenkinsfile changes I suggested above are actionable by the infrastructure team.

The infra team tends to avoid changing the Jenkinsfileof the Jenkins Core project to avoid messing up with the contribution processes, as it might impact people on knowledge areas that we do not have. This is why we added this milestone to mark this issue and watch it, but without really knowing what to do with it.

Your suggestion seems actionnable still: if I understand correctly the scope is to use make sure that failed Windows test suites are retried until the root cause is identified is correct. Is my understanding correct?

@dduportal
Copy link
Contributor

random thought. I believe on windows server that disk write caching is disabled by default (it is enabled by default on client OSes). If we are using ephemeral machines then if it is disabled, enabling it may well help a bit. (I know Jenkins startup is slower on windows than linux for the same hardware - but not normally by the factor that is observed in this ticket).

There are 2 options write caching, and write-cache buffer flushing. the latter option may also help (but depending on the drive it could hinder).

Interesting. If I understand correctly, this would be a Windows setting? Since we customize the VM images, that should be easy to do in https://github.com/jenkins-infra/packer-images/blob/main/provisioning/windows-provision.ps1 ?

Or is it a cloud-related to setup in the VM definition (e.g. in EC2 and Azure-VM plugin setups) ?

@basil
Copy link
Collaborator

basil commented Sep 27, 2022

The infra team tends to avoid changing the Jenkinsfile of the Jenkins Core project

https://github.com/jenkins-infra/pipeline-library/commits?author=dduportal ?

@dduportal
Copy link
Contributor

The infra team tends to avoid changing the Jenkinsfile of the Jenkins Core project

https://github.com/jenkins-infra/pipeline-library/commits?author=dduportal ?

I'm not sure to understand, could you clarify?

I'm not saying that infra team is not going to take care of that.
I'm saying that we (jenkins-infra team) tend to avoid touching things that we do not understand when it can impact others.
Unless of course if we have an idea of the scope (and if it meets our skills and knowledge).

So I'm asking for clarification because I'm not as skilled as you or other contributors so I need help to understand what has to be done if you want me or the team to do it.

@basil
Copy link
Collaborator

basil commented Sep 27, 2022

I do not see a substantial difference between working on pipeline-library, which is effectively a set of Jenkinsfiles for plugins and other repositories, and the Jenkinsfile of a particular repository. If your team does not want to do the work, please move this ticket to an issue tracking component used by the development team.

@basil basil removed this from the not-directly-actionable-by-infra-team milestone Sep 28, 2022
@basil
Copy link
Collaborator

basil commented Sep 28, 2022

I have removed this issue from the not-directly-actionable-by-infra-team milestone. This issue is directly actionable by the infrastructure team as in the last paragraph of #3117 (comment).

@dduportal
Copy link
Contributor

@basil I think I understand what you are saying, but please, can you let the infrastructure team manage their milestones, as it helps us to track our work in a consensual way.

For info, we do the milestone changes during the weekly meeting (which did not happen this week due to devopsworld).
Based on the inputs you gave + James' inputs, the infra team was going to reconsider and see what should be done.

@basil
Copy link
Collaborator

basil commented Sep 28, 2022

As I wrote previously, if the infrastructure team does not consent to doing this work, please move this ticket to an issue tracking component used by the development team.

@dduportal
Copy link
Contributor

As I wrote previously, if the infrastructure team does not consent to doing this work, please move this ticket to an issue tracking component used by the development team.

I've never implied that.

We are happy to take this task, we'll plan it on our next infra meeting to work it when we'll be able to.

@basil
Copy link
Collaborator

basil commented Oct 11, 2022

Still need to determine whether write caching is enabled or disabled and enable it if necessary.

@dduportal
Copy link
Contributor

Quick update: jenkins-infra/jenkins-infra#2635 changes the type of disk used by the VM instances (NOT Windows container!) from HDD to premium SSD.

It could be interesting to check the difference once deployed.

@daniel-beck
Copy link
Author

daniel-beck commented Feb 23, 2023

@dduportal Should a ci.j.io build from today show this change? https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-7669/1/pipeline/146 still took 5 hrs to build on Windows, compared to 1.5 hrs on Linux.

@dduportal
Copy link
Contributor

@dduportal Should a ci.j.io build from today show this change? https://ci.jenkins.io/blue/organizations/jenkins/Core%2Fjenkins/detail/PR-7669/1/pipeline/146 still took 5 hrs to build on Windows, compared to 1.5 hrs on Linux.

  • The builds on ci.jenkins.io using a Windows VM agent should are expected to see an improvement, but it has to be confirmed.
  • The job you linked uses the label maven-17-windows which is a Windows container (running in ACI). These container agents were not in the scope of the change above (using premium SSD). We are going to look on the ACI container resource to see if we can specify improved disk for these one. Alternative is migrating this workload from ACI to a Kubernetes cluster that we manage with High end SSDs (and Windows machines pool)

@timja
Copy link
Member

timja commented Feb 25, 2023

Another example that's a bit simpler than core:
https://ci.jenkins.io/blue/organizations/jenkins/Plugins%2Fpipeline-graph-view-plugin/detail/main/143/pipeline/69

Over 3x slower on Windows

@smerle33
Copy link
Contributor

Another example that's a bit simpler than core: https://ci.jenkins.io/blue/organizations/jenkins/Plugins%2Fpipeline-graph-view-plugin/detail/main/143/pipeline/69

Over 3x slower on Windows

it looks like it also running on an ACI container, for now we have only improved the windows VM. We will have a look on those ACI soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants