Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Better utilization of CI resources and other CI improvements #5891

Closed
4 of 10 tasks
hcho3 opened this issue Jul 14, 2020 · 15 comments
Closed
4 of 10 tasks

[CI] Better utilization of CI resources and other CI improvements #5891

hcho3 opened this issue Jul 14, 2020 · 15 comments
Assignees
Labels

Comments

@hcho3
Copy link
Collaborator

hcho3 commented Jul 14, 2020

Run-away cloud cost of our Jenkins CI server has been a pressing issue (#5176). It is hosted on AWS, which charges by the hour. #5884 finally created the mechanism to enforce a daily budget via throttling.

We now have a dashboard page to keep track of daily spending: https://xgboost-ci.net/dashboard/
Screen Shot 2020-07-14 at 3 54 49 AM

Now it is time to extract savings and ensure that we are using limited CI resources on where it matters.

State of the CI: The free credits from AWS ran out this month, so we now have to start drawing from the Open Collective account, which currently has 10531.16 USD. If we limit ourselves to spending 33 USD per day, the balance will last 287 days.

  • Make all tests conditional on the presence of a GitHub comment (e.g. run tests). Right now, tests run automatically, and there are many cases where automatically starting tests is wasteful.
  • Skip tests for all draft pull requests.
  • Allow "rolling over" left-over allowance from previous day. For example, if the daily budget is 33 USD and we spent only 10 USD yesterday, we should be able to spend up to 56 USD (33 USD + 23 USD roll over). The spending pattern is quite spiky:
    Screen Shot 2020-07-14 at 4 21 34 AM
  • Migrate some tests to free services. I suspect this may have limited impact, since any tests using GPUs need to run in Jenkins.
  • Pin point which tests cost the most. I was surprised to learn that Windows jobs cost more than 50% of the expenses, even though we run more tests in Linux:
    Screen Shot 2020-07-14 at 4 23 14 AM
  • Related: write a summary report of the AWS expenses for the last 6 months.
  • Speed up C++ builds. It takes ~ 10 min on Linux and ~ 15 min on Windows to build XGBoost with GPU support.
  • Try to get more funding, which is easier said than done.

Other CI improvements, outside of Jenkins

  • Migrate from AppVeyor to GitHub actions. AppVeyor tests are often a bottleneck because it only runs one test at a time.
  • Remove CPU-only tests from Jenkins CI pipeline. This is especially important for Windows targets, since Windows instances tend to cost more.
@hcho3 hcho3 pinned this issue Jul 14, 2020
@hcho3 hcho3 self-assigned this Jul 14, 2020
@terrytangyuan
Copy link
Member

terrytangyuan commented Jul 15, 2020

Thanks for surfacing and outlining the issues. Would a build be triggered by anyone that comments run tests or is it limited to committers only?

@hcho3
Copy link
Collaborator Author

hcho3 commented Jul 15, 2020

@terrytangyuan I think only committers should have a right to start tests.

  • We can migrate some basic checks off of the Jenkins server, so that they are always run. For example, lint checks should always run.
  • Or we could let anyone start tests for the first time, and then subsequent tests would require a go-ahead from a committer.

@terrytangyuan
Copy link
Member

The second option sounds better to me. Even though link checks are fast but if a contributor pushes many commits it could be a problem.

@hcho3
Copy link
Collaborator Author

hcho3 commented Jul 15, 2020

Actually two options can go together. Lint checks can run in GitHub Actions and would not be subject to the EC2 quota. So we can let contributors run as many lint checks as they'd like. More substantial tests should require a committer's approval when they are run second time.

@terrytangyuan
Copy link
Member

Got it. Yea all checks on GitHub Actions can always run.

@hcho3
Copy link
Collaborator Author

hcho3 commented Jul 15, 2020

I raised the daily limit to 50 USD, to allow more jobs to run each day. We don't want to slow down our development speed too much. I'm really hoping to reduce the cost of each test job, so that we can bring the daily rate back down.

@hcho3
Copy link
Collaborator Author

hcho3 commented Jul 15, 2020

Summary of expenses this year, per OS:

Platform Windows (Amazon VPC)($) Linux/UNIX($) No Platform($) Total cost ($)
Platform Total 7823.18 3397.2535665664 898.3373072795 12118.770873845900
2020-01-01 1145.258 439.38216705 185.986642061 1770.626809111
2020-02-01 1154.0820000000000 487.434750354 128.50315517710000 1770.0199055311
2020-03-01 1052.63 431.9893799469 134.3857972443 1619.0051771912
2020-04-01 1588.134 693.1112277207 130.5596068873 2411.804834608
2020-05-01 1122.826 468.5888252106 130.633854798 1722.0486800086
2020-06-01 1147.7060000000000 554.182259014 118.1775170761 1820.0657760901000
2020-07-01 612.544 322.5649572702 70.0907340357 1005.1996913059

Windows jobs cost whopping 64.5% of the total, and Linux cost only 28.0% of the total.

@hcho3
Copy link
Collaborator Author

hcho3 commented Jul 18, 2020

@dmlc/xgboost-committer Good news: #5904 saves the cost of Windows test pipeline by up to 66%.

@hcho3
Copy link
Collaborator Author

hcho3 commented Jul 22, 2020

It's been a week since we enacted a daily budget, and so far we've managed to rein in the cost:

Screen Shot 2020-07-21 at 8 45 37 PM

@hcho3

This comment has been minimized.

@hcho3
Copy link
Collaborator Author

hcho3 commented Jul 22, 2020

I just came up with a more robust method: Change the permissions of the Jenkins manager node. Ordinarily, it is given the right to launch new EC2 instances via an IAM policy. To restrict provision of new instances, it suffices to attach the following policy JSON:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "ec2:RunInstances",
                "ec2:StartInstances"
            ],
            "Effect": "Deny",
            "Resource": "*"
        }
    ]
}

This Deny policy will override the pre-existing permission and the manager would no longer be able to launch any new EC2 worker EC2 instances. (It can however terminate existing EC2 instances.)

Update. The Cost Watcher Lambda function now also controls whether the Jenkins manager (master) can launch EC2 workers or not: hcho3/xgboost-devops@e7402fe

@hcho3
Copy link
Collaborator Author

hcho3 commented Jul 29, 2020

#5904 has been very effective in reducing the cost of the Windows CI pipeline:

Screen Shot 2020-07-29 at 1 10 58 PM

@terrytangyuan
Copy link
Member

Great work!

@hcho3
Copy link
Collaborator Author

hcho3 commented Oct 28, 2020

There were two mistakes that slowed down GPU tests to ~ 40 minutes:

Now GPU test suite completes in 15 min.

@hcho3
Copy link
Collaborator Author

hcho3 commented Sep 14, 2022

Completed in #8142. Now we require manual approval for running tests with pull requests

@hcho3 hcho3 closed this as completed Sep 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants