[Azure Billing] Excessive consumption on `eastus-cijenkinsio` resource group #3551

dduportal · 2023-04-27T15:38:57Z

Service(s)

Azure, ci.jenkins.io

Summary

The Azure Billing assistant alerted us on anormal costs increase on the resource group eastus-cijenkinsio which hosts the ci.jenkins.io VM and ACI agents.

Despite the VM agent workload fully shifted to Azure (no more AWS VMs), there might be solutions to control this cost:

Check if the VM are properly sized (by checking datadog metrics)
Check if there are no dangling resources
Check if we cannot use another sponsored cloud provider (not AWS because of Spring 2023: Decrease AWS costs #3502 , but eventually DigitalOcean or Oracle)

Reproduction steps

No response

The text was updated successfully, but these errors were encountered:

dduportal · 2023-04-27T16:37:14Z

A first step: the "highmem" VMs, used by jenkinsci/jenkins and jenkinsci/ath, does seem to be under-used:

These machines are Standard_D16s_v3 (ref. https://github.com/jenkins-infra/jenkins-infra/blob/c8c44f88f011d545c483fce107e34c49d69c2ec2/hieradata/clients/azure.ci.jenkins.io.yaml#L433) which are described here: https://learn.microsoft.com/en-us/azure/virtual-machines/dv3-dsv3-series#dsv3-series

They use 16 vCPUs and 64 Gb of memory.
These machines are the main culprit for the cost:

=> Let's decrease their size to D8sv3 (8 vCPUs / 32 Gb) as soon as posible, since it would cut their price by 2.

dduportal · 2023-04-27T16:39:06Z

Here we are: jenkins-infra/jenkins-infra#2797

dduportal · 2023-05-02T08:26:35Z

After 5 days, we clearly see a decrease in the daily budget and no visible problem on the CPU/memory usage on the machines (except a half decrease of free memory of course):

dduportal · 2023-05-02T08:27:35Z

No answer back in https://groups.google.com/g/jenkinsci-dev/c/_HI6CzwB8cw
Not using spot instance for now (the cost is only 2 to 3 times so it could cause mayhem for developers as per All highmem jobs are stuck in queue #2971 )
=> Closing for now, we'll do a weekly check in May

dduportal · 2023-05-11T18:24:02Z

Reopening as, despite the decrease from previous actions, we still have costs to be decreased in this resource group.

Current costs are clearly better:

However we have a few improvements to continue decreasing this cost:

Use another cloud provider to spread the VM workload
ATH (an heavy hitter):
- Use spot instances for ATH since it handles retries when agent fails: https://github.com/jenkinsci/acceptance-test-harness/blob/master/Jenkinsfile#L107
- Don't build ATH image on every split jenkinsci/acceptance-test-harness#988
- Check if ATH could not avoid some builds when not really needed (Should https://github.com/jenkinsci/acceptance-test-harness/pull/1164/files trigger a full scale build?) - Conditional build: do we need to rebuild the full ATH for all Renovabot / Dependabot PRs? jenkinsci/acceptance-test-harness#1167
Move the VM plugins workload to container workloads
Check if VM instance sizes can use cheaper options (more recent VM sizes, and/or use ARM64 workloads whenever possible)
Use the work from Install and configure Datadog plugin on ci.jenkins.io #3573 to check which jobs are costing us money

dduportal · 2023-05-12T13:13:02Z

A few improvement for the VMs agents:

Using ephemeral disks would avoid paying for managed disk. Ephemeral Disks uses the local SSD storage (either "Temp" or "Cache") on the hypervisors for the OS disk.
- The disk size is constrained to the maximum cache storage OR temp. storage available. This maximum value depends on the instance type. With the current setting (150 Gb OS disk size for all VM agents, Standard_D4S_v3 for Linux/Windows VMs and Standard_D8s_v3 for Linux Highmem VMs, only the highmem instance would support the ephemeral OS Disk:

Instance size / price / spot price:
- For Highmem instance:
  - Currently, we use Standard_D8s_v3 (8 vCPUs, 32, Local Cache Storage of 200Gb - 16000 IOPS - 128 Mbps) for $0.3840/hour. Spot is half the price at $0.1695/hour. We currently have Managed disk (Premium SSD P15) at $0.05 / per hour but it could be removed.
  - After checking prices, specifications and validating manually on infra.ci, the Standard_D8ads_v5 is a serious contender:
    - Faster CPU (AMD EPYC 7763v 3.5 Ghz instead of Xeon at 2.4 Ghz) even if it's hard to compare due to different caching and turbo boost systems
    - Ability to use Temp storage for ephemeral disk (at 300 Gb, eg. 150 Gb system and 150 Gb in /mnt) on a local NVME with same IOPS as Premium P15 SSDs (thanks to the AMD EPYC cached PCI lanes)
    - Cost at $0.4120 per hour: a bit more expensive. BUT the Spot price is 10x less at $0.0412 per hours! Clearly worth the investment!
- For the "normal" VMs, clearly the change to Standard_D4ads_v5 is also worth: $0.2060/hour on demand, with Ephemeral storage of 150 Gb (!) at $0.0206/hour so already cheaper than the Standard_D4S_v3 with its managed disk ($0.1920 + $0.05 = $0.242/hour). The spot price is also 10x less, so clearly a nice investment!

dduportal · 2023-05-12T13:48:49Z

Additional check run on infra.ci to ensure IO performances does not change with the new instances and their ephemeral disk:

# With Standard_D8s_v3 and a Premium SSD P15 of 150 Gb
12:32:40  fiotest: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
12:32:40  fio-3.25
12:32:40  Starting 1 process
12:32:40  fiotest: Laying out IO file (1 file / 8192MiB)
12:44:01  
12:44:01  fiotest: (groupid=0, jobs=1): err= 0: pid=3005: Fri May 12 10:43:57 2023
12:44:01    read: IOPS=239, BW=960KiB/s (983kB/s)(574MiB/612606msec)
12:44:01     bw (  KiB/s): min=    8, max= 1472, per=100.00%, avg=961.97, stdev=236.62, samples=1222
12:44:01     iops        : min=    2, max=  368, avg=240.45, stdev=59.14, samples=1222
12:44:01    write: IOPS=3183, BW=12.4MiB/s (13.0MB/s)(7618MiB/612606msec); 0 zone resets
12:44:01     bw (  KiB/s): min=   32, max=15608, per=100.00%, avg=12761.32, stdev=2928.16, samples=1222
12:44:01     iops        : min=    8, max= 3902, avg=3190.28, stdev=732.02, samples=1222
12:44:01    cpu          : usr=0.57%, sys=1.40%, ctx=330551, majf=0, minf=7
12:44:01    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
12:44:01       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
12:44:01       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
12:44:01       issued rwts: total=147002,1950150,0,0 short=0,0,0,0 dropped=0,0,0,0
12:44:01       latency   : target=0, window=0, percentile=100.00%, depth=64
12:44:01  
12:44:01  Run status group 0 (all jobs):
12:44:01     READ: bw=960KiB/s (983kB/s), 960KiB/s-960KiB/s (983kB/s-983kB/s), io=574MiB (602MB), run=612606-612606msec
12:44:01    WRITE: bw=12.4MiB/s (13.0MB/s), 12.4MiB/s-12.4MiB/s (13.0MB/s-13.0MB/s), io=7618MiB (7988MB), run=612606-612606msec
12:44:01  
12:44:01  Disk stats (read/write):
12:44:01    sda: ios=147202/1954573, merge=0/5728, ticks=1694859/37949627, in_queue=40122165, util=96.13%

# With Standard_D4ads_v5 and and ephemeral storage of 150 Gb
12:34:40  fiotest: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
12:34:40  fio-3.25
12:34:40  Starting 1 process
12:34:40  fiotest: Laying out IO file (1 file / 8192MiB)
12:35:47  
12:35:47  fiotest: (groupid=0, jobs=1): err= 0: pid=3013: Fri May 12 10:35:47 2023
12:35:47    read: IOPS=2706, BW=10.6MiB/s (11.1MB/s)(574MiB/54322msec)
12:35:47     bw (  KiB/s): min= 7344, max=12640, per=100.00%, avg=10853.04, stdev=521.67, samples=108
12:35:47     iops        : min= 1836, max= 3160, avg=2713.26, stdev=130.42, samples=108
12:35:47    write: IOPS=35.9k, BW=140MiB/s (147MB/s)(7618MiB/54322msec); 0 zone resets
12:35:47     bw (  KiB/s): min=97080, max=158704, per=100.00%, avg=143985.70, stdev=5771.30, samples=108
12:35:47     iops        : min=24270, max=39676, avg=35996.43, stdev=1442.82, samples=108
12:35:47    cpu          : usr=3.37%, sys=13.70%, ctx=693561, majf=0, minf=7
12:35:47    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
12:35:47       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
12:35:47       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
12:35:47       issued rwts: total=147002,1950150,0,0 short=0,0,0,0 dropped=0,0,0,0
12:35:47       latency   : target=0, window=0, percentile=100.00%, depth=64
12:35:47  
12:35:47  Run status group 0 (all jobs):
12:35:47     READ: bw=10.6MiB/s (11.1MB/s), 10.6MiB/s-10.6MiB/s (11.1MB/s-11.1MB/s), io=574MiB (602MB), run=54322-54322msec
12:35:47    WRITE: bw=140MiB/s (147MB/s), 140MiB/s-140MiB/s (147MB/s-147MB/s), io=7618MiB (7988MB), run=54322-54322msec
12:35:47  
12:35:47  Disk stats (read/write):
12:35:47    sda: ios=146684/1945949, merge=0/105, ticks=248751/3178820, in_queue=3427575, util=99.33%

dduportal · 2023-05-15T17:26:53Z

Update: we see positive effects, but it is a bit early to conclude (we'll have to watch the week days):

dduportal · 2023-05-22T14:32:48Z

This issue can be closed: the effect of the last changes are persistent and effective:

dduportal added the triage Incoming issues that need review label Apr 27, 2023

dduportal added this to the infra-team-sync-2023-05-02 milestone Apr 27, 2023

jenkins-infra-helpdesk-app bot added azure ci.jenkins.io labels Apr 27, 2023

github-actions bot mentioned this issue Apr 27, 2023

[Azure Billing] Excessive consumption on *-packer-builds resource groups #3552

Closed

dduportal self-assigned this Apr 27, 2023

dduportal removed the triage Incoming issues that need review label Apr 27, 2023

This was referenced Apr 27, 2023

Update ci.adoc jenkins-infra/documentation#30

Merged

feat(ci.jenkins.io) decrease size of the Linux highmem instances jenkins-infra/jenkins-infra#2797

Merged

dduportal closed this as completed May 2, 2023

dduportal reopened this May 11, 2023

dduportal mentioned this issue May 11, 2023

Conditional build: do we need to rebuild the full ATH for all Renovabot / Dependabot PRs? jenkinsci/acceptance-test-harness#1167

Open

This was referenced May 12, 2023

feat(ci.jenkins.io) use new instance kind for highmem Linux: Standard_D8ads_v5, spot with ephemeral disks jenkins-infra/jenkins-infra#2828

Merged

feat(ci.jenkins.io) use spot + ephemeral disk for all Azure VM Agents jenkins-infra/jenkins-infra#2829

Merged

smerle33 mentioned this issue May 15, 2023

create an ARM64 nodepool on publick8s to start using arm64 pods #3584

Closed

MarkEWaite added the billing-report label May 15, 2023

dduportal modified the milestones: infra-team-sync-2023-05-02, infra-team-sync-2023-05-16, infra-team-sync-2023-05-23 May 15, 2023

dduportal closed this as completed May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Azure Billing] Excessive consumption on `eastus-cijenkinsio` resource group #3551

[Azure Billing] Excessive consumption on `eastus-cijenkinsio` resource group #3551

dduportal commented Apr 27, 2023

dduportal commented Apr 27, 2023

dduportal commented Apr 27, 2023

dduportal commented May 2, 2023

dduportal commented May 2, 2023

dduportal commented May 11, 2023 •

edited

Loading

dduportal commented May 12, 2023 •

edited by lemeurherve

Loading

dduportal commented May 12, 2023

dduportal commented May 15, 2023

dduportal commented May 22, 2023

[Azure Billing] Excessive consumption on eastus-cijenkinsio resource group #3551

[Azure Billing] Excessive consumption on eastus-cijenkinsio resource group #3551

Comments

dduportal commented Apr 27, 2023

Service(s)

Summary

Reproduction steps

dduportal commented Apr 27, 2023

dduportal commented Apr 27, 2023

dduportal commented May 2, 2023

dduportal commented May 2, 2023

dduportal commented May 11, 2023 • edited Loading

dduportal commented May 12, 2023 • edited by lemeurherve Loading

dduportal commented May 12, 2023

dduportal commented May 15, 2023

dduportal commented May 22, 2023

[Azure Billing] Excessive consumption on `eastus-cijenkinsio` resource group #3551

[Azure Billing] Excessive consumption on `eastus-cijenkinsio` resource group #3551

dduportal commented May 11, 2023 •

edited

Loading

dduportal commented May 12, 2023 •

edited by lemeurherve

Loading