Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Azure Billing] Excessive consumption on eastus-cijenkinsio resource group #3551

Closed
dduportal opened this issue Apr 27, 2023 · 9 comments
Closed

Comments

@dduportal
Copy link
Contributor

Service(s)

Azure, ci.jenkins.io

Summary

The Azure Billing assistant alerted us on anormal costs increase on the resource group eastus-cijenkinsio which hosts the ci.jenkins.io VM and ACI agents.

Capture d’écran 2023-04-27 à 17 33 32

Despite the VM agent workload fully shifted to Azure (no more AWS VMs), there might be solutions to control this cost:

  • Check if the VM are properly sized (by checking datadog metrics)
  • Check if there are no dangling resources
  • Check if we cannot use another sponsored cloud provider (not AWS because of Spring 2023: Decrease AWS costs #3502 , but eventually DigitalOcean or Oracle)

Reproduction steps

No response

@dduportal
Copy link
Contributor Author

A first step: the "highmem" VMs, used by jenkinsci/jenkins and jenkinsci/ath, does seem to be under-used:

Capture d’écran 2023-04-27 à 18 28 57

These machines are Standard_D16s_v3 (ref. https://github.com/jenkins-infra/jenkins-infra/blob/c8c44f88f011d545c483fce107e34c49d69c2ec2/hieradata/clients/azure.ci.jenkins.io.yaml#L433) which are described here: https://learn.microsoft.com/en-us/azure/virtual-machines/dv3-dsv3-series#dsv3-series

They use 16 vCPUs and 64 Gb of memory.
These machines are the main culprit for the cost:

Capture d’écran 2023-04-27 à 18 36 07

=> Let's decrease their size to D8sv3 (8 vCPUs / 32 Gb) as soon as posible, since it would cut their price by 2.

@dduportal dduportal self-assigned this Apr 27, 2023
@dduportal dduportal removed the triage Incoming issues that need review label Apr 27, 2023
@dduportal
Copy link
Contributor Author

Here we are: jenkins-infra/jenkins-infra#2797

@dduportal
Copy link
Contributor Author

After 5 days, we clearly see a decrease in the daily budget and no visible problem on the CPU/memory usage on the machines (except a half decrease of free memory of course):

Capture d’écran 2023-05-02 à 10 24 00
Capture d’écran 2023-05-02 à 10 24 46
Capture d’écran 2023-05-02 à 10 25 11

@dduportal
Copy link
Contributor Author

@dduportal
Copy link
Contributor Author

dduportal commented May 11, 2023

Reopening as, despite the decrease from previous actions, we still have costs to be decreased in this resource group.

Current costs are clearly better:

Capture d’écran 2023-05-11 à 20 16 40

However we have a few improvements to continue decreasing this cost:

@dduportal
Copy link
Contributor Author

dduportal commented May 12, 2023

A few improvement for the VMs agents:

  • Using ephemeral disks would avoid paying for managed disk. Ephemeral Disks uses the local SSD storage (either "Temp" or "Cache") on the hypervisors for the OS disk.
    • The disk size is constrained to the maximum cache storage OR temp. storage available. This maximum value depends on the instance type. With the current setting (150 Gb OS disk size for all VM agents, Standard_D4S_v3 for Linux/Windows VMs and Standard_D8s_v3 for Linux Highmem VMs, only the highmem instance would support the ephemeral OS Disk:

image

  • Instance size / price / spot price:
    • For Highmem instance:
      • Currently, we use Standard_D8s_v3 (8 vCPUs, 32, Local Cache Storage of 200Gb - 16000 IOPS - 128 Mbps) for $0.3840/hour. Spot is half the price at $0.1695/hour. We currently have Managed disk (Premium SSD P15) at $0.05 / per hour but it could be removed.
      • After checking prices, specifications and validating manually on infra.ci, the Standard_D8ads_v5 is a serious contender:
        • Faster CPU (AMD EPYC 7763v 3.5 Ghz instead of Xeon at 2.4 Ghz) even if it's hard to compare due to different caching and turbo boost systems
        • Ability to use Temp storage for ephemeral disk (at 300 Gb, eg. 150 Gb system and 150 Gb in /mnt) on a local NVME with same IOPS as Premium P15 SSDs (thanks to the AMD EPYC cached PCI lanes)
        • Cost at $0.4120 per hour: a bit more expensive. BUT the Spot price is 10x less at $0.0412 per hours! Clearly worth the investment!
    • For the "normal" VMs, clearly the change to Standard_D4ads_v5 is also worth: $0.2060/hour on demand, with Ephemeral storage of 150 Gb (!) at $0.0206/hour so already cheaper than the Standard_D4S_v3 with its managed disk ($0.1920 + $0.05 = $0.242/hour). The spot price is also 10x less, so clearly a nice investment!

@dduportal
Copy link
Contributor Author

  • Additional check run on infra.ci to ensure IO performances does not change with the new instances and their ephemeral disk:
# With Standard_D8s_v3 and a Premium SSD P15 of 150 Gb
12:32:40  fiotest: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
12:32:40  fio-3.25
12:32:40  Starting 1 process
12:32:40  fiotest: Laying out IO file (1 file / 8192MiB)
12:44:01  
12:44:01  fiotest: (groupid=0, jobs=1): err= 0: pid=3005: Fri May 12 10:43:57 2023
12:44:01    read: IOPS=239, BW=960KiB/s (983kB/s)(574MiB/612606msec)
12:44:01     bw (  KiB/s): min=    8, max= 1472, per=100.00%, avg=961.97, stdev=236.62, samples=1222
12:44:01     iops        : min=    2, max=  368, avg=240.45, stdev=59.14, samples=1222
12:44:01    write: IOPS=3183, BW=12.4MiB/s (13.0MB/s)(7618MiB/612606msec); 0 zone resets
12:44:01     bw (  KiB/s): min=   32, max=15608, per=100.00%, avg=12761.32, stdev=2928.16, samples=1222
12:44:01     iops        : min=    8, max= 3902, avg=3190.28, stdev=732.02, samples=1222
12:44:01    cpu          : usr=0.57%, sys=1.40%, ctx=330551, majf=0, minf=7
12:44:01    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
12:44:01       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
12:44:01       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
12:44:01       issued rwts: total=147002,1950150,0,0 short=0,0,0,0 dropped=0,0,0,0
12:44:01       latency   : target=0, window=0, percentile=100.00%, depth=64
12:44:01  
12:44:01  Run status group 0 (all jobs):
12:44:01     READ: bw=960KiB/s (983kB/s), 960KiB/s-960KiB/s (983kB/s-983kB/s), io=574MiB (602MB), run=612606-612606msec
12:44:01    WRITE: bw=12.4MiB/s (13.0MB/s), 12.4MiB/s-12.4MiB/s (13.0MB/s-13.0MB/s), io=7618MiB (7988MB), run=612606-612606msec
12:44:01  
12:44:01  Disk stats (read/write):
12:44:01    sda: ios=147202/1954573, merge=0/5728, ticks=1694859/37949627, in_queue=40122165, util=96.13%
# With Standard_D4ads_v5 and and ephemeral storage of 150 Gb
12:34:40  fiotest: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
12:34:40  fio-3.25
12:34:40  Starting 1 process
12:34:40  fiotest: Laying out IO file (1 file / 8192MiB)
12:35:47  
12:35:47  fiotest: (groupid=0, jobs=1): err= 0: pid=3013: Fri May 12 10:35:47 2023
12:35:47    read: IOPS=2706, BW=10.6MiB/s (11.1MB/s)(574MiB/54322msec)
12:35:47     bw (  KiB/s): min= 7344, max=12640, per=100.00%, avg=10853.04, stdev=521.67, samples=108
12:35:47     iops        : min= 1836, max= 3160, avg=2713.26, stdev=130.42, samples=108
12:35:47    write: IOPS=35.9k, BW=140MiB/s (147MB/s)(7618MiB/54322msec); 0 zone resets
12:35:47     bw (  KiB/s): min=97080, max=158704, per=100.00%, avg=143985.70, stdev=5771.30, samples=108
12:35:47     iops        : min=24270, max=39676, avg=35996.43, stdev=1442.82, samples=108
12:35:47    cpu          : usr=3.37%, sys=13.70%, ctx=693561, majf=0, minf=7
12:35:47    IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
12:35:47       submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
12:35:47       complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
12:35:47       issued rwts: total=147002,1950150,0,0 short=0,0,0,0 dropped=0,0,0,0
12:35:47       latency   : target=0, window=0, percentile=100.00%, depth=64
12:35:47  
12:35:47  Run status group 0 (all jobs):
12:35:47     READ: bw=10.6MiB/s (11.1MB/s), 10.6MiB/s-10.6MiB/s (11.1MB/s-11.1MB/s), io=574MiB (602MB), run=54322-54322msec
12:35:47    WRITE: bw=140MiB/s (147MB/s), 140MiB/s-140MiB/s (147MB/s-147MB/s), io=7618MiB (7988MB), run=54322-54322msec
12:35:47  
12:35:47  Disk stats (read/write):
12:35:47    sda: ios=146684/1945949, merge=0/105, ticks=248751/3178820, in_queue=3427575, util=99.33%

@dduportal
Copy link
Contributor Author

Update: we see positive effects, but it is a bit early to conclude (we'll have to watch the week days):

Capture d’écran 2023-05-15 à 19 25 57

@dduportal
Copy link
Contributor Author

This issue can be closed: the effect of the last changes are persistent and effective:

Capture d’écran 2023-05-22 à 16 32 09

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants