Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_lsi_raid affects host I/O performance #42

Closed
BlackZork opened this issue Sep 5, 2024 · 1 comment
Closed

check_lsi_raid affects host I/O performance #42

BlackZork opened this issue Sep 5, 2024 · 1 comment

Comments

@BlackZork
Copy link

BlackZork commented Sep 5, 2024

01:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)
storcli 007.3006.0000.0000-1 on ArchLinux (installed from AUR).

# /usr/lib/monitoring-plugins/check_lsi_raid -V
check_lsi_raid: Nagios/Icinga plugin to check LSI Raid Controller status
Version: 2.5
StorCli SAS Customization Utility Ver 007.3006.0000.0000 Apr 17, 2024

Icinga service definition:

apply Service for (name => config in host.vars.lsiraid) {
  import "generic-service"

  check_command = "lsi-raid"
  vars += config
}

object Host "myhost" {
  /* Import the default host template defined in `templates.conf`. */
  import "linux-server"

  address = "1.1.1.1"

  vars.lsiraid["LSI 3108"] = {
    lsi_ignored_other_errors = 9999999
    lsi_ignored_media_errors = 9999999
  }

  vars.lsiraid["RAID slot 1"] = {
    lsi_enclosure_id=1
    lsi_pd_id=0
    lsi_ignored_other_errors=8
  }
 
  [... and next 15 slots as above]
}

When the default 1-minute check interval was used, host I/O performance suffered dramatically. It looks like the controller stops some I/O operations when a storcli command is executed. I discovered this by looking for processes in the IO_WAIT state. The number of waiting processes increased when storcli was executed and I experienced slowdowns of various VMs and services hosted on my server.

As a workaround I've added check_period=15m and 15m day TimePeriod window to force Icinga2 to check LSI only once a day.

I am aware that there is probably nothing you can do to fix this problem. I spent a lot of time trying to figure out what was causing I/O problems on my host, so it may be worth adding a warning to this plugin documentation for others.

@gschoenberger
Copy link
Member

The 1 minute check interval might in fact be not a suitable option for this plugin!
I think the heaviest operation is:

time /usr/local/bin/storcli /c0 show all
real	0m1,798s
user	0m0,034s
sys	0m0,038s

If I can remember it correctly we had the issue when we were running "adpallinfo" in the plugin.

Added a "Warning" to the README with commit e27cc73

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants