Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate the racks #2056

Merged
merged 12 commits into from
Jan 31, 2020
Merged

Validate the racks #2056

merged 12 commits into from
Jan 31, 2020

Conversation

sjeropkipruto
Copy link
Contributor

@sjeropkipruto sjeropkipruto commented Jan 7, 2020

Status: Work in progress 🚧

Description

we want to validate the singularity racks because adding new instances in racks that singularity doesn't know about causes scheduling issues

Issue: https://git.hubteam.com/HubSpot/PaaS-Run/issues/1102

Todos

  • specify the valid racks in the singularity configuration
  • implement the validation checks to confirm the racks are known by singularity
  • testing
  • rolling out

@ssalinas
Copy link
Member

ssalinas commented Jan 7, 2020

I think this is close, just needs a tweak. We likely still want to be able to schedule some things on the smaller racks we add, but don't want a singular under sized rack to limit placement among the others. I think the hard coded list should be what informs if we choose RACK_SATURATED or not, rather than rejecting a new/unseen rack altogether. Essentially this gives us a more hardcoded count of 'expectedRacks' versus 'validRacks' if that makes sense

@ssalinas
Copy link
Member

ssalinas commented Jan 9, 2020

Close, but this is still a bit too restrictive. This will essentially not let anything schedule on unexpected racks, but doesn't solve the issue that the number of racks will still make it think that something is saturated too early.

Check out the isRackOk method. That is where we want to hook in. There is a line in there like:

double numPerRack = numDesiredInstances / (double) rackManager.getNumActive();
    if (racksAccountedFor < rackManager.getNumActive()) {
...

The rackManager.getNumActive() is what gets us right now when we accidentally do something like add a single host in a new rack. Singularity RackManager then thinks there are 4 for example, but only 3 actually have sufficient capacity for scheduling. We want to do something like using the count of expectedRacks here vs rackManager.getNumActive() (probably falling back to rackManager in the case that the expectedRacks is empty/unspecified, which should be the default)

@ssalinas
Copy link
Member

Looking good, if all tests are still passing let's give it a go in staging 👍

@WH77
Copy link
Contributor

WH77 commented Jan 28, 2020

Verified that scheduling still works normally on staging, added a unit test just to be safe.

@ssalinas
Copy link
Member

🚢

@WH77 WH77 added the hs_stable label Jan 30, 2020
@WH77 WH77 merged commit 34603ca into master Jan 31, 2020
@ssalinas ssalinas added this to the 1.2.0 milestone Feb 18, 2020
@ssalinas ssalinas deleted the rack branch July 17, 2020 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants