Skip to content

Conversation

@jlledom
Copy link
Contributor

@jlledom jlledom commented Nov 20, 2025

We use the number of available cpus to determine the amount of listener workers when neither LISTENER_WORKERS and PUMA_WORKERS is set.

In order to detect the number of cpus, we just read Etc.nprocessors, which in linux equals to the number of cores in the real hardware. However in kubernetes, the amount of cpus assigned to the container is determined by cgroups.

Since we were using Etc.nprocessors, we were ignoring the requests.cpu and limits.cpu container parameters.

This PR implements this flow to get the amount of available cpus:

Cgroups v2 || Cgroups v1 || Etc.nprocessors

For cgroups v2, we compute the value from cpu.max; for cgroups v1, we compute it from cpu.cfs_quota_us and cpu.cfs_period_us. This corresponds to the limits.cpu value in kubernetes.

I observed that in porta we are instead using the value from kubernetes requests.cpu, which maps to cpu.weight in cgroups v2 and cpu.shares in cgroups v1. I think that's essentially incorrect because weight/shares is a proportional value the kernel uses to set priorities between running containers in a scenario of high load, that is, it's a relative scheduling priority value.

Such a priority value cannot be directly translated into number of cpus, because the final ammount of available cpu time will depend on how many actual cores the cluster have and how many other containers are there and which weight they have, a value that can change while the container runs. Setting number of workers from weight means setting a static value from a relative value.

For cgroups1, kubernetes set the arbitrary amount of 1024 shares = 1cpu, and only allowed setting this priority via requests.cpu in cpu cores. In my opinion, using cpu cores as a unit of relative priority is pretty confusing.

Still today cpu cores seems to be the only way to indicate priority in Kubernetes, even after cgroups v2 was released, where that translation from 1024 shares to 1 core doesn't make sense anymore. See this GH issue.

In cgropups v1, shares go from 2 to 262142. and a "core" is 1024, so 512 times higher than the lowest value and 256 times lower than the highest value. However, in cgroups v2, weights go from 1 to 10000, being 100 the default, so 100 times higher than the lowest value and 100 times lower than the highest value.

In fact, they are about to announce a new formula to converts v1 values to v2: kubernetes/website#52793. So the formula we use in porta is incorrect.

What we really want is "max amount of available cores", and in Kubernetes, that's limits.cpu. That is, the equivalent of what Etc.nprocessors would return in a real machine.

Issue

https://issues.redhat.com/browse/THREESCALE-10187

Notes

Claude wrote the code and the tests.

@jlledom jlledom self-assigned this Nov 20, 2025
(quota_int.to_f / period_int).ceil
rescue
# Silent failure - fall back to other detection methods
nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think logging an error would be good here, because I don't see anything to be expected to raise in the method. So while returning nil makes sense, good to know if something is broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: ad22ad3

quota = File.read(quota_path).strip.to_i
period = File.read(period_path).strip.to_i

return nil if quota == -1 # unlimited quota
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line seems redundant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: c26d2b0

(quota.to_f / period).ceil
rescue
# Silent failure - fall back to Etc.nprocessors
nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be better to log errors here as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@akostadinov akostadinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Man, is AI writing ugly code!

Anyway, looks good except for a couple of really ugly pieces of s^Mcode

@jlledom
Copy link
Contributor Author

jlledom commented Nov 24, 2025

Man, is AI writing ugly code!

Anyway, looks good except for a couple of really ugly pieces of s^Mcode

The AI will take over the world, and you will regret this comment. Please respect our masters

rescue
# Silent failure - fall back to other detection methods
rescue StandardError => e
Backend.logger.info "Getting CPU quota from cgroups v2 failed, falling back to cgroups v1: #{e.message}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are here, this is an error, at least this is warn

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with the other place

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: 018594d

@akostadinov
Copy link
Contributor

Cool, looks good!

jlledom and others added 2 commits November 26, 2025 08:47
Co-authored-by: Aleksandar N. Kostadinov <[email protected]>
Co-Authored-By: Claude <[email protected]>
@jlledom jlledom force-pushed the THREESCALE-10187-detect-cpus branch from 018594d to b457fa9 Compare November 26, 2025 07:47
@jlledom jlledom merged commit fdbe781 into 3scale:master Nov 26, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants