-
Notifications
You must be signed in to change notification settings - Fork 391
Implement Linux memory policy #3230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
saku3
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
Would it be possible to add an integration test similar to runc’s?
https://github.com/opencontainers/runc/pull/4726/files#diff-eab28bb7cc15c949e829d02adfa5ad5869d5cea2dde63d0c2c5f4c52225967e6
In youki, we implement integration tests here:
https://github.com/youki-dev/youki/tree/main/tests/contest
As noted in the description, nodes can be empty in some cases. The runtime-spec has also been updated:
https://github.com/opencontainers/runtime-spec/pull/1294/files
Could you update oci-spec-rs accordingly as well?
| const MPOL_F_STATIC_NODES: u32 = 1 << 15; // 0x8000 | ||
|
|
||
| let base_mode = match policy.mode() { | ||
| oci_spec::runtime::MemoryPolicyModeType::MpolDefault => MPOL_DEFAULT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import them like this:
use oci_spec::runtime::{ /* other */, LinuxMemoryPolicy, MemoryPolicyFlagType, MemoryPolicyModeType };
Please refer to line 11.
| let highest_node = node_ids.iter().max().copied().unwrap_or(0) as usize; | ||
|
|
||
| // Calculate maxnode as highest_node + 1 (number of nodes to consider) | ||
| let maxnode = (highest_node + 1) as u64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the rationale for how maxnode is calculated?
When I run an example like the one in runc’s test code, youki fails with an error. I suspect the cause is a difference in how maxnode is implemented, but I don’t fully understand it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In set_mempolicy(2), nodemask points to a bit mask of node IDs that contains up to maxnode bits and the kernel will use bits only up to maxnode. Also, a maxnode value of zero specifies the empty set of nodes (in which case nodemask is ignored) .
Therefore, to make the highest set bit in nodemask visible to the kernel, maxnode should be one greater than the highest node ID we want to include.
The failure you encountered is likely not caused by the maxnode calculation itself. Right now, oci-spec-rs has already merged support for Linux memory policy (including mode/nodes/flags), but no new crate release has been published yet. If your local environment is still pointing to the crates.io release instead of the git revision with that PR, you may hit errors when parsing or building the spec. That may explain the discrepancy you observed when comparing with runc’s tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long message.
Please let me know if my analysis is incorrect.
If it’s working on your local environment, then something might be wrong with mine.
I first confirmed the machine has two NUMA nodes and the kernel version:
cat /sys/devices/system/node/online
0-1
uname -a
Linux ip-172-31-11-219 6.14.0-1012-aws #12~24.04.1-Ubuntu SMP Fri Aug 15 00:16:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
I also switched oci-spec to a branch that includes memoryPolicy support:
[dependencies]
- oci-spec = { version = "0.8.2", features = ["runtime"] }
+ oci-spec = { git = "https://github.com/n4mlz/oci-spec-rs.git", branch = "fix/mempolicy", features = ["runtime"] }
Then I ran the contest tests:
failed
just test-contest memory_policy
/home/ubuntu/workspace/youki/scripts/build.sh -o /home/ubuntu/workspace/youki -r -c youki
* FEATURES: <default>
* TARGET: x86_64-unknown-linux-gnu
Compiling libcontainer v0.5.5 (/home/ubuntu/workspace/youki/crates/libcontainer)
Compiling youki v0.5.5 (/home/ubuntu/workspace/youki/crates/youki)
Finished `release` profile [optimized] target(s) in 52.06s
/home/ubuntu/workspace/youki/scripts/build.sh -o /home/ubuntu/workspace/youki -r -c contest
* FEATURES: <default>
* TARGET: x86_64-unknown-linux-gnu
removed '/home/ubuntu/workspace/youki/contest'
Compiling contest v0.1.0 (/home/ubuntu/workspace/youki/tests/contest/contest)
Finished `release` profile [optimized] target(s) in 33.22s
removed '/home/ubuntu/workspace/youki/runtimetest'
Finished `release` profile [optimized] target(s) in 0.12s
sudo /home/ubuntu/workspace/youki/scripts/contest.sh /home/ubuntu/workspace/youki/youki memory_policy
# Start group memory_policy
1 / 10 : bind_static : not ok
container stderr was not empty, found : ERROR libcontainer::process::init::process: failed to set memory policy err=Nix(EINVAL)
ERROR libcontainer::process::container_intermediate_process: failed to initialize container process: failed syscall
ERROR libcontainer::process::container_main_process: failed to wait for init ready: exec process failed with error error in executing process : failed syscall
ERROR libcontainer::container::builder_impl: failed to run container process exec process failed with error error in executing process : failed syscall
ERROR youki: error in executing command: failed to create container: exec process failed with error error in executing process : failed syscall
error in executing command: failed to create container: exec process failed with error error in executing process : failed syscall
Error: failed to create container: exec process failed with error error in executing process : failed syscall
2 / 10 : bind_way_too_large_node_number : ok
3 / 10 : default_with_missing_nodes_ok : ok
4 / 10 : empty_memory_policy : ok
5 / 10 : interleave_without_flags : not ok
container stderr was not empty, found : ERROR libcontainer::process::init::process: failed to set memory policy err=Nix(EINVAL)
ERROR libcontainer::process::container_intermediate_process: failed to initialize container process: failed syscall
ERROR libcontainer::process::container_main_process: failed to wait for init ready: exec process failed with error error in executing process : failed syscall
ERROR libcontainer::container::builder_impl: failed to run container process exec process failed with error error in executing process : failed syscall
ERROR youki: error in executing command: failed to create container: exec process failed with error error in executing process : failed syscall
error in executing command: failed to create container: exec process failed with error error in executing process : failed syscall
Error: failed to create container: exec process failed with error error in executing process : failed syscall
6 / 10 : invalid_flag_string : ok
7 / 10 : invalid_mode_string : ok
8 / 10 : missing_mode_but_nodes_present : ok
9 / 10 : preferred_relative : not ok
container stderr was not empty, found : ERROR libcontainer::process::init::process: failed to set memory policy (MPOL_PREFERRED) err=Nix(EINVAL)
ERROR libcontainer::process::container_intermediate_process: failed to initialize container process: failed syscall
ERROR libcontainer::process::container_main_process: failed to wait for init ready: exec process failed with error error in executing process : failed syscall
ERROR libcontainer::container::builder_impl: failed to run container process exec process failed with error error in executing process : failed syscall
ERROR youki: error in executing command: failed to create container: exec process failed with error error in executing process : failed syscall
error in executing command: failed to create container: exec process failed with error error in executing process : failed syscall
Error: failed to create container: exec process failed with error error in executing process : failed syscall
10 / 10 : syscall_invalid_arguments : ok
# End group memory_policy
error: Recipe `test-contest` failed on line 62 with exit code 1
A simple manual run with the following OCI config also fails:
"linux": {
"memoryPolicy" : {
"mode": "MPOL_INTERLEAVE",
"nodes": "1"
}
}
sudo strace -f -tt -s 128 -e trace=set_mempolicy -o /tmp/youki.strace \
./youki run -b tutorial/ container
ERROR libcontainer::process::init::process: failed to set memory policy err=Nix(EINVAL)
ERROR libcontainer::process::container_intermediate_process: failed to initialize container process: failed syscall
ERROR libcontainer::process::container_main_process: failed to wait for init ready: exec process failed with error error in executing process : failed syscall
ERROR libcontainer::container::builder_impl: failed to run container process exec process failed with error error in executing process : failed syscall
INFO libcgroups::common: cgroup manager V2 will be used
DEBUG libcgroups::v2::manager: remove cgroup "/sys/fs/cgroup/:youki:container"
ERROR youki: error in executing command: failed to create container: exec process failed with error error in executing process : failed syscall
run failed : failed to create container: exec process failed with error error in executing process : failed syscall
trace result
50065 06:03:37.212423 set_mempolicy(MPOL_INTERLEAVE, [0x00000000000002], 2) = -1 EINVAL (Invalid argument)
50064 06:03:37.212710 +++ exited with 0 +++
50063 06:03:37.212746 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=50064, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
50065 06:03:37.213411 +++ exited with 255 +++
50063 06:03:37.213466 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=50065, si_uid=0, si_status=255, si_utime=0, si_stime=0} ---
50063 06:03:37.221851 +++ exited with 255 +++
The root cause is explained by the Linux source code:
- set_mempolicy
- kernel_set_mempolicy
- get_nodes
- get_bitmap
- do_set_mempolicy
- mpol_new
See mm/mempolicy.c around here: https://github.com/torvalds/linux/blob/v6.14/mm/mempolicy.c
Why set_mempolicy(MPOL_INTERLEAVE, [0x00000000000002], 2) fails:
In the kernel’s get_nodes(), maxnode is decremented first (--maxnode).
With maxnode=2 from userspace, it becomes 1 inside the kernel.
Then get_bitmap() applies a tail mask to keep only the lowest maxnode bits:
mask[nlongs - 1] &= (1UL << (maxnode % BITS_PER_LONG)) - 1;
With maxnode=1 and BITS_PER_LONG=64, the mask is 0x1.
nodemask word is 0x2 (bit1 set = node1).
0x2 & 0x1 = 0x0 → the effective nodemask becomes empty.
MPOL_INTERLEAVE does not allow an empty nodemask, so the kernel returns -EINVAL.
mpol_new
} else if (nodes_empty(*nodes))
return ERR_PTR(-EINVAL);
runc
With the above in mind, running runc’s code related to memory_policy succeeds: opencontainers/runc#4726
sudo strace -f -tt -s 128 -e trace=set_mempolicy -o /tmp/youki.strace runc run -b tutorial/
container
strace
38296 06:57:27.319222 set_mempolicy(MPOL_INTERLEAVE, [0x00000000000002], 64) = 0
Therefore, I think it’s preferable to implement the maxnode specification the same way as runc.
numactl
strace -f -tt -s 128 -e trace=set_mempolicy -o /tmp/numa.strace numactl --interleave=1 sh
cat /tmp/numa.strace
38676 06:59:11.357283 set_mempolicy(MPOL_INTERLEAVE, [0x00000000000002, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000], 1025) = 0
38676 06:59:14.501691 +++ exited with 0 +++
For your reference, when I run numactl, it uses a much larger value for maxnode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saku3
Thank you very much for this thorough response!
I realize now that my initial verification was insufficient. When I set up a virtual machine with NUMA node number 2 and retested the functionality, I encountered exactly the same error you pointed out.
As you've correctly noted, the maxnode parameter in set_mempolicy(2) requires the total number of bits in the nodemask array. I've made the required modifications to properly prepare the nodemask in the same format as runc. Could you please verify this again?
cf: In my environment, the test results looked like this:
$ cat /sys/devices/system/node/online
0-1
$ uname -a
Linux kernelhack 6.1.0-37-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.140-1 (2025-05-22) x86_64 GNU/Linux
$ just test-contest memory_policy
/home/n4mlz/youki/scripts/build.sh -o /home/n4mlz/youki -r -c youki
* FEATURES: <default>
* TARGET: x86_64-unknown-linux-gnu
Finished `release` profile [optimized] target(s) in 0.30s
/home/n4mlz/youki/scripts/build.sh -o /home/n4mlz/youki -r -c contest
* FEATURES: <default>
* TARGET: x86_64-unknown-linux-gnu
removed '/home/n4mlz/youki/contest'
Finished `release` profile [optimized] target(s) in 0.30s
removed '/home/n4mlz/youki/runtimetest'
Finished `release` profile [optimized] target(s) in 0.27s
sudo /home/n4mlz/youki/scripts/contest.sh /home/n4mlz/youki/youki memory_policy
Validation successful for runtime /home/n4mlz/youki/youki
847601f to
5a2cc68
Compare
|
Since oci-spec-rs crate—which includes support for Linux memory policy—hasn't yet been released, I couldn't perform tests via CI. However, I have added integration tests. By cloning the branch containing these changes in oci-spec-rs, editing youroki's Cargo.toml to point to the local crate, and running tests locally, we were able to verify that the tests pass. |
| Ok(()) | ||
| } | ||
|
|
||
| fn setup_memory_policy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Before we proceed with the overall review.)
For maintainability, please consider moving setup_memory_policy, build_nodemask, and parse_node_string into a separate file like memorypolicy.rs.
saku3
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve reviewed it, and the overall structure looks good.
I left comments on the finer details—when you have a moment, please consider addressing them. If anything is unclear, feel free to ask anytime.
| memory_policy: &Option<oci_spec::runtime::LinuxMemoryPolicy>, | ||
| syscall: &dyn Syscall, | ||
| ) -> Result<()> { | ||
| if let Some(policy) = memory_policy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using early returns? The if nesting is deep.
|
|
||
| for flag in flags { | ||
| match flag { | ||
| oci_spec::runtime::MemoryPolicyFlagType::MpolFNumaBalancing => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a use import so MemoryPolicyFlagType can be referenced directly.
| } | ||
| flags_value |= MPOL_F_NUMA_BALANCING; | ||
| } | ||
| oci_spec::runtime::MemoryPolicyFlagType::MpolFRelativeNodes => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here MemoryPolicyFlagType
| has_relative = true; | ||
| flags_value |= MPOL_F_RELATIVE_NODES; | ||
| } | ||
| oci_spec::runtime::MemoryPolicyFlagType::MpolFStaticNodes => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here MemoryPolicyFlagType
| "memory_policy_default" => { | ||
| tests::validate_memory_policy(&spec, Some(MemoryPolicyModeType::MpolDefault)) | ||
| } | ||
| "memory_policy_empty" => tests::validate_memory_policy(&spec, None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the memory_policy_empty test case is checking the same thing as hello_world, so it’s fine to omit it.
| MemoryPolicyModeType::MpolWeightedInterleave => MPOL_WEIGHTED_INTERLEAVE, | ||
| }; | ||
|
|
||
| let mut flags_value: u32 = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I think it would be easier to read if the validation and the construction of flags_value were separated.
| let mode_with_flags = base_mode | (flags_value as i32); | ||
|
|
||
| // Handle special cases for MPOL_DEFAULT and MPOL_LOCAL | ||
| match base_mode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I think it would be clearer to separate validation from the application of set_mempolicy.
|
|
||
| // Handle special cases for MPOL_DEFAULT and MPOL_LOCAL | ||
| match base_mode { | ||
| MPOL_DEFAULT => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the validation for MPOL_DEFAULT and MPOL_LOCAL be consolidated?
| // Double-check: if the parsed result is empty (e.g., whitespace-only input), | ||
| // treat as empty nodemask | ||
| if maxnode == 0 { | ||
| if flags_value & (MPOL_F_RELATIVE_NODES | MPOL_F_STATIC_NODES) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memo: https://docs.kernel.org/admin-guide/mm/numa_memory_policy.html
It is possible for the user to specify that local allocation is always preferred by passing an empty nodemask with this mode. If an empty nodemask is passed, the policy cannot use the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags described below.
| assert_eq!(got_args[0].mode, 2); // MPOL_BIND (corrected value) | ||
| assert_eq!(got_args[0].nodemask.len(), 1); // One c_ulong needed | ||
| assert_eq!(got_args[0].nodemask[0], 3); // 2^0 + 2^1 = 1 + 2 = 3 | ||
| assert_eq!(got_args[0].maxnode, 2); // highest node ID (1) + 1 = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since maxnode is calculated as (num_u64s * u64_bits) as u64 above, the test may be incorrect.
|
Thank you for the thorough code review. On a machine with two NUMA nodes, I have verified that the tests pass as follows: |
| let highest_node = node_ids.iter().max().copied().unwrap_or(0) as usize; | ||
|
|
||
| // Calculate how many c_ulong values we need to store the bitmask | ||
| let bits_per_ulong = std::mem::size_of::<libc::c_ulong>() * 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
u_long may have different bit widths depending on the environment, so it is treated as libc::u_long rather than u64.
saku3
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the changes. I really appreciate your thoughtful handling. I left a few additional notes—when you have time, please check them.
| InvalidNodes(String), | ||
|
|
||
| #[error("Invalid memory policy mode: {0}")] | ||
| InvalidMode(String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this flag used anywhere?
| use crate::namespaces::Namespaces; | ||
| use crate::process::args::{ContainerArgs, ContainerType}; | ||
| use crate::process::channel; | ||
| use crate::process::memory_policy; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running just lint shows the following:
| use crate::process::memory_policy; | |
| use crate::process::{channel, memory_policy}; |
| @@ -0,0 +1,514 @@ | |||
| use oci_spec::runtime::MemoryPolicyFlagType; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running just lint shows the following:
| use oci_spec::runtime::MemoryPolicyFlagType; | |
| use oci_spec::runtime::{MemoryPolicyFlagType, MemoryPolicyModeType}; |
| let nodes = match policy.nodes() { | ||
| None => { | ||
| return Err(MemoryPolicyError::InvalidNodes(format!( | ||
| "Mode {} requires non-empty node specification", |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| && !numa_maps_indicates_node0(first_line, policy_field) | ||
| { | ||
| eprintln!( | ||
| "expected interleave including node0, but got: {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bind?
| fn numa_maps_indicates_node0(first_line: &str, policy_field: &str) -> bool { | ||
| policy_field.contains("bind:0") | ||
| || policy_field.contains("prefer:0") | ||
| || first_line.contains("N0=") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this line is unnecessary.
|| first_line.contains("N0=")
| } | ||
|
|
||
| let policy = memory_policy.as_ref().unwrap(); | ||
| if policy.mode() != expected { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s unnecessary because it always matches.
let expected_mode = memory_policy.as_ref().map(|p| p.mode());
let expected = expected_mode.unwrap();
let policy = memory_policy.as_ref().unwrap();
policy.mode()
|
Thank you for your always thorough reviews. The corrections have been completed. Please check them. |
|
Please wait a little while until oci-spec-rs is released🙏 |
|
Please resolve the conflict. |
Signed-off-by: n4mlz <[email protected]>
Signed-off-by: n4mlz <[email protected]>
Signed-off-by: n4mlz <[email protected]>
…t cases Signed-off-by: n4mlz <[email protected]>
Signed-off-by: n4mlz <[email protected]>
Signed-off-by: n4mlz <[email protected]>
Signed-off-by: n4mlz <[email protected]>
Signed-off-by: n4mlz <[email protected]>
…r handling Signed-off-by: n4mlz <[email protected]>
Signed-off-by: n4mlz <[email protected]>
Signed-off-by: n4mlz <[email protected]>
Signed-off-by: n4mlz <[email protected]>
Signed-off-by: n4mlz <[email protected]>
Signed-off-by: n4mlz <[email protected]>
Signed-off-by: n4mlz <[email protected]>
068fc6b to
eb39c93
Compare
|
@saku3 |
…et-mempolicy Signed-off-by: n4mlz <[email protected]>
|
@saku3 |
|
Sorry for the delay. |
saku3
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n4mlz
Sorry for the delay—
it’s a small thing, but I’ve left some comments.
Please take a look 🙏
| if has_static_flag && !policy_field.contains("static") { | ||
| eprintln!("expected bind static, but found: {}", policy_field); | ||
| } | ||
| if expected_nodes.as_ref().is_some_and(|v| v.contains(&0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d like it to support checking values other than 0, too.
example image
const EXPECT_NODE = 1;
if expected_nodes.as_ref().is_some_and(|v| v.contains(EXPECT_NODE))
&& !numa_maps_expected_node(policy_field) // check for EXPECT_NODE
I know we’ll specify only 0 on GitHub Actions, but I want to be able to specify 1,0-2, (or etc.) locally and still run valid tests.
Since you’ve also implemented functions parse_node_string, I believe it’s feasible—what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion.
Supporting nodes other than 0 makes sense. However, I realized that the current test logic itself wasn’t actually correct; even the original implementation didn’t properly handle the intended semantics when MPOL_F_RELATIVE_NODES is used.
When this flag is set, the nodemask is interpreted relative to the process’s cpuset, not as absolute physical node IDs.
So a nodemask containing 0 doesn’t necessarily mean physical node 0 — it could mean “the first node in Mems_allowed_list”.
In that case, /proc/self/numa_maps will legitimately show a different node number, so our current assumption is not correct.
Reference:
set_mempolicy(2)— MPOL_F_RELATIVE_NODES / MPOL_F_STATIC_NODES- Kernel docs: NUMA Memory Policy (flags section)
I’ll update the test so that it maps relative nodemasks to the actual physical nodes (using Mems_allowed_list from /proc/self/status) before comparing them with /proc/self/numa_maps.
| run: sudo add-apt-repository -y ppa:criu/ppa | ||
| - name: Install requirements | ||
| run: sudo env PATH=$PATH just ci-prepare | ||
| - name: Install runc 1.3.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m sorry for the trouble.
validate-contest-runc is intended to verify the validity of newly added integration tests by running them with runc.
Thinking ahead to other integration tests we might add, I believe it’s best to use the stable version here.
If the memory policy test fails, we can skip it.
Please refer to the following.
82b64ca to
ab4a91a
Compare
Signed-off-by: n4mlz <[email protected]>
Signed-off-by: n4mlz <[email protected]>
| } | ||
|
|
||
| fn expected_nodes_from_spec(spec: &Spec) -> Option<Vec<u32>> { | ||
| fn mems_allowed_list() -> Option<Vec<u32>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’ve added logic to handle the MPOL_F_RELATIVE_NODES case by reading Mems_allowed_list from /proc/self/status and comparing the translated physical nodes against /proc/self/numa_maps.
This makes the test technically correct, but the logic became quite large and complex.
If you think this level of validation is too heavy for an integration test, I can remove or trim it down. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an environment with two NUMA nodes, I confirmed that the tests pass as follows:
$ cat /sys/devices/system/node/online
0-1
$ uname -a
Linux kernelhack 6.1.0-37-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.140-1 (2025-05-22) x86_64 GNU/Linux
$ just test-contest memory_policy
/home/n4mlz/youki/scripts/build.sh -o /home/n4mlz/youki -r -c youki
* FEATURES: <default>
* TARGET: x86_64-unknown-linux-gnu
Finished `release` profile [optimized] target(s) in 0.29s
/home/n4mlz/youki/scripts/build.sh -o /home/n4mlz/youki -r -c contest
* FEATURES: <default>
* TARGET: x86_64-unknown-linux-gnu
removed '/home/n4mlz/youki/contest'
Finished `release` profile [optimized] target(s) in 0.28s
removed '/home/n4mlz/youki/runtimetest'
Finished `release` profile [optimized] target(s) in 0.27s
sudo /home/n4mlz/youki/scripts/contest.sh /home/n4mlz/youki/youki memory_policy
Validation successful for runtime /home/n4mlz/youki/youki
$
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the thoughtful feedback.
When MPOL_F_RELATIVE_NODES or MPOL_F_STATIC_NODES is specified, I think it’s sufficient to verify that relative or static appears in /proc/self/numa_maps.
If we design tests to verify that, when multiple nodes are specified, the nodes are honored exactly as specified, it becomes quite complex—not only due to MPOL_F_RELATIVE_NODES but also features like MPOL_WEIGHTED_INTERLEAVE. I don’t think we need to account for that level of complexity.
For multiple nodes, I think we can stick to the simplest case. For example:
"mode": "MPOL_BIND",
"nodes": "0-1"If we can verify that this is reflected in /proc/self/numa_maps, that should be sufficient to ensure Youki’s handling of “multiple nodes passed to the syscall” works correctly for other modes and flags as well.
Rather than testing every (node (one-or-multiple) × mode × flag) combination, we’ll split coverage by dimension:
- Provide test cases for multiple nodes.
- Provide test cases for each mode.
- Provide test cases for each flag.
This gives us confidence in the behavior without enumerating all cross-product combinations.
So as long as the test covers a simple case where multiple nodes work correctly, there’s no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have enough capacity, it might be good to add the following test cases as well:
{ "mode": "MPOL_LOCAL" }
{ "mode": "MPOL_WEIGHTED_INTERLEAVE", "nodes": "0" }
{ "mode": "MPOL_PREFERRED_MANY", "nodes": "0" }
{ "mode": "MPOL_BIND", "nodes": "0", "flags": ["MPOL_F_NUMA_BALANCING"] }Signed-off-by: n4mlz <[email protected]>
Signed-off-by: n4mlz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n4mlz
Please resolve the conflicts.
Overall, I think it's mostly fine from my side.
Thank you for patiently working through all the changes.
|
@utam0k |
0225be5 to
a7e0a1d
Compare
…et-mempolicy Signed-off-by: n4mlz <[email protected]>
utam0k
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First review. I'll take a look another day 🙏
crates/libcgroups/Cargo.toml
Outdated
| [package] | ||
| name = "libcgroups" | ||
| version = "0.5.7" # MARK: Version | ||
| version = "0.5.7" # MARK: Version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure about the reason for this change.
| const MPOL_DEFAULT: i32 = 0; | ||
| const MPOL_PREFERRED: i32 = 1; | ||
| const MPOL_BIND: i32 = 2; | ||
| const MPOL_INTERLEAVE: i32 = 3; | ||
| const MPOL_LOCAL: i32 = 4; | ||
| const MPOL_PREFERRED_MANY: i32 = 5; | ||
| const MPOL_WEIGHTED_INTERLEAVE: i32 = 6; | ||
|
|
||
| // Memory policy flag constants from Linux UAPI | ||
| const MPOL_F_NUMA_BALANCING: u32 = 1 << 13; // 0x2000 | ||
| const MPOL_F_RELATIVE_NODES: u32 = 1 << 14; // 0x4000 | ||
| const MPOL_F_STATIC_NODES: u32 = 1 << 15; // 0x8000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An enum is best suited for two scenarios.
| let base_mode = match policy.mode() { | ||
| MemoryPolicyModeType::MpolDefault => MPOL_DEFAULT, | ||
| MemoryPolicyModeType::MpolPreferred => MPOL_PREFERRED, | ||
| MemoryPolicyModeType::MpolBind => MPOL_BIND, | ||
| MemoryPolicyModeType::MpolInterleave => MPOL_INTERLEAVE, | ||
| MemoryPolicyModeType::MpolLocal => MPOL_LOCAL, | ||
| MemoryPolicyModeType::MpolPreferredMany => MPOL_PREFERRED_MANY, | ||
| MemoryPolicyModeType::MpolWeightedInterleave => MPOL_WEIGHTED_INTERLEAVE, | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the From trait.
| let mut has_static = false; | ||
| let mut has_relative = false; | ||
| if let Some(flags) = policy.flags() { | ||
| for flag in flags.iter() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let mut has_static = false; | |
| let mut has_relative = false; | |
| if let Some(flags) = policy.flags() { | |
| for flag in flags.iter() { | |
| if let Some(flags) = policy.flags() { | |
| let mut has_static = false; | |
| let mut has_relative = false; | |
| for flag in flags.iter() { |
| MemoryPolicyFlagType::MpolFRelativeNodes => { | ||
| has_relative = true; | ||
| } | ||
| MemoryPolicyFlagType::MpolFStaticNodes => { | ||
| has_static = true; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using flags.contains(...) makes this logic simpler.
Signed-off-by: n4mlz <[email protected]>
Signed-off-by: n4mlz <[email protected]>
…et-mempolicy Signed-off-by: n4mlz <[email protected]>
Description
This PR adds support for the new
linux.memoryPolicyfield introduced in the OCI Runtime Spec (opencontainers/runtime-spec#1282).With this change, youki can configure NUMA memory policies for containers by calling
set_mempolicy(2)during container initialization.MemoryPolicyModeTypeandMemoryPolicyFlagTypefrom the spec to the corresponding Linuxmempolicy.hconstants.MPOL_DEFAULT/MPOL_LOCALrequire empty nodes and no flags.MPOL_PREFERREDsupports empty nodes (local allocation) and forbidsSTATIC_NODES/RELATIVE_NODESin that case.BIND/INTERLEAVE/PREFERRED_MANY/WEIGHTED_INTERLEAVErequire non-empty nodes.MPOL_F_NUMA_BALANCINGis only valid withBIND.STATIC_NODESandRELATIVE_NODESare mutually exclusive."0,1-3") and conversion to the bitmask representation required byset_mempolicy.set_mempolicyand integrated it into container setup.Type of Change
Testing
Related Issues
Fixes #3215
Related to opencontainers/runtime-spec#1282
Additional Context
This implementation depends on the new memory policy types recently added to
oci-spec-rs(see youki-dev/oci-spec-rs#290).Although the PR has been merged, no new crate version has been published yet.
As a result:
oci-spec-rscontaining the new definitions.oci-specrelease is published.Once
oci-specis released with memory policy support, the dependency in Cargo.toml can be updated and this PR will build and run correctly.