Skip to content

Conversation

@n4mlz
Copy link
Contributor

@n4mlz n4mlz commented Aug 27, 2025

Description

This PR adds support for the new linux.memoryPolicy field introduced in the OCI Runtime Spec (opencontainers/runtime-spec#1282).
With this change, youki can configure NUMA memory policies for containers by calling set_mempolicy(2) during container initialization.

  • Implemented mapping of MemoryPolicyModeType and MemoryPolicyFlagType from the spec to the corresponding Linux mempolicy.h constants.
  • Added validation logic to ensure mode/flag/node combinations conform to the spec and kernel semantics:
    • MPOL_DEFAULT / MPOL_LOCAL require empty nodes and no flags.
    • MPOL_PREFERRED supports empty nodes (local allocation) and forbids STATIC_NODES/RELATIVE_NODES in that case.
    • BIND / INTERLEAVE / PREFERRED_MANY / WEIGHTED_INTERLEAVE require non-empty nodes.
    • MPOL_F_NUMA_BALANCING is only valid with BIND.
    • STATIC_NODES and RELATIVE_NODES are mutually exclusive.
  • Implemented a parser for nodemask strings ("0,1-3") and conversion to the bitmask representation required by set_mempolicy.
  • Added syscall abstraction for set_mempolicy and integrated it into container setup.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test updates
  • CI/CD related changes
  • Other (please describe):

Testing

  • Added new unit tests
  • Added new integration tests
  • Ran existing test suite
  • Tested manually (please provide steps)

Related Issues

Fixes #3215
Related to opencontainers/runtime-spec#1282

Additional Context

This implementation depends on the new memory policy types recently added to oci-spec-rs (see youki-dev/oci-spec-rs#290).
Although the PR has been merged, no new crate version has been published yet.
As a result:

  • Locally, this branch was tested against a manually cloned version of oci-spec-rs containing the new definitions.
  • CI builds against the released crate version will likely fail until a new oci-spec release is published.

Once oci-spec is released with memory policy support, the dependency in Cargo.toml can be updated and this PR will build and run correctly.

@n4mlz n4mlz changed the title Implement set_mempolicy syscall and related error handling Implement Linux memory policy Aug 27, 2025
Copy link
Member

@saku3 saku3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Would it be possible to add an integration test similar to runc’s?
https://github.com/opencontainers/runc/pull/4726/files#diff-eab28bb7cc15c949e829d02adfa5ad5869d5cea2dde63d0c2c5f4c52225967e6

In youki, we implement integration tests here:
https://github.com/youki-dev/youki/tree/main/tests/contest


As noted in the description, nodes can be empty in some cases. The runtime-spec has also been updated:
https://github.com/opencontainers/runtime-spec/pull/1294/files

Could you update oci-spec-rs accordingly as well?

const MPOL_F_STATIC_NODES: u32 = 1 << 15; // 0x8000

let base_mode = match policy.mode() {
oci_spec::runtime::MemoryPolicyModeType::MpolDefault => MPOL_DEFAULT,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import them like this:

use oci_spec::runtime::{ /* other */, LinuxMemoryPolicy, MemoryPolicyFlagType, MemoryPolicyModeType };

Please refer to line 11.

let highest_node = node_ids.iter().max().copied().unwrap_or(0) as usize;

// Calculate maxnode as highest_node + 1 (number of nodes to consider)
let maxnode = (highest_node + 1) as u64;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain the rationale for how maxnode is calculated?

When I run an example like the one in runc’s test code, youki fails with an error. I suspect the cause is a difference in how maxnode is implemented, but I don’t fully understand it yet.

https://github.com/opencontainers/runc/pull/4726/files#diff-1c1e2ce4c942d6847ab0885e1ff1b92c28f8502fc536cc66e33493a8cbe13109R172

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In set_mempolicy(2), nodemask points to a bit mask of node IDs that contains up to maxnode bits and the kernel will use bits only up to maxnode. Also, a maxnode value of zero specifies the empty set of nodes (in which case nodemask is ignored) .
Therefore, to make the highest set bit in nodemask visible to the kernel, maxnode should be one greater than the highest node ID we want to include.

The failure you encountered is likely not caused by the maxnode calculation itself. Right now, oci-spec-rs has already merged support for Linux memory policy (including mode/nodes/flags), but no new crate release has been published yet. If your local environment is still pointing to the crates.io release instead of the git revision with that PR, you may hit errors when parsing or building the spec. That may explain the discrepancy you observed when comparing with runc’s tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long message.
Please let me know if my analysis is incorrect.
If it’s working on your local environment, then something might be wrong with mine.

I first confirmed the machine has two NUMA nodes and the kernel version:

cat /sys/devices/system/node/online
0-1

uname -a
Linux ip-172-31-11-219 6.14.0-1012-aws #12~24.04.1-Ubuntu SMP Fri Aug 15 00:16:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

I also switched oci-spec to a branch that includes memoryPolicy support:

[dependencies]
- oci-spec = { version = "0.8.2", features = ["runtime"] }
+ oci-spec = { git = "https://github.com/n4mlz/oci-spec-rs.git", branch = "fix/mempolicy", features = ["runtime"] }

Then I ran the contest tests:

failed

just test-contest memory_policy
/home/ubuntu/workspace/youki/scripts/build.sh -o /home/ubuntu/workspace/youki -r -c youki
* FEATURES: <default>
* TARGET: x86_64-unknown-linux-gnu
   Compiling libcontainer v0.5.5 (/home/ubuntu/workspace/youki/crates/libcontainer)
   Compiling youki v0.5.5 (/home/ubuntu/workspace/youki/crates/youki)
    Finished `release` profile [optimized] target(s) in 52.06s
/home/ubuntu/workspace/youki/scripts/build.sh -o /home/ubuntu/workspace/youki -r -c contest
* FEATURES: <default>
* TARGET: x86_64-unknown-linux-gnu
removed '/home/ubuntu/workspace/youki/contest'
   Compiling contest v0.1.0 (/home/ubuntu/workspace/youki/tests/contest/contest)
    Finished `release` profile [optimized] target(s) in 33.22s
removed '/home/ubuntu/workspace/youki/runtimetest'
    Finished `release` profile [optimized] target(s) in 0.12s
sudo /home/ubuntu/workspace/youki/scripts/contest.sh /home/ubuntu/workspace/youki/youki memory_policy
# Start group memory_policy
1 / 10 : bind_static : not ok
        container stderr was not empty, found : ERROR libcontainer::process::init::process: failed to set memory policy err=Nix(EINVAL)
ERROR libcontainer::process::container_intermediate_process: failed to initialize container process: failed syscall
ERROR libcontainer::process::container_main_process: failed to wait for init ready: exec process failed with error error in executing process : failed syscall
ERROR libcontainer::container::builder_impl: failed to run container process exec process failed with error error in executing process : failed syscall
ERROR youki: error in executing command: failed to create container: exec process failed with error error in executing process : failed syscall
error in executing command: failed to create container: exec process failed with error error in executing process : failed syscall
Error: failed to create container: exec process failed with error error in executing process : failed syscall

2 / 10 : bind_way_too_large_node_number : ok
3 / 10 : default_with_missing_nodes_ok : ok
4 / 10 : empty_memory_policy : ok
5 / 10 : interleave_without_flags : not ok
        container stderr was not empty, found : ERROR libcontainer::process::init::process: failed to set memory policy err=Nix(EINVAL)
ERROR libcontainer::process::container_intermediate_process: failed to initialize container process: failed syscall
ERROR libcontainer::process::container_main_process: failed to wait for init ready: exec process failed with error error in executing process : failed syscall
ERROR libcontainer::container::builder_impl: failed to run container process exec process failed with error error in executing process : failed syscall
ERROR youki: error in executing command: failed to create container: exec process failed with error error in executing process : failed syscall
error in executing command: failed to create container: exec process failed with error error in executing process : failed syscall
Error: failed to create container: exec process failed with error error in executing process : failed syscall

6 / 10 : invalid_flag_string : ok
7 / 10 : invalid_mode_string : ok
8 / 10 : missing_mode_but_nodes_present : ok
9 / 10 : preferred_relative : not ok
        container stderr was not empty, found : ERROR libcontainer::process::init::process: failed to set memory policy (MPOL_PREFERRED) err=Nix(EINVAL)
ERROR libcontainer::process::container_intermediate_process: failed to initialize container process: failed syscall
ERROR libcontainer::process::container_main_process: failed to wait for init ready: exec process failed with error error in executing process : failed syscall
ERROR libcontainer::container::builder_impl: failed to run container process exec process failed with error error in executing process : failed syscall
ERROR youki: error in executing command: failed to create container: exec process failed with error error in executing process : failed syscall
error in executing command: failed to create container: exec process failed with error error in executing process : failed syscall
Error: failed to create container: exec process failed with error error in executing process : failed syscall

10 / 10 : syscall_invalid_arguments : ok
# End group memory_policy

error: Recipe `test-contest` failed on line 62 with exit code 1

A simple manual run with the following OCI config also fails:

"linux": {
  "memoryPolicy" : {
    "mode": "MPOL_INTERLEAVE",
    "nodes": "1"
  }
}
sudo strace -f -tt -s 128 -e trace=set_mempolicy -o /tmp/youki.strace \
  ./youki run -b tutorial/ container

ERROR libcontainer::process::init::process: failed to set memory policy err=Nix(EINVAL)
ERROR libcontainer::process::container_intermediate_process: failed to initialize container process: failed syscall
ERROR libcontainer::process::container_main_process: failed to wait for init ready: exec process failed with error error in executing process : failed syscall
ERROR libcontainer::container::builder_impl: failed to run container process exec process failed with error error in executing process : failed syscall
 INFO libcgroups::common: cgroup manager V2 will be used
DEBUG libcgroups::v2::manager: remove cgroup "/sys/fs/cgroup/:youki:container"
ERROR youki: error in executing command: failed to create container: exec process failed with error error in executing process : failed syscall
run failed : failed to create container: exec process failed with error error in executing process : failed syscall

trace result

50065 06:03:37.212423 set_mempolicy(MPOL_INTERLEAVE, [0x00000000000002], 2) = -1 EINVAL (Invalid argument)
50064 06:03:37.212710 +++ exited with 0 +++
50063 06:03:37.212746 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=50064, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
50065 06:03:37.213411 +++ exited with 255 +++
50063 06:03:37.213466 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=50065, si_uid=0, si_status=255, si_utime=0, si_stime=0} ---
50063 06:03:37.221851 +++ exited with 255 +++

The root cause is explained by the Linux source code:

  • set_mempolicy
  • kernel_set_mempolicy
  • get_nodes
  • get_bitmap
  • do_set_mempolicy
  • mpol_new

See mm/mempolicy.c around here: https://github.com/torvalds/linux/blob/v6.14/mm/mempolicy.c

Why set_mempolicy(MPOL_INTERLEAVE, [0x00000000000002], 2) fails:

In the kernel’s get_nodes(), maxnode is decremented first (--maxnode).
With maxnode=2 from userspace, it becomes 1 inside the kernel.

Then get_bitmap() applies a tail mask to keep only the lowest maxnode bits:

mask[nlongs - 1] &= (1UL << (maxnode % BITS_PER_LONG)) - 1;

With maxnode=1 and BITS_PER_LONG=64, the mask is 0x1.

nodemask word is 0x2 (bit1 set = node1).
0x2 & 0x1 = 0x0 → the effective nodemask becomes empty.

MPOL_INTERLEAVE does not allow an empty nodemask, so the kernel returns -EINVAL.

mpol_new

} else if (nodes_empty(*nodes))
	return ERR_PTR(-EINVAL);

runc

With the above in mind, running runc’s code related to memory_policy succeeds: opencontainers/runc#4726

 sudo strace -f -tt -s 128   -e trace=set_mempolicy   -o /tmp/youki.strace runc run  -b tutorial/ 
container

strace

38296 06:57:27.319222 set_mempolicy(MPOL_INTERLEAVE, [0x00000000000002], 64) = 0

Therefore, I think it’s preferable to implement the maxnode specification the same way as runc.

numactl

strace -f -tt -s 128 -e trace=set_mempolicy -o /tmp/numa.strace   numactl --interleave=1 sh
cat  /tmp/numa.strace 
38676 06:59:11.357283 set_mempolicy(MPOL_INTERLEAVE, [0x00000000000002, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000, 0000000000000000], 1025) = 0
38676 06:59:14.501691 +++ exited with 0 +++

For your reference, when I run numactl, it uses a much larger value for maxnode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@saku3
Thank you very much for this thorough response!
I realize now that my initial verification was insufficient. When I set up a virtual machine with NUMA node number 2 and retested the functionality, I encountered exactly the same error you pointed out.

As you've correctly noted, the maxnode parameter in set_mempolicy(2) requires the total number of bits in the nodemask array. I've made the required modifications to properly prepare the nodemask in the same format as runc. Could you please verify this again?

cf: In my environment, the test results looked like this:

$ cat /sys/devices/system/node/online
0-1

$ uname -a
Linux kernelhack 6.1.0-37-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.140-1 (2025-05-22) x86_64 GNU/Linux

$ just test-contest memory_policy
/home/n4mlz/youki/scripts/build.sh -o /home/n4mlz/youki -r -c youki
* FEATURES: <default>
* TARGET: x86_64-unknown-linux-gnu
    Finished `release` profile [optimized] target(s) in 0.30s
/home/n4mlz/youki/scripts/build.sh -o /home/n4mlz/youki -r -c contest
* FEATURES: <default>
* TARGET: x86_64-unknown-linux-gnu
removed '/home/n4mlz/youki/contest'
    Finished `release` profile [optimized] target(s) in 0.30s
removed '/home/n4mlz/youki/runtimetest'
    Finished `release` profile [optimized] target(s) in 0.27s
sudo /home/n4mlz/youki/scripts/contest.sh /home/n4mlz/youki/youki memory_policy
Validation successful for runtime /home/n4mlz/youki/youki

@n4mlz
Copy link
Contributor Author

n4mlz commented Sep 3, 2025

Since oci-spec-rs crate—which includes support for Linux memory policy—hasn't yet been released, I couldn't perform tests via CI. However, I have added integration tests. By cloning the branch containing these changes in oci-spec-rs, editing youroki's Cargo.toml to point to the local crate, and running tests locally, we were able to verify that the tests pass.

@n4mlz n4mlz requested a review from saku3 September 3, 2025 10:28
Ok(())
}

fn setup_memory_policy(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Before we proceed with the overall review.)
For maintainability, please consider moving setup_memory_policy, build_nodemask, and parse_node_string into a separate file like memorypolicy.rs.

@n4mlz n4mlz requested a review from saku3 September 7, 2025 12:43
Copy link
Member

@saku3 saku3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve reviewed it, and the overall structure looks good.
I left comments on the finer details—when you have a moment, please consider addressing them. If anything is unclear, feel free to ask anytime.

memory_policy: &Option<oci_spec::runtime::LinuxMemoryPolicy>,
syscall: &dyn Syscall,
) -> Result<()> {
if let Some(policy) = memory_policy {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using early returns? The if nesting is deep.


for flag in flags {
match flag {
oci_spec::runtime::MemoryPolicyFlagType::MpolFNumaBalancing => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a use import so MemoryPolicyFlagType can be referenced directly.

}
flags_value |= MPOL_F_NUMA_BALANCING;
}
oci_spec::runtime::MemoryPolicyFlagType::MpolFRelativeNodes => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here MemoryPolicyFlagType

has_relative = true;
flags_value |= MPOL_F_RELATIVE_NODES;
}
oci_spec::runtime::MemoryPolicyFlagType::MpolFStaticNodes => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here MemoryPolicyFlagType

"memory_policy_default" => {
tests::validate_memory_policy(&spec, Some(MemoryPolicyModeType::MpolDefault))
}
"memory_policy_empty" => tests::validate_memory_policy(&spec, None),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the memory_policy_empty test case is checking the same thing as hello_world, so it’s fine to omit it.

MemoryPolicyModeType::MpolWeightedInterleave => MPOL_WEIGHTED_INTERLEAVE,
};

let mut flags_value: u32 = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think it would be easier to read if the validation and the construction of flags_value were separated.

let mode_with_flags = base_mode | (flags_value as i32);

// Handle special cases for MPOL_DEFAULT and MPOL_LOCAL
match base_mode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I think it would be clearer to separate validation from the application of set_mempolicy.


// Handle special cases for MPOL_DEFAULT and MPOL_LOCAL
match base_mode {
MPOL_DEFAULT => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the validation for MPOL_DEFAULT and MPOL_LOCAL be consolidated?

// Double-check: if the parsed result is empty (e.g., whitespace-only input),
// treat as empty nodemask
if maxnode == 0 {
if flags_value & (MPOL_F_RELATIVE_NODES | MPOL_F_STATIC_NODES) != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memo: https://docs.kernel.org/admin-guide/mm/numa_memory_policy.html

It is possible for the user to specify that local allocation is always preferred by passing an empty nodemask with this mode. If an empty nodemask is passed, the policy cannot use the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags described below.

assert_eq!(got_args[0].mode, 2); // MPOL_BIND (corrected value)
assert_eq!(got_args[0].nodemask.len(), 1); // One c_ulong needed
assert_eq!(got_args[0].nodemask[0], 3); // 2^0 + 2^1 = 1 + 2 = 3
assert_eq!(got_args[0].maxnode, 2); // highest node ID (1) + 1 = 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since maxnode is calculated as (num_u64s * u64_bits) as u64 above, the test may be incorrect.

@n4mlz
Copy link
Contributor Author

n4mlz commented Sep 19, 2025

Thank you for the thorough code review.
Sorry for the delay!
I've made all the requested changes. Please take a look.

On a machine with two NUMA nodes, I have verified that the tests pass as follows:

$ cat /sys/devices/system/node/online
0-1

$ just test-contest memory_policy
/home/n4mlz/youki/scripts/build.sh -o /home/n4mlz/youki -r -c youki
* FEATURES: <default>
* TARGET: x86_64-unknown-linux-gnu
    Finished `release` profile [optimized] target(s) in 0.29s
/home/n4mlz/youki/scripts/build.sh -o /home/n4mlz/youki -r -c contest
* FEATURES: <default>
* TARGET: x86_64-unknown-linux-gnu
removed '/home/n4mlz/youki/contest'
    Finished `release` profile [optimized] target(s) in 0.30s
removed '/home/n4mlz/youki/runtimetest'
    Finished `release` profile [optimized] target(s) in 0.27s
sudo /home/n4mlz/youki/scripts/contest.sh /home/n4mlz/youki/youki memory_policy
Validation successful for runtime /home/n4mlz/youki/youki

$

let highest_node = node_ids.iter().max().copied().unwrap_or(0) as usize;

// Calculate how many c_ulong values we need to store the bitmask
let bits_per_ulong = std::mem::size_of::<libc::c_ulong>() * 8;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

u_long may have different bit widths depending on the environment, so it is treated as libc::u_long rather than u64.

@n4mlz n4mlz requested a review from saku3 September 19, 2025 06:43
Copy link
Member

@saku3 saku3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the changes. I really appreciate your thoughtful handling. I left a few additional notes—when you have time, please check them.

InvalidNodes(String),

#[error("Invalid memory policy mode: {0}")]
InvalidMode(String),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this flag used anywhere?

use crate::namespaces::Namespaces;
use crate::process::args::{ContainerArgs, ContainerType};
use crate::process::channel;
use crate::process::memory_policy;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running just lint shows the following:

Suggested change
use crate::process::memory_policy;
use crate::process::{channel, memory_policy};

@@ -0,0 +1,514 @@
use oci_spec::runtime::MemoryPolicyFlagType;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running just lint shows the following:

Suggested change
use oci_spec::runtime::MemoryPolicyFlagType;
use oci_spec::runtime::{MemoryPolicyFlagType, MemoryPolicyModeType};

let nodes = match policy.nodes() {
None => {
return Err(MemoryPolicyError::InvalidNodes(format!(
"Mode {} requires non-empty node specification",

This comment was marked as outdated.

&& !numa_maps_indicates_node0(first_line, policy_field)
{
eprintln!(
"expected interleave including node0, but got: {}",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bind?

fn numa_maps_indicates_node0(first_line: &str, policy_field: &str) -> bool {
policy_field.contains("bind:0")
|| policy_field.contains("prefer:0")
|| first_line.contains("N0=")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line is unnecessary.

|| first_line.contains("N0=")

}

let policy = memory_policy.as_ref().unwrap();
if policy.mode() != expected {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s unnecessary because it always matches.

let expected_mode = memory_policy.as_ref().map(|p| p.mode());
let expected = expected_mode.unwrap();
let policy = memory_policy.as_ref().unwrap();
policy.mode()

@n4mlz
Copy link
Contributor Author

n4mlz commented Sep 29, 2025

Thank you for your always thorough reviews. The corrections have been completed. Please check them.

@n4mlz n4mlz requested a review from saku3 September 29, 2025 15:01
@saku3
Copy link
Member

saku3 commented Sep 30, 2025

Please wait a little while until oci-spec-rs is released🙏

@saku3
Copy link
Member

saku3 commented Oct 7, 2025

Please resolve the conflict.

@n4mlz n4mlz mentioned this pull request Oct 17, 2025
13 tasks
@n4mlz n4mlz force-pushed the feat/set-mempolicy branch from 068fc6b to eb39c93 Compare October 17, 2025 05:21
@n4mlz
Copy link
Contributor Author

n4mlz commented Oct 17, 2025

@saku3
I apologize for the delay🙏
When I updated the version of runc used in integration tests to one that supports Linux memory policy (v1.4.0-rc.2), it seems there was a change in how cpu.weight is calculated in cgroup v2, causing the CI to fail (cf: https://github.com/n4mlz/youki/actions/runs/18583292135/job/52982229747?pr=3#step:9:64).
I have created a PR (#3274) to update to the first version where this change occurred (v1.3.2), so I would appreciate it if you could merge this into main first.

@n4mlz
Copy link
Contributor Author

n4mlz commented Nov 1, 2025

@saku3
Thank you for waiting. I have updated the version of runc for CI, confirmed that the CI completes successfully in my forked repository, and am ready to run the tests. I would appreciate it if you could check it.

@saku3
Copy link
Member

saku3 commented Nov 2, 2025

Sorry for the delay.
I’ll review it in the coming days.

Copy link
Member

@saku3 saku3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n4mlz
Sorry for the delay—
it’s a small thing, but I’ve left some comments.
Please take a look 🙏

if has_static_flag && !policy_field.contains("static") {
eprintln!("expected bind static, but found: {}", policy_field);
}
if expected_nodes.as_ref().is_some_and(|v| v.contains(&0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d like it to support checking values other than 0, too.

example image

const EXPECT_NODE = 1;
if expected_nodes.as_ref().is_some_and(|v| v.contains(EXPECT_NODE))
 && !numa_maps_expected_node(policy_field) // check for EXPECT_NODE

I know we’ll specify only 0 on GitHub Actions, but I want to be able to specify 1,0-2, (or etc.) locally and still run valid tests.
Since you’ve also implemented functions parse_node_string, I believe it’s feasible—what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion.
Supporting nodes other than 0 makes sense. However, I realized that the current test logic itself wasn’t actually correct; even the original implementation didn’t properly handle the intended semantics when MPOL_F_RELATIVE_NODES is used.

When this flag is set, the nodemask is interpreted relative to the process’s cpuset, not as absolute physical node IDs.
So a nodemask containing 0 doesn’t necessarily mean physical node 0 — it could mean “the first node in Mems_allowed_list”.
In that case, /proc/self/numa_maps will legitimately show a different node number, so our current assumption is not correct.

Reference:

I’ll update the test so that it maps relative nodemasks to the actual physical nodes (using Mems_allowed_list from /proc/self/status) before comparing them with /proc/self/numa_maps.

run: sudo add-apt-repository -y ppa:criu/ppa
- name: Install requirements
run: sudo env PATH=$PATH just ci-prepare
- name: Install runc 1.3.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m sorry for the trouble.
validate-contest-runc is intended to verify the validity of newly added integration tests by running them with runc.
Thinking ahead to other integration tests we might add, I believe it’s best to use the stable version here.
If the memory policy test fails, we can skip it.
Please refer to the following.

https://github.com/youki-dev/youki/blob/main/tests/contest/contest/src/tests/fd_control/mod.rs#L100

@n4mlz n4mlz force-pushed the feat/set-mempolicy branch 2 times, most recently from 82b64ca to ab4a91a Compare November 11, 2025 09:16
}

fn expected_nodes_from_spec(spec: &Spec) -> Option<Vec<u32>> {
fn mems_allowed_list() -> Option<Vec<u32>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve added logic to handle the MPOL_F_RELATIVE_NODES case by reading Mems_allowed_list from /proc/self/status and comparing the translated physical nodes against /proc/self/numa_maps.
This makes the test technically correct, but the logic became quite large and complex.
If you think this level of validation is too heavy for an integration test, I can remove or trim it down. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an environment with two NUMA nodes, I confirmed that the tests pass as follows:

$ cat /sys/devices/system/node/online
0-1
$ uname -a
Linux kernelhack 6.1.0-37-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.140-1 (2025-05-22) x86_64 GNU/Linux
$ just test-contest memory_policy
/home/n4mlz/youki/scripts/build.sh -o /home/n4mlz/youki -r -c youki
* FEATURES: <default>
* TARGET: x86_64-unknown-linux-gnu
    Finished `release` profile [optimized] target(s) in 0.29s
/home/n4mlz/youki/scripts/build.sh -o /home/n4mlz/youki -r -c contest
* FEATURES: <default>
* TARGET: x86_64-unknown-linux-gnu
removed '/home/n4mlz/youki/contest'
    Finished `release` profile [optimized] target(s) in 0.28s
removed '/home/n4mlz/youki/runtimetest'
    Finished `release` profile [optimized] target(s) in 0.27s
sudo /home/n4mlz/youki/scripts/contest.sh /home/n4mlz/youki/youki memory_policy
Validation successful for runtime /home/n4mlz/youki/youki
$

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thoughtful feedback.

When MPOL_F_RELATIVE_NODES or MPOL_F_STATIC_NODES is specified, I think it’s sufficient to verify that relative or static appears in /proc/self/numa_maps.

If we design tests to verify that, when multiple nodes are specified, the nodes are honored exactly as specified, it becomes quite complex—not only due to MPOL_F_RELATIVE_NODES but also features like MPOL_WEIGHTED_INTERLEAVE. I don’t think we need to account for that level of complexity.

For multiple nodes, I think we can stick to the simplest case. For example:

"mode": "MPOL_BIND",
"nodes": "0-1"

If we can verify that this is reflected in /proc/self/numa_maps, that should be sufficient to ensure Youki’s handling of “multiple nodes passed to the syscall” works correctly for other modes and flags as well.

Rather than testing every (node (one-or-multiple) × mode × flag) combination, we’ll split coverage by dimension:

  • Provide test cases for multiple nodes.
  • Provide test cases for each mode.
  • Provide test cases for each flag.

This gives us confidence in the behavior without enumerating all cross-product combinations.
So as long as the test covers a simple case where multiple nodes work correctly, there’s no problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have enough capacity, it might be good to add the following test cases as well:

{ "mode": "MPOL_LOCAL" }
{ "mode": "MPOL_WEIGHTED_INTERLEAVE", "nodes": "0" }
{ "mode": "MPOL_PREFERRED_MANY", "nodes": "0" }
{ "mode": "MPOL_BIND", "nodes": "0", "flags": ["MPOL_F_NUMA_BALANCING"] }

@n4mlz n4mlz requested a review from saku3 November 12, 2025 04:22
Copy link
Member

@saku3 saku3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n4mlz
Please resolve the conflicts.

Overall, I think it's mostly fine from my side.
Thank you for patiently working through all the changes.

@saku3
Copy link
Member

saku3 commented Nov 20, 2025

@utam0k
PTAL 🙏

@n4mlz n4mlz force-pushed the feat/set-mempolicy branch from 0225be5 to a7e0a1d Compare November 20, 2025 12:24
Copy link
Member

@utam0k utam0k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First review. I'll take a look another day 🙏

[package]
name = "libcgroups"
version = "0.5.7" # MARK: Version
version = "0.5.7" # MARK: Version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure about the reason for this change.

Comment on lines 39 to 50
const MPOL_DEFAULT: i32 = 0;
const MPOL_PREFERRED: i32 = 1;
const MPOL_BIND: i32 = 2;
const MPOL_INTERLEAVE: i32 = 3;
const MPOL_LOCAL: i32 = 4;
const MPOL_PREFERRED_MANY: i32 = 5;
const MPOL_WEIGHTED_INTERLEAVE: i32 = 6;

// Memory policy flag constants from Linux UAPI
const MPOL_F_NUMA_BALANCING: u32 = 1 << 13; // 0x2000
const MPOL_F_RELATIVE_NODES: u32 = 1 << 14; // 0x4000
const MPOL_F_STATIC_NODES: u32 = 1 << 15; // 0x8000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An enum is best suited for two scenarios.

Comment on lines 52 to 60
let base_mode = match policy.mode() {
MemoryPolicyModeType::MpolDefault => MPOL_DEFAULT,
MemoryPolicyModeType::MpolPreferred => MPOL_PREFERRED,
MemoryPolicyModeType::MpolBind => MPOL_BIND,
MemoryPolicyModeType::MpolInterleave => MPOL_INTERLEAVE,
MemoryPolicyModeType::MpolLocal => MPOL_LOCAL,
MemoryPolicyModeType::MpolPreferredMany => MPOL_PREFERRED_MANY,
MemoryPolicyModeType::MpolWeightedInterleave => MPOL_WEIGHTED_INTERLEAVE,
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the From trait.

Comment on lines 62 to 65
let mut has_static = false;
let mut has_relative = false;
if let Some(flags) = policy.flags() {
for flag in flags.iter() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut has_static = false;
let mut has_relative = false;
if let Some(flags) = policy.flags() {
for flag in flags.iter() {
if let Some(flags) = policy.flags() {
let mut has_static = false;
let mut has_relative = false;
for flag in flags.iter() {

Comment on lines 74 to 79
MemoryPolicyFlagType::MpolFRelativeNodes => {
has_relative = true;
}
MemoryPolicyFlagType::MpolFStaticNodes => {
has_static = true;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using flags.contains(...) makes this logic simpler.

@n4mlz n4mlz requested a review from utam0k November 27, 2025 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

implement Linux memory policy

3 participants