Skip to content

Conversation

@edolstra
Copy link
Collaborator

@edolstra edolstra commented Jun 23, 2025

Motivation

Updated version of NixOS#10938.

Context

Summary by CodeRabbit

  • New Features

    • Experimental parallel evaluation: thread-pool executor, eval-cores setting, new __parallel primitive; parallelized flake checks, flake show, search, and JSON/XML rendering.
  • Performance

    • Memory-aware GC sizing and stack fixes; evaluator threading; Boost thread support; default remote-store max connections increased to 64.
  • UX Improvements

    • Better rendering for failed values in text/JSON/XML; improved error-location accuracy and diagnostics.
  • CI/Tests

    • Expanded CI matrix and richer diagnostic output; updated functional tests for adjusted error locations.

edolstra added 30 commits May 20, 2024 10:09
This is a mapping from paths to "resolved" paths (i.e. with
`default.nix` added, if appropriate). `fileParseCache` and
`fileEvalCache` are now keyed on the resolved path *only*.
Previously, the optimistic concurrency approach in `evalFile()` meant
that a `nix search nixpkgs ^` would do hundreds of duplicated
parsings/evaluations. Now, we reuse the thunk locking mechanism to
ensure it's done only once.
This refactoring allows the symbol table to be stored as something
other than std::strings.
This allows symbol IDs to be offsets into an arena whose base offset
never moves, and can therefore be dereferenced without any locks.
This makes it less likely that we concurrently execute tasks that
would block on a common subtask, e.g. evaluating `libfoo` and
`libfoo_variant` are likely to have common dependencies.
@github-actions github-actions bot temporarily deployed to pull request September 2, 2025 13:47 Inactive
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
src/libexpr/eval-gc.cc (1)

99-106: Restore pthread_attr_get_np fallback; current #error breaks BSD/non-glibc builds

The FreeBSD/non-glibc path is hard-erroring. Implement the pthread_attr_get_np branch with proper error checks.

Apply:

-#    ifdef HAVE_PTHREAD_GETATTR_NP
-    if (pthread_getattr_np(pthread_id, &pattr))
-        throw Error("fixupBoehmStackPointer: pthread_getattr_np failed");
-#    else
-#      error "Need  `pthread_attr_get_np`"
-#    endif
+#    ifdef HAVE_PTHREAD_GETATTR_NP
+    if (pthread_getattr_np(pthread_id, &pattr))
+        throw Error("fixupBoehmStackPointer: pthread_getattr_np failed");
+#    elif defined(HAVE_PTHREAD_ATTR_GET_NP)
+    if (pthread_attr_get_np(pthread_id, &pattr))
+        throw Error("fixupBoehmStackPointer: pthread_attr_get_np failed");
+#    else
+#      error "Need pthread_getattr_np or pthread_attr_get_np"
+#    endif
src/libexpr/include/nix/expr/symbol-table.hh (2)

240-244: Add bounds and alignment checks in operator[] and launder the pointer

Prevent out-of-bounds access and misaligned loads; also use std::launder for safety after placement construction.

     SymbolStr operator[](Symbol s) const
     {
-        assert(s.id);
-        return SymbolStr(*reinterpret_cast<const SymbolValue *>(arena.data + s.id));
+        assert(s.id);
+        // Basic integrity checks
+        assert(s.id < arena.size); // offset within arena
+        assert((s.id % alignment) == 0); // start of an aligned record
+        assert(s.id + sizeof(SymbolValue) <= arena.size); // header fits
+        auto p = std::launder(reinterpret_cast<const SymbolValue *>(arena.data + s.id));
+        return SymbolStr(*p);
     }

256-273: Fix dump() record-advance logic to account for header in alignment calculation

Current padding uses only the string length; it should include the Value header to remain aligned. Otherwise, iteration can desynchronize and hit asserts or skip symbols.

     void dump(T callback) const
     {
         std::string_view left{arena.data, arena.size};
         left = left.substr(alignment);
         while (true) {
             if (left.empty())
                 break;
             left = left.substr(sizeof(Value));
             auto p = left.find('\0');
             assert(p != left.npos);
             auto sym = left.substr(0, p);
             callback(sym);
             // skip alignment padding
-            auto n = sym.size() + 1;
-            left = left.substr(n + (n % alignment ? alignment - (n % alignment) : 0));
+            auto n = sizeof(Value) + sym.size() + 1; // header + string + NUL
+            auto alignedSize = ((n + alignment - 1) / alignment) * alignment;
+            left = left.substr(alignedSize - sizeof(Value));
         }
     }
🧹 Nitpick comments (11)
src/libexpr/eval-gc.cc (4)

38-40: Fix comment: /proc/meminfo, not /proc/cpuinfo

The comment references cpuinfo but the code reads meminfo.

Apply:

-    /* On Linux, use the `MemAvailable` or `MemFree` fields from
-       /proc/cpuinfo. */
+    /* On Linux, use the `MemAvailable` or `MemFree` fields from
+       /proc/meminfo. */

41-59: Harden Linux meminfo parsing against read failures

readFile("/proc/meminfo") can throw; let’s fall back instead of propagating.

Apply:

-    {
+    try {
         std::unordered_map<std::string, std::string> fields;
         for (auto & line :
              tokenizeString<std::vector<std::string>>(readFile(std::filesystem::path("/proc/meminfo")), "\n")) {
             auto colon = line.find(':');
             if (colon == line.npos)
                 continue;
             fields.emplace(line.substr(0, colon), trim(line.substr(colon + 1)));
         }

         auto i = fields.find("MemAvailable");
         if (i == fields.end())
             i = fields.find("MemFree");
         if (i != fields.end()) {
             auto kb = tokenizeString<std::vector<std::string>>(i->second, " ");
             if (kb.size() == 2 && kb[1] == "kB")
                 return string2Int<size_t>(kb[0]).value_or(0) * 1024;
         }
-    }
+    } catch (const std::exception &) {
+        /* Fall through to generic sysconf-based fallback below. */
+    }

151-159: Comment/code mismatch (80% vs 50%)

Comment says 80% but code uses ~50% of free RAM.

Apply one of:

  • Update comment to 50%:
-    /* Set the initial heap size to something fairly big (80% of
+    /* Set the initial heap size to something fairly big (50% of
        free RAM, up to a maximum of 4 GiB) so that in most cases
  • Or change code to 80% (if intended).

164-166: Avoid float math; use integer halving

free * 0.5 promotes to double and back. Use integer division for precision.

Apply:

-        auto free = getFreeMem();
-        size = std::max(size, std::min((size_t) (free * 0.5), maxSize));
+        auto free = getFreeMem();
+        auto halfFree = free / 2;
+        size = std::max(size, std::min(halfFree, maxSize));
src/libexpr/include/nix/expr/symbol-table.hh (7)

27-40: Make arena.data writable without const_casts and keep the pointer itself const

Construction writes through data; using const here forces const_cast elsewhere. Prefer a const pointer to mutable memory.

-    const char * data;
+    char * const data;

Follow-up: drop const_casts at call sites (e.g., SymbolStr::SymbolStr in symbol-table.cc).


32-36: Confirm cache-line isolation intent is achieved

aligned(64) on the atomic ensures its start is 64B-aligned, but only if the compiler inserts padding before it. Given preceding fields, this is likely fine, but please verify with static_asserts or sizeof/offsetof in a unit test.

Example check:

  • Expect offsetof(ContiguousArena, size) % 64 == 0.
  • Optionally add alignas(64) to the struct and reorder members if needed.

101-109: Key stores arena by reference; ensure lifetime and thread-safety are clear

Key captures ContiguousArena& and is used inside a concurrent container; document that arena outlives the set and is append-only. If not guaranteed, consider a pointer.


139-149: size()/empty() are O(n) due to std::string_view length computation

This is fine if not hot; otherwise consider storing the length next to the string to avoid repeated scans.


219-225: Constructor reserves ID 0; also assert arena invariants

Add cheap asserts to catch early misuse.

     SymbolTable()
         : arena(1 << 30)
     {
         // Reserve symbol ID 0 and ensure alignment of the first allocation.
         arena.allocate(alignment);
+        assert((uintptr_t)arena.data % alignment == 0);
+        assert(arena.size == alignment);
     }

251-254: Atomic load semantics for totalSize()

If used for metrics only, relaxed is fine and cheaper; otherwise use acquire. Make it explicit.

-    size_t totalSize() const
-    {
-        return arena.size;
-    }
+    size_t totalSize() const
+    {
+        return arena.size.load(std::memory_order_relaxed);
+    }

22-24: Consistent access: consider using Value APIs or document the aliasing

Returning a string_view by reinterpreting (this + 1) is correct for your layout but bypasses Value’s accessor; add a comment asserting that mkString stored the same pointer and that the trailing bytes remain immutable.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3d8127c and 3425c5d.

📒 Files selected for processing (4)
  • src/libexpr/eval-gc.cc (3 hunks)
  • src/libexpr/include/nix/expr/counter.hh (1 hunks)
  • src/libexpr/include/nix/expr/symbol-table.hh (7 hunks)
  • src/libexpr/meson.build (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/libexpr/include/nix/expr/counter.hh
  • src/libexpr/meson.build
🧰 Additional context used
🧬 Code graph analysis (2)
src/libexpr/eval-gc.cc (2)
src/libutil/file-system.cc (7)
  • readFile (251-264)
  • readFile (251-251)
  • readFile (266-269)
  • readFile (266-266)
  • readFile (271-298)
  • readFile (271-271)
  • path (850-850)
src/libutil/unix/file-descriptor.cc (2)
  • readFile (32-39)
  • readFile (32-32)
src/libexpr/include/nix/expr/symbol-table.hh (2)
src/libexpr/include/nix/expr/value.hh (14)
  • s (1143-1146)
  • s (1143-1143)
  • s (1148-1148)
  • s (1150-1150)
  • s (1152-1152)
  • p (1199-1199)
  • n (205-208)
  • n (205-205)
  • n (1128-1131)
  • n (1128-1128)
  • n (1133-1136)
  • n (1133-1133)
  • n (1216-1219)
  • n (1216-1216)
src/libexpr/symbol-table.cc (1)
  • SymbolStr (45-61)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build_aarch64-darwin / build
  • GitHub Check: build_x86_64-darwin / build
  • GitHub Check: build_x86_64-linux / build
  • GitHub Check: build_aarch64-linux / build
🔇 Additional comments (2)
src/libexpr/eval-gc.cc (2)

82-84: Confirm assumption: no GC roots in coroutines

Assuming no GC roots live on coroutine stacks is risky. Please confirm or document invariants/tests ensuring this remains true under parallel eval.


148-150: SP corrector hook: LGTM

Hooking and asserting the corrector is installed looks good.

@github-actions github-actions bot temporarily deployed to pull request September 2, 2025 14:02 Inactive
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
src/libexpr/include/nix/expr/value.hh (4)

440-542: Complex concurrency design requires careful analysis.

The atomic p0 and non-atomic p1 design with the ordering constraint ("p1 must be written before p0") is a sophisticated approach to lock-free value updates. However, this creates several concerns:

  1. Race condition potential: The requirement that "p1 must always be updated before p0" (Line 459) must be strictly enforced across all code paths.

  2. Memory ordering: The use of memory_order_release for p0 writes and memory_order_acquire for reads is correct, but the implementation must be consistent.

  3. Debug output in production: Lines 539-541 contain debug output that may not be appropriate for production builds.

Apply this fix for the debug output:

-        else if (pd == pdThunk) {
-            printError("BAD FINISH %x", this);
-            unreachable();
-        }
+        else if (pd == pdThunk)
+            unreachable();

1308-1312: Race condition warning for thunk accessor.

The comment "FIXME: remove this since reading it is racy" correctly identifies that accessing thunk data is unsafe in a multithreaded context. This method should be marked deprecated or removed.

Consider deprecating this method or adding runtime checks:

+    [[deprecated("Reading thunk data is racy in multithreaded evaluation")]]
     ClosureThunk thunk() const noexcept

1319-1323: Race condition warning for app accessor.

The comment "FIXME: remove this since reading it is racy" correctly identifies the same threading issue as with the thunk accessor. This should also be addressed.


781-791: Testing-only blackhole function should be restricted.

The mkBlackhole() function is marked as "only used for testing" but is in the public interface. This could be accidentally used in production code.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3425c5d and 2a001a8.

📒 Files selected for processing (1)
  • src/libexpr/include/nix/expr/value.hh (17 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/libexpr/include/nix/expr/value.hh (2)
src/libexpr/parallel-eval.cc (4)
  • notifyWaiters (248-253)
  • notifyWaiters (248-248)
  • waitOnThunk (186-245)
  • waitOnThunk (186-186)
src/libexpr/include/nix/expr/eval.hh (24)
  • v (536-539)
  • v (536-536)
  • v (541-541)
  • v (547-547)
  • v (552-552)
  • v (553-553)
  • v (554-554)
  • v (556-556)
  • v (561-561)
  • v (565-565)
  • v (566-566)
  • v (567-572)
  • v (573-573)
  • v (592-592)
  • v (757-757)
  • v (842-842)
  • pos (594-595)
  • pos (611-618)
  • pos (641-641)
  • pos (646-646)
  • pos (651-655)
  • pos (672-672)
  • pos (786-786)
  • pos (932-932)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build_aarch64-darwin / build
  • GitHub Check: build_aarch64-linux / build
  • GitHub Check: build_x86_64-darwin / build
  • GitHub Check: build_x86_64-linux / build
🔇 Additional comments (16)
src/libexpr/include/nix/expr/value.hh (16)

336-336: LGTM! Added Failed payload type mapping.

The macro definition correctly includes the Failed payload type in the storage system.


544-572: LGTM! Proper memory ordering in payload setters.

The payload setting functions correctly use the finish() method to ensure proper memory ordering. The bit manipulation and alignment assertions provide good safety checks.


587-621: LGTM! Efficient internal type decoding.

The getInternalType() function efficiently decodes the bit-packed discriminator information. The switch statement correctly handles all primary discriminator cases.


623-641: LGTM! Clean macro-based payload accessors.

The macro generates appropriate getter/setter pairs for pointer-based payloads, maintaining the correct memory ordering through the established payload setter functions.


643-698: LGTM! Consistent payload access patterns.

The storage getters correctly extract data from p1 for single-word payloads and use proper bit manipulation for pointer-based payloads.


695-698: LGTM! Failed payload accessor.

The Failed pointer getter is implemented consistently with other pointer-based payload types.


750-753: LGTM! Failed payload setter.

The setter correctly uses setSingleDWordPayload for storing the Failed pointer.


793-807: LGTM! Forward declarations for multithreaded operations.

The forward declarations for force, waitOnThunk, and notifyWaiters establish the interface for the multithreaded evaluation system.


810-814: LGTM! Template specialization declarations.

The explicit template specializations ensure that the multithreaded operations are properly instantiated for the target pointer size.


1028-1032: LGTM! Efficient finished check.

The isFinished() method provides a clean interface to check if a value is in weak-head normal form.


1048-1052: LGTM! Updated blackhole detection.

The isBlackhole() implementation correctly identifies pending and awaited states as blackhole conditions in the new state machine.


1070-1073: LGTM! Failed value predicate.

The isFailed() method provides a clean interface to check for failed values.


1079-1116: LGTM! Updated type mapping with failure support.

The type() method correctly maps the new tFailed internal type to nFailed and handles the new pending/awaited states as thunk types.


1221-1224: LGTM! Failed value constructor.

The mkFailed() method correctly creates a Failed value with the current exception. The use of std::current_exception() is appropriate for capturing the active exception.


1244-1251: LGTM! Updated trivial value detection.

The isTrivial() documentation correctly reflects that finished values are considered trivial in the new evaluation model.


1335-1338: LGTM! Failed value accessor.

The failed() method provides clean access to the Failed payload, consistent with other payload accessors.

@edolstra edolstra force-pushed the multithreaded-eval-v2 branch from ceb8c69 to 92788a0 Compare September 2, 2025 14:45
@github-actions github-actions bot temporarily deployed to pull request September 2, 2025 14:49 Inactive
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/libexpr/include/nix/expr/value.hh (1)

572-585: Release-store in setThunkPayload is correct.

This now preserves the “p1-before-p0” ordering without spurious wakeups for awaited states. Thanks for addressing the earlier fence concern.

🧹 Nitpick comments (3)
src/libexpr/include/nix/expr/eval.hh (1)

377-381: Good move to SharedSync for caches; consider similar treatment for remaining maps.

lookupPathResolved, primOpCalls, and attrSelects are still marked non-thread-safe. Consider SharedSync or per-thread aggregation with periodic merge to avoid lock contention.

Also applies to: 392-393, 398-398

src/libexpr/include/nix/expr/value.hh (2)

1217-1221: Failed payload: confirm GC ownership and lifetime.

mkFailed() allocates with new on a gc-derived type. Please confirm this is traced by the GC in both Boehm and non-Boehm builds and that any printers/visitors handle nFailed without dereferencing a null ex.

Optionally add a brief docstring to ValueType::nFailed and Value::failed() clarifying semantics.

Also applies to: 1331-1335


1044-1049: Expose racy accessors only where safe.

isBlackhole() is fine. The “FIXME: remove this since reading it is racy” for thunk()/app() remains—consider #ifdef UNIT_TESTS or moving to a test-only header to prevent accidental use.

Also applies to: 1304-1309, 1315-1320

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2a001a8 and 92788a0.

📒 Files selected for processing (4)
  • src/libexpr/eval-gc.cc (3 hunks)
  • src/libexpr/include/nix/expr/eval.hh (10 hunks)
  • src/libexpr/include/nix/expr/symbol-table.hh (7 hunks)
  • src/libexpr/include/nix/expr/value.hh (17 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/libexpr/include/nix/expr/symbol-table.hh
  • src/libexpr/eval-gc.cc
🧰 Additional context used
🧬 Code graph analysis (2)
src/libexpr/include/nix/expr/value.hh (2)
src/libexpr/parallel-eval.cc (4)
  • notifyWaiters (248-253)
  • notifyWaiters (248-248)
  • waitOnThunk (186-245)
  • waitOnThunk (186-186)
src/libexpr/include/nix/expr/eval.hh (24)
  • v (537-540)
  • v (537-537)
  • v (542-542)
  • v (548-548)
  • v (553-553)
  • v (554-554)
  • v (555-555)
  • v (557-557)
  • v (562-562)
  • v (566-566)
  • v (567-567)
  • v (568-573)
  • v (574-574)
  • v (593-593)
  • v (758-758)
  • v (843-843)
  • pos (595-596)
  • pos (612-619)
  • pos (642-642)
  • pos (647-647)
  • pos (652-656)
  • pos (673-673)
  • pos (787-787)
  • pos (933-933)
src/libexpr/include/nix/expr/eval.hh (2)
src/libexpr/primops.cc (2)
  • struct State (4312-4315)
  • re (4310-4334)
src/libexpr/eval.cc (3)
  • EvalState (378-380)
  • EvalState (187-375)
  • eval (1758-1784)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: build_aarch64-darwin / build
  • GitHub Check: build_x86_64-darwin / build
  • GitHub Check: build_x86_64-linux / build
🔇 Additional comments (3)
src/libexpr/include/nix/expr/eval.hh (3)

193-193: RegexCache ownership change looks good; verify all call sites.

Return type moved to ref<RegexCache> and the member matches. Please confirm all makeRegexCache() callers were updated and no lingering shared_ptr assumptions remain.

Also applies to: 408-408


537-540: Inline forceValue forwarding is correct.


780-781: Per-thread callDepth via thread_local is appropriate.

Comment on lines +227 to +228
ref<Executor> executor;

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

❓ Verification inconclusive

Ensure executor is always non-null (avoid nullptr deref).

waitOnThunk() uses state.executor->evalCores; if executor is unset you'll crash. Enforce non-null in the constructor (initialize a single-threaded default) or guard all uses.

Would you like a patch that wires a default Executor{evalCores=1} in EvalState’s ctor?


Initialize executor in EvalState’s constructor
Ensure executor is never null by assigning a default single-threaded Executor (e.g. with evalCores=1) in the EvalState ctor to avoid null-pointer deref in waitOnThunk().

🤖 Prompt for AI Agents
In src/libexpr/include/nix/expr/eval.hh around lines 227-228, EvalState's member
ref<Executor> executor is not initialized; update the EvalState constructor to
create and assign a default single-threaded Executor (e.g. construct Executor
with evalCores=1) so executor is never null and waitOnThunk() cannot dereference
a null pointer. Instantiate the Executor with appropriate ownership semantics
and assign it to the executor field in the ctor, ensuring any required
headers/params are provided.

@edolstra edolstra enabled auto-merge September 2, 2025 15:00
@cole-h cole-h disabled auto-merge September 2, 2025 15:23
@grahamc grahamc added this pull request to the merge queue Sep 2, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 3, 2025
@edolstra edolstra added this pull request to the merge queue Sep 3, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Sep 3, 2025
When useMaster is true, startMaster() acquires the state lock, then
calls isMasterRunning(), which calls addCommonSSHOpts(), which tries
to acquire the state lock again, causing a deadlock.

The solution is to move tmpDir out of the state. It doesn't need to be
there in the first place because it never changes.
@cole-h cole-h enabled auto-merge September 3, 2025 13:35
@github-actions github-actions bot temporarily deployed to pull request September 3, 2025 13:39 Inactive
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/libstore/ssh.cc (2)

118-125: Control master check ignores custom -S path; can skip starting our master and then use a non-existent socket

isMasterRunning() doesn’t pass the control socket path, but startMaster() later returns state->socketPath and callers use “-S ”. If a user-level master exists (default ControlPath), you’ll return our tmpDir path even though no master is listening there.

Apply these changes to scope the check to our socket:

-bool SSHMaster::isMasterRunning()
+bool SSHMaster::isMasterRunning(Path controlPath)
 {
     Strings args = {"-O", "check", hostnameAndUser};
     addCommonSSHOpts(args);
+    if (!controlPath.empty())
+        args.insert(args.end(), {"-S", controlPath});

And in the header:

-    bool isMasterRunning();
+    bool isMasterRunning(Path controlPath = "");

And in startMaster (see below).


249-253: Pass the socket path when checking master state

Tie the check to the socket we’ll actually use.

-    if (isMasterRunning())
+    if (isMasterRunning(state->socketPath))
         return state->socketPath;
🧹 Nitpick comments (2)
src/libstore/include/nix/store/ssh.hh (1)

30-31: Make tmpDir immutable and document lifetime/thread-safety

tmpDir is effectively write-once; marking it const prevents accidental reassignment and clarifies intent. Also, because Connections don’t hold a ref to tmpDir, verify SSHMaster outlives any in-flight ssh child using files under tmpDir.

-    ref<AutoDelete> tmpDir;
+    const ref<AutoDelete> tmpDir;

If you want to make usage clearer and avoid repeated recomputation of paths, consider adding:

  • a cached known-hosts file path (e.g., Path knownHostsFile;)
  • a cached control socket path (e.g., Path controlSocket;)
src/libstore/ssh.cc (1)

241-242: Use consistent path joining; avoid deref/cast of ref

Dereferencing tmpDir and C-style casting to Path is brittle. Use the same path API you used above.

-    state->socketPath = (Path) *tmpDir + "/ssh.sock";
+    state->socketPath = (tmpDir->path() / "ssh.sock").string();
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 92788a0 and 4b34658.

📒 Files selected for processing (2)
  • src/libstore/include/nix/store/ssh.hh (2 hunks)
  • src/libstore/ssh.cc (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/libstore/ssh.cc (2)
src/libutil/include/nix/util/file-system.hh (1)
  • createTempDir (330-330)
src/libstore/include/nix/store/ssh.hh (2)
  • args (42-42)
  • getNixSshOpts (12-12)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build_x86_64-darwin / build
  • GitHub Check: build_aarch64-linux / build
  • GitHub Check: build_x86_64-linux / build
  • GitHub Check: build_aarch64-darwin / build
🔇 Additional comments (2)
src/libstore/include/nix/store/ssh.hh (1)

4-4: Include looks correct and necessary

Adding ref.hh is appropriate for the new ref<...> member.

src/libstore/ssh.cc (1)

87-88: Good: secure temp dir creation

Creating the tmp dir at construction with 0700 is correct and avoids races.

Comment on lines 97 to 103
if (!keyFile.empty())
args.insert(args.end(), {"-i", keyFile});
if (!sshPublicHostKey.empty()) {
std::filesystem::path fileName = state->tmpDir->path() / "host-key";
std::filesystem::path fileName = tmpDir->path() / "host-key";
writeFile(fileName.string(), authority.host + " " + sshPublicHostKey + "\n");
args.insert(args.end(), {"-oUserKnownHostsFile=" + fileName.string()});
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Known_hosts line is malformed (missing key type and base64; wrong host format)

OpenSSH’s known_hosts requires “host keytype base64”. You’re writing raw bytes without a key type, and you don’t bracket non-default ports ([host]:port). This will cause host key verification to fail and may be racy if multiple threads rewrite the same file.

Apply this fix (defaults to ssh-ed25519; ideally plumb the real type):

-    if (!sshPublicHostKey.empty()) {
-        std::filesystem::path fileName = tmpDir->path() / "host-key";
-        writeFile(fileName.string(), authority.host + " " + sshPublicHostKey + "\n");
-        args.insert(args.end(), {"-oUserKnownHostsFile=" + fileName.string()});
-    }
+    if (!sshPublicHostKey.empty()) {
+        std::filesystem::path fileName = tmpDir->path() / "known_hosts";
+        // Use "[host]:port" when a non-default port is set.
+        std::string hostForKnownHosts =
+            authority.port ? fmt("[%s]:%d", authority.host, *authority.port)
+                           : std::string(authority.host);
+        // TODO: pass the real key type instead of defaulting.
+        const char *keyType = "ssh-ed25519";
+        writeFile(
+            fileName.string(),
+            hostForKnownHosts + " " + keyType + " " + base64::encode(sshPublicHostKey) + "\n");
+        args.insert(args.end(), {"-oUserKnownHostsFile=" + fileName.string()});
+    }

Optionally, write the file once (e.g., at construction) to avoid concurrent rewrites. I can send a follow-up patch using std::once_flag.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (!keyFile.empty())
args.insert(args.end(), {"-i", keyFile});
if (!sshPublicHostKey.empty()) {
std::filesystem::path fileName = state->tmpDir->path() / "host-key";
std::filesystem::path fileName = tmpDir->path() / "host-key";
writeFile(fileName.string(), authority.host + " " + sshPublicHostKey + "\n");
args.insert(args.end(), {"-oUserKnownHostsFile=" + fileName.string()});
}
if (!keyFile.empty())
args.insert(args.end(), {"-i", keyFile});
if (!sshPublicHostKey.empty()) {
std::filesystem::path fileName = tmpDir->path() / "known_hosts";
// Use "[host]:port" when a non-default port is set.
std::string hostForKnownHosts =
authority.port ? fmt("[%s]:%d", authority.host, *authority.port)
: std::string(authority.host);
// TODO: pass the real key type instead of defaulting.
const char *keyType = "ssh-ed25519";
writeFile(
fileName.string(),
hostForKnownHosts + " " + keyType + " " + base64::encode(sshPublicHostKey) + "\n");
args.insert(args.end(), {"-oUserKnownHostsFile=" + fileName.string()});
}
🤖 Prompt for AI Agents
In src/libstore/ssh.cc around lines 97 to 103, the known_hosts line is
malformed: you currently write raw key bytes without a key type or base64 and
you don’t bracket hosts with non-default ports, plus the file may be rewritten
concurrently; fix by formatting the entry as "host keytype base64key" (default
keytype to "ssh-ed25519" if the real type isn’t available), base64-encode
sshPublicHostKey before writing, bracket host as "[host]:port" when a
non-default port is used, and ensure the file is created/written only once
(e.g., move writeFile to construction or guard with a once_flag) so multiple
threads don’t race rewriting it.

@cole-h cole-h added this pull request to the merge queue Sep 3, 2025
Merged via the queue into main with commit 2bdbe00 Sep 3, 2025
49 checks passed
@cole-h cole-h deleted the multithreaded-eval-v2 branch September 3, 2025 14:33
@coderabbitai coderabbitai bot mentioned this pull request Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

flake-regression-test Run the flake regressions test suite on this PR upload to s3

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants