Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis SummaryProject: llama.cpp OverviewThis change addresses a race condition in threadpool synchronization by consolidating two atomic variables ( Key FindingsPerformance-Critical Area ImpactSynchronization Functions:
Parameter Accessor Functions:
These improvements are unrelated to the race condition fix and likely stem from compiler optimization changes or code refactoring in parameter handling infrastructure. Forward Computation Functions:
The consistent pattern suggests shared dispatch infrastructure changes affecting operation parameter extraction or type dispatch logic in Inference Performance ImpactTokens Per Second: No measurable impact expected. The modified functions ( Impacted Functions: None of the core inference functions are modified. The 6-7 ns increases in forward computation dispatch are negligible relative to the microsecond-scale execution times of actual tensor operations. Power Consumption AnalysisBinary:
The power consumption increase is driven by the cumulative throughput time increases in forward computation functions ( Other Binaries: All other binaries ( Technical ImplementationThe race condition fix uses bit-packing to atomically update both graph counter and thread count: // Upper 16 bits: graph counter, Lower 16 bits: thread count
n_graph = ((graph_counter + 1) << 16) | (n_threads & 0xFFFF)This ensures worker threads cannot observe mismatched graph counter and thread count values during rapid thread count changes (e.g., alternating between 1 and N threads). The single atomic store with sequential consistency ordering provides the necessary memory fence for polling threads using relaxed reads. The implementation removes the |
c09526e to
222c9f8
Compare
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #426OverviewPR #426 addresses a race condition in threadpool synchronization by combining Key FindingsPerformance-Critical Area ImpactThreadpool Synchronization Functions: The modified functions show improvements in parameter handling operations:
Absolute Changes:
Inference Performance ImpactTokens Per Second: No impact expected. The modified functions ( The improvements in synchronization overhead (20-50 ns per operation) are negligible compared to inference computation time (milliseconds per token). These changes fix correctness issues in multi-threaded execution without affecting single-token processing time. Impacted Functions for Inference: None. No changes to Power Consumption AnalysisImpacted Binary: This increase is driven by unrelated compute-forward function regressions ( All other binaries ( Code Change AnalysisThe implementation combines two separate atomic variables into one using bit packing (lower 16 bits: thread count, upper 16 bits: graph counter). This ensures atomic consistency when threads check for new work, eliminating the race condition where threads could read mismatched graph ID and thread count values during rapid thread count changes. The fix trades minimal bit manipulation overhead (1-3 cycles) for elimination of race conditions and reduction in atomic operations, resulting in both correctness improvement and performance gain in the synchronization path. |
9612097 to
c217e38
Compare
51e8448 to
78ff3d3
Compare
Mirrored from ggml-org/llama.cpp#17748
The original discussion started in #17515
The short summary is that we have a race condition when the number of active threads changes rapidly while the worker threads are still in their hybrid polling loops.
I updated
test_barrierto test for this scenario. There is an additional test in there now that flip-flops between doinggraph_computewith 1 and N threads. Without the fix this new test quickly and reliably fails on all platforms that I tested Snapdragon-Gen3/4/5 (Android), Mac M4-Pro, AMD Ryzen-9 (Linux).See this comment for the original report and analysis of the end-to-end use-cases that trigger this scenario
ggml-org/llama.cpp#17515 (comment)
This PR combines
n_graphandn_threads_cur(number of active threads) into a single atomic update.I played with a bunch of ideas and this seems to be the cleanest/simplest way to ensure all threads see a consistent state without adding extra logic. Also worth noting that adding additional memory ordering restriction (ie instead of doing relaxed reads) is not sufficient because the threads can get preempted in between the atomic reads and still end up with the inconsistent state.
Here is a quick test report from various systems:
AMD Ryzen 9 3950X (16-Cores) -- tested with and without OpenMP, with and without TSAN
Galaxy S24 Ultra (Gen3) -- no OpenMP, also tested Galaxy S25 (Gen4) and Gen5 device
Mac M4-Pron -- no OpenMP, with and without TSAN
Also tested all the usual stuff
llama-cliandllama-benchwith various models and backends with partial offloads.@DamonFool
Please give this a shot on your setup.
@jeffbolznv @ggerganov