Skip to content

Conversation

@snnn
Copy link
Contributor

@snnn snnn commented May 27, 2025

Symptom

The following unit tests failed when building ONNX Runtime with Visual Studio 17.14 in Release or RelWithDebInfo configuration.

  • SparseTensorConversionTests.TestDenseToSparseConversion
  • MeanVarianceNormalizationTest.AllAxes
  • MVNContribOpTest.MeanVarianceNormalizationCPUTest_Version1_TO_8

Minimal Reproducible Example

// ConsoleApplication1.cpp
#include <iostream>
#include <span> // Used for comparison/workaround
#include <vector>
#include <algorithm>
#include <numeric>
#include <functional>

#include <gsl/gsl> // Microsoft GSL library

// Function exhibiting the bug with gsl::span
std::vector<float> MeanStdev_gsl(gsl::span<const float> v, float mean) {
    std::vector<float> diff(v.size()); // diff is sized correctly
    // The following transform behaves incorrectly
    std::transform(v.begin(), v.end(), diff.begin(),
                   std::bind(std::minus<float>(), std::placeholders::_1, mean));
    return diff;
}

// Same function with std::span (works correctly)
std::vector<float> MeanStdev_std(std::span<const float> v, float mean) {
    std::vector<float> diff(v.size());
    std::transform(v.begin(), v.end(), diff.begin(),
                   std::bind(std::minus<float>(), std::placeholders::_1, mean));
    return diff;
}

int main() {
    std::vector<float> X{
        1.0f, 1.0f, 2.0f, 4.0f, 1.0f, 4.0f, 0.0f, 4.0f,
        -1.0f, 1.0f, 2.0f, -1.0f, -1.0f, -2.0f, 4.0f, -5.0f, // X[15] is -5.0f
        5.0f, -3.0f, -1.0f, 0.0f, 3.0f, 3.0f, 0.0f, 0.0f  // 24 elements total
    };
    float mean_val = 0.875f; // Example mean

    std::cout << "Testing with gsl::span (problematic):" << std::endl;
    const auto mean_stdev_gsl = MeanStdev_gsl(X, mean_val);
    for (size_t i = 0; i < mean_stdev_gsl.size(); ++i) {
        std::cout << mean_stdev_gsl[i] << (i == mean_stdev_gsl.size() - 1 ? "" : " ");
    }
    std::cout << std::endl;
    // Last non-zero element X[15] - mean = -5.0f - 0.875f = -5.875f
    // Output shows processing stops after 16 elements.

    std::cout << "\nTesting with std::span (correct):" << std::endl;
    const auto mean_stdev_std = MeanStdev_std(X, mean_val);
    for (size_t i = 0; i < mean_stdev_std.size(); ++i) {
        std::cout << mean_stdev_std[i] << (i == mean_stdev_std.size() - 1 ? "" : " ");
    }
    std::cout << std::endl;

    return 0;
}

Observed Behavior (with gsl::span):

When compiling the MeanStdev_gsl function (which uses gsl::span) in Release mode, the std::transform operation does not process all elements of the input span.
For the provided example X (24 elements), the output is:

4.125 -3.875 -1.875 -0.875 2.125 2.125 -0.875 -0.875 -1.875 0.125 1.125 -1.875 -1.875 -2.875 3.125 -5.875 0 0 0 0 0 0 0 0

Only the first 16 elements of the input vector X appear to be processed by std::transform. The remaining 8 elements in the diff vector retain their default-initialized value (0.0f). The calculation for diff[0] is also incorrect if based on X[0]; the first value 4.125 corresponds to X[16] - mean if the output was shifted, but that's less likely than premature termination. More accurately, the 16th value computed (-5.875) correctly corresponds to X[15] - mean

The correct output (achieved when using std::span) would be:

0.125 0.125 1.125 3.125 0.125 3.125 -0.875 3.125 -1.875 0.125 1.125 -1.875 -1.875 -2.875 3.125 -5.875 4.125 -3.875 -1.875 -0.875 2.125 2.125 -0.875 -0.875

Detailed Analysis of the Issue:

  • Vector Sizing: The std::vector<float> diff inside the MeanStdev_gsl function is correctly sized based on v.size(). If v is a span of 24 floats, diff is allocated to hold 24 floats. This indicates that v.size() is correctly read at this point.

  • Premature Loop Termination: The std::transform operation, when gsl::span is used, appears to terminate its loop prematurely. In the example with 24 elements, only 16 elements are processed.

  • Assembly Code Discrepancy:

    • gsl::span version (Incorrect): The provided assembly shows that the compiler generates a scalar loop (using movss, subss) for std::transform. The loop control logic involves:

      • rcx: Input pointer, starting at v.data().
      • rdi: Holds the size of the span in bytes (e.g., 24 elements * 4 bytes/element = 96 bytes). This rdi is derived from v.size() which was also used for diff's allocation.
      • r8: Calculated as v.data() + rdi (i.e., v.data() + size_in_bytes), which should be the correct end pointer for the input span.
      • The loop condition is cmp rcx, r8 followed by jne (loop if not equal). Despite r8 seemingly being calculated for the full range (e.g., 24 elements), the loop effectively terminates as if the range were shorter (16 elements). This suggests that the compiler might be incorrectly optimizing or misinterpreting the loop's trip count or the value of rdi (or the span size it represents) specifically when setting up or executing the std::transform loop with gsl::span and std::bind.
    • std::span version (Correct): The assembly for the std::span version typically shows the compiler successfully auto-vectorizing the loop (using movups, subps), and it processes all elements correctly.

Suspected Cause: The issue seems to be a subtle code generation bug in the MSVC optimizer when handling std::transform with gsl::span (a non-standard library type, though widely used) in conjunction with std::bind. The compiler correctly determines v.size() for memory allocation but appears to use an effectively different (and incorrect) size for controlling the iteration of the std::transform loop. This leads to processing fewer elements than expected. I believe the problem can be reproduced without using GSL, though I haven't found an example.

asm1.txt
asm2.txt

Resolution

This PR change the MeanStdev function to not using gsl::span as a workaround of the original issue. After this change, the two MVN related tests can pass.

@a-akoval
Copy link

a-akoval commented May 28, 2025

We experienced the same problem with std::transform compiler optimization.

The differences are:

  1. We used raw pointers to int32_t as a source range in std::transform, and vector<int64_t> as an output.
  2. 24 elements would be successfully processed in our case, unlike 23 or any other number that is not a multiple of 8. I.e. in our case the last 7 elements would overwrite the first 7 in the output.
  3. The possible resolutions were:
    • using std::span instead of raw pointers
    • using named inline function instead of lambda passed to std::transform
  4. I was able to reproduce the problem within a unit test in the project, but it is not trivially reproducible in a separate test application, even though the same compiler and linker options were applied.

Based on our investigation there is no premature loop termination. All elements are processed, but their processing is split into two blocks:

  1. First block uses SIMD instructions (movups in your case for floats, pmovsxdq in our case for integers) and handles all elements except the remainder (amount is equal to remainder of division by 8 in our case, and 8 in your case for some reason).
  2. Then the remaining elements are processed using scalar operations (movss in your case).

The problem is that the register with the destination address is not updated after the first block. Hence the last remaining elements overwrite the first ones in the resulting output, and the last elements in the result left default initialized (zero).

@a-akoval
Copy link

a-akoval commented May 28, 2025

Another problem is that SparseTensorConversionTests.TestDenseToSparseConversion still fails on my local build after I applied the fix from this PR. The compiler version is 19.44.35207.1 for x64.
The problem in that case with another STL helper std::copy, which is used internally by Google protobuf's RepeatedField<Element>::Add(Iter begin, Iter end) and which behaves exactly like std::transform above when copying data: https://github.com/protocolbuffers/protobuf/blob/v3.21.12/src/google/protobuf/repeated_field.h#L726

The simplest solution is to replace Add(begin, end) with Add(element) in the test code.
This works:

        // tp.mutable_int32_data()->Add(values.cbegin(), values.cend());
        for (const auto& v : values) tp.mutable_int32_data()->Add(v);

Another solution is to upgrade the protobuf version used from outdated 3.21.12 to some recent version, as implementation of RepeatedField<Element>::Add(Iter begin, Iter end) has changed. That especially makes sense given that the recommended version in the onnx itself is 5.29.2.

@snnn snnn merged commit 9705b17 into main May 29, 2025
86 of 88 checks passed
@snnn snnn deleted the snnn/fix_bug branch May 29, 2025 21:24
@vmoroz
Copy link
Member

vmoroz commented May 30, 2025

@snnn, @a-akoval , I had similar issue in my project and reported it here:
https://developercommunity.visualstudio.com/t/Auto-vectorization-breaks-std::copy-in-V/10912564
Please consider to upvote it.

quic-ankus pushed a commit to CodeLinaro/onnxruntime that referenced this pull request Nov 25, 2025
The following unit tests failed when building ONNX Runtime with Visual Studio 17.14 in Release or RelWithDebInfo configuration.

- SparseTensorConversionTests.TestDenseToSparseConversion
- MeanVarianceNormalizationTest.AllAxes
- MVNContribOpTest.MeanVarianceNormalizationCPUTest_Version1_TO_8

This PR provides a workaround for the two MVN tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants