Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std::swap of arrays, why is there no specialization for trivial types #2683

Closed
monamimani opened this issue Apr 28, 2022 · 11 comments · Fixed by #2689 or #4991
Closed

std::swap of arrays, why is there no specialization for trivial types #2683

monamimani opened this issue Apr 28, 2022 · 11 comments · Fixed by #2689 or #4991
Labels
fixed Something works now, yay! performance Must go faster

Comments

@monamimani
Copy link

I was looking at what swap was doing for arrays and in the MSVC stl it loops trough the array and swap each element. (libstdc++ and libc++ do the same thing).

The thing is, it was an array of std::byte, I was surprised that none of the STL have specializations for trivially copyable/movable types that would call memcopy. Now I figure, this is maybe how the spec is. So my first question is there something preventing those specialization from existing.

My second question which is maybe off-topic for this is even tough clang and gcc also do a loop, "I believe" their auto vectorizer see the pattern and emits SSE instructions but MSVC doesn't it still loops.

Here is a compiler explorer link
https://godbolt.org/z/W8GxPv1ov

Now this is maybe something more for the compiler team, I don't know.
In any case I saw the differences and I thought I would point it out.

Thank you

@monamimani monamimani added the question Further information is requested label Apr 28, 2022
@frederick-vs-ja
Copy link
Contributor

I think this is a compiler issue, as libstdc++'s implementation of std::swap is naive.

Perhaps we can use the same mechanism as that of std::swap_ranges for vectorization, but I don't know if it is suitable.

@StephanTLavavej StephanTLavavej added performance Must go faster and removed question Further information is requested labels Apr 28, 2022
@StephanTLavavej
Copy link
Member

@monamimani , note that we can't simply use memcpy/memmove as they don't swap, they overwrite (so we'd need temporary storage).

@frederick-vs-ja , I believe that the vectorized swap_ranges implementation is indeed applicable to std::swap() for built-in arrays; I also observe that std::array::swap already uses it:

STL/stl/inc/array

Lines 461 to 463 in 314f65f

_CONSTEXPR20 void swap(array& _Other) noexcept(_Is_nothrow_swappable<_Ty>::value) {
_Swap_ranges_unchecked(_Elems, _Elems + _Size, _Other._Elems);
}

STL/stl/inc/xutility

Lines 6129 to 6143 in 314f65f

template <class _FwdIt1, class _FwdIt2>
_CONSTEXPR20 _FwdIt2 _Swap_ranges_unchecked(_FwdIt1 _First1, const _FwdIt1 _Last1, _FwdIt2 _First2) {
// swap [_First1, _Last1) with [_First2, ...)
#if _USE_STD_VECTOR_ALGORITHMS
using _Elem1 = remove_reference_t<_Iter_ref_t<_FwdIt1>>;
using _Elem2 = remove_reference_t<_Iter_ref_t<_FwdIt2>>;
if constexpr (is_same_v<_Elem1, _Elem2> && _Is_trivially_swappable_v<_Elem1> //
&& _Iterators_are_contiguous<_FwdIt1, _FwdIt2>) {
#if _HAS_CXX20
if (!_STD is_constant_evaluated())
#endif // _HAS_CXX20
{
__std_swap_ranges_trivially_swappable_noalias(
_To_address(_First1), _To_address(_Last1), _To_address(_First2));

@monamimani
Copy link
Author

@StephanTLavavej, of course it would need temporary storage like swapping ints. Sorry that I wasn't clear enough. :)

Because I feel this is more an issue with the vectorizer I also opened an issue on the Developper Community

@StephanTLavavej
Copy link
Member

I can implement this - it just needs a bit of surgery in <xutility> and <utility>. PR coming soon...

@frederick-vs-ja
Copy link
Contributor

I can implement this - it just needs a bit of surgery in <xutility> and <utility>. PR coming soon...

Thank you!

But I found that <utility> is categorized as a core header in MSVC STL, while <tuple> is not. I'm afraid that it is possibly the vectorized algorithms declared in <xutility> that make <tuple> non-core. BTW, I think <tuple> should also be a core header.

@StephanTLavavej StephanTLavavej added the fixed Something works now, yay! label May 1, 2022
@monamimani
Copy link
Author

Great thanks!

@StephanTLavavej
Copy link
Member

I need to revert my PR as this broke users of std::swap who were including only <utility>, see #2699. Reopening this issue as we may be able to solve this later, but we need to carefully think about users like ATL who are including <type_traits> and expecting no IDL mismatch pragmas to be dragged in. (We could possibly separate out the pragma that drags in the import lib / static lib from the rest of the <yvals.h> machinery.)

@monamimani
Copy link
Author

As I figured, at the core it might be a Vectorizer issue I also opened an issue over at the Developper Community, here

They have acknowledge the issue and are working on it, it may take time. People that have access might be able to know more. I at least wanted to tell you they are aware of this.

I don't exactly know what the mechanism of std::swap_ranges does but, as a stop gap is it possible to just do a specialization for trivial type, constrain it and use memcpy? I bet the standard specify more and would prevent that.

Why not do something stupid like this?

    std::byte storageTmp[Size];
    std::memcpy(storageTmp, arrayA, Size);
    std::memcpy(arrayA, arrayB, Size);
    std::memcpy(arrayB, storageTmp, Size);

I am pretty sure that something that I don't know about will make this none desirable, but this should get vectorized.

@monamimani
Copy link
Author

monamimani commented May 4, 2022

I might add that from their own page The 13xx reason codes apply to the vectorizer.

void code_1300(int *A, int *B)
{
    // Code 1300 is emitted when the compiler detects that there is
    // no computation in the loop body.

    for (int i=0; i<1000; ++i)
    {
        A[i] = B[i]; // Do not vectorize, instead emit memcpy
    }
}

It says it should emit memcpy. but it doesn't. :)

@StephanTLavavej
Copy link
Member

Thanks - the autovectorizer may not be able to deal with the swap algorithm, but we do plan to properly fix this in the STL in the future. (Our vectorized algorithm is faster than memcpy, by the way.)

@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented Sep 24, 2024

There's a possibility to implement that in headers using memcpy and fixed power-of-two portions.

See how copying by 32-bit portions is vectorized using SSE2 and AVX2: https://godbolt.org/z/YcEPz848W

Note how despite using immediate buffer in C++ it is not used in the assembly.

Can do this in two or more loops with descending sizes to handle the variety of sizes.

The exact portions size and the way of handling the tail (memcpy of variable length vs manual loop vs unrolled loop) is sort of trade-off between code size and performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed Something works now, yay! performance Must go faster
Projects
None yet
4 participants