Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shortcut for ZSTD_compressBound() is slower when compiled with MSVC2019 #2314

Closed
ghost opened this issue Sep 18, 2020 · 22 comments
Closed

Shortcut for ZSTD_compressBound() is slower when compiled with MSVC2019 #2314

ghost opened this issue Sep 18, 2020 · 22 comments

Comments

@ghost
Copy link

ghost commented Sep 18, 2020

I wrote a pyzstd module for Python, it has a "rich memory mode" that the size of output buffer is provided by ZSTD_compressBound() function.

The document said it should faster, but I observed that when the code is compiled with MSVC2019, "rich memory mode" mode is slower.

This is a test code, it test two cases:

  • the size of output buffer is ZSTD_compressBound(srcSize) - 1
  • the size of output buffer is ZSTD_compressBound(srcSize)

On Windows 10 64-bit, MSVC2019:

read 545863680 bytes.
ZSTD_compressBound() value: 547995960

ZSTD_compressBound() - 1
output size: 286842128, time: 9270 ms

ZSTD_compressBound()
output size: 286842126, time: 14669 ms

ZSTD_compressBound() output buffer size is slower a lot.

FYI, compress the same file using pyzstd module (compiled with MSVC2019):

  • ZSTD_compressBound()-1 consumes 4.9 seconds
  • ZSTD_compressBound() consumes 6.9 seconds
  • when the output buffer grows gradually, consumes 5.6 seconds

While compiled with GCC 9.3.0, ZSTD_compressBound() output buffer size is faster as expected:
On Windows 10 64-bit, Cygwin64 (gcc 9.3.0):

ZSTD_compressBound() - 1
output size: 286842128, time: 4530 ms

ZSTD_compressBound()
output size: 286842126, time: 4376 ms

On Windows 10 WSL2, Ubuntu 9.3.0-17ubuntu1~20.04:

ZSTD_compressBound() - 1
output size: 286842128, time: 4578125 ms (seems microsecond)

ZSTD_compressBound()
output size: 286842126, time: 4390625 ms (seems microsecond)
@ghost
Copy link
Author

ghost commented Sep 18, 2020

I #define DEBUGLEVEL 5, then compare the output of two builds (MSVC2019 / Cygwin64), they are exactly same except for memory addresses.

Maybe it's not a problem of zstd code.

compress\zstd_compress.c: ZSTD_compressStream2, endOp=2  
compress\zstd_compress.c: ZSTD_getCParams_internal (cLevel=3) 
compress\zstd_compress.c: ZSTD_compressStream2 : transparent init stage 
compress\zstd_compress.c: ZSTD_getCParams_internal (cLevel=3) 
compress\zstd_compress.c: ZSTD_resetCStream_internal 
compress\zstd_compress.c: ZSTD_getCParams_internal (cLevel=3) 
compress\zstd_compress.c: ZSTD_compressBegin_internal: wlog=21 
compress\zstd_compress.c: ZSTD_resetCCtx_internal: pledgedSrcSize=545863680, wlog=21 
compress\zstd_compress.c: chainSize: 65536 - hSize: 131072 - h3Size: 0 
compress\zstd_compress.c: Need 3567KB workspace, including 768KB for match state, and 2304KB for buffers 
compress\zstd_compress.c: windowSize: 2097152 - blockSize: 131072 
compress\zstd_compress.c: Resize workspaceSize from 0KB to 3567KB 
...

@ghost
Copy link
Author

ghost commented Sep 19, 2020

I #define DEBUGLEVEL 5, and add microsecond timestamp to DEBUGLOG.

It seems the time between these two has significant difference:

  • compress/zstd_double_fast.c : ZSTD_compressBlock_doubleFast_generic
  • compress/zstd_compress.c : ZSTD_compressSequences_internal (nbSeq=dddd)

Some examples:

msvc    gcc
1679    909   (microseconds)
1358    800
1290    791
1490    826
1560    880
1461    847
...

If #define DEBUGLEVEL 6, there are a lot of Cpos line between the above two:

1 flib/compress/zstd_double_fast.c : ZSTD_compressBlock_doubleFast_generic
4 fE:\dev\pyzstd\lib\compress\zstd_compress_internal.h : Cpos1048576 :  2 literals, match   6 bytes at offCode  71959
4 fE:\dev\pyzstd\lib\compress\zstd_compress_internal.h : Cpos1048584 :  2 literals, match   5 bytes at offCode  88375
3 fE:\dev\pyzstd\lib\compress\zstd_compress_internal.h : Cpos1048591 :  0 literals, match   7 bytes at offCode 223824
9 fE:\dev\pyzstd\lib\compress\zstd_compress_internal.h : Cpos1048598 : 24 literals, match  14 bytes at offCode  89233
2 fE:\dev\pyzstd\lib\compress\zstd_compress_internal.h : Cpos1048636 :  0 literals, match   8 bytes at offCode    112
...(Cpos has more than ten thousand lines)
3 flib/compress/zstd_compress.c : ZSTD_compressSequences_internal (nbSeq=11973)

@ghost
Copy link
Author

ghost commented Sep 19, 2020

Sorry, I made a mistake, not solved yet.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 19, 2020

It seems the time between these two has significant difference:

If you are testing compression speed using default compression level,
this is expected :
ZSTD_compressBlock_doubleFast_generic is where the match finding happens
ZSTD_compressSequences_internal is where the encoding happens
In term of relative importance, ZSTD_compressBlock_doubleFast_generic is expected to use significantly more time than ZSTD_compressSequences_internal (likely ~3x).
In analyzing the performance differences between gcc and msvc, these are indeed prime targets: they likely concentrate > 90% of cpu time.

If #define DEBUGLEVEL 6, there are a lot of Cpos line

These are the matches found by ZSTD_compressBlock_doubleFast_generic.
I would expect this list to be identical between gcc and msvc,
otherwise, it would lead to different compressed results.

I #define DEBUGLEVEL 5, then compare the output of two builds (MSVC2019 / Cygwin64), they are exactly same except for memory addresses.

This is expected.
Differences in memory addresses shouldn't lead to any performance difference.

So far, the guess is that the speed difference probably comes primarily from the matchfinder.
Something like a bad constant propagation, which would not allow the compiler to properly optimize a specific instance, thus leaving more branches than necessary.
Just a guess though, we don't use MSVC, so we are lacking first hand debugging capabilities.

@ghost
Copy link
Author

ghost commented Sep 20, 2020

I compared the assembly code generated by MSVC and GCC, ZSTD_hashPtr function was not expanded by MSVC.

This patch fixes the problem:

 lib/compress/zstd_compress_internal.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/compress/zstd_compress_internal.h b/lib/compress/zstd_compress_internal.h
index db73f6c..52eea30 100644
--- a/lib/compress/zstd_compress_internal.h
+++ b/lib/compress/zstd_compress_internal.h
@@ -626,7 +626,7 @@ static const U64 prime8bytes = 0xCF1BBCDCB7A56463ULL;
 static size_t ZSTD_hash8(U64 u, U32 h) { return (size_t)(((u) * prime8bytes) >> (64-h)) ; }
 static size_t ZSTD_hash8Ptr(const void* p, U32 h) { return ZSTD_hash8(MEM_readLE64(p), h); }
 
-MEM_STATIC size_t ZSTD_hashPtr(const void* p, U32 hBits, U32 mls)
+FORCE_INLINE_TEMPLATE size_t ZSTD_hashPtr(const void* p, U32 hBits, U32 mls)
 {
     switch(mls)
     {

Compress the file mentioned above, the output buffer size is ZSTD_compressBound():

                 before   after
test code Win10  14.6s    11.9s   (32-bit build)
test code WSL2   4.39s    ----    (64-bit build)
pyzstd Win10     6.9s     4.86s   (64-bit build)
pyzstd WSL2      4.6s     ----    (64-bit build)

I'm not going to submit a PR, since there are still some MEM_STATIC functions in file zstd_compress_internal.h, please see what else needs to be changed.

FYI, assembly code of lib/compress/zstd_double_fast.c generated by MSVC:
https://github.com/animalize/misc/blob/master/zstd_double_fast.asm

@Cyan4973
Copy link
Contributor

I went ahead and resurrected an old file system with Windows 10 installed on a home computer.
It contains both a version of Visual Studio and MinGW64.
Not exactly "up to date", but not horribly out-of-date either :
this is Visual Studio 2017, vs gcc v9.1.0.

First thing : compare raw performance of these compilers, using zstd internal benchmark.
Gosh, the performance difference sure is huge.
To give a single data point, at level 3 (default), VS2017 compression speed is 140 MB/s, while gcc9.1 is 200 MB/s. That's way too large.
Decompression speed difference is even worse : VS2017 offers ~620 MB/s, while gcc9.1 reaches ... 1100 MB/s. Almost 2x faster.
So there is a lot of room to improve on MSVC.

I then tested your suggested change for ZSTD_hashPtr().
It works great, and produce nice benefits for MSVC.

Level 3 is the one offering the best benefit, moving up from 140 to 180 MB/s . Still slower than gcc (200 MB/s), but a lot closer, and firmly within "acceptable" range.

Level 3 gets the most benefit from this change.
Level 1 get some, but it's less impressive : from 280 -> 310 MB/s, compared to gcc's 380 MB/s.
Starting level 5, there may be some minor benefit, but it's difficult to distinguish in the noise. And there sure is a lot of noise on this Windows platform.
I also checked that performance change on gcc is basically neutral.

Overall, this seems like a positive change, at worst neutral, so I believe it could be integrated zstd "as is", while further improvements could be provided later on.

there are still some MEM_STATIC functions in file zstd_compress_internal.h

Things are not as simple as just transferring all MEM_STATIC into FORCE_INLINE_TEMPLATE.
In many cases, such a change doesn't improve performance, or even reduce it.
In all cases, it negatively impacts binary size (or at best is neutral), with corresponding impact on instruction cache.

In theory, we are supposed to "trust" the compiler, about making the correct choice regarding inlining.
They are supposed to know the cost and benefit of this operation, which varies depending on other choices made by the compiler (inlining might be a good choice for one compiler, and a bad one for another compiler).
In practice, we know this is not true, and compilers can be especially bad at inlining large functions which are drastically simplified thanks to constant propagation, even though it's a major performance boost.

Small functions like ZSTD_hashPtr() though are supposed easier to evaluate.
But then, it's MSVC, maybe it requires more manual guiding.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 20, 2020

Also,
regarding the issue at hand,
aka the difference in performance between dstCapacity = ZSTD_compressBound() and dstCapacity = ZSTD_compressBound()-1 when invoking ZSTD_compressStream2(,,ZSTD_e_end),

I cannot reproduce the reported issue.

On my test system (using MSVC2017), using ZSTD_compressBound() correctly leads to slightly improved performance, corresponding to ZSTD_compress2() on the same data with same parameters, while ZSTD_compressBound()-1 leads to slightly slower performance (~-5%) due to the need to employ intermediate buffers to guarantee compression success.
Tried multiple times, multiple sources and buffer sizes, this is consistent with expectation.
However, it doesn't match the first report in this issue.

I can't explain yet this difference in observation. Could it be specific to MSVC2019 ? Or the way workload is measured ?

@ghost
Copy link
Author

ghost commented Sep 21, 2020

I uploaded assembly code of zstd v1.4.5 generated by MSVC2019, hope it can help you to check other MEM_STATIC functions.
https://github.com/animalize/zstd_asm

regarding the issue at hand, aka the difference in performance between dstCapacity = ZSTD_compressBound() and dstCapacity = ZSTD_compressBound()-1 when invoking ZSTD_compressStream2(,,ZSTD_e_end)
I cannot reproduce the reported issue.

Could it because you test after applying the patch?

Before applying the patch, I also got a ~5% difference when compiled with GCC.

Cygwin64 (gcc 9.3.0):

ZSTD_compressBound() - 1
output size: 286842128, time: 4530 ms

ZSTD_compressBound()
output size: 286842126, time: 4376 ms

On Windows 10 WSL2, Ubuntu 9.3.0-17ubuntu1~20.04:

ZSTD_compressBound() - 1
output size: 286842128, time: 4578125 ms (seems microsecond)

ZSTD_compressBound()
output size: 286842126, time: 4390625 ms (seems microsecond)

Moreover, I found Win10 build still slower than WSL2 build (4.85s -> 4.58s), maybe other sites can be improved as well.

                before   after
pyzstd Win10    6.9s     4.85s
pyzstd WSL2     4.58s    ----

(best of 5 results)

@Cyan4973
Copy link
Contributor

Could it because you test after applying the patch?

I tried both with and without the patch, and compiled with MSVC2017.
The speed difference direction remains the same, aka dstCapacity == ZSTD_compressBound() is faster than dstCapacity == ZSTD_compressBound() -1.
The delta is ~4% without MEM_STATIC, and reduced to ~3% with FORCE_INLINE_TEMPLATE, still roughly equivalent.

@ghost
Copy link
Author

ghost commented Sep 21, 2020

Could you have a look at the assembly code of ZSTD_compressBlock_doubleFast_generic function generated by MSVC2017? In MSVC2019 the function ZSTD_hashPtr() was not expanded:
https://github.com/animalize/zstd_asm/blob/master/v1.4.5_msvc2019/zstd_double_fast.asm#L557

@Cyan4973
Copy link
Contributor

With MEM_STATIC, it does call ZSTD_hashPtr.

There's no question regarding the FORCE_INLINE_ATTR impact : it does inline ZSTD_hashPtr.
My observation is solely focused on the impact of dstCapacity, and it's roughly the same whether FORCE_INLINE_ATTR is present or not.

@ghost
Copy link
Author

ghost commented Sep 21, 2020

Does it execute different code path?
What I tested was a 500 MB input file, all parameters were default (unset), compress/zstd_double_fast.c : ZSTD_compressBlock_doubleFast_generic was the low performance code executed.

Cyan4973 added a commit that referenced this issue Sep 21, 2020
makes it possible to measure scenarios such as #2314
@ghost
Copy link
Author

ghost commented Sep 22, 2020

If add FORCE_INLINE_ATTR to ZSTD_getLowestPrefixIndex() function, speed up a little:
(edit, sorry, this was a mistake.)

4.83s -> 4.76s
(best of 5 results)

ZSTD_getLowestPrefixIndex() function:

/**
 * Returns the lowest allowed match index in the prefix.
 */
MEM_STATIC FORCE_INLINE_ATTR
U32 ZSTD_getLowestPrefixIndex(const ZSTD_matchState_t* ms, U32 current, unsigned windowLog)
{
    U32    const maxDistance = 1U << windowLog;
    U32    const lowestValid = ms->window.dictLimit;
    U32    const withinWindow = (current - lowestValid > maxDistance) ? current - maxDistance : lowestValid;
    U32    const isDictionary = (ms->loadedDictEnd != 0);
    U32    const matchLowest = isDictionary ? lowestValid : withinWindow;
    return matchLowest;
}

Don't know if it will slow down for other compilers.
Maybe adding a MSVC_FORCE_INLINE_ATTR macro will reduce uncertainty?

@Cyan4973
Copy link
Contributor

It looks to me that it should be possible to simplify this function.

That being said, it's only invoked once per block,
therefore, it should not be in position to measurably impact performance.

@ghost
Copy link
Author

ghost commented Sep 25, 2020

It's faster almost every time, so it may not be because of noise.

(edit, sorry, this was a mistake, adding the attr to ZSTD_getLowestPrefixIndex() has no effect.)


There are dozens of MEM_STATIC functions, maybe some of them can be added FORCE_INLINE_ATTR, hope the performance of MSVC build can be closer to GCC build.

(edit, I gradually added FORCE_INLINE_ATTR to MEM_STATIC functions, the performance did not change significantly.)

Only add FORCE_INLINE_ATTR to ZSTD_hashPtr() function:
(output buffer size is ZSTD_compressBound())

                 before   after

test code 32bit  14.6s    11.9s
test code 64bit  6.9s     4.84s
test code WSL2   4.39s    ----

(best of ~5 results)

6 MEM_readLE16() calls was not expanded in huf_decompress.c, this should be not expected:
https://github.com/animalize/zstd_asm/blob/master/v1.4.5_msvc2019/huf_decompress.asm#L6499

MEM_STATIC U16 MEM_readLE16(const void* memPtr)
{
    if (MEM_isLittleEndian())
        return MEM_read16(memPtr);
    else {
        const BYTE* p = (const BYTE*)memPtr;
        return (U16)(p[0] + (p[1]<<8));
    }
}

MEM_readLE16() function was compiled to this assembly code.
MSVC is not very smart in this case.

; File E:\dev\pyzstd\lib\common\mem.h
_TEXT	SEGMENT
memPtr$ = 8
MEM_readLE16 PROC

; 320  :     if (MEM_isLittleEndian())
; 321  :         return MEM_read16(memPtr);

	movzx	eax, WORD PTR [rcx]

; 322  :     else {
; 323  :         const BYTE* p = (const BYTE*)memPtr;
; 324  :         return (U16)(p[0] + (p[1]<<8));
; 325  :     }
; 326  : }

	ret	0
MEM_readLE16 ENDP
_TEXT	ENDS

@jpapp05
Copy link

jpapp05 commented Sep 30, 2020

Not sure if this helps with your quest to close the performance gap, but Visual Studio 2019 has a new optimization option that might be useful...

See /Ob3, which turns on aggressive inlining...
https://docs.microsoft.com/en-us/cpp/build/reference/ob-inline-function-expansion?view=vs-2019

@ghost
Copy link
Author

ghost commented Sep 30, 2020

Not sure if this helps with your quest to close the performance gap, but Visual Studio 2019 has a new optimization option that might be useful...

See /Ob3, which turns on aggressive inlining...
https://docs.microsoft.com/en-us/cpp/build/reference/ob-inline-function-expansion?view=vs-2019

Thanks for this information.

It works, just turn on this option, no code changed: 6.97s -> 4.77s.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 30, 2020

It looks to me that inlining ZSTD_hashPtr() produces the bulk of the benefits,
so it's the one change we should make to the source code,
and that's basically the case now, with the merging of @animalize 's PR.

This /Ob3 compilation flag is a pretty good discovery,
thanks for suggesting it @jpapp05.
Using it, instead of the more classical /Ob2, could be done by updating the cmake and visual scripts.
However, I'm concerned by the impact on older versions of Visual Studio.

If they don't support /Ob3 directly, but it becomes equivalent to /Ob2 on such older versions, that's fine.
But if compilation doesn't even accept to start, it's more concerning,
and would require a visual studio version selector.

@ghost
Copy link
Author

ghost commented Oct 1, 2020

It looks to me that inlining ZSTD_hashPtr() produces the bulk of the benefits,
so it's the one change we should make to the source code

In some ZSTD_hashPtr() calls, the third argument can't be predicted at compile time, maybe forcing inline will have a bit negative effect in this case:

for ( ; ip + fastHashFillStep < iend + 2; ip += fastHashFillStep) {
U32 const current = (U32)(ip - base);
size_t const hash0 = ZSTD_hashPtr(ip, hBits, mls);
hashTable[hash0] = current;
if (dtlm == ZSTD_dtlm_fast) continue;
/* Only load extra positions for ZSTD_dtlm_full */
{ U32 p;
for (p = 1; p < fastHashFillStep; ++p) {
size_t const hash = ZSTD_hashPtr(ip + p, hBits, mls);
if (hashTable[hash] == 0) { /* not yet filled */
hashTable[hash] = current + p;
} } } }

Remember zstd v1.4.6 will be released in near future, maybe this problem can be carefully improved later.

Some ZSTD_hashPtr() don't use inline keyword, please have a look:
(edit, in /contrib/linux-kernel/lib/zstd/compress.c, no inline keyword seems ok.)

@ghost
Copy link
Author

ghost commented Oct 3, 2020

Also,
regarding the issue at hand,
aka the difference in performance between dstCapacity = ZSTD_compressBound() and dstCapacity = > ZSTD_compressBound()-1 when invoking ZSTD_compressStream2(,,ZSTD_e_end),

I cannot reproduce the reported issue.

Maybe due to MSVC2017, when using MSVC2017 I also get the result:

                           MSVC2017      MSVC2019
ZSTD_compressBound()       6.4 sec       6.9 sec
ZSTD_compressBound()-1     7.0 sec       5.0 sec

(zstd v1.4.5, no code changed.)

@ghost
Copy link
Author

ghost commented Dec 17, 2020

When I have time, I will write a script to find uninlined functions in MSVC, maybe the performance can be improved a little.

@ghost ghost closed this as completed Dec 17, 2020
@ghost
Copy link
Author

ghost commented Dec 27, 2020

I will write a script to find uninlined functions in MSVC

I'm doing this, hope it can be completed in two or three weeks, as a good pastime. :)

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants