-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shortcut for ZSTD_compressBound() is slower when compiled with MSVC2019 #2314
Comments
I Maybe it's not a problem of zstd code.
|
I It seems the time between these two has significant difference:
Some examples:
If
|
Sorry, I made a mistake, not solved yet. |
If you are testing compression speed using default compression level,
These are the matches found by
This is expected. So far, the guess is that the speed difference probably comes primarily from the matchfinder. |
I compared the assembly code generated by MSVC and GCC, This patch fixes the problem: lib/compress/zstd_compress_internal.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/compress/zstd_compress_internal.h b/lib/compress/zstd_compress_internal.h
index db73f6c..52eea30 100644
--- a/lib/compress/zstd_compress_internal.h
+++ b/lib/compress/zstd_compress_internal.h
@@ -626,7 +626,7 @@ static const U64 prime8bytes = 0xCF1BBCDCB7A56463ULL;
static size_t ZSTD_hash8(U64 u, U32 h) { return (size_t)(((u) * prime8bytes) >> (64-h)) ; }
static size_t ZSTD_hash8Ptr(const void* p, U32 h) { return ZSTD_hash8(MEM_readLE64(p), h); }
-MEM_STATIC size_t ZSTD_hashPtr(const void* p, U32 hBits, U32 mls)
+FORCE_INLINE_TEMPLATE size_t ZSTD_hashPtr(const void* p, U32 hBits, U32 mls)
{
switch(mls)
{
Compress the file mentioned above, the output buffer size is
I'm not going to submit a PR, since there are still some FYI, assembly code of |
I went ahead and resurrected an old file system with Windows 10 installed on a home computer. First thing : compare raw performance of these compilers, using I then tested your suggested change for Level 3 is the one offering the best benefit, moving up from 140 to 180 MB/s . Still slower than Level 3 gets the most benefit from this change. Overall, this seems like a positive change, at worst neutral, so I believe it could be integrated
Things are not as simple as just transferring all In theory, we are supposed to "trust" the compiler, about making the correct choice regarding inlining. Small functions like |
Also, I cannot reproduce the reported issue. On my test system (using MSVC2017), using I can't explain yet this difference in observation. Could it be specific to MSVC2019 ? Or the way workload is measured ? |
I uploaded assembly code of zstd v1.4.5 generated by MSVC2019, hope it can help you to check other
Could it because you test after applying the patch? Before applying the patch, I also got a ~5% difference when compiled with GCC.
Moreover, I found Win10 build still slower than WSL2 build (4.85s -> 4.58s), maybe other sites can be improved as well.
|
I tried both with and without the patch, and compiled with MSVC2017. |
Could you have a look at the assembly code of |
With There's no question regarding the |
Does it execute different code path? |
makes it possible to measure scenarios such as #2314
ZSTD_getLowestPrefixIndex() function: /**
* Returns the lowest allowed match index in the prefix.
*/
MEM_STATIC FORCE_INLINE_ATTR
U32 ZSTD_getLowestPrefixIndex(const ZSTD_matchState_t* ms, U32 current, unsigned windowLog)
{
U32 const maxDistance = 1U << windowLog;
U32 const lowestValid = ms->window.dictLimit;
U32 const withinWindow = (current - lowestValid > maxDistance) ? current - maxDistance : lowestValid;
U32 const isDictionary = (ms->loadedDictEnd != 0);
U32 const matchLowest = isDictionary ? lowestValid : withinWindow;
return matchLowest;
} Don't know if it will slow down for other compilers. |
It looks to me that it should be possible to simplify this function. That being said, it's only invoked once per block, |
(edit, sorry, this was a mistake, adding the attr to
(edit, I gradually added Only add
6 MEM_STATIC U16 MEM_readLE16(const void* memPtr)
{
if (MEM_isLittleEndian())
return MEM_read16(memPtr);
else {
const BYTE* p = (const BYTE*)memPtr;
return (U16)(p[0] + (p[1]<<8));
}
}
; File E:\dev\pyzstd\lib\common\mem.h
_TEXT SEGMENT
memPtr$ = 8
MEM_readLE16 PROC
; 320 : if (MEM_isLittleEndian())
; 321 : return MEM_read16(memPtr);
movzx eax, WORD PTR [rcx]
; 322 : else {
; 323 : const BYTE* p = (const BYTE*)memPtr;
; 324 : return (U16)(p[0] + (p[1]<<8));
; 325 : }
; 326 : }
ret 0
MEM_readLE16 ENDP
_TEXT ENDS |
Not sure if this helps with your quest to close the performance gap, but Visual Studio 2019 has a new optimization option that might be useful... See /Ob3, which turns on aggressive inlining... |
Thanks for this information. It works, just turn on this option, no code changed: |
It looks to me that inlining This If they don't support |
In some Lines 30 to 41 in a880ca2
Remember zstd v1.4.6 will be released in near future, maybe this problem can be carefully improved later.
|
Maybe due to MSVC2017, when using MSVC2017 I also get the result:
|
When I have time, I will write a script to find uninlined functions in MSVC, maybe the performance can be improved a little. |
I'm doing this, hope it can be completed in two or three weeks, as a good pastime. :) |
I wrote a pyzstd module for Python, it has a "rich memory mode" that the size of output buffer is provided by
ZSTD_compressBound()
function.The document said it should faster, but I observed that when the code is compiled with MSVC2019, "rich memory mode" mode is slower.
This is a test code, it test two cases:
ZSTD_compressBound(srcSize) - 1
ZSTD_compressBound(srcSize)
On Windows 10 64-bit, MSVC2019:
ZSTD_compressBound()
output buffer size is slower a lot.FYI, compress the same file using pyzstd module (compiled with MSVC2019):
ZSTD_compressBound()-1
consumes 4.9 secondsZSTD_compressBound()
consumes 6.9 secondsWhile compiled with GCC 9.3.0,
ZSTD_compressBound()
output buffer size is faster as expected:On Windows 10 64-bit, Cygwin64 (gcc 9.3.0):
On Windows 10 WSL2, Ubuntu 9.3.0-17ubuntu1~20.04:
The text was updated successfully, but these errors were encountered: