Shortcut for ZSTD_compressBound() is slower when compiled with MSVC2019 #2314

ghost · 2020-09-18T04:05:35Z

I wrote a pyzstd module for Python, it has a "rich memory mode" that the size of output buffer is provided by ZSTD_compressBound() function.

The document said it should faster, but I observed that when the code is compiled with MSVC2019, "rich memory mode" mode is slower.

This is a test code, it test two cases:

the size of output buffer is ZSTD_compressBound(srcSize) - 1
the size of output buffer is ZSTD_compressBound(srcSize)

On Windows 10 64-bit, MSVC2019:

read 545863680 bytes.
ZSTD_compressBound() value: 547995960

ZSTD_compressBound() - 1
output size: 286842128, time: 9270 ms

ZSTD_compressBound()
output size: 286842126, time: 14669 ms

ZSTD_compressBound() output buffer size is slower a lot.

FYI, compress the same file using pyzstd module (compiled with MSVC2019):

ZSTD_compressBound()-1 consumes 4.9 seconds
ZSTD_compressBound() consumes 6.9 seconds
when the output buffer grows gradually, consumes 5.6 seconds

While compiled with GCC 9.3.0, ZSTD_compressBound() output buffer size is faster as expected:
On Windows 10 64-bit, Cygwin64 (gcc 9.3.0):

ZSTD_compressBound() - 1
output size: 286842128, time: 4530 ms

ZSTD_compressBound()
output size: 286842126, time: 4376 ms

On Windows 10 WSL2, Ubuntu 9.3.0-17ubuntu1~20.04:

ZSTD_compressBound() - 1
output size: 286842128, time: 4578125 ms (seems microsecond)

ZSTD_compressBound()
output size: 286842126, time: 4390625 ms (seems microsecond)

The text was updated successfully, but these errors were encountered:

ghost · 2020-09-18T07:15:28Z

I #define DEBUGLEVEL 5, then compare the output of two builds (MSVC2019 / Cygwin64), they are exactly same except for memory addresses.

Maybe it's not a problem of zstd code.

compress\zstd_compress.c: ZSTD_compressStream2, endOp=2  
compress\zstd_compress.c: ZSTD_getCParams_internal (cLevel=3) 
compress\zstd_compress.c: ZSTD_compressStream2 : transparent init stage 
compress\zstd_compress.c: ZSTD_getCParams_internal (cLevel=3) 
compress\zstd_compress.c: ZSTD_resetCStream_internal 
compress\zstd_compress.c: ZSTD_getCParams_internal (cLevel=3) 
compress\zstd_compress.c: ZSTD_compressBegin_internal: wlog=21 
compress\zstd_compress.c: ZSTD_resetCCtx_internal: pledgedSrcSize=545863680, wlog=21 
compress\zstd_compress.c: chainSize: 65536 - hSize: 131072 - h3Size: 0 
compress\zstd_compress.c: Need 3567KB workspace, including 768KB for match state, and 2304KB for buffers 
compress\zstd_compress.c: windowSize: 2097152 - blockSize: 131072 
compress\zstd_compress.c: Resize workspaceSize from 0KB to 3567KB 
...

ghost · 2020-09-19T06:09:07Z

I #define DEBUGLEVEL 5, and add microsecond timestamp to DEBUGLOG.

It seems the time between these two has significant difference:

compress/zstd_double_fast.c : ZSTD_compressBlock_doubleFast_generic
compress/zstd_compress.c : ZSTD_compressSequences_internal (nbSeq=dddd)

Some examples:

msvc    gcc
1679    909   (microseconds)
1358    800
1290    791
1490    826
1560    880
1461    847
...

If #define DEBUGLEVEL 6, there are a lot of Cpos line between the above two:

1 flib/compress/zstd_double_fast.c : ZSTD_compressBlock_doubleFast_generic
4 fE:\dev\pyzstd\lib\compress\zstd_compress_internal.h : Cpos1048576 :  2 literals, match   6 bytes at offCode  71959
4 fE:\dev\pyzstd\lib\compress\zstd_compress_internal.h : Cpos1048584 :  2 literals, match   5 bytes at offCode  88375
3 fE:\dev\pyzstd\lib\compress\zstd_compress_internal.h : Cpos1048591 :  0 literals, match   7 bytes at offCode 223824
9 fE:\dev\pyzstd\lib\compress\zstd_compress_internal.h : Cpos1048598 : 24 literals, match  14 bytes at offCode  89233
2 fE:\dev\pyzstd\lib\compress\zstd_compress_internal.h : Cpos1048636 :  0 literals, match   8 bytes at offCode    112
...(Cpos has more than ten thousand lines)
3 flib/compress/zstd_compress.c : ZSTD_compressSequences_internal (nbSeq=11973)

ghost · 2020-09-19T07:12:40Z

Sorry, I made a mistake, not solved yet.

Cyan4973 · 2020-09-19T14:15:29Z

It seems the time between these two has significant difference:

If you are testing compression speed using default compression level,
this is expected :
ZSTD_compressBlock_doubleFast_generic is where the match finding happens
ZSTD_compressSequences_internal is where the encoding happens
In term of relative importance, ZSTD_compressBlock_doubleFast_generic is expected to use significantly more time than ZSTD_compressSequences_internal (likely ~3x).
In analyzing the performance differences between gcc and msvc, these are indeed prime targets: they likely concentrate > 90% of cpu time.

If #define DEBUGLEVEL 6, there are a lot of Cpos line

These are the matches found by ZSTD_compressBlock_doubleFast_generic.
I would expect this list to be identical between gcc and msvc,
otherwise, it would lead to different compressed results.

I #define DEBUGLEVEL 5, then compare the output of two builds (MSVC2019 / Cygwin64), they are exactly same except for memory addresses.

This is expected.
Differences in memory addresses shouldn't lead to any performance difference.

So far, the guess is that the speed difference probably comes primarily from the matchfinder.
Something like a bad constant propagation, which would not allow the compiler to properly optimize a specific instance, thus leaving more branches than necessary.
Just a guess though, we don't use MSVC, so we are lacking first hand debugging capabilities.

ghost · 2020-09-20T10:05:33Z

I compared the assembly code generated by MSVC and GCC, ZSTD_hashPtr function was not expanded by MSVC.

This patch fixes the problem:

 lib/compress/zstd_compress_internal.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/compress/zstd_compress_internal.h b/lib/compress/zstd_compress_internal.h
index db73f6c..52eea30 100644
--- a/lib/compress/zstd_compress_internal.h
+++ b/lib/compress/zstd_compress_internal.h
@@ -626,7 +626,7 @@ static const U64 prime8bytes = 0xCF1BBCDCB7A56463ULL;
 static size_t ZSTD_hash8(U64 u, U32 h) { return (size_t)(((u) * prime8bytes) >> (64-h)) ; }
 static size_t ZSTD_hash8Ptr(const void* p, U32 h) { return ZSTD_hash8(MEM_readLE64(p), h); }
 
-MEM_STATIC size_t ZSTD_hashPtr(const void* p, U32 hBits, U32 mls)
+FORCE_INLINE_TEMPLATE size_t ZSTD_hashPtr(const void* p, U32 hBits, U32 mls)
 {
     switch(mls)
     {

Compress the file mentioned above, the output buffer size is ZSTD_compressBound():

                 before   after
test code Win10  14.6s    11.9s   (32-bit build)
test code WSL2   4.39s    ----    (64-bit build)
pyzstd Win10     6.9s     4.86s   (64-bit build)
pyzstd WSL2      4.6s     ----    (64-bit build)

I'm not going to submit a PR, since there are still some MEM_STATIC functions in file zstd_compress_internal.h, please see what else needs to be changed.

FYI, assembly code of lib/compress/zstd_double_fast.c generated by MSVC:
https://github.com/animalize/misc/blob/master/zstd_double_fast.asm

Cyan4973 · 2020-09-20T16:20:04Z

I went ahead and resurrected an old file system with Windows 10 installed on a home computer.
It contains both a version of Visual Studio and MinGW64.
Not exactly "up to date", but not horribly out-of-date either :
this is Visual Studio 2017, vs gcc v9.1.0.

First thing : compare raw performance of these compilers, using zstd internal benchmark.
Gosh, the performance difference sure is huge.
To give a single data point, at level 3 (default), VS2017 compression speed is 140 MB/s, while gcc9.1 is 200 MB/s. That's way too large.
Decompression speed difference is even worse : VS2017 offers ~620 MB/s, while gcc9.1 reaches ... 1100 MB/s. Almost 2x faster.
So there is a lot of room to improve on MSVC.

I then tested your suggested change for ZSTD_hashPtr().
It works great, and produce nice benefits for MSVC.

Level 3 is the one offering the best benefit, moving up from 140 to 180 MB/s . Still slower than gcc (200 MB/s), but a lot closer, and firmly within "acceptable" range.

Level 3 gets the most benefit from this change.
Level 1 get some, but it's less impressive : from 280 -> 310 MB/s, compared to gcc's 380 MB/s.
Starting level 5, there may be some minor benefit, but it's difficult to distinguish in the noise. And there sure is a lot of noise on this Windows platform.
I also checked that performance change on gcc is basically neutral.

Overall, this seems like a positive change, at worst neutral, so I believe it could be integrated zstd "as is", while further improvements could be provided later on.

there are still some MEM_STATIC functions in file zstd_compress_internal.h

Things are not as simple as just transferring all MEM_STATIC into FORCE_INLINE_TEMPLATE.
In many cases, such a change doesn't improve performance, or even reduce it.
In all cases, it negatively impacts binary size (or at best is neutral), with corresponding impact on instruction cache.

In theory, we are supposed to "trust" the compiler, about making the correct choice regarding inlining.
They are supposed to know the cost and benefit of this operation, which varies depending on other choices made by the compiler (inlining might be a good choice for one compiler, and a bad one for another compiler).
In practice, we know this is not true, and compilers can be especially bad at inlining large functions which are drastically simplified thanks to constant propagation, even though it's a major performance boost.

Small functions like ZSTD_hashPtr() though are supposed easier to evaluate.
But then, it's MSVC, maybe it requires more manual guiding.

Cyan4973 · 2020-09-20T17:58:21Z

Also,
regarding the issue at hand,
aka the difference in performance between dstCapacity = ZSTD_compressBound() and dstCapacity = ZSTD_compressBound()-1 when invoking ZSTD_compressStream2(,,ZSTD_e_end),

I cannot reproduce the reported issue.

On my test system (using MSVC2017), using ZSTD_compressBound() correctly leads to slightly improved performance, corresponding to ZSTD_compress2() on the same data with same parameters, while ZSTD_compressBound()-1 leads to slightly slower performance (~-5%) due to the need to employ intermediate buffers to guarantee compression success.
Tried multiple times, multiple sources and buffer sizes, this is consistent with expectation.
However, it doesn't match the first report in this issue.

I can't explain yet this difference in observation. Could it be specific to MSVC2019 ? Or the way workload is measured ?

ghost · 2020-09-21T03:09:58Z

I uploaded assembly code of zstd v1.4.5 generated by MSVC2019, hope it can help you to check other MEM_STATIC functions.
https://github.com/animalize/zstd_asm

regarding the issue at hand, aka the difference in performance between dstCapacity = ZSTD_compressBound() and dstCapacity = ZSTD_compressBound()-1 when invoking ZSTD_compressStream2(,,ZSTD_e_end)
I cannot reproduce the reported issue.

Could it because you test after applying the patch?

Before applying the patch, I also got a ~5% difference when compiled with GCC.

Cygwin64 (gcc 9.3.0):

ZSTD_compressBound() - 1
output size: 286842128, time: 4530 ms

ZSTD_compressBound()
output size: 286842126, time: 4376 ms

On Windows 10 WSL2, Ubuntu 9.3.0-17ubuntu1~20.04:

ZSTD_compressBound() - 1
output size: 286842128, time: 4578125 ms (seems microsecond)

ZSTD_compressBound()
output size: 286842126, time: 4390625 ms (seems microsecond)

Moreover, I found Win10 build still slower than WSL2 build (4.85s -> 4.58s), maybe other sites can be improved as well.

                before   after
pyzstd Win10    6.9s     4.85s
pyzstd WSL2     4.58s    ----

(best of 5 results)

Cyan4973 · 2020-09-21T12:00:12Z

Could it because you test after applying the patch?

I tried both with and without the patch, and compiled with MSVC2017.
The speed difference direction remains the same, aka dstCapacity == ZSTD_compressBound() is faster than dstCapacity == ZSTD_compressBound() -1.
The delta is ~4% without MEM_STATIC, and reduced to ~3% with FORCE_INLINE_TEMPLATE, still roughly equivalent.

ghost · 2020-09-21T12:30:44Z

Could you have a look at the assembly code of ZSTD_compressBlock_doubleFast_generic function generated by MSVC2017? In MSVC2019 the function ZSTD_hashPtr() was not expanded:
https://github.com/animalize/zstd_asm/blob/master/v1.4.5_msvc2019/zstd_double_fast.asm#L557

Cyan4973 · 2020-09-21T12:56:01Z

With MEM_STATIC, it does call ZSTD_hashPtr.

There's no question regarding the FORCE_INLINE_ATTR impact : it does inline ZSTD_hashPtr.
My observation is solely focused on the impact of dstCapacity, and it's roughly the same whether FORCE_INLINE_ATTR is present or not.

ghost · 2020-09-21T13:05:16Z

Does it execute different code path?
What I tested was a 500 MB input file, all parameters were default (unset), compress/zstd_double_fast.c : ZSTD_compressBlock_doubleFast_generic was the low performance code executed.

makes it possible to measure scenarios such as #2314

ghost · 2020-09-22T03:33:04Z

~~If add FORCE_INLINE_ATTR to ZSTD_getLowestPrefixIndex() function, speed up a little:~~
(edit, sorry, this was a mistake.)

4.83s -> 4.76s
(best of 5 results)

ZSTD_getLowestPrefixIndex() function:

/**
 * Returns the lowest allowed match index in the prefix.
 */
MEM_STATIC FORCE_INLINE_ATTR
U32 ZSTD_getLowestPrefixIndex(const ZSTD_matchState_t* ms, U32 current, unsigned windowLog)
{
    U32    const maxDistance = 1U << windowLog;
    U32    const lowestValid = ms->window.dictLimit;
    U32    const withinWindow = (current - lowestValid > maxDistance) ? current - maxDistance : lowestValid;
    U32    const isDictionary = (ms->loadedDictEnd != 0);
    U32    const matchLowest = isDictionary ? lowestValid : withinWindow;
    return matchLowest;
}

Don't know if it will slow down for other compilers.
Maybe adding a MSVC_FORCE_INLINE_ATTR macro will reduce uncertainty?

Cyan4973 · 2020-09-24T16:12:15Z

It looks to me that it should be possible to simplify this function.

That being said, it's only invoked once per block,
therefore, it should not be in position to measurably impact performance.

ghost · 2020-09-25T01:35:34Z

~~It's faster almost every time, so it may not be because of noise.~~

(edit, sorry, this was a mistake, adding the attr to ZSTD_getLowestPrefixIndex() has no effect.)

~~There are dozens of MEM_STATIC functions, maybe some of them can be added FORCE_INLINE_ATTR, hope the performance of MSVC build can be closer to GCC build.~~

(edit, I gradually added FORCE_INLINE_ATTR to MEM_STATIC functions, the performance did not change significantly.)

Only add FORCE_INLINE_ATTR to ZSTD_hashPtr() function:
(output buffer size is ZSTD_compressBound())

                 before   after

test code 32bit  14.6s    11.9s
test code 64bit  6.9s     4.84s
test code WSL2   4.39s    ----

(best of ~5 results)

6 MEM_readLE16() calls was not expanded in huf_decompress.c, this should be not expected:
https://github.com/animalize/zstd_asm/blob/master/v1.4.5_msvc2019/huf_decompress.asm#L6499

MEM_STATIC U16 MEM_readLE16(const void* memPtr)
{
    if (MEM_isLittleEndian())
        return MEM_read16(memPtr);
    else {
        const BYTE* p = (const BYTE*)memPtr;
        return (U16)(p[0] + (p[1]<<8));
    }
}

MEM_readLE16() function was compiled to this assembly code.
MSVC is not very smart in this case.

; File E:\dev\pyzstd\lib\common\mem.h
_TEXT	SEGMENT
memPtr$ = 8
MEM_readLE16 PROC

; 320  :     if (MEM_isLittleEndian())
; 321  :         return MEM_read16(memPtr);

	movzx	eax, WORD PTR [rcx]

; 322  :     else {
; 323  :         const BYTE* p = (const BYTE*)memPtr;
; 324  :         return (U16)(p[0] + (p[1]<<8));
; 325  :     }
; 326  : }

	ret	0
MEM_readLE16 ENDP
_TEXT	ENDS

jpapp05 · 2020-09-30T08:01:55Z

Not sure if this helps with your quest to close the performance gap, but Visual Studio 2019 has a new optimization option that might be useful...

See /Ob3, which turns on aggressive inlining...
https://docs.microsoft.com/en-us/cpp/build/reference/ob-inline-function-expansion?view=vs-2019

ghost · 2020-09-30T10:37:25Z

Not sure if this helps with your quest to close the performance gap, but Visual Studio 2019 has a new optimization option that might be useful...

See /Ob3, which turns on aggressive inlining...
https://docs.microsoft.com/en-us/cpp/build/reference/ob-inline-function-expansion?view=vs-2019

Thanks for this information.

It works, just turn on this option, no code changed: 6.97s -> 4.77s.

Cyan4973 · 2020-09-30T15:35:21Z

It looks to me that inlining ZSTD_hashPtr() produces the bulk of the benefits,
so it's the one change we should make to the source code,
and that's basically the case now, with the merging of @animalize 's PR.

This /Ob3 compilation flag is a pretty good discovery,
thanks for suggesting it @jpapp05.
Using it, instead of the more classical /Ob2, could be done by updating the cmake and visual scripts.
However, I'm concerned by the impact on older versions of Visual Studio.

If they don't support /Ob3 directly, but it becomes equivalent to /Ob2 on such older versions, that's fine.
But if compilation doesn't even accept to start, it's more concerning,
and would require a visual studio version selector.

ghost · 2020-10-01T02:57:34Z

It looks to me that inlining ZSTD_hashPtr() produces the bulk of the benefits,
so it's the one change we should make to the source code

In some ZSTD_hashPtr() calls, the third argument can't be predicted at compile time, maybe forcing inline will have a bit negative effect in this case:

zstd/lib/compress/zstd_fast.c

Lines 30 to 41 in a880ca2

    
           for ( ; ip + fastHashFillStep < iend + 2; ip += fastHashFillStep) { 
        
               U32 const current = (U32)(ip - base); 
        
               size_t const hash0 = ZSTD_hashPtr(ip, hBits, mls); 
        
               hashTable[hash0] = current; 
        
               if (dtlm == ZSTD_dtlm_fast) continue; 
        
               /* Only load extra positions for ZSTD_dtlm_full */ 
        
               {   U32 p; 
        
                   for (p = 1; p < fastHashFillStep; ++p) { 
        
                       size_t const hash = ZSTD_hashPtr(ip + p, hBits, mls); 
        
                       if (hashTable[hash] == 0) {  /* not yet filled */ 
        
                           hashTable[hash] = current + p; 
        
           }   }   }   }

Remember zstd v1.4.6 will be released in near future, maybe this problem can be carefully improved later.

~~Some ZSTD_hashPtr() don't use inline keyword, please have a look:~~
(edit, in /contrib/linux-kernel/lib/zstd/compress.c, no inline keyword seems ok.)

ghost · 2020-10-03T04:56:57Z

Also,
regarding the issue at hand,
aka the difference in performance between dstCapacity = ZSTD_compressBound() and dstCapacity = > ZSTD_compressBound()-1 when invoking ZSTD_compressStream2(,,ZSTD_e_end),

I cannot reproduce the reported issue.

Maybe due to MSVC2017, when using MSVC2017 I also get the result:

                           MSVC2017      MSVC2019
ZSTD_compressBound()       6.4 sec       6.9 sec
ZSTD_compressBound()-1     7.0 sec       5.0 sec

(zstd v1.4.5, no code changed.)

ghost · 2020-12-17T11:39:27Z

When I have time, I will write a script to find uninlined functions in MSVC, maybe the performance can be improved a little.

ghost · 2020-12-27T02:28:17Z

I will write a script to find uninlined functions in MSVC

I'm doing this, hope it can be completed in two or three weeks, as a good pastime. :)

Cyan4973 added the help wanted label Sep 18, 2020

ghost mentioned this issue Sep 21, 2020

Let MSVC force inline ZSTD_hashPtr() function #2317

Merged

Cyan4973 added a commit that referenced this issue Sep 21, 2020

update fullbench for compressStream2()

5618e00

makes it possible to measure scenarios such as #2314

Cyan4973 mentioned this issue Sep 21, 2020

update fullbench for compressStream2() #2319

Merged

Cyan4973 referenced this issue Sep 24, 2020

[lib] Fix repcode validation in no dict mode

4e05159

ghost mentioned this issue Oct 18, 2020

meson: Deny shared build on MSVC compiler #2363

Closed

ghost closed this as completed Dec 17, 2020

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shortcut for ZSTD_compressBound() is slower when compiled with MSVC2019 #2314

Shortcut for ZSTD_compressBound() is slower when compiled with MSVC2019 #2314

ghost commented Sep 18, 2020 •

edited by ghost

Loading

ghost commented Sep 18, 2020

ghost commented Sep 19, 2020

ghost commented Sep 19, 2020 •

edited by ghost

Loading

Cyan4973 commented Sep 19, 2020 •

edited

Loading

ghost commented Sep 20, 2020 •

edited by ghost

Loading

Cyan4973 commented Sep 20, 2020

Cyan4973 commented Sep 20, 2020 •

edited

Loading

ghost commented Sep 21, 2020

Cyan4973 commented Sep 21, 2020

ghost commented Sep 21, 2020 •

edited by ghost

Loading

Cyan4973 commented Sep 21, 2020

ghost commented Sep 21, 2020 •

edited by ghost

Loading

ghost commented Sep 22, 2020 •

edited by ghost

Loading

Cyan4973 commented Sep 24, 2020

ghost commented Sep 25, 2020 •

edited by ghost

Loading

jpapp05 commented Sep 30, 2020

ghost commented Sep 30, 2020 •

edited by ghost

Loading

Cyan4973 commented Sep 30, 2020 •

edited

Loading

ghost commented Oct 1, 2020 •

edited by ghost

Loading

ghost commented Oct 3, 2020 •

edited by ghost

Loading

ghost commented Dec 17, 2020

ghost commented Dec 27, 2020

Shortcut for ZSTD_compressBound() is slower when compiled with MSVC2019 #2314

Shortcut for ZSTD_compressBound() is slower when compiled with MSVC2019 #2314

Comments

ghost commented Sep 18, 2020 • edited by ghost Loading

ghost commented Sep 18, 2020

ghost commented Sep 19, 2020

ghost commented Sep 19, 2020 • edited by ghost Loading

Cyan4973 commented Sep 19, 2020 • edited Loading

ghost commented Sep 20, 2020 • edited by ghost Loading

Cyan4973 commented Sep 20, 2020

Cyan4973 commented Sep 20, 2020 • edited Loading

ghost commented Sep 21, 2020

Cyan4973 commented Sep 21, 2020

ghost commented Sep 21, 2020 • edited by ghost Loading

Cyan4973 commented Sep 21, 2020

ghost commented Sep 21, 2020 • edited by ghost Loading

ghost commented Sep 22, 2020 • edited by ghost Loading

Cyan4973 commented Sep 24, 2020

ghost commented Sep 25, 2020 • edited by ghost Loading

jpapp05 commented Sep 30, 2020

ghost commented Sep 30, 2020 • edited by ghost Loading

Cyan4973 commented Sep 30, 2020 • edited Loading

ghost commented Oct 1, 2020 • edited by ghost Loading

ghost commented Oct 3, 2020 • edited by ghost Loading

ghost commented Dec 17, 2020

ghost commented Dec 27, 2020

ghost commented Sep 18, 2020 •

edited by ghost

Loading

ghost commented Sep 19, 2020 •

edited by ghost

Loading

Cyan4973 commented Sep 19, 2020 •

edited

Loading

ghost commented Sep 20, 2020 •

edited by ghost

Loading

Cyan4973 commented Sep 20, 2020 •

edited

Loading

ghost commented Sep 21, 2020 •

edited by ghost

Loading

ghost commented Sep 21, 2020 •

edited by ghost

Loading

ghost commented Sep 22, 2020 •

edited by ghost

Loading

ghost commented Sep 25, 2020 •

edited by ghost

Loading

ghost commented Sep 30, 2020 •

edited by ghost

Loading

Cyan4973 commented Sep 30, 2020 •

edited

Loading

ghost commented Oct 1, 2020 •

edited by ghost

Loading

ghost commented Oct 3, 2020 •

edited by ghost

Loading