huff0: translate asm implementation into avo program #543

WojciechMula · 2022-03-24T10:28:29Z

Currently register allocation fails for some reason. Fixes #529

klauspost · 2022-03-24T12:34:07Z

Some notes:

Don't pre-allocate and re-use temp vars. Instead "allocate" them just as you need them. In your code, you keep "temp" allocated across calls since you pass it as a parameter. This gives the allocator more space to work.
Dereference the bitreader between calls, so you don't need to keep it alive all the time.
Made bmi/non-bmi versions. We already have the detection code and don't require the cleanup.
Moved "off" to memory. Used the return index which is already allocated, which gets updated inplace.
Moved RCX/CL references to a variable to make it more clear what is happening.

The 8H is limited to just a few registers (which you probably know), which avo needs to know up-front. Maybe there is a bug in avo. I used manual assignment for those.

"Compiling" version: https://gist.github.com/klauspost/8f8dbbd9745662464dfac37d00cbd5f6

WojciechMula · 2022-03-25T12:26:33Z

There are significant regressions. The asm code cached more values in GPRs.

benchmark                                                   old ns/op     new ns/op     delta
BenchmarkDecompress4XNoTable/digits/100-16                  424           421           -0.59%
BenchmarkDecompress4XNoTable/digits/10000-16                11797         53166         +350.67%
BenchmarkDecompress4XNoTable/digits/262143-16               316076        1393336       +340.82%
BenchmarkDecompress4XNoTable/gettysburg/100-16              326           322           -1.23%
BenchmarkDecompress4XNoTable/gettysburg/10000-16            11727         54531         +365.00%
BenchmarkDecompress4XNoTable/gettysburg/262143-16           328128        1462261       +345.64%
BenchmarkDecompress4XNoTable/twain/100-16                   416           413           -0.86%
BenchmarkDecompress4XNoTable/twain/10000-16                 11762         56049         +376.53%
BenchmarkDecompress4XNoTable/twain/262143-16                399616        1521951       +280.85%
BenchmarkDecompress4XNoTable/low-ent.10k/100-16             445           437           -1.91%
BenchmarkDecompress4XNoTable/low-ent.10k/10000-16           11643         44926         +285.86%
BenchmarkDecompress4XNoTable/low-ent.10k/262143-16          273885        1175472       +329.18%
BenchmarkDecompress4XNoTable/superlow-ent-10k/262143-16     274214        1176740       +329.13%
BenchmarkDecompress4XNoTable/case1/100-16                   410           409           -0.39%
BenchmarkDecompress4XNoTable/case1/10000-16                 11863         47098         +297.02%
BenchmarkDecompress4XNoTable/case1/262143-16                298821        1226254       +310.36%
BenchmarkDecompress4XNoTable/case2/100-16                   421           416           -1.33%
BenchmarkDecompress4XNoTable/case2/10000-16                 11657         48478         +315.87%
BenchmarkDecompress4XNoTable/case2/262143-16                291177        1272254       +336.93%
BenchmarkDecompress4XNoTable/case3/100-16                   408           407           -0.17%
BenchmarkDecompress4XNoTable/case3/10000-16                 11664         49914         +327.93%
BenchmarkDecompress4XNoTable/case3/262143-16                294305        1297013       +340.70%
BenchmarkDecompress4XNoTable/pngdata.001/100-16             439           429           -2.23%
BenchmarkDecompress4XNoTable/pngdata.001/10000-16           11965         45253         +278.21%
BenchmarkDecompress4XNoTable/pngdata.001/262143-16          307526        1215161       +295.14%
BenchmarkDecompress4XNoTable/normcount2/100-16              337           329           -2.17%
BenchmarkDecompress4XNoTable/normcount2/10000-16            11978         50669         +323.02%
BenchmarkDecompress4XNoTable/normcount2/262143-16           303912        1330331       +337.74%
BenchmarkDecompress4XNoTableTableLog8/digits-16             113774        531650        +367.29%
BenchmarkDecompress4XTable/digits-16                        114840        532410        +363.61%
BenchmarkDecompress4XTable/gettysburg-16                    3447          9315          +170.23%
BenchmarkDecompress4XTable/twain-16                         402565        1520643       +277.74%
BenchmarkDecompress4XTable/low-ent.10k-16                   43690         179044        +309.81%
BenchmarkDecompress4XTable/superlow-ent-10k-16              12669         47572         +275.50%
BenchmarkDecompress4XTable/case1-16                         2078          2102          +1.15%
BenchmarkDecompress4XTable/case2-16                         2037          2064          +1.33%
BenchmarkDecompress4XTable/case3-16                         2036          2040          +0.20%
BenchmarkDecompress4XTable/pngdata.001-16                   58369         236666        +305.47%
BenchmarkDecompress4XTable/normcount2-16                    1437          1432          -0.35%

benchmark                                                   old MB/s     new MB/s     speedup
BenchmarkDecompress4XNoTable/digits/100-16                  236.15       237.51       1.01x
BenchmarkDecompress4XNoTable/digits/10000-16                847.70       188.09       0.22x
BenchmarkDecompress4XNoTable/digits/262143-16               829.37       188.14       0.23x
BenchmarkDecompress4XNoTable/gettysburg/100-16              307.23       311.03       1.01x
BenchmarkDecompress4XNoTable/gettysburg/10000-16            852.73       183.38       0.22x
BenchmarkDecompress4XNoTable/gettysburg/262143-16           798.90       179.27       0.22x
BenchmarkDecompress4XNoTable/twain/100-16                   240.16       242.22       1.01x
BenchmarkDecompress4XNoTable/twain/10000-16                 850.20       178.41       0.21x
BenchmarkDecompress4XNoTable/twain/262143-16                655.99       172.24       0.26x
BenchmarkDecompress4XNoTable/low-ent.10k/100-16             224.64       228.97       1.02x
BenchmarkDecompress4XNoTable/low-ent.10k/10000-16           858.90       222.59       0.26x
BenchmarkDecompress4XNoTable/low-ent.10k/262143-16          957.13       223.01       0.23x
BenchmarkDecompress4XNoTable/superlow-ent-10k/262143-16     955.98       222.77       0.23x
BenchmarkDecompress4XNoTable/case1/100-16                   243.67       244.60       1.00x
BenchmarkDecompress4XNoTable/case1/10000-16                 842.98       212.32       0.25x
BenchmarkDecompress4XNoTable/case1/262143-16                877.26       213.78       0.24x
BenchmarkDecompress4XNoTable/case2/100-16                   237.31       240.51       1.01x
BenchmarkDecompress4XNoTable/case2/10000-16                 857.84       206.28       0.24x
BenchmarkDecompress4XNoTable/case2/262143-16                900.29       206.05       0.23x
BenchmarkDecompress4XNoTable/case3/100-16                   245.01       245.47       1.00x
BenchmarkDecompress4XNoTable/case3/10000-16                 857.37       200.35       0.23x
BenchmarkDecompress4XNoTable/case3/262143-16                890.72       202.11       0.23x
BenchmarkDecompress4XNoTable/pngdata.001/100-16             227.91       233.09       1.02x
BenchmarkDecompress4XNoTable/pngdata.001/10000-16           835.80       220.98       0.26x
BenchmarkDecompress4XNoTable/pngdata.001/262143-16          852.42       215.73       0.25x
BenchmarkDecompress4XNoTable/normcount2/100-16              297.01       303.54       1.02x
BenchmarkDecompress4XNoTable/normcount2/10000-16            834.86       197.36       0.24x
BenchmarkDecompress4XNoTable/normcount2/262143-16           862.56       197.05       0.23x
BenchmarkDecompress4XNoTableTableLog8/digits-16             878.96       188.10       0.21x
BenchmarkDecompress4XTable/digits-16                        870.80       187.83       0.22x
BenchmarkDecompress4XTable/gettysburg-16                    449.10       166.18       0.37x
BenchmarkDecompress4XTable/twain-16                         651.18       172.39       0.26x
BenchmarkDecompress4XTable/low-ent.10k-16                   915.55       223.41       0.24x
BenchmarkDecompress4XTable/superlow-ent-10k-16              828.81       220.72       0.27x
BenchmarkDecompress4XTable/case1-16                         26.46        26.17        0.99x
BenchmarkDecompress4XTable/case2-16                         22.09        21.80        0.99x
BenchmarkDecompress4XTable/case3-16                         23.57        23.53        1.00x
BenchmarkDecompress4XTable/pngdata.001-16                   877.18       216.34       0.25x
BenchmarkDecompress4XTable/normcount2-16                    60.54        60.75        1.00x

klauspost · 2022-03-28T10:33:00Z

@WojciechMula There must be something else. There is no way reading/writing L1 cached values will slow down that much, maybe a percent or two.

It will be a couple of days before I can look at this.

huff0/decompress_amd64.go

WojciechMula · 2022-03-31T06:44:29Z

I managed to fix obvious mistakes, but still, there are regressions. I will investigate it further, just dumping the current state.

benchmark                                                   old ns/op     new ns/op     delta
BenchmarkDecompress4XNoTable/digits/100-16                  424           418           -1.20%
BenchmarkDecompress4XNoTable/digits/10000-16                11797         19903         +68.71%
BenchmarkDecompress4XNoTable/digits/262143-16               316076        509543        +61.21%
BenchmarkDecompress4XNoTable/gettysburg/100-16              326           321           -1.32%
BenchmarkDecompress4XNoTable/gettysburg/10000-16            11727         15252         +30.06%
BenchmarkDecompress4XNoTable/gettysburg/262143-16           328128        404332        +23.22%
BenchmarkDecompress4XNoTable/twain/100-16                   416           413           -0.91%
BenchmarkDecompress4XNoTable/twain/10000-16                 11762         15229         +29.48%
BenchmarkDecompress4XNoTable/twain/262143-16                399616        454245        +13.67%
BenchmarkDecompress4XNoTable/low-ent.10k/100-16             445           437           -1.84%
BenchmarkDecompress4XNoTable/low-ent.10k/10000-16           11643         20342         +74.71%
BenchmarkDecompress4XNoTable/low-ent.10k/262143-16          273885        510006        +86.21%
BenchmarkDecompress4XNoTable/superlow-ent-10k/262143-16     274214        509801        +85.91%
BenchmarkDecompress4XNoTable/case1/100-16                   410           408           -0.63%
BenchmarkDecompress4XNoTable/case1/10000-16                 11863         19904         +67.78%
BenchmarkDecompress4XNoTable/case1/262143-16                298821        511902        +71.31%
BenchmarkDecompress4XNoTable/case2/100-16                   421           415           -1.59%
BenchmarkDecompress4XNoTable/case2/10000-16                 11657         19992         +71.50%
BenchmarkDecompress4XNoTable/case2/262143-16                291177        511883        +75.80%
BenchmarkDecompress4XNoTable/case3/100-16                   408           408           -0.07%
BenchmarkDecompress4XNoTable/case3/10000-16                 11664         19853         +70.21%
BenchmarkDecompress4XNoTable/case3/262143-16                294305        510981        +73.62%
BenchmarkDecompress4XNoTable/pngdata.001/100-16             439           428           -2.53%
BenchmarkDecompress4XNoTable/pngdata.001/10000-16           11965         15898         +32.87%
BenchmarkDecompress4XNoTable/pngdata.001/262143-16          307526        404434        +31.51%
BenchmarkDecompress4XNoTable/normcount2/100-16              337           330           -2.14%
BenchmarkDecompress4XNoTable/normcount2/10000-16            11978         19868         +65.87%
BenchmarkDecompress4XNoTable/normcount2/262143-16           303912        511161        +68.19%
BenchmarkDecompress4XNoTableTableLog8/digits-16             113774        195346        +71.70%
BenchmarkDecompress4XTable/digits-16                        114840        196381        +71.00%
BenchmarkDecompress4XTable/gettysburg-16                    3447          3941          +14.33%
BenchmarkDecompress4XTable/twain-16                         402565        457229        +13.58%
BenchmarkDecompress4XTable/low-ent.10k-16                   43690         79427         +81.80%
BenchmarkDecompress4XTable/superlow-ent-10k-16              12669         21837         +72.37%
BenchmarkDecompress4XTable/case1-16                         2078          2083          +0.24%
BenchmarkDecompress4XTable/case2-16                         2037          2026          -0.54%
BenchmarkDecompress4XTable/case3-16                         2036          2024          -0.59%
BenchmarkDecompress4XTable/pngdata.001-16                   58369         79526         +36.25%
BenchmarkDecompress4XTable/normcount2-16                    1437          1432          -0.35%

klauspost · 2022-04-04T10:14:03Z

Hint: Disabling BMI2 brings back most of the performance. 8 bit version also appears worse.

klauspost · 2022-04-04T10:48:33Z

It seems like BMI just isn't a gain here.

It seems we have enough regs to move this back out of the main loop:

	br0 := Dereference(Param("pbr0"))
	br1 := Dereference(Param("pbr1"))
	br2 := Dereference(Param("pbr2"))
	br3 := Dereference(Param("pbr3"))

	Comment("Main loop")

Also 8 bit variant is a tiny bit slower here. Does it improve things on your side?

klauspost · 2022-04-04T10:52:00Z

If you want a good speedup, decode directly to out and avoid the memcopy on every loop.

Technically you don't even need to return between loops with that.

WojciechMula · 2022-04-04T17:57:57Z

It seems like BMI just isn't a gain here.

I need to investigate it, there might be some strange code generated.

It seems we have enough regs to move this back out of the main loop:

	br0 := Dereference(Param("pbr0"))
	br1 := Dereference(Param("pbr1"))
	br2 := Dereference(Param("pbr2"))
	br3 := Dereference(Param("pbr3"))

	Comment("Main loop")

True, but we had to introduce these dereferences due to reg. allocation failure.

Also 8 bit variant is a tiny bit slower here. Does it improve things on your side?

Will check it.

mmcloughlin

Are you still having problems with the register allocator?

huff0/_generate/gen.go

mmcloughlin · 2022-04-24T21:48:19Z

huff0/_generate/gen.go

+	Constraint(buildtags.Not("appengine").ToConstraint())
+	Constraint(buildtags.Not("noasm").ToConstraint())
+	Constraint(buildtags.Term("gc").ToConstraint())
+	Constraint(buildtags.Not("noasm").ToConstraint())


You can use ConstraintExpr and just provide a string (in the old +build syntax).

huff0/_generate/gen.go

mmcloughlin · 2022-04-24T21:54:22Z

huff0/_generate/gen.go

+	peekBits := GP64()
+	buffer := GP64()
+	table := GP64()
+
+	Comment("Preload values")
+	{
+		Load(Param("peekBits"), peekBits)
+		Load(Param("buf"), buffer)
+		Load(Param("tbl"), table)
+	}


Load returns the register. You can do it like this:

peekBits := Load(Param("peekBits"), GP64())

Yes, I saw this in the examples. For me having separate allocation and the use is a bit cleaner.

WojciechMula · 2022-04-25T17:07:08Z

Are you still having problems with the register allocator?

First of all, thank you for such a great tool!

I haven't worked on this PR recently, I'm getting back to it this week.

mmcloughlin · 2022-04-25T17:57:24Z

I haven't worked on this PR recently, I'm getting back to it this week.

Let me know. As @klauspost alluded to sometimes it's just a matter of writing the code to limit the number of live variables. However, there's a chance there are bugs/inefficiencies in avo's liveness analysis or register allocator.

WojciechMula · 2022-04-28T18:48:34Z

So my today's findings are quite strange. BMI2 functions are significantly slower, despite the fact I managed to keep all the values in registers as in the master's version now. I spent some time comparing the generated assembly with the current version and couldn't notice any differences (other than different registers used). I'm continuing tomorrow.

@mmcloughlin Would you please explain how I can convert an arbitrary pointer stored in a reg into a Component? I was unable to find any example and I'm a newbie in terms of Go types magic. Currently I hardcoded offsets (https://github.com/klauspost/compress/pull/543/commits/0105d90cc7ba160bdd0f6ddeffdf9b7a09645241#diff-ea089d652c358a82c9850f63ac418edf3f1869e9686199a3d35a5c44eb4a430a#L99), that's ugly.

mmcloughlin · 2022-04-29T02:28:09Z

Would you please explain how I can convert an arbitrary pointer stored in a reg into a Component?

Sorry, can you elaborate on what you're trying to do?

WojciechMula · 2022-04-29T07:14:00Z

Would you please explain how I can convert an arbitrary pointer stored in a reg into a Component?

Sorry, can you elaborate on what you're trying to do?

Oh, sorry for not being precise. When we have a pointer to a struct as a parameter, then reading fields is obvious:

s := Dereference(Param("structPtr"))
x := Load(s.Field("something"), GP64())

But we read the pointer directly:

ptr ;= GP64()
MOVQ(Partam("structPtr"), ptr)

And want to interpret that bare pointer as structure, to use Load:

s := ???(ptr, "likely name of structure")
x := Load(s.Field("something"), GP64())

klauspost · 2022-04-29T07:36:42Z

@WojciechMula Do you mean as we did in zstd: https://github.com/klauspost/compress/blob/master/zstd/_generate/gen.go#L165-L166

		ctx := Dereference(Param("ctx"))
		Load(ctx.Field("llState"), llState)

WojciechMula · 2022-04-29T07:46:33Z

@klauspost No, something else. OK, this is an actual snippet from this branch:

const bitReader_in = 0
const bitReader_off = bitReader_in + 3*8 // {ptr, len, cap}
const bitReader_value = bitReader_off + 8
const bitReader_bitsRead = bitReader_value + 8

func (d decompress4x) decodeTwoValues(id int, br, peekBits, table, buffer, off reg.GPVirtual, out, exhausted reg.GPPhysical) {
	brOffset := GP64()
	brBitsRead := GP64()
	brValue := GP64()

	MOVQ(Mem{Base: br, Disp: bitReader_off}, brOffset)
	MOVQ(Mem{Base: br, Disp: bitReader_value}, brValue)
	MOVBQZX(Mem{Base: br, Disp: bitReader_bitsRead}, brBitsRead)
        // snip
}

The br is an untyped pointer. I would like to convert that pointer into a value of type gotypes.Components and then be able to use Load function and friends. Now, as you see, I hardcoded offsets into struct.

klauspost · 2022-04-29T09:41:46Z

Ah, ok. Don't know how to do that.

BMI2 functions are significantly slower

I find the same, so shouldn't enable them. Let's analyze with numbers from https://github.com/InstLatx64/InstLatx64

	if d.bmi2 {
		SHRXQ(peekBits, brValue, val.As64()) // val = (value >> peek_bits) & mask
	} else {
		MOVQ(brValue, val.As64())
		MOVQ(peekBits, CX.As64())
		SHRQ(CX, val.As64()) // val = (value >> peek_bits) & mask
	}

BMI: SHRX r64, r64, r64 L: 0.26ns= 1.0c T: 0.11ns= 0.42c (Zen2) 0.78ns= 1.0c T: 0.39ns= 0.50c (Ice lake)

X86: MOVs are just register renames, SHR r64, cl L: 0.26ns= 1.0c T: 0.09ns= 0.35c (Zen2), L: 0.83ns= 1.1c T: 0.78ns= 1.02c (Icelake)

So these can be expected to perform the same.

	MOVBQZX(v.As8(), CX.As64())
	if d.bmi2 {
		SHLXQ(v.As64(), brValue, brValue) // value <<= n
	} else {
		SHLQ(CX, brValue) // value <<= n
	}

Should be the same again, though BMI should doesn't have to wait for CX if it can't be executed ahead-of-time (which I expect it would).

So I don't really see this being the cause. Either way, I trust the benchmarks in this. Let's keep BMI disabled.

WojciechMula · 2022-04-29T09:54:40Z

I'm comparing the old assembly with avo generated one, instruction by instruction. For the non-BMI version the assembly is almost identical with a bit different encoding in a few cases due to using different registers. ~~I'm working now on BMI versions, and will share the results soon~~. Exactly the same outcome, just different registers got allocted by avo.

klauspost · 2022-04-29T09:58:09Z

@WojciechMula The "old" is not using BMI unless you set the v3 env var for Go 1.18

WojciechMula · 2022-04-29T15:49:33Z

I didn't figure out the reasons for regressions. The assembly code generated by avo is identical to the handwritten procedure. Differences are IMHO negligible, they cannot cause 20-30% slowdowns.
TBH I'm running out of ideas. I will continue the next week with a fresh mind. :)

klauspost · 2022-04-29T16:07:35Z

I see at most a 1-3% regression. Nothing to really worry about.

klauspost · 2022-04-29T16:25:39Z

Instead if you can eliminate the memcopies that would make a much bigger diff:

		copy(out, buf[0][:])
		copy(out[dstEvery:], buf[1][:])
		copy(out[dstEvery*2:], buf[2][:])
		copy(out[dstEvery*3:], buf[3][:])

Instead of decoding to buf, you can send 4 slices to write to. If you keep track of the bytes written you can get rid of the looping altogether.

You could also look into branchless filling similar to #550 - I can't remember if I already looked at this.

It is not that I don't care about 2%, but I think there are bigger fish to catch.

WojciechMula · 2022-04-30T09:14:03Z

I see at most a 1-3% regression. Nothing to really worry about.

Hm, on my Ice Lake machines there are 15-20% regressions in a few cases. But if you are happy with the current shape, we may merge this PR. Then I'll eliminate mem copying as you suggested.

klauspost · 2022-05-02T12:08:46Z

Running this branch, I get these numbers on 4XNoTable:

λ benchcmp before.txt after.txt                                                               
benchmark                                                   old ns/op     new ns/op     delta 
BenchmarkDecompress4XNoTable/digits/100-32                  334           334           +0.21%
BenchmarkDecompress4XNoTable/digits/10000-32                10835         11017         +1.68%
BenchmarkDecompress4XNoTable/digits/262143-32               303585        310422        +2.25%
BenchmarkDecompress4XNoTable/gettysburg/100-32              285           285           +0.04%
BenchmarkDecompress4XNoTable/gettysburg/10000-32            11393         11480         +0.76%
BenchmarkDecompress4XNoTable/gettysburg/262143-32           327973        333221        +1.60%
BenchmarkDecompress4XNoTable/twain/100-32                   331           332           +0.36%
BenchmarkDecompress4XNoTable/twain/10000-32                 11458         11453         -0.04%
BenchmarkDecompress4XNoTable/twain/262143-32                374970        386719        +3.13%
BenchmarkDecompress4XNoTable/low-ent.10k/100-32             367           372           +1.17%
BenchmarkDecompress4XNoTable/low-ent.10k/10000-32           10812         10956         +1.33%
BenchmarkDecompress4XNoTable/low-ent.10k/262143-32          256684        260314        +1.41%
BenchmarkDecompress4XNoTable/superlow-ent-10k/262143-32     256839        261779        +1.92%
BenchmarkDecompress4XNoTable/case1/100-32                   318           320           +0.72%
BenchmarkDecompress4XNoTable/case1/10000-32                 10803         11021         +2.02%
BenchmarkDecompress4XNoTable/case1/262143-32                277377        280121        +0.99%
BenchmarkDecompress4XNoTable/case2/100-32                   345           342           -1.10%
BenchmarkDecompress4XNoTable/case2/10000-32                 10659         10870         +1.98%
BenchmarkDecompress4XNoTable/case2/262143-32                268723        274144        +2.02%
BenchmarkDecompress4XNoTable/case3/100-32                   333           333           +0.15%
BenchmarkDecompress4XNoTable/case3/10000-32                 10737         10806         +0.64%
BenchmarkDecompress4XNoTable/case3/262143-32                272268        276636        +1.60%
BenchmarkDecompress4XNoTable/pngdata.001/100-32             361           361           -0.11%
BenchmarkDecompress4XNoTable/pngdata.001/10000-32           11583         11683         +0.86%
BenchmarkDecompress4XNoTable/pngdata.001/262143-32          306257        313848        +2.48%
BenchmarkDecompress4XNoTable/normcount2/100-32              287           288           +0.49%
BenchmarkDecompress4XNoTable/normcount2/10000-32            10832         11073         +2.22%
BenchmarkDecompress4XNoTable/normcount2/262143-32           279908        283876        +1.42%
BenchmarkDecompress4XNoTableTableLog8/digits-32             107990        109990        +1.85%

klauspost · 2022-05-02T13:33:36Z

Just tried removing the "peekBits" variable shift and creating a version that has 9,10 and 11 bits of fixed peek.

That was 200MB/s worse than the variable shift for relevant benchmarks. THAT is a surprise.

WojciechMula · 2022-05-02T21:38:57Z

Just tried removing the "peekBits" variable shift and creating a version that has 9,10 and 11 bits of fixed peek.

That was 200MB/s worse than the variable shift for relevant benchmarks. THAT is a surprise.

It's weird. I'm starting to suspect that there's something odd in the tests.

klauspost · 2022-05-03T08:27:53Z

You are welcome to check, but I am pretty sure it holds up. It just shows you can never trust intuition, and what "makes sense" to be true, and always benchmark every small change. (see edit)

Simplifying and comparing these 3:

	if true {
		MOVQ(U32(64-d.nBits), CX.As64())
		MOVQ(brValue, val.As64())
		SHRQ(CX, val.As64()) // val = (value >> peek_bits) & mask
	} else if false {
		mask := GP64()
		MOVQ(U32(64-d.nBits|(d.nBits<<8)), mask)
		BEXTRQ(mask, brValue, val.As64())
	} else {
		MOVQ(brValue, val.As64())
		SHRQ(U8(64-d.nBits), val.As64()) // val = (value >> peek_bits) & mask
	}

The first is by far the fastest:

BenchmarkDecompress4XNoTable/gettysburg/10000-32          106276             11207 ns/op         892.32 MB/s
BenchmarkDecompress4XNoTable/gettysburg/10000-32           86511             13541 ns/op         738.50 MB/s
BenchmarkDecompress4XNoTable/gettysburg/10000-32           87301             13686 ns/op         730.69 MB/s

The existing code gave 857.25 MB/s.

EDIT: Actually it seems I should have picked this up. Looking at Zen 2 timings:

236 AMD64           :SHR r64, imm8                         L:   0.29ns=  1.0c  T:   0.14ns=  0.46c
240 AMD64           :SHR r64, cl                           L:   0.29ns=  1.0c  T:   0.11ns=  0.37c

There are more shifting pipelines with variable shifts - see how throughput is a bit higher. It may be able to do 2 fixed shifts/cycle or 3 variable shifts per cycle, indicating different pipelines are used. It could also be the pipelines with fixed shifts are already doing work.

EDIT 2:

Seems like Intel (Tiger Lake here) has the opposite:

236 AMD64               :SHR r64, imm8                         L:   0.41ns=  1.0c  T:   0.21ns=  0.51c
240 AMD64               :SHR r64, cl                           L:   0.45ns=  1.1c  T:   0.41ns=  1.00c

Fixed shifts 2x throughput than variable...

klauspost · 2022-05-03T11:28:10Z

@WojciechMula I factored out the bit filling code and made versions for 9,10 and 11 bits peek.

https://gist.github.com/klauspost/617e149f31f8967bc184f5a48c3834f4

This is the same speed, but mainly for future extensions. I would like to unconditionally fill 56 bits, so we have enough for 4x11 but I haven't gotten it to work yet.

There should also be less register use.

WojciechMula · 2022-05-04T20:17:45Z

@klauspost Great, and thank you for the answers. Yeah, I keep forgetting that intuition too often does not match reality. :)

Let me recheck the code on Ice Lake once more -- I'm giving myself 2-3 hours. If I don't find any obvious mistake, I propose to merge this code. And then I will pick #576. It may give a significant boost, as you wrote.

WojciechMula · 2022-05-06T13:04:52Z

This PR got replaced by #577.

WojciechMula force-pushed the avo-decode4x branch 2 times, most recently from 418c6e8 to 45efab9 Compare March 25, 2022 12:23

WojciechMula changed the title ~~[skip ci] huff0: translate asm implementation into avo program~~ huff0: translate asm implementation into avo program Mar 25, 2022

klauspost self-requested a review March 28, 2022 10:33

klauspost reviewed Mar 28, 2022

View reviewed changes

huff0/decompress_amd64.go Outdated Show resolved Hide resolved

mmcloughlin reviewed Apr 24, 2022

View reviewed changes

klauspost mentioned this pull request Apr 25, 2022

zstd: Decoder assembly (tracking) #515

Closed

WojciechMula force-pushed the avo-decode4x branch from 4d03601 to 1b8091b Compare April 28, 2022 10:49

huff0: translate asm implementation into avo program

613a0f3

WojciechMula force-pushed the avo-decode4x branch from 5c45d8c to 613a0f3 Compare May 5, 2022 09:21

WojciechMula mentioned this pull request May 5, 2022

huff0: decompress directly into output #577

Merged

WojciechMula closed this May 6, 2022

WojciechMula deleted the avo-decode4x branch May 12, 2022 11:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

huff0: translate asm implementation into avo program #543

huff0: translate asm implementation into avo program #543

WojciechMula commented Mar 24, 2022 •

edited

Loading

klauspost commented Mar 24, 2022

WojciechMula commented Mar 25, 2022

klauspost commented Mar 28, 2022

WojciechMula commented Mar 31, 2022

klauspost commented Apr 4, 2022

klauspost commented Apr 4, 2022

klauspost commented Apr 4, 2022

WojciechMula commented Apr 4, 2022

mmcloughlin left a comment

mmcloughlin Apr 24, 2022

mmcloughlin Apr 24, 2022

WojciechMula Apr 25, 2022

WojciechMula commented Apr 25, 2022

mmcloughlin commented Apr 25, 2022

WojciechMula commented Apr 28, 2022

mmcloughlin commented Apr 29, 2022

WojciechMula commented Apr 29, 2022

klauspost commented Apr 29, 2022

WojciechMula commented Apr 29, 2022

klauspost commented Apr 29, 2022

WojciechMula commented Apr 29, 2022 •

edited

Loading

klauspost commented Apr 29, 2022

WojciechMula commented Apr 29, 2022

klauspost commented Apr 29, 2022 •

edited

Loading

klauspost commented Apr 29, 2022

WojciechMula commented Apr 30, 2022

klauspost commented May 2, 2022

klauspost commented May 2, 2022

WojciechMula commented May 2, 2022

klauspost commented May 3, 2022 •

edited

Loading

klauspost commented May 3, 2022 •

edited

Loading

WojciechMula commented May 4, 2022

WojciechMula commented May 6, 2022

huff0: translate asm implementation into avo program #543

huff0: translate asm implementation into avo program #543

Conversation

WojciechMula commented Mar 24, 2022 • edited Loading

klauspost commented Mar 24, 2022

WojciechMula commented Mar 25, 2022

klauspost commented Mar 28, 2022

WojciechMula commented Mar 31, 2022

klauspost commented Apr 4, 2022

klauspost commented Apr 4, 2022

klauspost commented Apr 4, 2022

WojciechMula commented Apr 4, 2022

mmcloughlin left a comment

Choose a reason for hiding this comment

mmcloughlin Apr 24, 2022

Choose a reason for hiding this comment

mmcloughlin Apr 24, 2022

Choose a reason for hiding this comment

WojciechMula Apr 25, 2022

Choose a reason for hiding this comment

WojciechMula commented Apr 25, 2022

mmcloughlin commented Apr 25, 2022

WojciechMula commented Apr 28, 2022

mmcloughlin commented Apr 29, 2022

WojciechMula commented Apr 29, 2022

klauspost commented Apr 29, 2022

WojciechMula commented Apr 29, 2022

klauspost commented Apr 29, 2022

WojciechMula commented Apr 29, 2022 • edited Loading

klauspost commented Apr 29, 2022

WojciechMula commented Apr 29, 2022

klauspost commented Apr 29, 2022 • edited Loading

klauspost commented Apr 29, 2022

WojciechMula commented Apr 30, 2022

klauspost commented May 2, 2022

klauspost commented May 2, 2022

WojciechMula commented May 2, 2022

klauspost commented May 3, 2022 • edited Loading

klauspost commented May 3, 2022 • edited Loading

WojciechMula commented May 4, 2022

WojciechMula commented May 6, 2022

WojciechMula commented Mar 24, 2022 •

edited

Loading

WojciechMula commented Apr 29, 2022 •

edited

Loading

klauspost commented Apr 29, 2022 •

edited

Loading

klauspost commented May 3, 2022 •

edited

Loading

klauspost commented May 3, 2022 •

edited

Loading