Skip to content
This repository has been archived by the owner on Nov 20, 2020. It is now read-only.

Snappy compression #146

Merged
merged 7 commits into from
May 25, 2017
Merged

Snappy compression #146

merged 7 commits into from
May 25, 2017

Conversation

scrwtp
Copy link
Contributor

@scrwtp scrwtp commented May 20, 2017

Adding Snappy compression using Snappy.NET.

@scrwtp scrwtp mentioned this pull request May 20, 2017
@@ -146,7 +146,8 @@ Target "Build" (fun _ ->
Target "RunTests" (fun _ ->
!! testAssemblies
|> NUnit3 (fun p ->
{ p with TimeOut = TimeSpan.FromMinutes 20.})
{ p with TimeOut = TimeSpan.FromMinutes 20.
ProcessModel = NUnit3ProcessModel.SingleProcessModel })
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you want this change - but I couldn't get the tests to run otherwise on my machine.

@scrwtp
Copy link
Contributor Author

scrwtp commented May 22, 2017

Seems this will be a bit more tricky than that. Kafka uses non-standard framing (snappy-java
(prefix "\x82SNAPPY\x0")), so we'd either need Snappy.NET to support that or somehow work around it on kafunk side.

Currently Snappy.NET uses a hardcoded value for that ("sNaPpY").

@eulerfx
Copy link
Contributor

eulerfx commented May 23, 2017

@scrwtp do we need framing for this?

@scrwtp
Copy link
Contributor Author

scrwtp commented May 24, 2017

@eulerfx I've just updated the PR with new changes.

I've switched from SnappyStream to plain Compress/Decompress calls, and we also now put the same headers on the message as snappy-java does.

Checked compatibility with reference kafka-console-producer/kafka-console-consumer, looks good both ways.

[<assembly: AssemblyVersionAttribute("0.0.40")>]
[<assembly: AssemblyFileVersionAttribute("0.0.40")>]
[<assembly: AssemblyVersionAttribute("0.1.0")>]
[<assembly: AssemblyFileVersionAttribute("0.1.0")>]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your fork might have been an older version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it just wasn't bumped, the fork was created this weekend?

let compressed = SnappyCodec.Compress(inputBytes)
let output = CompressedMessage.pack compressed
createMessage (Binary.ofArray output) CompressionCodec.Snappy
with :? IOException as _ex ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say lets just drop this with here and above since we're not adding any context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine dropping it. Don't you want logging there though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception would be caught at a higher level in the code anyway. Adding some context here might be useful, but not necessary IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

MessageSet.Write (messageVer,ms,BinaryZipper(buf))
try
let compressed = SnappyCodec.Compress(inputBytes)
let output = CompressedMessage.pack compressed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be possible to avoid copying the array. Looking at SnappyCodec.Compress, it uses GetMaxCompressedLength to estimate the size of the array. What if you use that to allocate the array with the header bytes added to it, then use the other overload of Compress to write into that array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a stab at it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


// write content
// NOTE: this will also write content length in the first 4 bytes, this is expected.
bz.WriteBytes(Binary.ofArray bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be worth using an internally copied version of this. BinaryZipper was designed specifically for the Kafka protocol and its possible it will cause an issue down the road if its refactored.

Copy link
Contributor Author

@scrwtp scrwtp May 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do a WriteInt32 followed by WriteBlock, but I'd think one of the tests added would fail if WriteBytes changed. (Compression.Snappy reads messages that are compatible with reference implementation)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to create a function called writeBytes local to the snappy compressor, that would simply do the same thing as WriteBytes. The reason is that WriteBytes might change for another reason at some point, driven by a need from the Kafka protocol, and it can cause an issue here down the road.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, no more dependency on BinaryZipper.

let output = CompressedMessage.compress inputBytes

createMessage (Binary.ofArray output) CompressionCodec.Snappy

let decompress (messageVer:ApiVersion) (m:Message) =
let inputBytes = m.value |> Binary.toArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this copy be avoided by passing m.value directly into the binary zipper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - the same fix can be applied in the gzip version.

let arr = ArraySegment<byte>(buf.Array, buf.Offset, length)
arr, Binary.shiftOffset length buf

let truncateIfSmaller actualLength maxLength (array: byte []) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than copying an array, you can truncate by adjusting the offset on an ArraySegment (Binary module has some helper for this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed what Snappy.NET did, but I guess it can be avoided. Made both compress and decompress here be Binary.Segment - > Binary.Segment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah looks good. Also here:

let buf = Array.zeroCreate uncompressedLength
let actualLength = SnappyCodec.Uncompress(content.Array, content.Offset, content.Count, buf, 0)
Binary.truncateIfSmaller actualLength uncompressedLength buf

The call to Binary.truncateIfSmaller should operate on ArraySegment<byte> rather than array directly. This way, when it actually has to truncate, it doesn't call Binary.toArray which creates a new array, but simply adjusts the offsets on the ArraySegment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't call Binary.toArray anymore - it does take an array, but it will return a Segment of appropriate length.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry I misread ofArray

@eulerfx
Copy link
Contributor

eulerfx commented May 25, 2017

Looks good. I've tried this out and it appears to be working, and performs well. I'm going to merge, and the only thing I'll do is I'll "unroll" the Write/Read methods on the SnappyBinaryZipper (much like BinaryZipper is unrolled) to avoid FSharpFunc allocations and virtual calls - this is a critical enough area where we should get low-level.

@eulerfx eulerfx merged commit 308f3a6 into jet:master May 25, 2017
@scrwtp
Copy link
Contributor Author

scrwtp commented May 25, 2017

I like the way they read, how about making them inline?

@eulerfx
Copy link
Contributor

eulerfx commented May 25, 2017

In critical paths you often forego readability/reusability for the sake of performance. Making them inline won't help in this case - the calls to the underlying operations on the Binary module however are already inline if you look at the dissassembly.

Btw, if you look at the IL for the former implementation of WriteInt32, it looks like this:

IL_0001: ldarg.1      // x
IL_0002: newobj       instance void Kafunk.CompressionModule/SnappyModule/WriteInt32@95::.ctor(int32)
IL_0007: stloc.0      // V_0
IL_0008: ldarg.0      // this
IL_0009: ldloc.0      // V_0
IL_000a: ldarg.0      // this
IL_000b: ldfld        valuetype [mscorlib]System.ArraySegment`1<unsigned int8> Kafunk.CompressionModule/SnappyModule/SnappyBinaryZipper::buffer
IL_0010: callvirt     instance !1/*valuetype [mscorlib]System.ArraySegment`1<unsigned int8>*/ class [FSharp.Core]Microsoft.FSharp.Core.FSharpFunc`2<valuetype [mscorlib]System.ArraySegment`1<unsigned int8>, valuetype [mscorlib]System.ArraySegment`1<unsigned int8>>::Invoke(!0/*valuetype [mscorlib]System.ArraySegment`1<unsigned int8>*/)
IL_0015: stfld        valuetype [mscorlib]System.ArraySegment`1<unsigned int8> Kafunk.CompressionModule/SnappyModule/SnappyBinaryZipper::buffer
IL_001a: ret       

The newobj opcode creates an instance of an FSharpFunc which represents the actual reading of the Int32 - this contains the core of the logic. Then the instantiated object's fields are initialized. After that the call to callvirt invokes the object. Unrolling this into direct calls avoids this object allocation, movements of values to/from heap and the virtual call. The only instantiations that remain are stack allocations of ArraySegment.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants