Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Edge Cases Undocumented #2

Closed
TheBlueMatt opened this issue Nov 16, 2016 · 49 comments
Closed

API Edge Cases Undocumented #2

TheBlueMatt opened this issue Nov 16, 2016 · 49 comments

Comments

@TheBlueMatt
Copy link

Some fun things callers need to be aware of:

  • You may not call wirehair_read twice with the same packet ID, this causes decode failure.
  • Memory lifetimes generally arent documented - during encode the memory you pass into wirehair_encode must be available until the last time you call wirehair_write (and maybe later? probably not), whereas the same is not true for writing.

In other news, for my application I hardcoded the chunk size, assumed 16/32-byte alignment (depending on where its used) for XOR, replaced the different-compile-unit XOR code with simple avx-/sse-based intrinsics and saw something like a 3x speedup :).

@catid
Copy link
Owner

catid commented Nov 16, 2016

Nice speedup! I have a newer version of the codec that needs some more
work for tuning over in wh256 repo that should be similarly fast? Maybe I
can learn something from your memxor() code

On Nov 15, 2016 5:42 PM, "Matt Corallo" [email protected] wrote:

Some fun things callers need to be aware of:

  • You may not call wirehair_read twice with the same packet ID, this
    causes decode failure.
  • Memory lifetimes generally arent documented - during encode the
    memory you pass into wirehair_encode must be available until the last time
    you call wirehair_write (and maybe later? probably not), whereas the same
    is not true for writing.

In other news, for my application I hardcoded the chunk size, assumed
16/32-byte alignment (depending on where its used) for XOR, replaced the
different-compile-unit XOR code with simple avx-/sse-based intrinsics and
saw something like a 3x speedup :).


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#2, or mute the thread
https://github.com/notifications/unsubscribe-auth/AAPZIXXjw8bMiuZd4uVyKcLVw0lK0ducks5q-l9sgaJpZM4KzPvJ
.

@TheBlueMatt
Copy link
Author

Oh, also, if you set ALL_ORIGINAL and receive all blocks from non-FEC chunks, even though wirehair has copied each chunk into its memory it will only give you back 0s when you ask for chunk reconstructions.

@TheBlueMatt
Copy link
Author

TheBlueMatt commented Nov 16, 2016

The optimizations in question are at bitcoinfibre/bitcoinfibre@eafd1ca though note that I really only benchmarked in my benchmark tool, which works on relatively large data (~1MB @ 1024-byte chunks). Sadly, I think the initial setup/EncodeFeed adds too much latency for my application (since I'm bound by, essentially, setup/encode/first packet->setup/decode with low loss latency). Any pointers on reducing the initial setup latency, even if it comes at the cost of a handful of additional packets?

@catid
Copy link
Owner

catid commented Nov 16, 2016

One thing you can do is have the setup run in a background thread and when it's complete start generating output symbols. Since the first ~1000 symbols are going to be the originals you have some time to allow the CPU to catch up.

cm256 is faster for a small number of symbols (about 32). But wirehair with speedups is much faster after that point. More details here: https://github.com/catid/wh256/blob/master/docs/wh256_performance.pdf

@TheBlueMatt
Copy link
Author

Sadly for my application the receiving end has the vast majority of source symbols before it even receives the first packet (just needs a few packets of metadata to figure out how to place the data in FEC chunks). The FEC is just to communicate some random missing data, essentially, and since it realistically normally only ever needs a few packets to fill in the gaps the latency-to-first-FEC-packet is critical.

@catid
Copy link
Owner

catid commented Nov 17, 2016

Maybe something like this is more what you need?

https://github.com/catid/shorthair

That project could be improved a lot by: (1) Using cm256 instead of longhair and (2) by using an approach like Tetrys instead https://www.ietf.org/proceedings/86/slides/slides-86-nwcrg-1.pdf

(2) requires a bunch more work that I've started but haven't gotten into a working state yet

@TheBlueMatt
Copy link
Author

TheBlueMatt commented Nov 17, 2016

I looked at shorthair, though it looks like a full network implementation and everything above...I've already got my own networking layer with lots of random crap that I need for my specific use-case.

I'm not sure that cm256 and sliding window are really ideal...let me describe a bit more:

I'm sending Bitcoin blocks around a global network of servers at incredibly low-latency (10-20 milliseconds slower than the RTT/2 between the servers, usually, including through the GFW), so: every 10 minutes I get a new block (1-4MB) which contains almost entirely transactions that both sides have already seen, and I need to get it to the other side fast. I send a small set of data to encode which transactions go where in the ultimate block, send that (1-2 ms, including FEC, though that should be easily optimized by using cm256, etc), and then send FEC data for the whole block. Currently this works OK with some pretty shitty LDPC FEC code, though the code isnt well-optimized and is slightly frightening in its memory usage. Was hoping to switch to wirehair but it ends up wasting too much time in the initial encode by about 2x (I havent benchmarked, but I assume it has to do with calculating the GF(2^8) RS code data) to be a strict gain.

@catid
Copy link
Owner

catid commented Nov 17, 2016

How big is the data to send? Sounds like less than 1 MB diffs that would
encode faster than full data

On Nov 16, 2016 4:56 PM, "Matt Corallo" [email protected] wrote:

I looked at shorthair, though it looks like a full network implementation
and everything above...I've already got my own networking layer with lots
of random crap that I need for my specific use-case.

I'm not sure that cm256 and sliding window are really ideal...let me
describe a bit more:

I'm sending Bitcoin blocks around a global network of servers at
incredibly low-latency (10-20 milliseconds slower than the RTT/2 between
the servers, usually), so: every 10 minutes I get a new block (1-4MB) which
contains almost entirely transactions that both sides have already seen,
and I need to get it to the other side fast. I send a small set of data to
encode which transactions go where in the ultimate block, send that (1-2
ms, including FEC, though that should be easily optimized by using cm256,
etc), and then send FEC data for the whole block. Currently this works OK
with some pretty shitty LDPC FEC code, though the code isnt well-optimized
and is slightly frightening in its memory usage. Was hoping to switch to
wirehair but it ends up wasting too much time in the initial encode by
about 2x (I havent benchmarked, but I assume it has to do with calculating
the GF(2^8) RS code data) to be a strict gain.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#2 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAPZIVUSmoqWilVSZGhFujVWlaKNL0jJks5q-6Y1gaJpZM4KzPvJ
.

@TheBlueMatt
Copy link
Author

It ranges widely...There's some data at http://bitcoinfibre.org/stats.html but, essentially, 90-something% of the time you get the full block reconstructed with ~40k of data, but a non-trivial amount of time it can take 1MB+ of data before the current LDPC stuff completes a reconstruction.

@TheBlueMatt
Copy link
Author

Oh, and I really would prefer to not introduce of knowledge-of-receiver-side-data assumption, because speed-of-light makes that a bitch :(.

@catid
Copy link
Owner

catid commented Nov 17, 2016

You could give wh256 a shot. It switches to cm256 when it is faster to do
that, for small data. Tuning is not done yet but I can get that done this
week. Also could build you a codec that just does xors and no GF256 stuff
that would require more overhead but encode faster..

On Nov 16, 2016 5:27 PM, "Matt Corallo" [email protected] wrote:

Oh, and I really would prefer to not introduce of
knowledge-of-receiver-side-data assumption, because speed-of-light makes
that a bitch :(.


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#2 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAPZIbdFChRDqcCDflH8MYxErd6Pf-Seks5q-61ygaJpZM4KzPvJ
.

@catid
Copy link
Owner

catid commented Nov 17, 2016

You could give wh256 a shot. It switches to cm256 when it is faster to do
that, for small data. Tuning is not done yet but I can get that done this
week. Also could build you a codec that just does xors and no GF256 stuff
that would require more overhead but encode faster..

On Nov 16, 2016 5:27 PM, "Matt Corallo" [email protected] wrote:

Oh, and I really would prefer to not introduce of
knowledge-of-receiver-side-data assumption, because speed-of-light makes
that a bitch :(.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#2 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAPZIbdFChRDqcCDflH8MYxErd6Pf-Seks5q-61ygaJpZM4KzPvJ
.

@TheBlueMatt
Copy link
Author

Yea, I think I was hoping a no-GF256 codec would be a simple constant-change that I could hack up...happy to get you a beer if its more involved :).

@gmaxwell
Copy link

Hi, -- In addition to patching over packet loss, this system ab(uses) the forward error correction to create a diff where you have no idea what the remote side knows, which we don't because it's a distributed system with data coming in every which direction and multiple sources... and 100ms+ one way delays.

Two potential source nodes might even receive the same data at once, and both can usefully contribute to all the receivers recovering as fast as possible (as they can send different ranges of recovery data).

The currently live system uses a LDPC erasure code which decodes exclusively via pealing (and has a fairly low quality implementation that Matt has somewhat fixed up). It's fast-- but very high overhead (e.g. 20% more packets than a RS could would have needed). I encountered whirehair a while back and suggested Matt give it a shot-- the performance is really impressive overall. I think what is frustrating him now is the time to the first recovery packet (as mentioned, he doesn't send the source packets because they're reconstructed on the far end).

As an aside, if you want some number crunching done for coming up with except-seeds for wh256 I have a good hundred cores sitting idle for the moment and wouldn't mind running something for you. I was suggesting WH256 to Matt last night, with the hope that the faster GF256 parts might help some of his issues. (Prior to that I tried turning the heavy row count down but got lost trying to figure out why it crashing.)

@catid
Copy link
Owner

catid commented Nov 17, 2016

Two other random thoughts: (1) MTU size increase would improve latency in
the netwotk, if you can push it above 1024 bytes. (2) The decoder
recreates the recovery set during decoding, so it would not be necessary to
reinitialize the encoder afterwards if the codec could transition into an
encoder mode with the samr source data.

On Nov 16, 2016 5:42 PM, "Matt Corallo" [email protected] wrote:

Yea, I think I was hoping a no-GF256 codec would be a simple
constant-change that I could hack up...happy to get you a beer if its more
involved :).


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#2 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAPZIf11or7hbtIfR7Ps_wR-oVxkJVm6ks5q-7EcgaJpZM4KzPvJ
.

@TheBlueMatt
Copy link
Author

"The decoder recreates the recovery set during decoding" huh? I'm assuming you're referring to the GF2^8 RS stuff? Is that strictly neccessary? Seems like it should be able to get away with only doing that when needed?

@catid
Copy link
Owner

catid commented Nov 18, 2016

Recovery set is a set of symbols slightly larger than the original data
that are xored together after encoder is initialized. The decoder recovers
these in the case of lost data and then recovers any missing parts.

If a decoder has recovered lost data then it can be reused for encoding
because the recovery set does not need to be recalculated.

On Nov 17, 2016 10:16 PM, "Matt Corallo" [email protected] wrote:

"The decoder recreates the recovery set during decoding" huh? I'm assuming
you're referring to the GF2^8 RS stuff? Is that strictly neccessary? Seems
like it should be able to get away with only doing that when needed?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#2 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAPZIRuA0ihGk-Jpjls48Uuo5iQgKZkIks5q_UKxgaJpZM4KzPvJ
.

@gmaxwell
Copy link

That is useful for the application here (once a participant receives enough to recover, they'll encode and start generating new symbols to help nodes beyond simply forwarding along already received symbols), but unfortunately it's not on the critical path (which it sounds like Matt is saying is time to first encoded symbols on nodes that received the data from outside.)

Pardon my ignorance, but is it the case that some non-trivial number of IDs would happen to not reference anything but original packets? (I'm familiar with the basic relevant FEC-tech, but ignorant of your construction except in a vague hand-waving sense :) ) if so, extracting and sending a wad of those first might resolve Matt's concern though I suppose it would seriously degrade the overhead.

@catid
Copy link
Owner

catid commented Nov 21, 2016

So I see two ways to make the encoder super-fast without writing a whole other codec:

(1) Add a feature to work in non-systematic mode, where it multiplies by the matrix to generate the recovery set and then you'd send across the network only linear xor combinations of those. Instead of sending the original data you'd send the output of the encoder.

(2) Add a larger feature to work in a mixed mode, where it again multiplies the matrix to generate the recovery set, but then it allows you to send the original data followed by some random combinations of the original data as in the systematic code. It then requires a more complicated decoder to solve the recovery set in two stages.

I've not fully thought through (2) so maybe the decoder would be much slower for 50-50 original data and recovery data? (1) is a safer way forward.

I tried out wh256 for your data sizes and it takes about 1.3-1.6 milliseconds to initialize the encoder. Is that really significant?

@TheBlueMatt
Copy link
Author

So I hacked up some code to cache the matrix generation, which seems to be a nontrivial improvement in synthetic and real-world benchmarks, though it doesnt show up in callgrind almost at all (and I havent hooked a real profier up to it yet) (see code at bitcoinfibre/bitcoinfibre@d32a9a2, though I'm pretty sure the copy stuff is busted since I didnt bother to audit anything).

I also swapped out for wh256 and it seemed to improve a bit more (getting within spitting distance of the naiive LDPC stuff I had previously - so probably worth swapping out at this point). Do you have the code to generate seeds handy and I could spin it up one some beefy servers?

@catid
Copy link
Owner

catid commented May 10, 2017

Would you guys mind also evaluating https://github.com/catid/siamese ?

@gmaxwell
Copy link

Neat! For Fibre the targeting for few recovery symbols will likely get in the way, but we have some related things where the ability to have a rolling window of changing data would be very useful. (... once I figure out how that actually works with Siamese :) ).

@catid
Copy link
Owner

catid commented May 11, 2017

Oh I thought the files were like 1 MB, so 1000 pieces, and 256 limit on output would be 25% - should be higher than loss right?

@catid
Copy link
Owner

catid commented May 11, 2017

Actually for that use case it would be nicer if Siamese had an API that supported fixed-sized input blocks that are maintained outside of the library to avoid some copies. It's probably still within 20% of current performance though

@TheBlueMatt
Copy link
Author

Well FIBRE has a few strange properties - because we (usually) have almost the entire 1MB of data at the other end before we start sending, we really want to get the data in just a handful of packets. However, because it occasionally has much lower hitrate, we also need to be able to get the data at the other end after having just sent a bunch of recovery symbols. I suppose I could do some hack where it first sends 256 recovery symbols and then starts sending data packets, but then you'd waste a bunch of packets on data that is useless to the other side :/.

@catid
Copy link
Owner

catid commented May 11, 2017

If I increased the number of recovery packets to some larger number that would work for you? It's somewhat arbitrary

@gmaxwell
Copy link

For FIBRE we really want recovery packets to be at least equal to the input packet count, twice the input size is even better.

Here is a brief and somewhat simplified outline of what we do w/ FIBRE today:

There is a small collection of nodes spread around the world.

Our task will be to get a block from a random source to all the targets with at little latency as possible. The nodes have gigabit+ connectivity, and the distance between them is such that a single round trip anywhere will dwarf the transmission time.

Each have a many megabyte queue of transactions that will potentially end up in blocks, learned via a gossip protocol and prioritized based on their odds of showing up soon. These queues are roughly but not precisely consistent among the participating nodes.

At one of these nodes, a block shows up (poisson process with expectation of ~10 minutes)
A block shows up and most of the time 99.9% of the transactions in it will be already known. The block is about 1MB in size and has about 3000 txn in it.

The receiving node makes a sketch of the block-- a collection of short hashes and sizes for each of the transactions.. this is on the order of 15 kbytes in size. It's streamed out to peers over UDP with a bit of FEC.

The targets get this data use the hashes and sizes to mock up the block in memory. There is a size based permutation applied to the transaction order that also adds a bit of padding so that when the block gets packetized missing transactions will cause fewer packet erasures.

The source performs the same permutation+packetization process and runs a FEC encoder on it (currently wirehair, originally a more boring LDPC code). It then starts streaming out recovery packets, interleaving between peers (you can think of it as sending half east around the world, half west around the world).

There is no congestion control, rather each node has a preconfigured transmission rate that is adjusted to send as fast as possible without causing queuing delays/losses.

The targets feed their locally generated (from their queues + sketches) into the FEC decoder.

When a peer receives a recovery packet it may pass it on to another peer. (There is a precomputed forwarding topology that minimizes latencies and avoids peers getting duplicates of the same data-- it's not sufficient to simply have the source send directly, getting the latency minimizing path can require passing through other nodes).

Once a target has enough data, it reconstructs the whole block. It sends a packet to each of its peers telling them to not bother transmitting any more about the block to it. It will then FEC encode it and begin sending out novel chunks which it never received to whatever peers haven't told it that they're done. (this last point isn't that important except when there are outages, though it's easy to do with LDPC-like codes)

Sometimes a block will show up which isn't very consistent with nodes queues and may have a much smaller hitrate. (e.g. a few percent) or nodes might have been recently restarted and not have a useful queue yet. In those cases, their reconstruction will take more data, but we still must avoid a round-trip.

Also, a block might show up at multiple sources about the same time-- so it is useful if both can operate independently sending non-overlapping packet sets, and both contribute usefully to the reconstruction... rather than having a bunch of duplicates show up due to sources which can't know about each other's transmissions (outside each other's light cone).

As you can see in this model the source never sends any of the input packets, because any of them potentially duplicate what the target already knows.

Often recovery only needs a couple packets past the sketch... maybe two dozen (I asked Matt to get a historgram of that). But sometimes the hitrate is low and it needs an amount equal to the input. Because the source sends packets in multiple directions around the world, it's also good if the recovery count is 2x whats needed, so non-duplicates can be sent in different directions.

The source can more or less detect in advance of sending if the hitrate is going to be very low: because the block will also not coincide well with the source's queue. We could potentially switch coding schemes based on that. We just can't wait for replies from the targets e.g. to tell us what they were missing so that we can just send original data-- because the whole process should be done before that reply would be received.

I think it would be fine for us if the encoder could produce more recovery packets but decode was somewhat slower/less efficient if many were needed... otherwise we could potentially switch from WH to Siamese based on the expected queue hitrate.. and if siamese is use, after sending all the recovery packets start streaming out the input until we're told to stop. If that would be attractive would depend on the performance. Right now, WH encode+decode performance is a major factor in the controllable part of our total latency. (which 95% of the time is less than 16ms over the one-packet latency).

@catid
Copy link
Owner

catid commented May 12, 2017

I appreciate the extra detail; let me think about other methods that may help.

Okay I'll put together a specialized block codec that can produce more output symbols (with higher perf) so you can evaluate that as an alternative.

@TheBlueMatt
Copy link
Author

TheBlueMatt commented May 12, 2017

Ideally, of course, the decode would be iterative on packets to avoid taking too much CPU time if we do need a lot of packets - even at 1Gbps packets aren't coming in that fast, so you've got a bit of time to process things, and I've already got infrastructure to fill the block with transactions on a background thread (which used to be an expensive process with lots of mallocs, but is now largely free) which could easily be converted to a "do a step of something useful in FEC decoder" thread.

@catid
Copy link
Owner

catid commented May 15, 2017

Please evaluate this library for your purposes: https://github.com/catid/fecal

@TheBlueMatt
Copy link
Author

Hmm, I hacked in a naive version into my fec wrapper and it turned out quite a bit slower (2x on my few-missing-chunks test, which is around 14 missing chunks out of 880).

You can see the fec wrapper changes at bitcoinfibre/bitcoinfibre@6f69893#diff-208ddd15cda5e81d824f18d7daffdde1 it has some extra memcpys that add a bit of overhead, but at least callgrind's cycle estimation thinks the fecal_decode is where nearly all the time ended up.

I'm happy to provide callgrind outputs for your analysis (or you can run the bench_bitcoin binary in that repo using ./autogen.sh && CXXFLAGS="-march=native -mtune=native -O2 -g" ./configure --disable-wallet && make && ./src/bench/bench_bitcoin)

@catid
Copy link
Owner

catid commented May 16, 2017

Disappointment =(

@catid
Copy link
Owner

catid commented May 16, 2017

Well, I'll get back to you guys next time I have some new software to share. Feel free to reach out to me at [email protected] for more support.

@catid catid closed this as completed May 16, 2017
@catid
Copy link
Owner

catid commented May 30, 2017

If you're interested in evaluating another codec, I just put up this one:

https://github.com/catid/leopard

It's limited in that the recovery data cannot be streamed out like a fountain code. Also the recovery set cannot be larger than the original data size. Hope it's useful.

@catid catid reopened this May 30, 2017
@TheBlueMatt
Copy link
Author

TheBlueMatt commented Jun 2, 2017

Damn, super close, but didnt beat wh256 :(. Indeed, the lack of streaming output/incremental decode really hurts this one for FIBRE. In my synthetic benchmark on my Haswell-EP workstation, wh256 has ~1.5ms of setup time, then streams out enough chunks to get a decode super fast, with decode run taking around 1.5-1.7ms ignoring encode time. leopard is closer to 1.5 on the encode, 4 on the decode. It got a bit faster after I hardcoded the block size like I do in wh256, but only by a ms or 2 in total, which wasn't quite enough to get over the hump.

(so, in conclusion, one thing to consider for future fecs is an option to hardcoded fec chunk block size - in FIBRE I use fixed-size UDP packet which makes tons of things lots simpler, and hardcoding the block size all the way down the the XOR functions can be a decent gain).

@catid
Copy link
Owner

catid commented Jun 2, 2017 via email

@TheBlueMatt
Copy link
Author

Oh wow, alright, that might just do it to beat wh256. I can try to PR a decent version of the optionally-hardcode-fec-size, if you'd like.

@catid
Copy link
Owner

catid commented Jun 2, 2017 via email

@catid
Copy link
Owner

catid commented Jun 4, 2017 via email

@TheBlueMatt
Copy link
Author

Nice: for a single test block (~800 1152-byte chunks) it went from ~1.3 on the encode to ~1.05. On the decode side, ~4.5-> ~3.6. Still, compared to wh256's ~1.2 to start encoding, ~1.56 to do the full encode and ~1.1 to do a decode, wh256 still wins by a healthy margin. I'd be curious to see the 12-bit field.

(fec-al numbers, since I went and did things a bit more formally so went back and benchmarked that: ~0.2ms to start streaming, ~0.5 to do a full encode, ~1ms to decode in my low-loss bench, which is realistic for some reasonably-high-percent of cases, but the ~6ms to decode in my medium-loss bench makes it hard to swallow the lack of consistency :( ).

@catid
Copy link
Owner

catid commented Jun 5, 2017

If you allow for multithreading it can decode quite a lot faster. I am still working on gaining another 15% or so performance by moving copies/xors into the inner loops. Then MT after that's done, which could cut the decode time in half or more...

12-bit fields are going to be a little faster in that there would be less register pressure in the inner loops.
FFT_m and IFFT_m and FormalDerivative_m steps would not be working on smaller data sizes though. Also instead of a simple "multiple of 64 bytes" requirement it would be some weird number like 96.. I'm more interested in doing an XOR-only version for platforms that do not have PSHUFB instructions like quadcopter processors.

@catid
Copy link
Owner

catid commented Jun 6, 2017

So Sunday and tonight I managed to implement a worker thread pool and got somewhat tuned work spread across the threads. In practice it's not quite a 2x performance boost to add multi-threading:

Parameters: [original count=1000] [recovery count=1000] [buffer bytes=1344] [loss count=1000] [random seed=2] (multi-threading OFF)
Leopard Encoder(1.344 MB in 1000 pieces, 1000 losses): Input=1745.45 MB/s, Output=1745.45 MB/s
Leopard Decoder(1.344 MB in 1000 pieces, 1000 losses): Input=491.947 MB/s, Output=491.947 MB/s

Parameters: [original count=1000] [recovery count=1000] [buffer bytes=1344] [loss count=1000] [random seed=2] (multi-threading ON)
Leopard Encoder(1.344 MB in 1000 pieces, 1000 losses): Input=1669.57 MB/s, Output=1669.57 MB/s
Leopard Decoder(1.344 MB in 1000 pieces, 1000 losses): Input=691.358 MB/s, Output=691.358 MB/s

I'm not 100% sure that I'm spreading the work across the workers the most optimal way, but it looks like about a 40% performance improvement for the case you care about. I don't think this is enough to make you prefer Leopard over Wirehair. Also my thread pool only works on Windows.

@Bulat-Ziganshin
Copy link

In practice it's not quite a 2x performance boost to add multi-threading:

your cpu is 4 GHz in s/t mode and 2.8 GHz in m/t mode

@catid
Copy link
Owner

catid commented Jun 6, 2017

Switched from my own threadpool to OpenMP and tuned more:

Leopard Encoder(1.344 MB in 1000 pieces, 1000 losses): Input=3162.35 MB/s, Output=3162.35 MB/s
Leopard Decoder(1.344 MB in 1000 pieces, 1000 losses): Input=714.514 MB/s, Output=714.514 MB/s

Maybe this will be good enough?

@catid
Copy link
Owner

catid commented Apr 18, 2018

New version of the codec is up

@catid catid closed this as completed Apr 18, 2018
@TheBlueMatt
Copy link
Author

Hey! Sorry, somehow missed the first message and just got around to testing the new wirehair updates. Even with the OpenMP threadpool the leopard updates weren't a winner. They still did better than wirehair/wh256 on most/all-chunks-missing benchmarks, but on tests where most chunks were locally available before sending starts, they failed to quite reach the wh256/wirehair speed (but...)

The wirehair updates did, indeed, provide a pretty healthy boost, though I went ahead and implemented the behavior of wh256 where it uses cm256 if the chunk count is below some threshold, which appeared to help my small-chunk-count reconstruction runs (recall, my use-case is two-pass: the first is around 5-10 chunks of 1152 bytes and never has any pre-knowlege, the second is usually around 800 chunks of 1152 bytes, but usually upwards of 780 chunks are available on the receiving end before any packets are even sent, as they can be calculated from the first pass). I did not, however, test backporting this behavior to leopard, but it looked like leopard lost to wirehair normally anyway, so I'd guess it wont save it.

@catid
Copy link
Owner

catid commented May 3, 2018 via email

@catid
Copy link
Owner

catid commented Sep 17, 2018 via email

@TheBlueMatt
Copy link
Author

Hmm, sadly the stuff I've got is all super-context-specific so a more general non-FEC-only framework probably doesn't apply. Thanks for the update, though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants