Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline each hashtable using macros #270

Closed
wants to merge 2 commits into from

Conversation

splitice
Copy link
Contributor

This allows for the compiler to optimize out a significant number of instructions (calls to memcmp and hashing function conditionals).

There is room for a further PR after this:
a) further optimizing the hash table
b) making the table collision safe (!!)

As this is a micro optimization at this stage its liable to be difficult to benchmark on it's own. I got a 0.2% improvement, but that's well within the margin of error for my environment.

@splitice splitice force-pushed the patch/inline-hashtable branch from 8905a73 to e05d62b Compare June 29, 2021 02:11
@fangfufu fangfufu added 1.1 Issue related to Tinc 1.1 enhancement New feature requests or performance improvement. labels Jun 30, 2021
Copy link
Owner

@gsliepen gsliepen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot of macros. I wonder if we can avoid this by just moving the functions definitions from hash.c to hash.h, and ensuring the compiler can see what size is, by statically initializing that, like:

static hash_t ipv4_cache {
    .size = sizeof(ipv4_t);
};

@splitice
Copy link
Contributor Author

@gsliepen

There is two new types macros defined a define and a call for each called function. Effectively I'm emulating C++ template functions.

With O0 i'm highly doubtful of the compiler inlining the methods even if there are gains in dead code elimination from second level global static members.

Main reason this needs to be inlined is to make the hash function significantly faster. Right now it's unoptimized in the produced code. Although I intend to replace it with versions for 4 (or any division of 4), 6 variants and fallback to this current method for any other size (won't be used).

I'm expecting the order of low hanging fruit optimizations to be epoll (easy), tap overheads (hard), encryption (hard), hash (easy). Of course I've found a few unexpected improvements along the way and this may happen again. Hash improvements are easy to make :)

@splitice splitice force-pushed the patch/inline-hashtable branch from f629260 to 70eef6a Compare July 1, 2021 00:18
@splitice
Copy link
Contributor Author

splitice commented Jul 2, 2021

  1. Why not make sure modulo() can be inlined by moving it into hash.h as well?

Because I verified it was inline without it. That and in a future patch I intend to test a simplified modulo calculation, and masking is better (faster, lower code) when the hash table sizes are fixed to powers of 2 (which all of them currently are and are fixed to).

  1. Hm, instead of having creating multiple versions of the hash functions, why not have an explicit size parameter, so that you have to do: hash_search(sizeof(mac_t), mac_cache, address)?

I feel using types may open up additional room for optimization. Specifically ip addresses have different distribution paramters (being largely within 1 subnet with a common prefix). While it's possible do this by size (size=4 aka ipv4) since you don't have many hash types I felt putting the type name in the definition was cleaner. It also makes the GCC backtrace better (hash_search_ipv4_t is much nicer than hash_search_4). This is stage 1 of what I intend to offer up in PRs for the hash table.

  1. If we want to go this way, then instead of having a hash_alloc_ ## t() that returns a generic hash_t *, I'd rather have a hash_alloc_define(t) create a specialized hash_ ## t ## _t, that has a member variable t keys[0x100],

re; static. Me too, I didnt do it because of the resize function that exists. I was hence intending it as a later optional commit as it depends on your plans to (in the future) use this currently unused function.

I'm happy to merge that into this single commit. I'd also be removing the currently unused resize function (no plans to use that?). NOTE: The unused hash_resize is also possibly incomplete it doesnt redistribute over the hash table it only resizes the underlying allocation.

While I'm doing that what about I add open addressing? That way your hash table will be conflict safe. Or i can add both of these to the next stage PR for the hash table.

I generally perfer smaller incremental PRs.

@splitice
Copy link
Contributor Author

splitice commented Jul 2, 2021

By the way re; open addressing I think thats definately something for another PR.

I havent benchmarked it yet.

It might be better to just place a key check in hash_delete and not delete the node if the key doesnt match. The cost of cache misses (i.e due replacement or erroneous delete) needs to be weight against the costs of open addressing.

Open addressing with a linear (next hash over in the positive direction) search is very cheap and takes advantage of the cache. There are ways to make it even cheaper again but a simple search over say the next 3 buckets would only take a 4/8 byte read to 16/32 bytes and provide significantly less hash conflicts on large networks.

Currently the birthday paradox suggests that with:

  • 100 nodes in cache: 99.99999998278067529% of at-least one collision
  • 50 nodes in cache: 99.406783227075113949% chance of at-least one collision
  • 25 nodes in cache 70.214748117218117013% chance of at-least one collision

Also what would be your oppinion on increasing the size of the cache by default? 1024 might be a more sensible default.
If i was willing to touch autoconf I'd suggest a configure defined definition.

On low memory devices (which are likely all 32bit) the ipv4 hash table uses 2kb of ram, the ipv6 3kb, the mac 2.5kb (7.5kb). An increase to 1024 slots would times these requirements by 4 (30kb ram usage) but reduce conflicts greatly (to approx what 25 nodes are now in the <=50 nodes category).

High memory devices and those on 64bit linux are unlikely to care about a few KB. Perhaps for the 64bit arch we use an even larger constant?

Open addressing would make conflicts basically non existant for the size of networks currently reasonable with tinc.

@splitice
Copy link
Contributor Author

splitice commented Jul 2, 2021

I went all the way to 4094 for 32bit devices with c4ef1b6

When thinking about tinc's memory usage and the size of the smallest reasonable devices I think they can spare 120kb of ram.

@gsliepen
Copy link
Owner

gsliepen commented Jul 2, 2021

The hash tables are just use to cache things, so if there is a conflict overwriting what was there before is safe. Most of the time, we will have large groups of packets being sent to and received from the same address. However, it uses two entries in the cache (one for the destination address of packets sent, and one for the destination address of packets received). So we want to avoid those two conflicting. So linear probing once might be ideal (then we also don't have to worry about the load factor).

Bumping the size to 1024 entries seems reasonable, the only drawback is that hash_clear() will take longer. So if open addressing solves the conflict issue, I don't think we need to increase the size.

Resizing should indeed be removed. The patches look good, small PRs are fine with me as long as each commit leaves tinc in a working state :)

@splitice
Copy link
Contributor Author

splitice commented Jul 2, 2021

@gsliepen I'd like to reduce or remove hash_clear anyway. In my testing I've had an horrible hash table hit rate. Initially I noticed the collisions, but I'm sure hash_clear would also factor in if I benchmarked it.

hash_clear() can be largely eliminated from most peoples networks anyway I suspect. Many people run /32 per node networks I dare say. Of course not all.

Perhaps if the subnet that is being added is a /32 (/128 for IPv6) only that IP could be removed? And if it's actually a subnet (<31 etc) then and only then do a clear operation (also ipv4 subnets only needs to clear ipv4_cache & mac_cache).

Most of the time, we will have large groups of packets being sent to and received from the same address.

I have quite the opposite use case, one server receiving packets from a large (approx 50 currently) other servers. For it the cache is basically useless.

Bumping the size to 1024 entries seems reasonable, the only drawback is that hash_clear() will take longer. So if open addressing solves the conflict issue, I don't think we need to increase the size.

Would you be able to make it a configure option? I've long since become terminally fed up with autoconf.

@splitice
Copy link
Contributor Author

splitice commented Jul 2, 2021

Also hash_search_or_insert() is entirely unused. Should it be removed?

@gsliepen
Copy link
Owner

gsliepen commented Jul 2, 2021

Unused functions should be removed. I shouldn't have added them in the first place...

As for hash_clear(), removing individual /32 or /128s would be great, and especially for Mode = switch the possibility to remove a single MAC address would be helpful. We can solve the overhead of hash_clear() for other cases by introducing a generation counter. This way we can increase the size of the cache easily.

@splitice
Copy link
Contributor Author

splitice commented Jul 2, 2021

@gsliepen Individual IP removal complete. Open addressing too.

introducing a generation counter

I thought similar. But honestly clearing a few pages as long as its done sparingly is cheap. Ideally it shouldnt be needed at all.

I suspect the best way would be a reference counter on the actual node_t combined with a flag to mark staleness. I'm not looking to do that currently though. At-least until lower hanging fruit have been dealt with.

the possibility to remove a single MAC address would be helpful.

Easy enough but I made the subnet function static for now.

@splitice
Copy link
Contributor Author

splitice commented Jul 2, 2021

Why not make sure modulo() can be inlined by moving it into hash.h as well?

As the last hold out in hash.c I've taken your suggestion.

@splitice
Copy link
Contributor Author

splitice commented Jul 2, 2021

Added type specific hash functions with a particular emphasis on IPv4. IPv6 and MAC addresses could probably benefit from similar algorithms in the future. For now though I implemented a similar bit scrambler using the Golden Prime constant supplied. Nothing fancy but simpler than the existing implementation (and no more while(true) loop).

Added a hash seed. Its a small security issue but technically someone could exploit known collisions to (currently) force cache clears or (this PR) reduce performance. A randomized seed prevents IP/MAC address collisions from being predictable.

@splitice
Copy link
Contributor Author

splitice commented Jul 2, 2021

By the way that IPv4 hashing algorithm should be collision free for /24 on 32bit and 64bit and collision free for a /16 on 64bit (on the first slot).

@splitice
Copy link
Contributor Author

splitice commented Jul 2, 2021

The hash_seed for IPv4 is of limited use. Unfortunately I'm not sure the algorithm can be seeded securely while maintaining the no collision properties for standard network sizes.

Specifically someone could purposefully choose to send traffic to/from 10.0.0.1,10.1.0.1 etc to force collisions. Given that the same property applied (although you would have more work to calculate out the collisions previously) to the existing implementation. I'm hence not super worried.

Choices are really:
a) perhaps apply a small hash_seed as a multiplier to the return value. This would remove the predictability of collision properties. It should be quite good, but not zero. If 0...255 /24 no collision could be maintained for 64bit.
b) live with it. Its not a big issue (and it already exists)
c) apply a 8,16 or 32bit bit scrambling hash (or really any) as they have a higher bar to calculate collisions for

Copy link
Owner

@gsliepen gsliepen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if the 18 commits could be squashed into one before merging this pull request.

src/graph.c Outdated Show resolved Hide resolved
src/subnet.c Show resolved Hide resolved
src/subnet.c Show resolved Hide resolved
src/system.h Show resolved Hide resolved
@splitice splitice force-pushed the patch/inline-hashtable branch 2 times, most recently from 7f2c7a9 to 015a6f7 Compare July 11, 2021 23:03
@splitice splitice requested a review from gsliepen July 12, 2021 00:02
@fangfufu
Copy link
Collaborator

fangfufu commented Aug 9, 2021

It would be great if the 18 commits could be squashed into one before merging this pull request.

@gsliepen , you can always press the "Squash and merge" button, and this whole pull request would be merged into one commit. You can then edit the commit message. The documentation is here.

I did exactly that here: fangfufu/httpdirfs#78, the resulting commit is here: fangfufu/httpdirfs@60b8851

@gsliepen
Copy link
Owner

gsliepen commented Aug 9, 2021

you can always press the "Squash and merge" button, and this whole pull request would be merged into one commit.

I know, but first of all, GitHub for me is just a mirror of the official Git repository at https://tinc-vpn.org/git/tinc/, so I wouldn't use that button (but of course I'd just squash and merge it locally on my own machine almost as easily). Second, ideally pull requests are already in a state I can merge them without any extra work. There's also still some pending change requests, which is the main reason this hasn't been merged.

I'm still not happy with the size of the hash tables. The amount of memory available for tinc can vary a lot depending on which machine it is running on, if someone wants to put it into a constrained cgroup for example, regardless of whether it is on a 32 bit or 64 bit machine. Maybe dynamically growing the tables is the right thing to do after all?

@fangfufu
Copy link
Collaborator

fangfufu commented Aug 9, 2021

Here are some of my thoughts, I am not sure how useful they are.

Maybe dynamically growing the tables is the right thing to do after all?

I suppose it depends on whether you want to go for memory footprint or performance. Earlier on in the thread you said:

That way we also avoid malloc calls and pointer dereferencing as much as possible.

Is there a cheap way to do both? I'll admit that I am a bit clueless. Right now, how much performance difference does this pull request make anyway? @splitice

Right now the performance of tinc doesn't seem to be great: https://vpncloud.ddswd.de/features/comparison. However I like tinc because of its P2P nature, and the stability and maturity of the project. Personally I think the performance is enough / reasonable for me. Personally, I am limited by the physical bandwidth, than the overhead of tinc.

I do note the memory requirement of this patch:

So subnet cache usage on a 32bit system 152KB
On a 64bit system it's 2.3MB for comparison.

I think these are reasonable.

With respect to:

The amount of memory available for tinc can vary a lot depending on which machine it is running on, if someone wants to put it into a constrained cgroup for example, regardless of whether it is on a 32 bit or 64 bit machine.

If a user is using cgroup on a 64-bit machine, how much memory do you reckon the user will allocate to tinc? Perhaps we can write it down somewhere on the manual on the minimum requirement for the users.

@splitice
Copy link
Contributor Author

splitice commented Aug 10, 2021

you can always press the "Squash and merge" button, and this whole pull request would be merged into one commit.

Someting that shouldnt be done at-least (if at all) until the maintainer is happy with all aspects and ready to merge.

There's also still some pending change requests, which is the main reason this hasn't been merged.

Thats news to me. This PR is not waiting on me to reply or make changes. From my point of view this is complete.

Is there a cheap way to do both? I'll admit that I am a bit clueless. Right now, how much performance difference does this pull request make anyway? @splitice

It stops (extends) a performance regression on large networks (where the subnet cache on current versions is nearly useless). I don't have accurate numbers at this stage as its very workload dependant. Its resolved the peaks to full saturation we receive on a network where nearly every node communicates with each other (rendering the subnet cache thrashed). Splay tree lookup is fast, but not as fast as the cache lookup and lookup time increases with size (appears to be roughly O log?). With high PPS thrashing theres considerable gains.

Right now the performance of tinc doesn't seem to be great: https://vpncloud.ddswd.de/features/comparison

Very true, and there is multiple areas that need to be tackled to get the performance up to comparable levels. This is just one area.

I do hope by making the subnet cache more useful in the future there will also get more gains. Not performance related; I particularly want to explore improving route finding through the network with relay nodes as thats a big issue for me with a world spanning network. Having a useful (non-thrashed) subnet cache is essential to my plans here.

Maybe dynamically growing the tables is the right thing to do after all?

Thats a substantial re-write, and I quite like the statically allocated tables you suggested. For the small memory requirement of the subnet cache I think it's not worth making configurable. Perhaps a autoconf option for low memory (default true on 32bit)?

@gsliepen
Copy link
Owner

gsliepen commented Aug 10, 2021

On a 64bit system it's 2.3MB for comparison.

I think these are reasonable.

Maybe I'm just overreacting.

Thats news to me. This PR is not waiting on me to reply or make changes. From my point of view this is complete.

Ah, GitHub still shows it in the "Changes requested" state for me. I'll review it again tonight.

Perhaps a autoconf option for low memory (default true on 32bit)?

That could work.

@fangfufu
Copy link
Collaborator

@splitice , it shows "Changes requested" on my side as well.

@splitice
Copy link
Contributor Author

@fangfufu it will show that until myself or gsliepen marks the conversation as resolved

Copy link
Owner

@gsliepen gsliepen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase as well, there are some merge conflicts.

src/subnet.c Outdated Show resolved Hide resolved
src/subnet.c Show resolved Hide resolved
src/subnet.c Outdated Show resolved Hide resolved
@splitice splitice force-pushed the patch/inline-hashtable branch from 015a6f7 to 690b077 Compare August 13, 2021 01:47
 - inline & staticly allocated hash table
 - increased hashtable size (32bit: 1024, 64bit: 65536)
 - re-arrange subnet members
 - Add key type
 - reduce clearing of hash table
 - cleanup key pointer operations
 - removed unused hash_search_or_insert
 - add open addressing to hash table
 - type specific hash functions & hash seeding
 - no collisions for 32bit os
 - implement cache flush by SUBNET_MAC
@splitice splitice force-pushed the patch/inline-hashtable branch from b7cd5ab to 9a018c2 Compare August 13, 2021 02:01
@splitice
Copy link
Contributor Author

Rebased, requested changes applied. Squashed as per request.

@splitice splitice force-pushed the patch/inline-hashtable branch from 377acf6 to e12c616 Compare August 13, 2021 05:04
@splitice
Copy link
Contributor Author

Still needs further work by someone who understands what to do in autoconf to get both static analysis and sanitizer passing.

@gsliepen
Copy link
Owner

Still needs further work by someone who understands what to do in autoconf to get both static analysis and sanitizer passing.

Yeah it's a bit weird it's complaining about those integers when you explicitly cast them. I'll try to see what the best way is to fix these warnings. The code looks fine to me otherwise!

@splitice
Copy link
Contributor Author

I tried explicit casting in two different ways without luck too (reverted)

@gsliepen
Copy link
Owner

Merged 9a018c2, fixed the sanitizer warnings by some more casting that keeps Clang's sanitizer happy. Thanks for the patch!

@gsliepen gsliepen closed this Aug 13, 2021
@fangfufu fangfufu linked an issue Aug 14, 2021 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.1 Issue related to Tinc 1.1 enhancement New feature requests or performance improvement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sanitizer (address) CI checks fail due to timeout.
3 participants