-
Notifications
You must be signed in to change notification settings - Fork 287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inline each hashtable using macros #270
Conversation
8905a73
to
e05d62b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a lot of macros. I wonder if we can avoid this by just moving the functions definitions from hash.c
to hash.h
, and ensuring the compiler can see what size
is, by statically initializing that, like:
static hash_t ipv4_cache {
.size = sizeof(ipv4_t);
};
There is two new types macros defined a define and a call for each called function. Effectively I'm emulating C++ template functions. With Main reason this needs to be inlined is to make the hash function significantly faster. Right now it's unoptimized in the produced code. Although I intend to replace it with versions for 4 (or any division of 4), 6 variants and fallback to this current method for any other size (won't be used). I'm expecting the order of low hanging fruit optimizations to be epoll (easy), tap overheads (hard), encryption (hard), hash (easy). Of course I've found a few unexpected improvements along the way and this may happen again. Hash improvements are easy to make :) |
f629260
to
70eef6a
Compare
Because I verified it was inline without it. That and in a future patch I intend to test a simplified modulo calculation, and masking is better (faster, lower code) when the hash table sizes are fixed to powers of 2 (which all of them currently are and are fixed to).
I feel using types may open up additional room for optimization. Specifically ip addresses have different distribution paramters (being largely within 1 subnet with a common prefix). While it's possible do this by size (size=4 aka ipv4) since you don't have many hash types I felt putting the type name in the definition was cleaner. It also makes the GCC backtrace better (
re; static. Me too, I didnt do it because of the resize function that exists. I was hence intending it as a later optional commit as it depends on your plans to (in the future) use this currently unused function. I'm happy to merge that into this single commit. I'd also be removing the currently unused resize function (no plans to use that?). NOTE: The unused While I'm doing that what about I add open addressing? That way your hash table will be conflict safe. Or i can add both of these to the next stage PR for the hash table. I generally perfer smaller incremental PRs. |
By the way re; open addressing I think thats definately something for another PR. I havent benchmarked it yet. It might be better to just place a key check in Open addressing with a linear (next hash over in the positive direction) search is very cheap and takes advantage of the cache. There are ways to make it even cheaper again but a simple search over say the next 3 buckets would only take a 4/8 byte read to 16/32 bytes and provide significantly less hash conflicts on large networks. Currently the birthday paradox suggests that with:
Also what would be your oppinion on increasing the size of the cache by default? 1024 might be a more sensible default. On low memory devices (which are likely all 32bit) the ipv4 hash table uses 2kb of ram, the ipv6 3kb, the mac 2.5kb (7.5kb). An increase to 1024 slots would times these requirements by 4 (30kb ram usage) but reduce conflicts greatly (to approx what 25 nodes are now in the <=50 nodes category). High memory devices and those on 64bit linux are unlikely to care about a few KB. Perhaps for the 64bit arch we use an even larger constant? Open addressing would make conflicts basically non existant for the size of networks currently reasonable with tinc. |
I went all the way to 4094 for 32bit devices with c4ef1b6 When thinking about tinc's memory usage and the size of the smallest reasonable devices I think they can spare 120kb of ram. |
The hash tables are just use to cache things, so if there is a conflict overwriting what was there before is safe. Most of the time, we will have large groups of packets being sent to and received from the same address. However, it uses two entries in the cache (one for the destination address of packets sent, and one for the destination address of packets received). So we want to avoid those two conflicting. So linear probing once might be ideal (then we also don't have to worry about the load factor). Bumping the size to 1024 entries seems reasonable, the only drawback is that Resizing should indeed be removed. The patches look good, small PRs are fine with me as long as each commit leaves tinc in a working state :) |
@gsliepen I'd like to reduce or remove hash_clear anyway. In my testing I've had an horrible hash table hit rate. Initially I noticed the collisions, but I'm sure hash_clear would also factor in if I benchmarked it.
Perhaps if the subnet that is being added is a /32 (/128 for IPv6) only that IP could be removed? And if it's actually a subnet (<31 etc) then and only then do a clear operation (also ipv4 subnets only needs to clear
I have quite the opposite use case, one server receiving packets from a large (approx 50 currently) other servers. For it the cache is basically useless.
Would you be able to make it a configure option? I've long since become terminally fed up with autoconf. |
Also |
Unused functions should be removed. I shouldn't have added them in the first place... As for |
@gsliepen Individual IP removal complete. Open addressing too.
I thought similar. But honestly clearing a few pages as long as its done sparingly is cheap. Ideally it shouldnt be needed at all. I suspect the best way would be a reference counter on the actual node_t combined with a flag to mark staleness. I'm not looking to do that currently though. At-least until lower hanging fruit have been dealt with.
Easy enough but I made the subnet function static for now. |
As the last hold out in hash.c I've taken your suggestion. |
Added type specific hash functions with a particular emphasis on IPv4. IPv6 and MAC addresses could probably benefit from similar algorithms in the future. For now though I implemented a similar bit scrambler using the Golden Prime constant supplied. Nothing fancy but simpler than the existing implementation (and no more Added a hash seed. Its a small security issue but technically someone could exploit known collisions to (currently) force cache clears or (this PR) reduce performance. A randomized seed prevents IP/MAC address collisions from being predictable. |
By the way that IPv4 hashing algorithm should be collision free for /24 on 32bit and 64bit and collision free for a /16 on 64bit (on the first slot). |
The Specifically someone could purposefully choose to send traffic to/from Choices are really: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if the 18 commits could be squashed into one before merging this pull request.
7f2c7a9
to
015a6f7
Compare
@gsliepen , you can always press the "Squash and merge" button, and this whole pull request would be merged into one commit. You can then edit the commit message. The documentation is here. I did exactly that here: fangfufu/httpdirfs#78, the resulting commit is here: fangfufu/httpdirfs@60b8851 |
I know, but first of all, GitHub for me is just a mirror of the official Git repository at https://tinc-vpn.org/git/tinc/, so I wouldn't use that button (but of course I'd just squash and merge it locally on my own machine almost as easily). Second, ideally pull requests are already in a state I can merge them without any extra work. There's also still some pending change requests, which is the main reason this hasn't been merged. I'm still not happy with the size of the hash tables. The amount of memory available for tinc can vary a lot depending on which machine it is running on, if someone wants to put it into a constrained cgroup for example, regardless of whether it is on a 32 bit or 64 bit machine. Maybe dynamically growing the tables is the right thing to do after all? |
Here are some of my thoughts, I am not sure how useful they are.
I suppose it depends on whether you want to go for memory footprint or performance. Earlier on in the thread you said:
Is there a cheap way to do both? I'll admit that I am a bit clueless. Right now, how much performance difference does this pull request make anyway? @splitice Right now the performance of tinc doesn't seem to be great: https://vpncloud.ddswd.de/features/comparison. However I like tinc because of its P2P nature, and the stability and maturity of the project. Personally I think the performance is enough / reasonable for me. Personally, I am limited by the physical bandwidth, than the overhead of tinc. I do note the memory requirement of this patch:
I think these are reasonable. With respect to:
If a user is using cgroup on a 64-bit machine, how much memory do you reckon the user will allocate to tinc? Perhaps we can write it down somewhere on the manual on the minimum requirement for the users. |
Someting that shouldnt be done at-least (if at all) until the maintainer is happy with all aspects and ready to merge.
Thats news to me. This PR is not waiting on me to reply or make changes. From my point of view this is complete.
It stops (extends) a performance regression on large networks (where the subnet cache on current versions is nearly useless). I don't have accurate numbers at this stage as its very workload dependant. Its resolved the peaks to full saturation we receive on a network where nearly every node communicates with each other (rendering the subnet cache thrashed). Splay tree lookup is fast, but not as fast as the cache lookup and lookup time increases with size (appears to be roughly O log?). With high PPS thrashing theres considerable gains.
Very true, and there is multiple areas that need to be tackled to get the performance up to comparable levels. This is just one area. I do hope by making the subnet cache more useful in the future there will also get more gains. Not performance related; I particularly want to explore improving route finding through the network with relay nodes as thats a big issue for me with a world spanning network. Having a useful (non-thrashed) subnet cache is essential to my plans here.
Thats a substantial re-write, and I quite like the statically allocated tables you suggested. For the small memory requirement of the subnet cache I think it's not worth making configurable. Perhaps a autoconf option for low memory (default true on 32bit)? |
Maybe I'm just overreacting.
Ah, GitHub still shows it in the "Changes requested" state for me. I'll review it again tonight.
That could work. |
@splitice , it shows "Changes requested" on my side as well. |
@fangfufu it will show that until myself or gsliepen marks the conversation as resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rebase as well, there are some merge conflicts.
015a6f7
to
690b077
Compare
- inline & staticly allocated hash table - increased hashtable size (32bit: 1024, 64bit: 65536) - re-arrange subnet members - Add key type - reduce clearing of hash table - cleanup key pointer operations - removed unused hash_search_or_insert - add open addressing to hash table - type specific hash functions & hash seeding - no collisions for 32bit os - implement cache flush by SUBNET_MAC
b7cd5ab
to
9a018c2
Compare
Rebased, requested changes applied. Squashed as per request. |
377acf6
to
e12c616
Compare
Still needs further work by someone who understands what to do in autoconf to get both static analysis and sanitizer passing. |
Yeah it's a bit weird it's complaining about those integers when you explicitly cast them. I'll try to see what the best way is to fix these warnings. The code looks fine to me otherwise! |
I tried explicit casting in two different ways without luck too (reverted) |
Merged 9a018c2, fixed the sanitizer warnings by some more casting that keeps Clang's sanitizer happy. Thanks for the patch! |
This allows for the compiler to optimize out a significant number of instructions (calls to memcmp and hashing function conditionals).
There is room for a further PR after this:
a) further optimizing the hash table
b) making the table collision safe (!!)
As this is a micro optimization at this stage its liable to be difficult to benchmark on it's own. I got a 0.2% improvement, but that's well within the margin of error for my environment.