Skip to content

Commit d1e5e64

Browse files
q2venkuba-moo
authored andcommitted
tcp: Introduce optional per-netns ehash.
The more sockets we have in the hash table, the longer we spend looking up the socket. While running a number of small workloads on the same host, they penalise each other and cause performance degradation. The root cause might be a single workload that consumes much more resources than the others. It often happens on a cloud service where different workloads share the same computing resource. On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash entries), after running iperf3 in different netns, creating 24Mi sockets without data transfer in the root netns causes about 10% performance regression for the iperf3's connection. thash_entries sockets length Gbps 524288 1 1 50.7 24Mi 48 45.1 It is basically related to the length of the list of each hash bucket. For testing purposes to see how performance drops along the length, I set 131072 (1Mi / 8) to thash_entries, and here's the result. thash_entries sockets length Gbps 131072 1 1 50.7 1Mi 8 49.9 2Mi 16 48.9 4Mi 32 47.3 8Mi 64 44.6 16Mi 128 40.6 24Mi 192 36.3 32Mi 256 32.5 40Mi 320 27.0 48Mi 384 25.0 To resolve the socket lookup degradation, we introduce an optional per-netns hash table for TCP, but it's just ehash, and we still share the global bhash, bhash2 and lhash2. With a smaller ehash, we can look up non-listener sockets faster and isolate such noisy neighbours. In addition, we can reduce lock contention. We can control the ehash size by a new sysctl knob. However, depending on workloads, it will require very sensitive tuning, so we disable the feature by default (net.ipv4.tcp_child_ehash_entries == 0). Moreover, we can fall back to using the global ehash in case we fail to allocate enough memory for a new ehash. The maximum size is 16Mi, which is large enough that even if we have 48Mi sockets, the average list length is 3, and regression would be less than 1%. We can check the current ehash size by another read-only sysctl knob, net.ipv4.tcp_ehash_entries. A negative value means the netns shares the global ehash (per-netns ehash is disabled or failed to allocate memory). # dmesg | cut -d ' ' -f 5- | grep "established hash" TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage) # sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries # sysctl net.ipv4.tcp_child_ehash_entries net.ipv4.tcp_child_ehash_entries = 0 # disabled by default # ip netns add test1 # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = -524288 # share the global ehash # sysctl -w net.ipv4.tcp_child_ehash_entries=100 net.ipv4.tcp_child_ehash_entries = 100 # ip netns add test2 # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries net.ipv4.tcp_ehash_entries = 128 # own a per-netns ehash with 2^n buckets When more than two processes in the same netns create per-netns ehash concurrently with different sizes, we need to guarantee the size in one of the following ways: 1) Share the global ehash and create per-netns ehash First, unshare() with tcp_child_ehash_entries==0. It creates dedicated netns sysctl knobs where we can safely change tcp_child_ehash_entries and clone()/unshare() to create a per-netns ehash. 2) Control write on sysctl by BPF We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on sysctl knobs. Note that the global ehash allocated at the boot time is spread over available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate pages for each per-netns ehash depending on the current process's NUMA policy. By default, the allocation is done in the local node only, so the per-netns hash table could fully reside on a random node. Thus, depending on the NUMA policy the netns is created with and the CPU the current thread is running on, we could see some performance differences for highly optimised networking applications. Note also that the default values of two sysctl knobs depend on the ehash size and should be tuned carefully: tcp_max_tw_buckets : tcp_child_ehash_entries / 2 tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128) As a bonus, we can dismantle netns faster. Currently, while destroying netns, we call inet_twsk_purge(), which walks through the global ehash. It can be potentially big because it can have many sockets other than TIME_WAIT in all netns. Splitting ehash changes that situation, where it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets in each netns. With regard to this, we do not free the per-netns ehash in inet_twsk_kill() to avoid UAF while iterating the per-netns ehash in inet_twsk_purge(). Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to keep it protocol-family-independent. In the future, we could optimise ehash lookup/iteration further by removing netns comparison for the per-netns ehash. Signed-off-by: Kuniyuki Iwashima <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Signed-off-by: Jakub Kicinski <[email protected]>
1 parent edc12f0 commit d1e5e64

File tree

9 files changed

+164
-8
lines changed

9 files changed

+164
-8
lines changed

Documentation/networking/ip-sysctl.rst

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1040,6 +1040,35 @@ tcp_challenge_ack_limit - INTEGER
10401040
TCP stack implements per TCP socket limits anyway.
10411041
Default: INT_MAX (unlimited)
10421042

1043+
tcp_ehash_entries - INTEGER
1044+
Show the number of hash buckets for TCP sockets in the current
1045+
networking namespace.
1046+
1047+
A negative value means the networking namespace does not own its
1048+
hash buckets and shares the initial networking namespace's one.
1049+
1050+
tcp_child_ehash_entries - INTEGER
1051+
Control the number of hash buckets for TCP sockets in the child
1052+
networking namespace, which must be set before clone() or unshare().
1053+
1054+
If the value is not 0, the kernel uses a value rounded up to 2^n
1055+
as the actual hash bucket size. 0 is a special value, meaning
1056+
the child networking namespace will share the initial networking
1057+
namespace's hash buckets.
1058+
1059+
Note that the child will use the global one in case the kernel
1060+
fails to allocate enough memory. In addition, the global hash
1061+
buckets are spread over available NUMA nodes, but the allocation
1062+
of the child hash table depends on the current process's NUMA
1063+
policy, which could result in performance differences.
1064+
1065+
Note also that the default value of tcp_max_tw_buckets and
1066+
tcp_max_syn_backlog depend on the hash bucket size.
1067+
1068+
Possible values: 0, 2^n (n: 0 - 24 (16Mi))
1069+
1070+
Default: 0
1071+
10431072
UDP variables
10441073
=============
10451074

include/net/inet_hashtables.h

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,8 @@ struct inet_hashinfo {
168168
/* The 2nd listener table hashed by local port and address */
169169
unsigned int lhash2_mask;
170170
struct inet_listen_hashbucket *lhash2;
171+
172+
bool pernet;
171173
};
172174

173175
static inline struct inet_hashinfo *tcp_or_dccp_get_hashinfo(const struct sock *sk)
@@ -214,6 +216,10 @@ static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo)
214216
hashinfo->ehash_locks = NULL;
215217
}
216218

219+
struct inet_hashinfo *inet_pernet_hashinfo_alloc(struct inet_hashinfo *hashinfo,
220+
unsigned int ehash_entries);
221+
void inet_pernet_hashinfo_free(struct inet_hashinfo *hashinfo);
222+
217223
struct inet_bind_bucket *
218224
inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
219225
struct inet_bind_hashbucket *head,

include/net/netns/ipv4.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,7 @@ struct netns_ipv4 {
171171
int sysctl_tcp_pacing_ca_ratio;
172172
int sysctl_tcp_wmem[3];
173173
int sysctl_tcp_rmem[3];
174+
unsigned int sysctl_tcp_child_ehash_entries;
174175
unsigned long sysctl_tcp_comp_sack_delay_ns;
175176
unsigned long sysctl_tcp_comp_sack_slack_ns;
176177
int sysctl_max_syn_backlog;

net/dccp/proto.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1197,6 +1197,8 @@ static int __init dccp_init(void)
11971197
INIT_HLIST_HEAD(&dccp_hashinfo.bhash2[i].chain);
11981198
}
11991199

1200+
dccp_hashinfo.pernet = false;
1201+
12001202
rc = dccp_mib_init();
12011203
if (rc)
12021204
goto out_free_dccp_bhash2;

net/ipv4/inet_hashtables.c

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1145,3 +1145,50 @@ int inet_ehash_locks_alloc(struct inet_hashinfo *hashinfo)
11451145
return 0;
11461146
}
11471147
EXPORT_SYMBOL_GPL(inet_ehash_locks_alloc);
1148+
1149+
struct inet_hashinfo *inet_pernet_hashinfo_alloc(struct inet_hashinfo *hashinfo,
1150+
unsigned int ehash_entries)
1151+
{
1152+
struct inet_hashinfo *new_hashinfo;
1153+
int i;
1154+
1155+
new_hashinfo = kmemdup(hashinfo, sizeof(*hashinfo), GFP_KERNEL);
1156+
if (!new_hashinfo)
1157+
goto err;
1158+
1159+
new_hashinfo->ehash = vmalloc_huge(ehash_entries * sizeof(struct inet_ehash_bucket),
1160+
GFP_KERNEL_ACCOUNT);
1161+
if (!new_hashinfo->ehash)
1162+
goto free_hashinfo;
1163+
1164+
new_hashinfo->ehash_mask = ehash_entries - 1;
1165+
1166+
if (inet_ehash_locks_alloc(new_hashinfo))
1167+
goto free_ehash;
1168+
1169+
for (i = 0; i < ehash_entries; i++)
1170+
INIT_HLIST_NULLS_HEAD(&new_hashinfo->ehash[i].chain, i);
1171+
1172+
new_hashinfo->pernet = true;
1173+
1174+
return new_hashinfo;
1175+
1176+
free_ehash:
1177+
vfree(new_hashinfo->ehash);
1178+
free_hashinfo:
1179+
kfree(new_hashinfo);
1180+
err:
1181+
return NULL;
1182+
}
1183+
EXPORT_SYMBOL_GPL(inet_pernet_hashinfo_alloc);
1184+
1185+
void inet_pernet_hashinfo_free(struct inet_hashinfo *hashinfo)
1186+
{
1187+
if (!hashinfo->pernet)
1188+
return;
1189+
1190+
inet_ehash_locks_free(hashinfo);
1191+
vfree(hashinfo->ehash);
1192+
kfree(hashinfo);
1193+
}
1194+
EXPORT_SYMBOL_GPL(inet_pernet_hashinfo_free);

net/ipv4/sysctl_net_ipv4.c

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@ static u32 u32_max_div_HZ = UINT_MAX / HZ;
3939
static int one_day_secs = 24 * 3600;
4040
static u32 fib_multipath_hash_fields_all_mask __maybe_unused =
4141
FIB_MULTIPATH_HASH_FIELD_ALL_MASK;
42+
static unsigned int tcp_child_ehash_entries_max = 16 * 1024 * 1024;
4243

4344
/* obsolete */
4445
static int sysctl_tcp_low_latency __read_mostly;
@@ -382,6 +383,29 @@ static int proc_tcp_available_ulp(struct ctl_table *ctl,
382383
return ret;
383384
}
384385

386+
static int proc_tcp_ehash_entries(struct ctl_table *table, int write,
387+
void *buffer, size_t *lenp, loff_t *ppos)
388+
{
389+
struct net *net = container_of(table->data, struct net,
390+
ipv4.sysctl_tcp_child_ehash_entries);
391+
struct inet_hashinfo *hinfo = net->ipv4.tcp_death_row.hashinfo;
392+
int tcp_ehash_entries;
393+
struct ctl_table tbl;
394+
395+
tcp_ehash_entries = hinfo->ehash_mask + 1;
396+
397+
/* A negative number indicates that the child netns
398+
* shares the global ehash.
399+
*/
400+
if (!net_eq(net, &init_net) && !hinfo->pernet)
401+
tcp_ehash_entries *= -1;
402+
403+
tbl.data = &tcp_ehash_entries;
404+
tbl.maxlen = sizeof(int);
405+
406+
return proc_dointvec(&tbl, write, buffer, lenp, ppos);
407+
}
408+
385409
#ifdef CONFIG_IP_ROUTE_MULTIPATH
386410
static int proc_fib_multipath_hash_policy(struct ctl_table *table, int write,
387411
void *buffer, size_t *lenp,
@@ -1320,6 +1344,21 @@ static struct ctl_table ipv4_net_table[] = {
13201344
.extra1 = SYSCTL_ZERO,
13211345
.extra2 = SYSCTL_ONE,
13221346
},
1347+
{
1348+
.procname = "tcp_ehash_entries",
1349+
.data = &init_net.ipv4.sysctl_tcp_child_ehash_entries,
1350+
.mode = 0444,
1351+
.proc_handler = proc_tcp_ehash_entries,
1352+
},
1353+
{
1354+
.procname = "tcp_child_ehash_entries",
1355+
.data = &init_net.ipv4.sysctl_tcp_child_ehash_entries,
1356+
.maxlen = sizeof(unsigned int),
1357+
.mode = 0644,
1358+
.proc_handler = proc_douintvec_minmax,
1359+
.extra1 = SYSCTL_ZERO,
1360+
.extra2 = &tcp_child_ehash_entries_max,
1361+
},
13231362
{
13241363
.procname = "udp_rmem_min",
13251364
.data = &init_net.ipv4.sysctl_udp_rmem_min,

net/ipv4/tcp.c

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4790,6 +4790,7 @@ void __init tcp_init(void)
47904790
INIT_HLIST_HEAD(&tcp_hashinfo.bhash2[i].chain);
47914791
}
47924792

4793+
tcp_hashinfo.pernet = false;
47934794

47944795
cnt = tcp_hashinfo.ehash_mask + 1;
47954796
sysctl_tcp_max_orphans = cnt / 2;

net/ipv4/tcp_ipv4.c

Lines changed: 32 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3110,10 +3110,38 @@ static void __net_exit tcp_sk_exit(struct net *net)
31103110
net->ipv4.tcp_congestion_control->owner);
31113111
}
31123112

3113-
static int __net_init tcp_sk_init(struct net *net)
3113+
static void __net_init tcp_set_hashinfo(struct net *net)
31143114
{
3115-
int cnt;
3115+
struct inet_hashinfo *hinfo;
3116+
unsigned int ehash_entries;
3117+
struct net *old_net;
3118+
3119+
if (net_eq(net, &init_net))
3120+
goto fallback;
3121+
3122+
old_net = current->nsproxy->net_ns;
3123+
ehash_entries = READ_ONCE(old_net->ipv4.sysctl_tcp_child_ehash_entries);
3124+
if (!ehash_entries)
3125+
goto fallback;
3126+
3127+
ehash_entries = roundup_pow_of_two(ehash_entries);
3128+
hinfo = inet_pernet_hashinfo_alloc(&tcp_hashinfo, ehash_entries);
3129+
if (!hinfo) {
3130+
pr_warn("Failed to allocate TCP ehash (entries: %u) "
3131+
"for a netns, fallback to the global one\n",
3132+
ehash_entries);
3133+
fallback:
3134+
hinfo = &tcp_hashinfo;
3135+
ehash_entries = tcp_hashinfo.ehash_mask + 1;
3136+
}
3137+
3138+
net->ipv4.tcp_death_row.hashinfo = hinfo;
3139+
net->ipv4.tcp_death_row.sysctl_max_tw_buckets = ehash_entries / 2;
3140+
net->ipv4.sysctl_max_syn_backlog = max(128U, ehash_entries / 128);
3141+
}
31163142

3143+
static int __net_init tcp_sk_init(struct net *net)
3144+
{
31173145
net->ipv4.sysctl_tcp_ecn = 2;
31183146
net->ipv4.sysctl_tcp_ecn_fallback = 1;
31193147

@@ -3140,11 +3168,8 @@ static int __net_init tcp_sk_init(struct net *net)
31403168
net->ipv4.sysctl_tcp_no_ssthresh_metrics_save = 1;
31413169

31423170
refcount_set(&net->ipv4.tcp_death_row.tw_refcount, 1);
3143-
cnt = tcp_hashinfo.ehash_mask + 1;
3144-
net->ipv4.tcp_death_row.sysctl_max_tw_buckets = cnt / 2;
3145-
net->ipv4.tcp_death_row.hashinfo = &tcp_hashinfo;
3171+
tcp_set_hashinfo(net);
31463172

3147-
net->ipv4.sysctl_max_syn_backlog = max(128, cnt / 128);
31483173
net->ipv4.sysctl_tcp_sack = 1;
31493174
net->ipv4.sysctl_tcp_window_scaling = 1;
31503175
net->ipv4.sysctl_tcp_timestamps = 1;
@@ -3209,6 +3234,7 @@ static void __net_exit tcp_sk_exit_batch(struct list_head *net_exit_list)
32093234
tcp_twsk_purge(net_exit_list, AF_INET);
32103235

32113236
list_for_each_entry(net, net_exit_list, exit_list) {
3237+
inet_pernet_hashinfo_free(net->ipv4.tcp_death_row.hashinfo);
32123238
WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
32133239
tcp_fastopen_ctx_destroy(net);
32143240
}

net/ipv4/tcp_minisocks.c

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -349,15 +349,20 @@ EXPORT_SYMBOL_GPL(tcp_twsk_destructor);
349349

350350
void tcp_twsk_purge(struct list_head *net_exit_list, int family)
351351
{
352+
bool purged_once = false;
352353
struct net *net;
353354

354355
list_for_each_entry(net, net_exit_list, exit_list) {
355356
/* The last refcount is decremented in tcp_sk_exit_batch() */
356357
if (refcount_read(&net->ipv4.tcp_death_row.tw_refcount) == 1)
357358
continue;
358359

359-
inet_twsk_purge(&tcp_hashinfo, family);
360-
break;
360+
if (net->ipv4.tcp_death_row.hashinfo->pernet) {
361+
inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo, family);
362+
} else if (!purged_once) {
363+
inet_twsk_purge(&tcp_hashinfo, family);
364+
purged_once = true;
365+
}
361366
}
362367
}
363368
EXPORT_SYMBOL_GPL(tcp_twsk_purge);

0 commit comments

Comments
 (0)