Major slowdowns writing to tile38 #697

TheCycoONE · 2023-07-21T18:14:05Z

Describe the bug
We are experiencing significant slow downs in our write operations. For the past 3 days we've been seeing write speeds for points of approximately 10-30/s, down from 10,000 or more per second typically. We've experienced these slow downs a few times in the past. We expect but cannot yet confirm that it may be related to replication.

We are running 6 instances of tile38, split between two geographically separated datacenters with ~94ms round trip ping time between the two datacenters, with one instance as the leader and all others as replicas. The coordination is managed by redis sentinal.

To Reproduce
This is a behaviour we've observed in production several times, but we do not have steps to reproduce outside of production.

Expected behavior
We expect consistent write speed that can keep up with our load of thousands of points per second.

Logs
Not Applicable

Operating System (please complete the following information):

OS: [Rocky Linux 8.6]
CPU: [Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz x4]
Version: [1.31.0]
Container: [VMWare ESX]

Additional context
See attached
tile38_slowness_server_cmds.txt

tidwall · 2023-07-31T11:45:23Z

Sorry, but unable to help without some way to reproduce the issue.

Kilowhisky · 2023-07-31T15:48:01Z

Are there a large number of points at the same GPS location? I've found that if you reuse the same static point or have tons of points at the same location it will slow things down substantially as the index's get huge.

TheCycoONE · 2025-02-15T21:16:42Z

We hit this again in production but still can't replicate on demand; I know that's not especially helpful. All replicas are behaving the same and it persists over a restart so the aof file may be enough to reproduce elsewhere. Unfortunately it's > 20G and probably can't be shared; but we can test.

TheCycoONE · 2025-02-18T21:50:54Z

Similar to when it happened in 2023 the problem resolved itself after a couple days. Last time it seemed to correlate with removing followers but that didn't help this time, instead it correlated with what we call a reindex, where we synced all the points with their values in our SQL database.

During the slow period we were seeing 6/s to 10/s inserts.

Things we tried that did not help:

Reducing the size of other collections in the tile38
Removing replicas
Running AOFSHRINK or starting from a shrinked AOF
Failing over to a replica
Restarting
Running GC

TheCycoONE · 2025-02-18T21:55:55Z

The collections we were inserting into:

[root@sje-redsent1 ~]$ redis-cli -h tile2.sje.raveu.net -p 9851 -a XXXX stats mcontacts
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
1) 1) "in_memory_size"
   2) "588272369"
   3) "num_objects"
   4) "8824934"
   5) "num_points"
   6) "9544022"
   7) "num_strings"
   8) "0"
[root@sje-redsent1 ~]$ redis-cli -h tile2.sje.raveu.net -p 9851 -a XXXX stats smart911
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
1) 1) "in_memory_size"
   2) "1133818330"
   3) "num_objects"
   4) "13642596"
   5) "num_points"
   6) "20877048"
   7) "num_strings"
   8) "0"

Info from the primary (after failing over) during the slowdown:

[root@sje-redsent1 ~]$ redis-cli -h tile2.sje.raveu.net -p 9851 -a XXXX info
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
# Server
tile38_version:1.31.0
redis_version:1.31.0
uptime_in_seconds:44447

# Clients
connected_clients:51

# Memory
used_memory:17618532576

# Persistence
aof_enabled:1
aof_rewrite_in_progress:0
aof_last_rewrite_time_sec:0
aof_current_rewrite_time_sec:0

# Stats
total_connections_received:18430
total_commands_processed:3821922
total_messages_sent:317896
expired_keys:0

# Replication
role:master
slave0:ip=10.1.1.21,port=9851,state=online
connected_slaves:1

# CPU
used_cpu_sys:232.00
used_cpu_user:45464.00
used_cpu_sys_children:0.00
used_cpu_user_children:0.00

# Cluster
cluster_enabled:0

Server ext from the previous day, during the incident, when we were still on the primary and had testing shutting down all the replicas:

[XXXX@sje-tile1 ~]$ /rave/tile38/tile38-cli -p 9851
127.0.0.1:9851> auth XXXX
{"ok":true,"elapsed":"2.605µs"}
127.0.0.1:9851> server ext
{"ok":true,"stats":{"alloc_bytes":17697877848,"alloc_bytes_total":2329690545608,"buck_hash_sys_bytes":1890924,"frees_total":22683236441,"gc_cpu_fraction":0.048510525131749416,"gc_sys_bytes":778766912,"go_goroutines":83,"go_threads":15,"go_version":"go1.20.4","heap_alloc_bytes":17697877848,"heap_idle_bytes":8312029184,"heap_inuse_bytes":20857585664,"heap_objects":265452711,"heap_released_bytes":6203179008,"heap_sys_bytes":29169614848,"last_gc_time_seconds":1739600860.8283973,"lookups_total":0,"mallocs_total":22948689152,"mcache_inuse_bytes":4800,"mcache_sys_bytes":15600,"mspan_inuse_bytes":369807040,"mspan_sys_bytes":534023040,"next_gc_bytes":35373472056,"other_sys_bytes":40042460,"stack_inuse_bytes":1769472,"stack_sys_bytes":1769472,"sys_bytes":30526123256,"sys_cpus":4,"tile38_aof_current_rewrite_time_sec":0,"tile38_aof_enabled":true,"tile38_aof_last_rewrite_time_sec":0,"tile38_aof_rewrite_in_progress":false,"tile38_aof_size":20545506622,"tile38_avg_point_size":196,"tile38_cluster_enabled":false,"tile38_connected_clients":56,"tile38_connected_slaves":1,"tile38_expired_keys":0,"tile38_http_transport":true,"tile38_id":"tile38SJE1","tile38_in_memory_size":5468794609,"tile38_max_heap_size":25769803776,"tile38_num_collections":8,"tile38_num_hook_groups":0,"tile38_num_hooks":0,"tile38_num_object_groups":0,"tile38_num_objects":81982387,"tile38_num_points":89946115,"tile38_num_strings":0,"tile38_pid":775508,"tile38_pointer_size":8,"tile38_read_only":false,"tile38_total_commands_processed":393617400,"tile38_total_connections_received":7577352,"tile38_total_messages_sent":183307586,"tile38_type":"leader","tile38_uptime_in_seconds":17340178.520754464,"tile38_version":"1.31.0"},"elapsed":"261.579217ms"}

tidwall · 2025-02-18T22:46:05Z

Could this be related to #756?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major slowdowns writing to tile38 #697

Major slowdowns writing to tile38 #697

TheCycoONE commented Jul 21, 2023

tidwall commented Jul 31, 2023

Kilowhisky commented Jul 31, 2023

TheCycoONE commented Feb 15, 2025

TheCycoONE commented Feb 18, 2025

TheCycoONE commented Feb 18, 2025 •

edited

Loading

tidwall commented Feb 18, 2025

Major slowdowns writing to tile38 #697

Major slowdowns writing to tile38 #697

Comments

TheCycoONE commented Jul 21, 2023

tidwall commented Jul 31, 2023

Kilowhisky commented Jul 31, 2023

TheCycoONE commented Feb 15, 2025

TheCycoONE commented Feb 18, 2025

TheCycoONE commented Feb 18, 2025 • edited Loading

tidwall commented Feb 18, 2025

TheCycoONE commented Feb 18, 2025 •

edited

Loading