Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major slowdowns writing to tile38 #697

Open
TheCycoONE opened this issue Jul 21, 2023 · 6 comments
Open

Major slowdowns writing to tile38 #697

TheCycoONE opened this issue Jul 21, 2023 · 6 comments

Comments

@TheCycoONE
Copy link

Describe the bug
We are experiencing significant slow downs in our write operations. For the past 3 days we've been seeing write speeds for points of approximately 10-30/s, down from 10,000 or more per second typically. We've experienced these slow downs a few times in the past. We expect but cannot yet confirm that it may be related to replication.

We are running 6 instances of tile38, split between two geographically separated datacenters with ~94ms round trip ping time between the two datacenters, with one instance as the leader and all others as replicas. The coordination is managed by redis sentinal.

To Reproduce
This is a behaviour we've observed in production several times, but we do not have steps to reproduce outside of production.

Expected behavior
We expect consistent write speed that can keep up with our load of thousands of points per second.

Logs
Not Applicable

Operating System (please complete the following information):

  • OS: [Rocky Linux 8.6]
  • CPU: [Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz x4]
  • Version: [1.31.0]
  • Container: [VMWare ESX]

Additional context
See attached
tile38_slowness_server_cmds.txt

@tidwall
Copy link
Owner

tidwall commented Jul 31, 2023

Sorry, but unable to help without some way to reproduce the issue.

@Kilowhisky
Copy link
Contributor

Are there a large number of points at the same GPS location? I've found that if you reuse the same static point or have tons of points at the same location it will slow things down substantially as the index's get huge.

@TheCycoONE
Copy link
Author

We hit this again in production but still can't replicate on demand; I know that's not especially helpful. All replicas are behaving the same and it persists over a restart so the aof file may be enough to reproduce elsewhere. Unfortunately it's > 20G and probably can't be shared; but we can test.

@TheCycoONE
Copy link
Author

Similar to when it happened in 2023 the problem resolved itself after a couple days. Last time it seemed to correlate with removing followers but that didn't help this time, instead it correlated with what we call a reindex, where we synced all the points with their values in our SQL database.

During the slow period we were seeing 6/s to 10/s inserts.

Things we tried that did not help:

Reducing the size of other collections in the tile38
Removing replicas
Running AOFSHRINK or starting from a shrinked AOF
Failing over to a replica
Restarting
Running GC

@TheCycoONE
Copy link
Author

TheCycoONE commented Feb 18, 2025

The collections we were inserting into:

[root@sje-redsent1 ~]$ redis-cli -h tile2.sje.raveu.net -p 9851 -a XXXX stats mcontacts
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
1) 1) "in_memory_size"
   2) "588272369"
   3) "num_objects"
   4) "8824934"
   5) "num_points"
   6) "9544022"
   7) "num_strings"
   8) "0"
[root@sje-redsent1 ~]$ redis-cli -h tile2.sje.raveu.net -p 9851 -a XXXX stats smart911
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
1) 1) "in_memory_size"
   2) "1133818330"
   3) "num_objects"
   4) "13642596"
   5) "num_points"
   6) "20877048"
   7) "num_strings"
   8) "0"

Info from the primary (after failing over) during the slowdown:

[root@sje-redsent1 ~]$ redis-cli -h tile2.sje.raveu.net -p 9851 -a XXXX info
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
# Server
tile38_version:1.31.0
redis_version:1.31.0
uptime_in_seconds:44447

# Clients
connected_clients:51

# Memory
used_memory:17618532576

# Persistence
aof_enabled:1
aof_rewrite_in_progress:0
aof_last_rewrite_time_sec:0
aof_current_rewrite_time_sec:0

# Stats
total_connections_received:18430
total_commands_processed:3821922
total_messages_sent:317896
expired_keys:0

# Replication
role:master
slave0:ip=10.1.1.21,port=9851,state=online
connected_slaves:1

# CPU
used_cpu_sys:232.00
used_cpu_user:45464.00
used_cpu_sys_children:0.00
used_cpu_user_children:0.00

# Cluster
cluster_enabled:0

Server ext from the previous day, during the incident, when we were still on the primary and had testing shutting down all the replicas:

[XXXX@sje-tile1 ~]$ /rave/tile38/tile38-cli -p 9851
127.0.0.1:9851> auth XXXX
{"ok":true,"elapsed":"2.605µs"}
127.0.0.1:9851> server ext
{"ok":true,"stats":{"alloc_bytes":17697877848,"alloc_bytes_total":2329690545608,"buck_hash_sys_bytes":1890924,"frees_total":22683236441,"gc_cpu_fraction":0.048510525131749416,"gc_sys_bytes":778766912,"go_goroutines":83,"go_threads":15,"go_version":"go1.20.4","heap_alloc_bytes":17697877848,"heap_idle_bytes":8312029184,"heap_inuse_bytes":20857585664,"heap_objects":265452711,"heap_released_bytes":6203179008,"heap_sys_bytes":29169614848,"last_gc_time_seconds":1739600860.8283973,"lookups_total":0,"mallocs_total":22948689152,"mcache_inuse_bytes":4800,"mcache_sys_bytes":15600,"mspan_inuse_bytes":369807040,"mspan_sys_bytes":534023040,"next_gc_bytes":35373472056,"other_sys_bytes":40042460,"stack_inuse_bytes":1769472,"stack_sys_bytes":1769472,"sys_bytes":30526123256,"sys_cpus":4,"tile38_aof_current_rewrite_time_sec":0,"tile38_aof_enabled":true,"tile38_aof_last_rewrite_time_sec":0,"tile38_aof_rewrite_in_progress":false,"tile38_aof_size":20545506622,"tile38_avg_point_size":196,"tile38_cluster_enabled":false,"tile38_connected_clients":56,"tile38_connected_slaves":1,"tile38_expired_keys":0,"tile38_http_transport":true,"tile38_id":"tile38SJE1","tile38_in_memory_size":5468794609,"tile38_max_heap_size":25769803776,"tile38_num_collections":8,"tile38_num_hook_groups":0,"tile38_num_hooks":0,"tile38_num_object_groups":0,"tile38_num_objects":81982387,"tile38_num_points":89946115,"tile38_num_strings":0,"tile38_pid":775508,"tile38_pointer_size":8,"tile38_read_only":false,"tile38_total_commands_processed":393617400,"tile38_total_connections_received":7577352,"tile38_total_messages_sent":183307586,"tile38_type":"leader","tile38_uptime_in_seconds":17340178.520754464,"tile38_version":"1.31.0"},"elapsed":"261.579217ms"}

@tidwall
Copy link
Owner

tidwall commented Feb 18, 2025

Could this be related to #756?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants