Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 BUG]: Memory Leak in KV 'in-memory' driver #2051

Closed
1 task done
segrax opened this issue Nov 15, 2024 · 9 comments · Fixed by roadrunner-server/memory#155
Closed
1 task done

[🐛 BUG]: Memory Leak in KV 'in-memory' driver #2051

segrax opened this issue Nov 15, 2024 · 9 comments · Fixed by roadrunner-server/memory#155
Assignees
Labels
B-bug Bug: bug, exception
Milestone

Comments

@segrax
Copy link

segrax commented Nov 15, 2024

No duplicates 🥲.

  • I have searched for a similar issue in our bug tracker and didn't find any solutions.

What happened?

When setting an item in the KV with a TTL (eg. 2000 -- using a small value, like 5, seems to stop the issue), memory usage will continually increase over time.

I suspect its related to the callback here, which is created each time a key is set

Have provided some pprof outputs in the log

// Example Usage, simple HTTP request handler that sets the same key every request

try {
    while ($req = $http->waitRequest()) {
		
	$storage->set('key', 'value', 2000);
        $http->respond(200, 'asd');
    }
} catch (\Throwable $e) {
    $worker->error($e->getMessage());
}

Unrelated, but just discovered, if you set your TTL to 1, you can cause a panic

panic: non-positive interval for NewTicker

goroutine 11800 [running]:
time.NewTicker(0x417665?)
        time/tick.go:38 +0xbb
github.com/roadrunner-server/memory/v5/memorykv.(*Driver).Set.(*Driver).ttlcallback.func2(0xc0001804d0)
        github.com/roadrunner-server/memory/[email protected]/memorykv/kv.go:389 +0x73
github.com/roadrunner-server/memory/v5/memorykv.(*Driver).Set.func1()
        github.com/roadrunner-server/memory/[email protected]/memorykv/kv.go:199 +0x22

Version (rr --version)

2024.2.1

How to reproduce the issue?

Just using rakyll/hey to hit the http endpoint

hey -c 20 -z 90m http://localhost:8080

RR Config:

version: '3'

rpc:
  listen: tcp://127.0.0.1:6001

status:
  address: 0.0.0.0:2114

metrics:
  address: 0.0.0.0:2112 

server:
  command: "php public/worker.php"


http:
  address: "0.0.0.0:8080"
  access_logs: false
  middleware: []
  pool:
    num_workers: 2
    debug: false

    supervisor:
      watch_tick: 1s
      exec_ttl: 25s
    
kv:
  cache:
    driver: memory

Relevant log output

Some pprof outputs over time, starting with 0 requests processed... upto over 1,000,000 by the last log

================================================================
Showing top 10 nodes out of 83
      flat  flat%   sum%        cum   cum%
   10.41MB 27.05% 27.05%    10.41MB 27.05%  github.com/roadrunner-server/goridge/v3/internal.internalAllocate.func1
       6MB 15.60% 42.65%        6MB 15.60%  runtime.malg
    4.50MB 11.70% 54.35%     4.50MB 11.70%  github.com/roadrunner-server/memory/v5/memorykv.(*Driver).ttlcallback
    3.04MB  7.90% 62.25%     6.54MB 17.00%  github.com/roadrunner-server/memory/v5/memorykv.(*Driver).Set.(*Driver).ttlcallback.func2
       3MB  7.80% 70.05%        3MB  7.80%  github.com/aws/aws-sdk-go/aws/endpoints.init
    2.51MB  6.52% 76.57%     2.51MB  6.52%  github.com/goccy/go-json/internal/decoder.init.0
    2.50MB  6.50% 83.07%     3.50MB  9.10%  time.NewTicker
       1MB  2.61% 85.67%        1MB  2.61%  bufio.NewReaderSize
       1MB  2.60% 88.27%        1MB  2.60%  time.newTimer
       1MB  2.60% 90.87%        1MB  2.60%  github.com/roadrunner-server/pool/worker_watcher/container/channel.(*Vec).Pop

================================================================

Showing nodes accounting for 586.78MB, 97.50% of 601.80MB total
Dropped 54 nodes (cum <= 3.01MB)
Showing top 10 nodes out of 47
      flat  flat%   sum%        cum   cum%
  172.07MB 28.59% 28.59%   172.07MB 28.59%  runtime.malg
  119.82MB 19.91% 48.50%   230.83MB 38.36%  github.com/roadrunner-server/memory/v5/memorykv.(*Driver).Set.(*Driver).ttlcallback.func2
  101.51MB 16.87% 65.37%   101.51MB 16.87%  github.com/roadrunner-server/memory/v5/memorykv.(*Driver).ttlcallback
   61.51MB 10.22% 75.59%   111.01MB 18.45%  time.NewTicker
   54.44MB  9.05% 84.64%    54.44MB  9.05%  time.newTimer
      36MB  5.98% 90.62%       37MB  6.15%  github.com/roadrunner-server/pool/worker_watcher/container/channel.(*Vec).Pop
   32.38MB  5.38% 96.00%    32.38MB  5.38%  github.com/roadrunner-server/goridge/v3/internal.internalAllocate.func1
       4MB  0.66% 96.67%        4MB  0.66%  github.com/roadrunner-server/goridge/v3/pkg/frame.(*Frame).ReadOptions
    3.54MB  0.59% 97.25%     3.54MB  0.59%  runtime.allgadd
    1.50MB  0.25% 97.50%     6.43MB  1.07%  context.WithDeadlineCause

================================================================

Showing nodes accounting for 784.38MB, 97.75% of 802.41MB total
Dropped 56 nodes (cum <= 4.01MB)
Showing top 10 nodes out of 47
      flat  flat%   sum%        cum   cum%
  239.60MB 29.86% 29.86%   239.60MB 29.86%  runtime.malg
  155.64MB 19.40% 49.26%   301.15MB 37.53%  github.com/roadrunner-server/memory/v5/memorykv.(*Driver).Set.(*Driver).ttlcallback.func2
  142.02MB 17.70% 66.96%   142.02MB 17.70%  github.com/roadrunner-server/memory/v5/memorykv.(*Driver).ttlcallback
   79.01MB  9.85% 76.80%   145.51MB 18.13%  time.NewTicker
   73.80MB  9.20% 86.00%    73.80MB  9.20%  time.newTimer
   51.50MB  6.42% 92.42%    52.50MB  6.54%  github.com/roadrunner-server/pool/worker_watcher/container/channel.(*Vec).Pop
   32.38MB  4.04% 96.45%    32.38MB  4.04%  github.com/roadrunner-server/goridge/v3/internal.internalAllocate.func1
    4.50MB  0.56% 97.01%     4.50MB  0.56%  github.com/roadrunner-server/goridge/v3/pkg/frame.(*Frame).ReadOptions
    4.43MB  0.55% 97.57%     4.43MB  0.55%  runtime.allgadd
    1.50MB  0.19% 97.75%     8.79MB  1.10%  context.WithDeadlineCause
(pprof)

================================================================

Showing nodes accounting for 1272.26MB, 97.66% of 1302.79MB total
Dropped 69 nodes (cum <= 6.51MB)
Showing top 10 nodes out of 46
      flat  flat%   sum%        cum   cum%
  383.16MB 29.41% 29.41%   383.16MB 29.41%  runtime.malg
  259.02MB 19.88% 49.29%   489.54MB 37.58%  github.com/roadrunner-server/memory/v5/memorykv.(*Driver).Set.(*Driver).ttlcallback.func2
  225.02MB 17.27% 66.56%   225.02MB 17.27%  github.com/roadrunner-server/memory/v5/memorykv.(*Driver).ttlcallback
  129.01MB  9.90% 76.47%   230.52MB 17.69%  time.NewTicker
  121.49MB  9.33% 85.79%   121.49MB  9.33%  time.newTimer
   91.01MB  6.99% 92.78%    95.01MB  7.29%  github.com/roadrunner-server/pool/worker_watcher/container/channel.(*Vec).Pop
   50.89MB  3.91% 96.68%    50.89MB  3.91%  github.com/roadrunner-server/goridge/v3/internal.internalAllocate.func1
    8.66MB  0.67% 97.35%     8.66MB  0.67%  runtime.allgadd
       4MB  0.31% 97.66%    23.98MB  1.84%  context.WithDeadlineCause
         0     0% 97.66%    23.98MB  1.84%  context.WithDeadline

================================================================


Showing nodes accounting for 1729.22MB, 98.80% of 1750.25MB total
Dropped 74 nodes (cum <= 8.75MB)
Showing top 10 nodes out of 49
      flat  flat%   sum%        cum   cum%
  548.73MB 31.35% 31.35%   548.73MB 31.35%  runtime.malg
  365.77MB 20.90% 52.25%   677.30MB 38.70%  github.com/roadrunner-server/memory/v5/memorykv.(*Driver).Set.(*Driver).ttlcallback.func2
  299.53MB 17.11% 69.36%   299.53MB 17.11%  github.com/roadrunner-server/memory/v5/memorykv.(*Driver).ttlcallback
  172.02MB  9.83% 79.19%   311.53MB 17.80%  time.NewTicker
  151.10MB  8.63% 87.82%   151.10MB  8.63%  time.newTimer
  129.51MB  7.40% 95.22%   129.51MB  7.40%  github.com/roadrunner-server/pool/worker_watcher/container/channel.(*Vec).Pop
   31.23MB  1.78% 97.01%    31.23MB  1.78%  github.com/roadrunner-server/goridge/v3/internal.internalAllocate.func1
   11.50MB  0.66% 97.67%    11.50MB  0.66%  github.com/roadrunner-server/goridge/v3/pkg/frame.(*Frame).ReadOptions
   10.84MB  0.62% 98.28%    10.84MB  0.62%  runtime.allgadd
       9MB  0.51% 98.80%        9MB  0.51%  google.golang.org/protobuf/internal/impl.consumeStringValidateUTF8

================================================================

Showing nodes accounting for 1.99GB, 98.81% of 2.01GB total
Dropped 82 nodes (cum <= 0.01GB)
Showing top 10 nodes out of 49
      flat  flat%   sum%        cum   cum%
    0.64GB 31.64% 31.64%     0.64GB 31.64%  runtime.malg
    0.42GB 20.72% 52.36%     0.78GB 38.58%  github.com/roadrunner-server/memory/v5/memorykv.(*Driver).Set.(*Driver).ttlcallback.func2
    0.35GB 17.25% 69.61%     0.35GB 17.25%  github.com/roadrunner-server/memory/v5/memorykv.(*Driver).ttlcallback
    0.20GB 10.03% 79.65%     0.36GB 17.86%  time.NewTicker
    0.18GB  8.81% 88.45%     0.18GB  8.81%  time.newTimer
    0.14GB  7.14% 95.60%     0.15GB  7.26%  github.com/roadrunner-server/pool/worker_watcher/container/channel.(*Vec).Pop
    0.03GB  1.46% 97.06%     0.03GB  1.46%  github.com/roadrunner-server/goridge/v3/internal.internalAllocate.func1
    0.01GB  0.66% 97.72%     0.01GB  0.66%  runtime.allgadd
    0.01GB  0.58% 98.30%     0.01GB  0.58%  github.com/roadrunner-server/goridge/v3/pkg/frame.(*Frame).ReadOptions
    0.01GB  0.51% 98.81%     0.01GB  0.51%  google.golang.org/protobuf/internal/impl.consumeStringValidateUTF8

================================================================

(pprof) top
Showing nodes accounting for 2.70GB, 98.21% of 2.75GB total
Dropped 98 nodes (cum <= 0.01GB)
Showing top 10 nodes out of 46
      flat  flat%   sum%        cum   cum%
    0.87GB 31.48% 31.48%     0.87GB 31.48%  runtime.malg
    0.57GB 20.56% 52.04%     1.06GB 38.64%  github.com/roadrunner-server/memory/v5/memorykv.(*Driver).Set.(*Driver).ttlcallback.func2
    0.46GB 16.84% 68.88%     0.46GB 16.84%  github.com/roadrunner-server/memory/v5/memorykv.(*Driver).ttlcallback
    0.28GB 10.00% 78.88%     0.50GB 18.08%  time.NewTicker
    0.26GB  9.42% 88.31%     0.26GB  9.42%  time.newTimer
    0.20GB  7.22% 95.52%     0.20GB  7.34%  github.com/roadrunner-server/pool/worker_watcher/container/channel.(*Vec).Pop
    0.03GB  0.98% 96.51%     0.03GB  0.98%  github.com/roadrunner-server/goridge/v3/internal.internalAllocate.func1
    0.02GB   0.6% 97.11%     0.02GB   0.6%  runtime.allgadd
    0.02GB  0.55% 97.66%     0.02GB  0.55%  google.golang.org/protobuf/internal/impl.consumeStringValidateUTF8
    0.02GB  0.55% 98.21%     0.02GB  0.55%  github.com/roadrunner-server/goridge/v3/pkg/frame.(*Frame).ReadOptions
@rustatian
Copy link
Member

Hey @segrax 👋
Do you mean, if you push a significant number of items to the in-memory KV, after TTL they're not cleaned? Would you be able to still get the TTL-ed item?

@rustatian
Copy link
Member

And, just to clarify some things. I used a hashmap (sync.Map actually, but under the hood this is a two hashmaps, dirty and clean) to store KV items. In Go, maps and more specifically - buckets in the map, can't be shrunk when you delete the key. So, if you add 100 mil items to the map, and then delete them, you would see approximately (minus your data, but including buckets) the same memory usage. And memory won't be reclaimed. The only way to reclaim that memory is to re-create the map itself.

@rustatian
Copy link
Member

Might be it'd be better to use binary-search+slice combination, so, on big numbers Go runtime would be able to reclaim freed memory. But after intensive use, the Tree might become a linked list with O(n) search time. So, AVL tree or similar algorithm could be applied here to prevent that.

@segrax
Copy link
Author

segrax commented Nov 15, 2024

Hey @segrax 👋 Do you mean, if you push a significant number of items to the in-memory KV, after TTL they're not cleaned? Would you be able to still get the TTL-ed item?

Hey @rustatian,
In this case, we're just setting the same key, same value and same TTL over and over

$storage->set('key', 'value', 2000);

@segrax
Copy link
Author

segrax commented Nov 15, 2024

And, just to clarify some things. I used a hashmap (sync.Map actually, but under the hood this is a two hashmaps, dirty and clean) to store KV items. In Go, maps and more specifically - buckets in the map, can't be shrunk when you delete the key. So, if you add 100 mil items to the map, and then delete them, you would see approximately (minus your data, but including buckets) the same memory usage. And memory won't be reclaimed. The only way to reclaim that memory is to re-create the map itself.

If an existing key is set again (before its expired), is the memory reclaimed?

my GO experience is pretty limited, but if this runs again (with the same Key) before the callback has executed, is it cleaned up?

clbk, stopCh, updateCh := d.ttlcallback(items[i].Key(), tm)
			go func() {
				clbk(*d.broadcastStopCh.Load())
			}()

			// store the callback since we have TTL
			d.callbacks.Store(items[i].Key(), &cb{
				updateCh: updateCh,
				stopCh:   stopCh,
			})

I did come across this issue,
Go Issue, but am not sure if its related or not

@rustatian
Copy link
Member

And this case is interesting, callback won't be cleaned, but it should fire after the TTL is expired and remove that key, but other goroutines would be waiting for the TTL. But yeah, good point I think, I'll double check that case and also fix the panic.

@rustatian rustatian added this to the v2024.3 milestone Nov 15, 2024
@segrax
Copy link
Author

segrax commented Nov 15, 2024

And this case is interesting, callback won't be cleaned, but it should fire after the TTL is expired and remove that key, but other goroutines would be waiting for the TTL. But yeah, good point I think, I'll double check that case and also fix the panic.

I have tried setting the same key ~50k times with a TTL of 2000... then waiting 5 minutes, but the memory use never goes down

This is my full worker.php, although you should be able to replicate it pretty easily just by calling the RPC in a loop with a fixed payload

<?php

require __DIR__ . '/vendor/autoload.php';

use Spiral\RoadRunner;
use Spiral\RoadRunner\Environment;
use Spiral\Goridge\RPC\RPC;
use Spiral\RoadRunner\KeyValue\Factory;

$worker = RoadRunner\Worker::create();
$http = new RoadRunner\Http\HttpWorker($worker);

$rpc = RPC::create('tcp://127.0.0.1:6001');
$factory = new Factory($rpc);
$storage = $factory->select('cache');

try {
    while ($req = $http->waitRequest()) {
		
	$storage->set('key', 'value', 2000);
        $http->respond(200, 'asd');
    }
} catch (\Throwable $e) {
    $worker->error($e->getMessage());
}

@rustatian
Copy link
Member

@segrax I've located the issue, will release the patch soon.

@rustatian
Copy link
Member

Ok, what was the reason:

  1. Under the several conditions, if the user wanted to overwrite the entry withing the same key (e.g.: entry with the key a is currently exists, and incoming key is also a), then, if the new key a would be set before the callback removes the old one, then the fired callback will remove the new a key which overwrites the old one. All these operations are thread safe.
  2. After this operation, a key would be removed and no a key would be presented in the local storage.
  3. Goroutine waiting on the select would wait forever, which will cause a memory leak.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-bug Bug: bug, exception
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

2 participants