Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM in a 2G box #5412

Closed
cannium opened this issue Aug 31, 2018 · 21 comments
Closed

OOM in a 2G box #5412

cannium opened this issue Aug 31, 2018 · 21 comments
Labels
kind/bug A bug in existing code (including security flaws) topic/perf Performance

Comments

@cannium
Copy link

cannium commented Aug 31, 2018

Version information:

go-ipfs version: 0.4.18-dev-
Repo version: 7
System version: amd64/linux
Golang version: go1.10.1

Type:

Bug

Description:

I reserved a droplet(VPS in digitalocean) with 1 core, 2G memory in digitalocean, and solely run IPFS. After about 12 hours, OOM occured. The commands I used to build and run IPFS are:

go get -u -d github.com/ipfs/go-ipfs
cd $GOPATH/src/github.com/ipfs/go-ipfs
make install
ipfs init
ipfs daemon &

kernel OOM log:

[52300.460643] Out of memory: Kill process 9053 (ipfs) score 885 or sacrifice child
[52300.465445] Killed process 9053 (ipfs) total-vm:2660108kB, anon-rss:1860420kB, file-rss:0kB, shmem-rss:0kB

monitor graph:
digitalocean_-_ubuntu-usa

As a second try, I lowered connection numbers to

"LowWater": 60
"HighWater": 90

I'll wait and see if it would OOM again.

I'd like to know the recommended memory configuration for a production system. On the other side, if I need to run IPFS in a 2G box, how to tune its memory usage(by both config and code)?

@Stebalien Stebalien added kind/support A question or request for support topic/perf Performance labels Aug 31, 2018
@cannium
Copy link
Author

cannium commented Sep 1, 2018

Out of memory again. Ran about 24 hours.

"LowWater": 60
"HighWater": 90

digitalocean_-_ubuntu-usa

@schomatis
Copy link
Contributor

Hey @cannium, could you provide a few heap trace dumps (ideally at different uptimes, like every 3/4 hours), you can do this with:

curl localhost:5001/debug/pprof/heap > ipfs.heap

@cannium
Copy link
Author

cannium commented Sep 4, 2018

Sure, I'll do.

@cannium
Copy link
Author

cannium commented Sep 6, 2018

In case you become impatient, I'll put some updates here.
I started another instance, with LowWater 60, HighWater 90 config, ran for about 2 days, but OOM didn't happen. The only difference is that this instance is in region Toronto.
So I started another instance in US region today. Hopefully OOM would happen in 24 hours...

@schomatis
Copy link
Contributor

No problem, take your time.

@schomatis schomatis added the need/author-input Needs input from the original author label Sep 6, 2018
@Stebalien Stebalien added need/author-input Needs input from the original author and removed need/author-input Needs input from the original author labels Sep 6, 2018
@cannium
Copy link
Author

cannium commented Sep 7, 2018

It happened again. I started a file server containing all the heap files here: http://159.203.36.22:8000/
Note the last 2 heap files are empty because ipfs was already killed. I leave them there as an indicator.
Also I put in the binary file(file ipfs) in case someone need it.

Monitor graphs for last few hours:
digitalocean_-_ubuntu-can

@magik6k magik6k removed the need/author-input Needs input from the original author label Sep 7, 2018
@Stebalien
Copy link
Member

What commit did you build?

@cannium
Copy link
Author

cannium commented Sep 7, 2018

I believe it's 78a32f2

@schomatis
Copy link
Contributor

It would seem to be related to the number of active connections.

tool pprof top "ipfs.heap-Fri Sep  7 02_15_20 UTC 2018"
[...]
(pprof) top
Showing nodes accounting for 682.71MB, 84.78% of 805.26MB total
Dropped 183 nodes (cum <= 4.03MB)
Showing top 10 nodes out of 88
      flat  flat%   sum%        cum   cum%
  290.93MB 36.13% 36.13%   435.98MB 54.14%  gx/ipfs/QmZt87ZHYGaZFBrtGPEqYjAC2yNAhgiRDBmu8KaCzHjx6h/yamux.newSession
  148.57MB 18.45% 54.58%   148.57MB 18.45%  bufio.NewReaderSize (inline)
   57.51MB  7.14% 61.72%    57.51MB  7.14%  gx/ipfs/QmZt87ZHYGaZFBrtGPEqYjAC2yNAhgiRDBmu8KaCzHjx6h/yamux.newStream
   46.52MB  5.78% 67.50%    48.02MB  5.96%  crypto/cipher.NewCTR
   45.51MB  5.65% 73.15%    45.51MB  5.65%  crypto/aes.newCipher
   24.51MB  3.04% 76.19%    33.01MB  4.10%  gx/ipfs/QmWri2HWdxHjWBUermhWy7QWJqN1cV8Gd1QbDiB5m86f1H/go-libp2p-secio.newSecureSession
      18MB  2.24% 78.43%       18MB  2.24%  gx/ipfs/QmXTpwq2AkzQsPjKqFQDNY2bMdsAT53hUBETeyj8QRHTZU/sha256-simd.New
   17.50MB  2.17% 80.60%    35.50MB  4.41%  crypto/hmac.New
   17.15MB  2.13% 82.73%    23.15MB  2.88%  gx/ipfs/QmVYxfoJQiZijTgPNHCHgHELvQpbsJNTg6Crmc3dQkj3yy/golang-lru/simplelru.(*LRU).Add
   16.50MB  2.05% 84.78%    16.50MB  2.05%  math/big.nat.make

@cannium
Copy link
Author

cannium commented Sep 10, 2018

Could you give more instructions about the session - stream structure? I'd like to understand how could a session take 290M memory, cache the whole file object in memory?

@Stebalien
Copy link
Member

It's not a single session, that's all your connections. However, that's still an absurd number.

I'm getting:

         .          .     89:	s := &Session{
         .          .     90:		config:     config,
         .          .     91:		logger:     log.New(config.LogOutput, "", log.LstdFlags),
         .          .     92:		conn:       conn,
         .          .     93:		bufRead:    bufio.NewReader(conn),
    1.50MB     1.50MB     94:		pings:      make(map[uint32]chan struct{}),
  512.02kB   512.02kB     95:		streams:    make(map[uint32]*Stream),
       1MB        1MB     96:		inflight:   make(map[uint32]struct{}),
       2MB        2MB     97:		synCh:      make(chan struct{}, config.AcceptBacklog),
  108.92MB   108.92MB     98:		acceptCh:   make(chan *Stream, config.AcceptBacklog),
   70.70MB    70.70MB     99:		sendCh:     make(chan sendReady, 64),
    2.50MB     2.50MB    100:		recvDoneCh: make(chan struct{}),
       5MB        5MB    101:		shutdownCh: make(chan struct{}),
         .          .    102:	}
         .          .    103:	if client {
         .          .    104:		s.nextStreamID = 1
         .          .    105:	} else {
         .          .    106:		s.nextStreamID = 2

Really, those channels should be pretty small. I wonder if we're leaking connection objects somewhere. Looking at this, I'm pretty sure we are.

@Stebalien Stebalien added regression and removed kind/support A question or request for support labels Sep 10, 2018
@Stebalien
Copy link
Member

Yeah, there's definitely a slow leak. I've tried killing a bunch of connections but that memory doesn't move anywhere. However, this is quite a slow leak so it'll be tricky to track down where it's happening.

@Stebalien
Copy link
Member

Historically, this has always been a race in stream management. That is, some service is holding onto an open stream but not actually using it. When services do that, they're supposed to detect that the client has disconnected and free the stream but I'm guessing that that isn't happening here.

@Stebalien
Copy link
Member

Found it! If we abort the connection while the identify protocol is running, we'll keep a reference to the connection forever. This is an ancient bug.

@Stebalien
Copy link
Member

Stebalien commented Sep 10, 2018

fix: libp2p/go-libp2p#420

@Stebalien Stebalien added kind/bug A bug in existing code (including security flaws) and removed regression labels Sep 10, 2018
@cannium
Copy link
Author

cannium commented Sep 11, 2018

Awesome! I'll test with the latest version to see if fixed.

@Stebalien
Copy link
Member

You'll have to wait for me to bubble the update to go-ipfs. I should get to that in a few days.

@cannium
Copy link
Author

cannium commented Sep 12, 2018

I gx-go linked it. The memory graph still grows slowly, but is far better than before.

@Stebalien
Copy link
Member

That's great to hear (I didn't want to have to trouble you with getting gx working but I'm glad you were able to).

The remaining growth is probably the peerstore (which we're working on). However, so we don't make assumptions, mind posting a heap profile (just one should be fine) along with the IPFS binary and commit used to build it?

@cannium
Copy link
Author

cannium commented Sep 13, 2018

Put here as before: http://159.203.36.22:8000/
The go-ipfs version is 7ad12bf, linked go-libp2p version is 542d43a30fdece4b61517cacff56a3ccbedc4139

@Stebalien
Copy link
Member

Thanks! Yeah, that looks like a known bug: ipfs/go-ipfs-blockstore#3. Basically, we should be using a significantly more efficient "has" cache for our blockstore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) topic/perf Performance
Projects
None yet
Development

No branches or pull requests

4 participants