openslide memory leak #24

gabricampanella · 2017-08-25T23:49:48Z

Context

Issue type (bug report):
Operating system (CentOS 7.3.1611):
Platform (x86_64 GNU/Linux):
OpenSlide Python version (1.1.0):
Slide format (SVS):

Details

Dear openslide team,
Thank you for making this software available. It greatly facilitates my research. I am building large-scale pipelines for digital pathology that rely on extracting tiles from thousands of slides on the fly. To achieve this I create all openslide objects at the beginning and append them to a list. I fetch tiles by looping through a list of pixel coordinates and slide index. The loop somehow leaks memory. The memory used keeps increasing until out of memory. The script below reproduces this behavior. You only need to change the path to a slide to make it work.

import openslide
import random
import sys
import numpy as np

class library(object):
    def __init__(self, slides, idx, grid, size):
        sl = []
        for i,slide in enumerate(slides):
            sl.append(openslide.OpenSlide(slide))
            sys.stdout.write("{}/{} \r".format(i+1,len(slides)))
            sys.stdout.flush()
        tup = zip(grid,idx)
        self.size = size
        self.slides = slides
        self.tup = tup
        self.sl = sl

    def make_list_of_objs(self):
        sl = []
        for i,slide in enumerate(self.slides):
            sl.append(openslide.OpenSlide(slide))
            sys.stdout.write("{}/{} \r".format(i+1,len(self.slides)))
            sys.stdout.flush()
        self.sl = sl

    def __getitem__(self, index):
        tup = self.tup[index]
        img = self.sl[tup[1]].read_region(tup[0],0,(self.size,self.size)).convert('RGB')
        return img

    def __len__(self):
        return len(self.tup)

class batch_loader(object):
    def __init__(self,dset,n):
        batches = []
        for i in range(0,dset.__len__(),n):
            d = min(n,dset.__len__()-i)
            batches.append(range(i,i+d))
        self.batches = batches

#MAKE DATA
slides = ['/scratch/gabriele/prostate_dict/389973.svs']*2000 #CHANGE SLIDE NAME
idx = list(np.random.choice(len(slides),2000000,replace=True))
gridx = list(np.random.choice(10000,2000000,replace=True))
gridy = list(np.random.choice(10000,2000000,replace=True))
grid = zip(gridx,gridy)

#SET UP LOADER
dset = library(slides,idx,grid,224)
loader = batch_loader(dset,512)

#LEAKING LOOP
for ii,batch in enumerate(loader.batches):
    imgs = []
    for i,idx in enumerate(batch):
        img = np.array(dset.__getitem__(idx))
        imgs.append(img)
    sys.stdout.write("{}/{} \r".format(ii+1,len(loader.batches)))
    sys.stdout.flush()

To circumvent this one can delete the list of openslide objects and recreate it every few iterations. This stabilizes memory footprint but it is an unfortunate work-around that wastes a lot of time reopening openslide objects:

import openslide
import random
import sys
import numpy as np

class library(object):
    def __init__(self, slides, idx, grid, size):
        sl = []
        for i,slide in enumerate(slides):
            sl.append(openslide.OpenSlide(slide))
            sys.stdout.write("{}/{} \r".format(i+1,len(slides)))
            sys.stdout.flush()
        tup = zip(grid,idx)
        self.size = size
        self.slides = slides
        self.tup = tup
        self.sl = sl
        self.idx = 0

    def make_list_of_objs(self):
        sl = []
        for i,slide in enumerate(self.slides):
            sl.append(openslide.OpenSlide(slide))
            sys.stdout.write("{}/{} \r".format(i+1,len(self.slides)))
            sys.stdout.flush()
        self.sl = sl

    def __getitem__(self, index):
        if self.idx > 5120:
            self.idx = 0
            self.make_list_of_objs()
        self.idx += 1
        tup = self.tup[index]
        img = self.sl[tup[1]].read_region(tup[0],0,(self.size,self.size)).convert('RGB')
        return img

    def __len__(self):
        return len(self.tup)

class batch_loader(object):
    def __init__(self,dset,n):
        batches = []
        for i in range(0,dset.__len__(),n):
            d = min(n,dset.__len__()-i)
            batches.append(range(i,i+d))
        self.batches = batches

#MAKE DATA
slides = ['/scratch/gabriele/prostate_dict/389973.svs']*2000 #CHANGE SLIDE NAME
idx = list(np.random.choice(len(slides),2000000,replace=True))
gridx = list(np.random.choice(10000,2000000,replace=True))
gridy = list(np.random.choice(10000,2000000,replace=True))
grid = zip(gridx,gridy)

#SET UP LOADER
dset = library(slides,idx,grid,224)
loader = batch_loader(dset,512)

#LEAKING LOOP
for ii,batch in enumerate(loader.batches):
    imgs = []
    for i,idx in enumerate(batch):
        img = np.array(dset.__getitem__(idx))
        imgs.append(img)
    sys.stdout.write("{}/{} \r".format(ii+1,len(loader.batches)))
    sys.stdout.flush()

Thanks for your patience.

The text was updated successfully, but these errors were encountered:

bgilbert · 2017-08-29T05:12:23Z

Hi @gabricampanella. Each OpenSlide object has an internal cache of decoded pixel data that can grow to 32 MiB. There's currently no API to configure the size of that cache (openslide/openslide#38).

My usual recommendation is to keep a fixed-length cache of OpenSlide handles, closing old ones as necessary. That works reasonably well if the workload exhibits locality, such as in the backend of a web-based slide viewer, but of course it's not great for truly random workloads. Do you have any opportunity to coalesce multiple accesses to a slide?

gabricampanella · 2017-08-29T17:06:52Z

Thanks for your reply. If I understand correctly, if each handle can grow to 32 MiB then if I open 2000 of them, they should grow no more than 64 GiB. But I observe a memory footprint that goes well beyond that.
I believe that in a supervised learning framework the fact that samples are drawn independently from each other is important. I think complete randomness when extracting tiles is more correct, but practically it could be that good learning is achieved also by having batches of tiles all coming from the same slide. For now I am more inclined to use the hack than to eliminate random sampling.

bgilbert · 2017-08-30T08:46:58Z

For what it's worth, you're also storing at least 373 GiB of pixel data in imgs.

If you have a fixed set of samples to collect, as in your reproduction code, you could do it in two passes: first pick a set of samples, then sort by source slide, extract the pixel data, and re-sort into the original order. That way you'd only need to have one slide open at a time.

gabricampanella · 2017-08-30T15:59:15Z

imgs should contain only 512 images of size 3x224x224 (In other words 77M floats). It gets rewritten at every step of the loop. It definitely shouldn't grow to 370GiB.

I see what you are saying. In that case If I am unlucky I would need to open 512 slides per batch (if the batch size is 512). In the best case scenario I only get to open 1. It is an interesting idea. I'll think about it.

bgilbert · 2017-08-31T07:10:45Z

Whoops, you're right about imgs, I misread.

happyalfred2016 · 2023-04-25T03:03:46Z

I have a similar issue. I am trying to load patches from slides in the training of deep learning model. But even though I closed the slides (OpenSlide.close()) after each patch loading, the memory still grow heavily. Do you have any suggestions or workaround?

gabricampanella mentioned this issue Sep 2, 2019

Memory leak while obtaining data from DataLoader MSKCC-Computational-Pathology/MIL-nature-medicine-2019#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openslide memory leak #24

openslide memory leak #24

gabricampanella commented Aug 25, 2017

bgilbert commented Aug 29, 2017

gabricampanella commented Aug 29, 2017

bgilbert commented Aug 30, 2017

gabricampanella commented Aug 30, 2017

bgilbert commented Aug 31, 2017

happyalfred2016 commented Apr 25, 2023

openslide memory leak #24

openslide memory leak #24

Comments

gabricampanella commented Aug 25, 2017

Context

Details

bgilbert commented Aug 29, 2017

gabricampanella commented Aug 29, 2017

bgilbert commented Aug 30, 2017

gabricampanella commented Aug 30, 2017

bgilbert commented Aug 31, 2017

happyalfred2016 commented Apr 25, 2023