[ray] Object store shared memory numpy leak in worker loop #7653

DMTSource · 2020-03-19T04:27:49Z

What is the problem?

When looping over rows in a shared 2D array I am seeing a consistent memory buildup in my ray workers if I do any work with the array. I attempted to recreate this in a loop on the main process but I do not see the problem occuring.

I assumed this was a numpy bug similar to this Issue below and built my test both in the main process and a ray worker, but I can't seem to work on the shared object in my worker without causing the leak.
numpy/numpy#15746

Is this expected behavior when touching a read-only item in a worker's loop, or should it be acting nice such as in the main process example(illustrated in pic below)?

I imagine over the loop the entire contents of the read-only object are being loaded into local memory for manipulation, perhaps what I'm asking is: is there a way around this or do I need to work with a local copy of very big objects or back to loading data batches from file?

Ray version and other system information (Python version, TensorFlow version, OS):
Ubuntu 1404
Python 3.7.6 (Anaconda environment active)
[GCC 7.3.0] :: Anaconda, Inc. on linux
numpy # 1.18.1
ray # 0.8.2

Reproduction

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the [latest wheels]

I ran my script with the following to obtain memory profile plot:

mprof run --include-children --multiprocess test_ray_numpy_read_only_leak_in_loop.py
mprof plot

# Similar to issue 15746? Unable to get around the leak though with copy so prob ndarray+pyarrow/objectstore related?
# https://github.com/numpy/numpy/issues/15746

# Wed Mar 18 2020
import numpy as np # 1.18.1
import ray         # 0.8.2

ray.init(num_cpus=1)

# fake data
features = np.random.random((1000000,100)).astype(np.float32)
# shared memory store
feature_store_id = ray.put(features)
del features
# We may have many workers looping over the read only shared array, with intention of reducing memory

def test_read_only(feats):
    # ensure read only
    try:
        feats += 1.
    except:
        print("Features are properly read-only from ray shared memory store.")

###################################################
# I do not believe there is any problem here according to first half of mprof plot
def event_loop_local():
    # I could also use global features here
    #   (bypass ray and skip deleting features above) and I would have no mem leak.
    shared_features = ray.get(feature_store_id)

    test_read_only(shared_features)

    for i in range(1000000):
        # Does this mem leak?:  NO (see 'mprof plot')
        features_i = shared_features[i]
        features_i = [float(v) for v in features_i]

# test as func on main process
print("Testing event_loop_local")
event_loop_local()
###################################################

###################################################
# Now test in ray as a worker
@ray.remote
def event_loop_remote(shared_features):
    # tested with get and pass via .remote()
    #shared_features = ray.get(feature_store_id)
    test_read_only(shared_features)

    for i in range(1000000):
        # Does this leak?:  YES, lets try the copy fix from np issue 15746
        features_i = shared_features[i]
        # Does this leak?:  YES (see 'mprof plot')
        #features_i = shared_features[i].copy()
        # Does this leak?: YES
        #features_i = np.array(shared_features[i], copy=True, dtype=object)#.copy()
        # Does this leak?: YES
        #features_i = shared_features[i:i+1]#.copy()

        # All of the above are unable to prevent any work done here from leaking
        features_i = [float(v) for v in features_i[0]]


        # Does this leak?: YES
        #features_i = shared_features[i].copy()
        #features_i += 1.

# test inside a ray worker
print("Testing event_loop_remote.remote")
ray.get(event_loop_remote.remote(feature_store_id))
###################################################

ray.shutdown()

Expected Output:

Testing event_loop_local
Features are properly read-only from ray shared memory store.
Testing event_loop_remote.remote
(pid=13191) Features are properly read-only from ray shared memory store.

The text was updated successfully, but these errors were encountered:

DMTSource added the bug Something that is supposed to be working; but isn't label Mar 19, 2020

ericl added the P3 Issue moderate in impact or severity label Mar 19, 2020

chunweiyuan mentioned this issue Nov 24, 2021

[Bug] Excess memory usage when scheduling tasks in parallel? #20618

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ray] Object store shared memory numpy leak in worker loop #7653

[ray] Object store shared memory numpy leak in worker loop #7653

DMTSource commented Mar 19, 2020 •

edited

Loading

[ray] Object store shared memory numpy leak in worker loop #7653

[ray] Object store shared memory numpy leak in worker loop #7653

Comments

DMTSource commented Mar 19, 2020 • edited Loading

What is the problem?

Reproduction

DMTSource commented Mar 19, 2020 •

edited

Loading