Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ray] Object store shared memory numpy leak in worker loop #7653

Open
1 of 2 tasks
DMTSource opened this issue Mar 19, 2020 · 0 comments
Open
1 of 2 tasks

[ray] Object store shared memory numpy leak in worker loop #7653

DMTSource opened this issue Mar 19, 2020 · 0 comments
Labels
bug Something that is supposed to be working; but isn't P3 Issue moderate in impact or severity

Comments

@DMTSource
Copy link

DMTSource commented Mar 19, 2020

What is the problem?

When looping over rows in a shared 2D array I am seeing a consistent memory buildup in my ray workers if I do any work with the array. I attempted to recreate this in a loop on the main process but I do not see the problem occuring.

I assumed this was a numpy bug similar to this Issue below and built my test both in the main process and a ray worker, but I can't seem to work on the shared object in my worker without causing the leak.
numpy/numpy#15746

Is this expected behavior when touching a read-only item in a worker's loop, or should it be acting nice such as in the main process example(illustrated in pic below)?

I imagine over the loop the entire contents of the read-only object are being loaded into local memory for manipulation, perhaps what I'm asking is: is there a way around this or do I need to work with a local copy of very big objects or back to loading data batches from file?

Ray version and other system information (Python version, TensorFlow version, OS):
Ubuntu 1404
Python 3.7.6 (Anaconda environment active)
[GCC 7.3.0] :: Anaconda, Inc. on linux
numpy # 1.18.1
ray # 0.8.2

Reproduction

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the [latest wheels]
    mem leak but report  (2)

I ran my script with the following to obtain memory profile plot:

mprof run --include-children --multiprocess test_ray_numpy_read_only_leak_in_loop.py
mprof plot

# Similar to issue 15746? Unable to get around the leak though with copy so prob ndarray+pyarrow/objectstore related?
# https://github.com/numpy/numpy/issues/15746

# Wed Mar 18 2020
import numpy as np # 1.18.1
import ray         # 0.8.2

ray.init(num_cpus=1)

# fake data
features = np.random.random((1000000,100)).astype(np.float32)
# shared memory store
feature_store_id = ray.put(features)
del features
# We may have many workers looping over the read only shared array, with intention of reducing memory

def test_read_only(feats):
    # ensure read only
    try:
        feats += 1.
    except:
        print("Features are properly read-only from ray shared memory store.")

###################################################
# I do not believe there is any problem here according to first half of mprof plot
def event_loop_local():
    # I could also use global features here
    #   (bypass ray and skip deleting features above) and I would have no mem leak.
    shared_features = ray.get(feature_store_id)

    test_read_only(shared_features)

    for i in range(1000000):
        # Does this mem leak?:  NO (see 'mprof plot')
        features_i = shared_features[i]
        features_i = [float(v) for v in features_i]

# test as func on main process
print("Testing event_loop_local")
event_loop_local()
###################################################

###################################################
# Now test in ray as a worker
@ray.remote
def event_loop_remote(shared_features):
    # tested with get and pass via .remote()
    #shared_features = ray.get(feature_store_id)
    test_read_only(shared_features)

    for i in range(1000000):
        # Does this leak?:  YES, lets try the copy fix from np issue 15746
        features_i = shared_features[i]
        # Does this leak?:  YES (see 'mprof plot')
        #features_i = shared_features[i].copy()
        # Does this leak?: YES
        #features_i = np.array(shared_features[i], copy=True, dtype=object)#.copy()
        # Does this leak?: YES
        #features_i = shared_features[i:i+1]#.copy()

        # All of the above are unable to prevent any work done here from leaking
        features_i = [float(v) for v in features_i[0]]


        # Does this leak?: YES
        #features_i = shared_features[i].copy()
        #features_i += 1.

# test inside a ray worker
print("Testing event_loop_remote.remote")
ray.get(event_loop_remote.remote(feature_store_id))
###################################################

ray.shutdown()

Expected Output:

Testing event_loop_local
Features are properly read-only from ray shared memory store.
Testing event_loop_remote.remote
(pid=13191) Features are properly read-only from ray shared memory store.

@DMTSource DMTSource added the bug Something that is supposed to be working; but isn't label Mar 19, 2020
@ericl ericl added the P3 Issue moderate in impact or severity label Mar 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P3 Issue moderate in impact or severity
Projects
None yet
Development

No branches or pull requests

2 participants