Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Object lost even though a reference to it exists #20018

Closed
1 of 2 tasks
birgerbr opened this issue Nov 3, 2021 · 2 comments
Closed
1 of 2 tasks

[Bug] Object lost even though a reference to it exists #20018

birgerbr opened this issue Nov 3, 2021 · 2 comments
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)
Milestone

Comments

@birgerbr
Copy link
Contributor

birgerbr commented Nov 3, 2021

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

An object stored using ray.put is lost when the owner dies, even though there still exist references to the object.

I was expecting the object to be copied during the worker shutdown, such that it is not lost.

Versions / Dependencies

Ray 1.8.0, Python 3.8.10, Ubuntu 20.04

Reproduction script

# stdlib
import asyncio
import time

# thirdparty
import numpy as np
import ray
import ray.util.queue


def make_data():
    return np.random.random((10, 10))


make_data_remote = ray.remote(make_data)


@ray.remote
def make_data_collection(make_it_fail: bool):
    if make_it_fail:
        return [
            ray.put(make_data()),
            ray.put(make_data()),
        ]
    else:
        return [make_data_remote.remote(), make_data_remote.remote()]


def use_actor():
    queue = ray.util.queue.Queue()
    queue.shutdown()


@ray.remote
def process_data(x):
    return len(x)


async def main(make_it_fail: bool):
    print(f"Make it fail: {make_it_fail}")
    ray.init()

    make_data_collection_task = ray.get(make_data_collection.remote(make_it_fail))

    use_actor()
    time.sleep(1)

    for ref in make_data_collection_task:
        print(await process_data.remote(ref))

    ray.shutdown()


asyncio.run(main(False))
asyncio.run(main(True))

Anything else

The problem is in our case magnified since we are using actors, which as I understand it shuts down the worker they are running on when they are done.

Here is the output of the script

Make it fail: False
2021-11-03 12:46:37,597 INFO services.py:1270 -- View the Ray dashboard at http://127.0.0.1:8265
10
10
Make it fail: True
2021-11-03 12:46:41,696 INFO services.py:1270 -- View the Ray dashboard at http://127.0.0.1:8265
Traceback (most recent call last):
  File "play_ray_issue_2.py", line 55, in <module>
    asyncio.run(main(True))
  File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "play_ray_issue_2.py", line 49, in main
    print(await process_data.remote(ref))
ray.exceptions.RayTaskError: ray::process_data() (pid=3242475, ip=10.0.0.63)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.OwnerDiedError: Failed to retrieve object 32cccd03c567a254ffffffffffffffffffffffff0100000002000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.

The object's owner has exited. This is the Python worker that first created the ObjectRef via `.remote()` or `ray.put()`. Check cluster logs (`/tmp/ray/session_latest/logs/*8011c02e5d0655287e75d4d9920c4238139f2b0956d9a752874fa6d8*` at IP address 10.0.0.63) for more information about the Python worker failure.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@birgerbr birgerbr added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 3, 2021
@rkooo567 rkooo567 added this to the Core Backlog milestone Nov 4, 2021
@rkooo567
Copy link
Contributor

rkooo567 commented Nov 4, 2021

cc @ericl Is this addressable by _owner= private API? I assume this is the ownership transfer problem?

@ericl
Copy link
Contributor

ericl commented Nov 12, 2021

Yes, this duplicates #12635. You can work around this for now by setting a designated owner actor via ray.put(_owner=actor).

@ericl ericl closed this as completed Nov 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

3 participants