Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Support detached/GCS owned objects #12635

Open
ericl opened this issue Dec 5, 2020 · 52 comments
Open

[core] Support detached/GCS owned objects #12635

ericl opened this issue Dec 5, 2020 · 52 comments
Labels
core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability P2 Important issue, but not time-critical size-large
Milestone

Comments

@ericl
Copy link
Contributor

ericl commented Dec 5, 2020

Overview

In the current ownership model, all created objects are owned by some worker executing the job. This means however that when a job exits, all created objects become unavailable.

In certain cases it is desirable to share objects between jobs (e.g., shared cache), without creating objects explicitly from a detached actor.

We have a current internal API to allow the owner of an object created with ray.put() to be assigned to a specific actor, e.g.:

ray.put(data, _owner=actor_handle)

This means the object created will fate-share with that actor instead of the current worker process. This means that it's currently possible to create global objects by creating a named detached actor, and setting the owner to that actor.

However, it would be nice to support ray.put(data, _owner="global") to avoid the need for that hack, and allow the object to be truly HA (e.g., tracked durably in HA GCS storage).

@ericl ericl added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 5, 2020
@fishbone
Copy link
Contributor

fishbone commented Dec 6, 2020

I have a concern about this method.
In this way, it looks like we treat the global storage as a key-value (with schema) storage and after running ray.util.put_global_ref(obj_ref, lookup_key="objfoo1") we store objfoo1 -> obj_ref in the storage which might leads to memory leak. It looks like makes memory management complicated. It's different from the actor which when it terminates it'll release all the resources. Besides if we do it in this way, we probably want to introduce namespace to avoid conflict and to prevent some object from being accidentally being updated by other apps.

If we have a global actor per group of task, and it's served for this purpose, will it just work? Will it be simpler? Do we need to support move ownership for this?

@ericl
Copy link
Contributor Author

ericl commented Dec 7, 2020

I agree binding the lifetime of an object to an actor might be the way to go. The way I see it there are probably a few use cases:
(1) pin object permanently, manual deletion
(2) allow object to be evicted if under memory pressure (caching case)
(3) bind lifetime of object to another actor/task (similar to ownership transfer)

Perhaps we could create the following API? Perhaps we don't need to give objects an explicit name, just allow their lifetime to be changed:

ray.util.set_object_lifetime(obj_ref, lifetime="detached"|"lru"|<actorRef>)

# can only implement <actorRef> case initially

Btw, actually doing ownership transfer is pretty hard, since the owner metadata is propagated inline with the object reference. You'd need to add some kind of transfer table to handle that. Assuming the number of objects that require transfer is quite small, it would be easier to instead only allow ownership to be transferred to the GCS. This is simpler since it means owner lookups can always fall back to GCS on owner failure.

@ericl ericl added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 8, 2020
@fishbone
Copy link
Contributor

I think my concern here is two things: 1) how to manage all the detached variables; 2) how to do better isolation between groups of jobs.

how to manage all the detached variables
We plan to make ray run in a cluster, and developers can submit their job to the cluster. If someone created many detached variables by using (1), it'd prevent other jobs from running due to OOM. And it's not easy to recover from this since we don't have extra info about all these detached variables.
Besides, if we do ray.util.set_object_lifetime(obj_ref, lifetime="detached"|"lru"|<actorRef>) how can another job get the obj_ref? It should be a variable shared between jobs, but if one job exit, how can it be transferred from one to another?

Here I assume one job to be a .py file and will be executed by ray submit job.py or python job.py. Please let me know if I'm wrong.

how to do better isolation between groups of jobs
If we use the variable name to do communication between jobs (if I understand the concept of the job correctly), the variable name will be a global variable visible to all jobs in the cluster. If we have a series of jobs and run them with different inputs, there will be conflicts.

I thought of introducing job groups of something else, but then it looks like one driver job with several sub-jobs. So sub-jobs(separate py file) can run in the driver job (main py file), and the driver job will have a reference of the shared data, which will work.

I think something is missing here in my brain. Please share more details about this.

@ericl
Copy link
Contributor Author

ericl commented Dec 10, 2020

Another user ask in #12748

@ericl
Copy link
Contributor Author

ericl commented Dec 10, 2020

If someone created many detached variables by using (1), it'd prevent other jobs from running due to OOM. And it's not easy to recover from this since we don't have extra info about all these detached variables.

This isn't something we need to worry about though--- if the user pins too much, it's their problem. With an autoscaling cluster, we can also spill / add more nodes to handle this case.

Besides, if we do ray.util.set_object_lifetime(obj_ref, lifetime="detached"|"lru"|) how can another job get the obj_ref? It should be a variable shared between jobs, but if one job exit, how can it be transferred from one to another?

One way is that the object refs can be transferred through a named actor. Perhaps we can try this to see if it satisfies use cases before adding another API.

@ericl
Copy link
Contributor Author

ericl commented Dec 10, 2020

I thought of introducing job groups of something else, but then it looks like one driver job with several sub-jobs. So sub-jobs(separate py file) can run in the driver job (main py file), and the driver job will have a reference of the shared data, which will work.

Btw, in the RayDP/Spark on Ray use case, the variable just needs to be transferred within the job, so there isn't the use case of multiple jobs yet. For caching use cases, the variables can be transferred to named actors global across many jobs. So I think changing lifetime is sufficient for both these cases, but perhaps there are others that require more functionality.

@fishbone
Copy link
Contributor

I think I misunderstood some parts. So let me just summarize it here and please let me know if I was wrong:

  • Having a named actor shared across jobs to exchange data between them
  • If the user wants to do the cross job communication, pass the ownership to the actor
  • Multiple eviction policy can be supported here (detached, lru, binding it to anther job/actor)

Another thing I might want to be clarified here is what's a job? Is it a python script? Some demo code about what needs to be achieved should be helpful for me to understand it.

This isn't something we need to worry about though--- if the user pins too much, it's their problem. With an autoscaling cluster, we can also spill / add more nodes to handle this case.

I'm thinking that it's a cluster and multiple users will want to submit the job in the same cluster. If someone uses too much and we can't have a way to group the related variables, it's going to be a mess. But if it's through a named actor and jobs users have different named actors, then it's OK. We can limit memory usage through these actors to protect the cluster.

@edoakes edoakes added the serve Ray Serve Related Issue label Dec 11, 2020
@ericl
Copy link
Contributor Author

ericl commented Dec 11, 2020 via email

@DonYum
Copy link

DonYum commented Dec 12, 2020

Both @ericl and @ahbone 's opinions are impressive.

As @ahbone say, the management of the detached objs looks like shouldn't be given to the user. This will mess the code of user.
However, In practice, we do need a 'method' to make some long-life objs.

Yes, using detached named actor to customize a 'detached objs' is a method. But it's to complete, and is implicit.

As a novice of ray, I spent hours to find some method to 'detach' a plasma obj In docs/issues and discussions, but I failed.
I have to submit a new issue #12748 which lead me to here.
The process was too painful. Along the way, I always thinking, why not provide such an API. Is this a technical problem?
...

In my opinion, ray core is like a toolbox, providing enough APIs is our responsibility, and how customers use it is another thing that we may not consider too much.

Hope my ideas will help.
Thanks a lot!

@fishbone
Copy link
Contributor

@DonYum Thanks for your ideas here, and I agree this is a useful feature to support.
My concern here is not about providing such an API/feature but how we we should support this feature better. Early discussion of it will save us effort in the future.
What worries me here is actually how to manage all these long-life objects. Suppose we have a lot of long-life objects created, now the user wants to free some objects through ray client, it's hard to know which one to free since they are all ids (maybe I was wrong, not an experienced ray user). Usually it might be useful for us to free all the global variables created by some job when they are not needed anymore. Binding them to a named actor sounds to me like a potential way to go.

@ericl Here for RayDP, as you mentioned that:

So in the Spark on Ray case, all actors run would be all within a single job.

In this way, it looks like all objects can be passed back to the driver. Why detached objects will help here?
If you have some sample code about the usecase we'd like to support, it'll be helpful.

@ericl
Copy link
Contributor Author

ericl commented Dec 16, 2020

Perhaps we don't allow "true" global objects in the first iteration, to avoid the problems mentioned above. If an object can only be transferred to another actor, then that would solve the RayDP use case (single job) and also allow other use cases which span multiple jobs.

In this way, it looks like all objects can be passed back to the driver. Why detached objects will help here?

The issue is the objects are still owned by the producing worker (the one that call ray.put). If that worker is killed, the object is lost, even if the reference is passed back to the driver.

>>> @ray.remote
... class A:
...    def __init__(self):
...       self.ref = ray.put("hi")
...    def get(self):
...       return self.ref
... 
>>> 
>>> a = A.remote()
>>> ref = ray.get(a.get.remote())
>>> ray.get(ref)
'hi'
>>> ray.kill(a)
>>> ray.get(ref)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/eric/Desktop/ray/python/ray/worker.py", line 1378, in get
    raise value
ray.exceptions.ObjectLostError: Object ffffffffffffffffa67dc375e60ddd1a23bd3bb90100000002000000 is lost due to node failure.

@fishbone
Copy link
Contributor

Thanks for the explanation. I'll spend some time to check how to support this version.

@fishbone
Copy link
Contributor

fishbone commented Jan 24, 2021

@ericl, sorry for not making progress for a while. I recently picked it again.

I'm wondering whether this will make RayDP's case and other cases work: ref = ray.get(a.get.remote(), clone=True)
Underline this get, we'll make the ref be the thing created by this task, so its life cycle is independent of the actor.

Since the value stored is immutable, whenever we need an updated value, we need to call actor's get again, which will generate a new ref.

Btw, actually doing ownership transfer is pretty hard, since the owner metadata is propagated inline with the object reference.

We don't need to take care of this, since ref is referencing the object owned by itself and all the other copied later on is about ref.

Please let me know whether it's the correct direction to go or not.

@ericl
Copy link
Contributor Author

ericl commented Jan 26, 2021

I'm wondering whether this will make RayDP's case and other cases work: ref = ray.get(a.get.remote(), clone=True)
Underline this get, we'll make the ref be the thing created by this task, so its life cycle is independent of the actor.

Is this the same thing as ray.put(ray.get(a.get.remote())) -- making a copy? It could be another solution though the cost is we need to use twice the memory for a short period of time.

@fishbone
Copy link
Contributor

fishbone commented Jan 27, 2021

I double-checked your sample code, and I realized that I misunderstood it. It's about return ObjectRef directly in the remote. I updated the API a little bit. I'm trying to avoid introducing the API to make the system like a global key-value store, but to make it more like a variable's ownership can be shared/moved.

>>> @ray.remote
... class A:
...    def __init__(self):
...       self.ref = ray.put("hi")
...    def get(self):
...       return self.ref
... 
>>> 
>>> a = A.remote()
>>> ref = ray.get(a.get.remote()).shared()
>>> ray.get(ref)

Here, if we give ObjectRef an API, maybe called shared, and the return, ref, is a local object id, and inside GCS, we have a link from ref -> loc.

And here, we need to update raylet to make an object shared by multiple tasks. I haven't checked how to do this yet, but I feel this makes the API looks cleaner, and the major updates should sit in raylet. As for inline value, since it's small and needs to be copied to remote tasks anyway, it should be OK to create a new object. If we do this:

shared_ref = ray.get(a.get.remote()).shared()
ray.get(shared_ref)

ref = ray.get(a.get.remote())
ray.get(ref)

There will be a duplicate; otherwise, there is no extra cost for memory. Correct me if I was wrong here.

So most of the updates should be in raylet. Here are some thoughts from me:

  • When ref = ray.get(a.get.remote()).shared() is executed, raylet will trigger copy and pull data from remote.
  • The object stored in raylet will have information like: owners: [task1, task2]. The object will be eligible to be evicted if none of the owners is the local node.
  • When task1 finished, delete notify raylet and remote it there.

Maybe some step is hard to do, so please give me some comments.

@fishbone
Copy link
Contributor

fishbone commented Feb 6, 2021

@ericl Here is my plan to support this

  • Update plasma store to support two objects share the same entry. LRU policy needs to be updated to manage it at the entry-level. After this, we enable ref sharing at a low level.
  • Introduce API in the plasma store to make an object share another object's entry.
  • Update GCS to include sharing information. The shared obj_ref equal with each other except the ownership is different. They are symmetric.
  • Update python/java API to support the semantic.

So the overall flow might look like

  • py/java: triggering by a_ref = b_ref.shared()
  • put a into GCS with ownership of the current node. put a to b's sharing list and b to a's sharing list.
  • start to download data of b to local
  • after downloading finished, in plasma store, make a shared by b.
  • do this for the nested refs in b too

In this way, the objects' life-cycle stays the same as before. We also avoid duplicating low-level copy. And we give the fundamental support for ownership sharing. Maybe ownership sharing is enough for now. We can call delete in the first owner to support ownership transferring.

If you think it's good to go, I'll start implementing it.

@ericl
Copy link
Contributor Author

ericl commented Feb 6, 2021

Hmm this seems a little complicated, I'm not sure we should be changing plasma here or introducing entry sharing as a concept. How about we outline a couple different design options before coming to a conclusion?

I was originally thinking this would be a simple change, where the object is simply not deleted when the owner dies. If it turns out to be much harder then the feature might not be worth it.

Btw: there is a new ownership-based object directory that might conflict with this change: #12552

@fishbone
Copy link
Contributor

fishbone commented Feb 6, 2021

@ericl
Update plasma store to support obj sharing in low-level is not that complicated. We need to update ObjectTable with the value part to a raw pointer. And for the eviction, we need to manage it based on entry-level. Now it's ObjectId in the hash table with a linked list. We need to update its entry pointer in the hash table with a linked list. The overall update in this module is ok.

The sharing info tells the lower-level store that you don't need to create an entry, but instead, you can share it with the existing one. I haven't figured out how to ingest this sharing info in GCS or the new ownership-based object directory.

I was originally thinking this would be a simple change, where the object is simply not deleted when the owner dies. If it turns out to be much harder then the feature might not be worth it.

I thought about this before. But if this node(A) passes this obj_ref to another node(B), and B needs to download this info from the owner(X), which is dead. So B also needs this information that the obj_ref is owned by X but pinned in B.
Without this information, B will think X is dead, and it'll throw errors as the current design. But with this info, it looks like supporting object sharing, which can be implemented better.

If we want to support pin an obj locally and the visibility is limited to the current node, it'll be a simpler change. We just don't delete the pinned obj even owner is dead.
But if we'd like to implement ownership sharing, we need to make more changes.

@ericl
Copy link
Contributor Author

ericl commented Feb 6, 2021

I thought about this before. But if this node(A) passes this obj_ref to another node(B), and B needs to download this info from the owner(X), which is dead. So B also needs this information that the obj_ref is owned by X but pinned in B.

You can fall back to the GCS in this case right? The owner can respond that the object is owned by the GCS if it's up; if it's down the caller can auto-retry at the GCS just in case.

@fishbone
Copy link
Contributor

fishbone commented Feb 6, 2021

@ericl Maybe I missed some parts in the system. I thought the owner needs to be up to access that object.

  • ref is owned by X
  • A get ref and pin it locally.
  • X got kill then.
  • A can still access data pointed by ref
  • A passes ref to B
  • B wants to access ref, and it needs the owner to be up under the current design. It's fate-shared by the owner. X is a job doing some work, for example, accumulation, it finished the job and wants to pass the result to the driver. So X will not be up anymore after returning the results to the driver.

@ericl
Copy link
Contributor Author

ericl commented Feb 6, 2021

B wants to access ref, and it needs the owner to be up under the current design. It's fate-shared by the owner. X is a job doing some work, for example, accumulation, it finished the job and wants to pass the result to the driver. So X will not be up anymore after returning the results to the driver.

Right, but we can change the metadata resolution protocol to fall back to querying the GCS if X is down. So in this case, B can fallback to querying the GCS before giving up resolving the object metadata.

Btw, another related use case is that we might consider transferring object ownership to the GCS if it's ever spilled to external storage such as S3 (cc @rkooo567 @clarkzinzow ). This would allow the object to be retrieved even if the owner has been killed for some reason (might be useful for large scale distributed shuffle use cases). In this case, we would want to automatically transfer ownership to the GCS on spill without any API call from the user.

@fishbone
Copy link
Contributor

fishbone commented Feb 6, 2021

@ericl

Right, but we can change the metadata resolution protocol to fall back to querying the GCS if X is down. So in this case, B can fallback to querying the GCS before giving up resolving the object metadata.

But even it falls back to querying the GCS, it still can't get the actual data since the owner is dead.

I feel that we are talking about different scenarios.

  1. One is to transfer/share the ownership from one node to another one. For example, a complicated actor finished a task and decided to pass the object created locally to the driver and it'll be terminated after. This is the case I was thinking about.
  2. Another one is to improve availability. We'd like to relax the restriction about fate-sharing. If a ref is used in some node and if in the cluster (or locally), it's available somewhere, we'll take that even the owner is down. We assume the owner will be up eventually. So in this case, the new API: ray.util.set_object_lifetime(obj_ref, lifetime="detached"|"lru"|<actorRef>) is a way to modify the behavior of eviction policy. In this case, we always query the local store first and the owner later.

Btw, another related use case is that we might consider transferring object ownership to the GCS if it's ever spilled to external storage such as S3 (cc @rkooo567 @clarkzinzow ). This would allow the object to be retrieved even if the owner has been killed for some reason (might be useful for large scale distributed shuffle use cases). In this case, we would want to automatically transfer ownership to the GCS on spill without any API call from the user.

Is this one also case 2?

@ericl
Copy link
Contributor Author

ericl commented Feb 6, 2021

But even it falls back to querying the GCS, it still can't get the actual data since the owner is dead.

If we transfer the metadata to the GCS this would fix that right? The protocol I'm imagining is:

Original owner:

  1. Transfer the metadata to the GCS
  2. Forward any future queries to the GCS

Resolver:

  1. First try the original owner
  2. If the owner says the object is transferred to the GCS, or the owner is down, retry at the GCS
  3. If the GCS says it doesn't know about the object, then an error is raised.
  4. Get the data (it should be available unless the node fails). We aren't worrying about the node failure case --- that could be handled with spilling.

One is to transfer/share the ownership from one node to another one. For example, a complicated actor finished a task and decided to pass the object created locally to the driver and it'll be terminated after. This is the case I was thinking about.

This seems like a more narrow case than general ownership transfer to GCS. Note that nodes don't own objects, worker processes within the node own objects. The problem here is not a node dying, it's the ability to kill a worker actor.

Another one is to improve availability. We'd like to relax the restriction about fate-sharing. If a ref is used in some node and if in the cluster (or locally), it's available somewhere, we'll take that even the owner is down. We assume the owner will be up eventually. So in this case, the new API: ray.util.set_object_lifetime(obj_ref, lifetime="detached"|"lru"|) is a way to modify the behavior of eviction policy. In this case, we always query the local store first and the owner later.

Hmm I don't quite get this one, once the owner is dead, it will never come back right?

Is this one also case 2?

I think it's neither; it's transferring ownership to the GCS.

@fishbone
Copy link
Contributor

fishbone commented Feb 6, 2021

@ericl Thanks for the clarification. It looks like we are talking about the same problem. The only difference is about who should own that object. Let me summarize our ideas below, and please correct me if I was wrong.

First one:

  • GCS will be the owner of the object.
  • ObjectId remains the same as before. So we'll have objects stored in the node even the original owner(task) is not there.
  • Resolver query the owner first and then fall back to GCS to identify where the object is.
  • We can't evict the object in the node where the original owner lives since GCS owns it. Some garbage collection or explicit deletion should be made.

Things need to update:

  • A way to transfer the ownership of an object to GCS
  • Update resolver for falling back if the owner is dead
  • Take care of object life-cycle management. It's used to be done by the original owner. Now GCS should take this responsibility.

Second one:

  • A task can shadow copy an obj_ref from another node. This task will be the owner of the object.
  • We have two object_id now pointing to the same object but owned by different task/actor
  • Resolver will check whether the object is in the local node(also check the shared one) or not. Others remain the same.
  • The lifetime of the objects stays the same as before.

Things need to update:

  • Make storage level manage objects in entry-level instead of ID. As for external storage for spilling, it's the same.
  • When the id is not locally available, the resolver should also check whether the shared ids are local to avoid copying.
  • Maintain a new field in the object table in GCS about sharing ids.

In my opinion about the pros and cons:

  • First one: 1) ObjectId remains the same, so after running for a while and deciding to kill an actor, we can transfer ownership to GCS; 2) underline storage doesn't need to be updated. They stay the same as before. Rows are managed by key. No two keys can reference the same value. 3) the fallback has overhead, especially when the object is retrieved frequently by different tasks. 4) The definition of the life-time of an object got changed, and we need a way to manage it in raylet, which used to be in the task.
  • Second one: 1) ObjectId is different, but sharing with the same underline data. So users need to call sharing at the beginning. 2) the storage layer needs to be updated to manage data based on entry and support different keys referring to the same value. 3) There is no fallback. But we need to check whether the shared objects have already been downloaded. 4) Definition of the life-time of objects stays the same, and we don't need to update this.

I feel the complexity of transferring ownership to GCS comes from object management, which looks different from before. Probably it's OK. The fallback can cause some performance issues, but it might be OK depending on the use case.
The complexity of ownership sharing comes from the underline storage layer. For plasma store, it's ok from my understanding, not sure about spilling external storage.

@ericl
Copy link
Contributor Author

ericl commented Feb 6, 2021

Hmm I see, basically you're proposing a fast logical copy which generates a new object id. @stephanie-wang any thoughts on this vs ownership transfer?

We could potentially also take a copy approach to spilling.

Does it make sense to investigate the difficulty of these two approaches more first?

@fishbone
Copy link
Contributor

For anyone who might be lost in this long discussion, I have a summary for this thread. Please leave a comment there if you have any questions/suggestions.
cc @ericl @stephanie-wang @fschlimb @rkooo567 @DonYum

@fschlimb
Copy link

fschlimb commented Feb 11, 2021

@iycheng Shouldn't a fourth solution be considered which simply treats an objref like a shared_ptr? E.g. nothing needed from the user, as long as one ObjRef exists the object will be available. This will probably imply moving objects to the GCS if the owner dies and additional effort to get the ref-counting right. As mentioned before, this seems to be the only fully composable option.

@fishbone
Copy link
Contributor

@fschlimb The second one is almost similar to the one you mentioned. If we make it default for all variables which introduces extra performance cost. So at someplace, we should have some kind of call to tell the system, it's shared_ptr, not unique_ptr. In the second solution, it's like the caller will do this. Or we make callee do this kind of work, like return an ObjectRef and user set the 'shared bit' set to true and in caller side, it'll just do this work automatically.

From the API we can do this one easily if we support the second proposal. One is explicitly shared and the other one is default shared.

@fschlimb
Copy link

@iycheng Yes, I agree, these are closely related.

The point I am trying to make is that I do not see how 2 (or 1 and 3) is (are) composable. Without shared being the default, how would different packages orchestrated to a larger program synchronize? I see only 2 practical options: 1. all producers explicitly declare all outgoing refs shared or 2. the consumers always convert all incoming objrefs to shared. Both are pretty fragile/error prone and cumbersome.

@fishbone
Copy link
Contributor

@fschlimb I'm not an expert in ray's using patterns, so correct me if I'm wrong. From my understanding, the actor will be killed by the user ( not sure :p ), and the user knows which object will be used after killing. So, in this case, the user only needs to make that particular object shared.

One difference between (2) and (4) is that (2) always generates new object id while (4) reuse the old one. The first one make it fit into the current framework easily, and we'll only have overhead for the necessary one.

Default is definitely one way to do this at the API level, but we need to revisit the reference counting algorithm and memory management. It'll be pinned at the node for shared objects to avoid eviction. If we make it default by all, it'll waste a lot of memory.

@guykhazma
Copy link
Contributor

Hi,

is this issue still active? was there any progress?

Thanks

@kira-lin
Copy link
Contributor

Hi all,
First, thank you all for working on this issue! I am from the raydp team. This feature is very important in our project, so we want to settle on a solution and implement it.

I have read the proposals. @ericl @iycheng Is there any progress since then? All of them work for us, but some are harder to implement than others. So far the proposals assume objects are already in the plasma, but is it possible for a worker to put an object into plasma on behalf of another worker? In other words, we can add an API like ray.put(obj, owner=another_worker). Here another_worker could be a global named worker. Changing the ownership info in the first place might save us from having inconsistent copies. I'm not an expert of ray core, so please correct me if I'm wrong.

Thanks

@ericl
Copy link
Contributor Author

ericl commented May 31, 2021 via email

@edoakes edoakes removed the serve Ray Serve Related Issue label Oct 18, 2021
@ericl ericl moved this to Planned in Ray Core Public Roadmap Nov 11, 2021
@ericl ericl modified the milestones: Containerized Workers, Core Backlog Nov 12, 2021
@ericl ericl added P2 Important issue, but not time-critical and removed P2 Important issue, but not time-critical labels Nov 17, 2021
@ericl
Copy link
Contributor Author

ericl commented Nov 18, 2021

I've updated the description given the current state of things. The proposal is to extend the ray.put with owner API to allow GCS owned objects. In the future, we can then enable automatic transfer to GCS.

@ericl ericl moved this from Planned to In discussion in Ray Core Public Roadmap Nov 20, 2021
@ericl ericl moved this from In discussion to Planned in Ray Core Public Roadmap Nov 23, 2021
@ericl ericl moved this from Planned to In discussion in Ray Core Public Roadmap Nov 23, 2021
@ericl ericl moved this from In discussion to Planned in Ray Core Public Roadmap Nov 23, 2021
@ericl ericl moved this from Planned to In discussion in Ray Core Public Roadmap Nov 23, 2021
@jovany-wang
Copy link
Contributor

However, it would be nice to support ray.put(data, _owner="global") to avoid the need for that hack, and allow the object to be truly HA (e.g., tracked durably in HA GCS storage).

You mean the data is put into the backend storage? When and how do we clear it? by TTL or explicit API?

@ericl
Copy link
Contributor Author

ericl commented Dec 21, 2021

Only the metadata ownership would be handled by the GCS--- everything else including ref counting remains the same.

@rkooo567 rkooo567 added the core Issues that should be addressed in Ray Core label Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core enhancement Request for new feature and/or capability P2 Important issue, but not time-critical size-large
Projects
Status: In discussion
Development

No branches or pull requests

10 participants