[core] Support detached/GCS owned objects #12635

ericl · 2020-12-05T02:51:34Z

Overview

In the current ownership model, all created objects are owned by some worker executing the job. This means however that when a job exits, all created objects become unavailable.

In certain cases it is desirable to share objects between jobs (e.g., shared cache), without creating objects explicitly from a detached actor.

We have a current internal API to allow the owner of an object created with ray.put() to be assigned to a specific actor, e.g.:

ray.put(data, _owner=actor_handle)

This means the object created will fate-share with that actor instead of the current worker process. This means that it's currently possible to create global objects by creating a named detached actor, and setting the owner to that actor.

However, it would be nice to support ray.put(data, _owner="global") to avoid the need for that hack, and allow the object to be truly HA (e.g., tracked durably in HA GCS storage).

The text was updated successfully, but these errors were encountered:

fishbone · 2020-12-06T06:33:36Z

I have a concern about this method.
In this way, it looks like we treat the global storage as a key-value (with schema) storage and after running ray.util.put_global_ref(obj_ref, lookup_key="objfoo1") we store objfoo1 -> obj_ref in the storage which might leads to memory leak. It looks like makes memory management complicated. It's different from the actor which when it terminates it'll release all the resources. Besides if we do it in this way, we probably want to introduce namespace to avoid conflict and to prevent some object from being accidentally being updated by other apps.

If we have a global actor per group of task, and it's served for this purpose, will it just work? Will it be simpler? Do we need to support move ownership for this?

ericl · 2020-12-07T03:46:07Z

I agree binding the lifetime of an object to an actor might be the way to go. The way I see it there are probably a few use cases:
(1) pin object permanently, manual deletion
(2) allow object to be evicted if under memory pressure (caching case)
(3) bind lifetime of object to another actor/task (similar to ownership transfer)

Perhaps we could create the following API? Perhaps we don't need to give objects an explicit name, just allow their lifetime to be changed:

ray.util.set_object_lifetime(obj_ref, lifetime="detached"|"lru"|<actorRef>)

# can only implement <actorRef> case initially

Btw, actually doing ownership transfer is pretty hard, since the owner metadata is propagated inline with the object reference. You'd need to add some kind of transfer table to handle that. Assuming the number of objects that require transfer is quite small, it would be easier to instead only allow ownership to be transferred to the GCS. This is simpler since it means owner lookups can always fall back to GCS on owner failure.

fishbone · 2020-12-10T08:59:44Z

I think my concern here is two things: 1) how to manage all the detached variables; 2) how to do better isolation between groups of jobs.

how to manage all the detached variables
We plan to make ray run in a cluster, and developers can submit their job to the cluster. If someone created many detached variables by using (1), it'd prevent other jobs from running due to OOM. And it's not easy to recover from this since we don't have extra info about all these detached variables.
Besides, if we do ray.util.set_object_lifetime(obj_ref, lifetime="detached"|"lru"|<actorRef>) how can another job get the obj_ref? It should be a variable shared between jobs, but if one job exit, how can it be transferred from one to another?

Here I assume one job to be a .py file and will be executed by ray submit job.py or python job.py. Please let me know if I'm wrong.

how to do better isolation between groups of jobs
If we use the variable name to do communication between jobs (if I understand the concept of the job correctly), the variable name will be a global variable visible to all jobs in the cluster. If we have a series of jobs and run them with different inputs, there will be conflicts.

I thought of introducing job groups of something else, but then it looks like one driver job with several sub-jobs. So sub-jobs(separate py file) can run in the driver job (main py file), and the driver job will have a reference of the shared data, which will work.

I think something is missing here in my brain. Please share more details about this.

ericl · 2020-12-10T18:58:46Z

Another user ask in #12748

ericl · 2020-12-10T19:00:48Z

If someone created many detached variables by using (1), it'd prevent other jobs from running due to OOM. And it's not easy to recover from this since we don't have extra info about all these detached variables.

This isn't something we need to worry about though--- if the user pins too much, it's their problem. With an autoscaling cluster, we can also spill / add more nodes to handle this case.

Besides, if we do ray.util.set_object_lifetime(obj_ref, lifetime="detached"|"lru"|) how can another job get the obj_ref? It should be a variable shared between jobs, but if one job exit, how can it be transferred from one to another?

One way is that the object refs can be transferred through a named actor. Perhaps we can try this to see if it satisfies use cases before adding another API.

ericl · 2020-12-10T19:02:23Z

I thought of introducing job groups of something else, but then it looks like one driver job with several sub-jobs. So sub-jobs(separate py file) can run in the driver job (main py file), and the driver job will have a reference of the shared data, which will work.

Btw, in the RayDP/Spark on Ray use case, the variable just needs to be transferred within the job, so there isn't the use case of multiple jobs yet. For caching use cases, the variables can be transferred to named actors global across many jobs. So I think changing lifetime is sufficient for both these cases, but perhaps there are others that require more functionality.

fishbone · 2020-12-11T01:26:15Z

I think I misunderstood some parts. So let me just summarize it here and please let me know if I was wrong:

Having a named actor shared across jobs to exchange data between them
If the user wants to do the cross job communication, pass the ownership to the actor
Multiple eviction policy can be supported here (detached, lru, binding it to anther job/actor)

Another thing I might want to be clarified here is what's a job? Is it a python script? Some demo code about what needs to be achieved should be helpful for me to understand it.

This isn't something we need to worry about though--- if the user pins too much, it's their problem. With an autoscaling cluster, we can also spill / add more nodes to handle this case.

I'm thinking that it's a cluster and multiple users will want to submit the job in the same cluster. If someone uses too much and we can't have a way to group the related variables, it's going to be a mess. But if it's through a named actor and jobs users have different named actors, then it's OK. We can limit memory usage through these actors to protect the cluster.

ericl · 2020-12-11T21:29:50Z

That sounds right to me. A couple clarifications: a Python script/driver defines a job, since we generate a unique job on connecting to a cluster. So in the Spark on Ray case, all actors run would be all within a single job. Re: memory usage, the isolation here is not very well developed, but note this is a general problem in Ray and not something specific to detached objects. Hence, it is out of scope of this feature itself.

…

On Thu, Dec 10, 2020, 5:26 PM Yi Cheng ***@***.***> wrote: I think I misunderstood some parts. So let me just summarize it here and please let me know if I was wrong: - Having a named actor shared across jobs to exchange data between them - If the user wants to do the cross job communication, pass the ownership to the actor - Multiple eviction policy can be supported here (detached, lru, binding it to anther job/actor) Another thing I might want to be clarified here is what's a job? Is it a python script? Some demo code about what needs to be achieved should be helpful for me to understand it. This isn't something we need to worry about though--- if the user pins too much, it's their problem. With an autoscaling cluster, we can also spill / add more nodes to handle this case. I'm thinking that it's a cluster and multiple users will want to submit the job in the same cluster. If someone uses too much and we can't have a way to group the related variables, it's going to be a mess. But if it's through a named actor and jobs users have different named actors, then it's OK. We can limit memory usage through these actors to protect the cluster. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#12635 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSXOM64NL5Q6GBZZ3FDSUFYMRANCNFSM4UOG4ZQQ> .

DonYum · 2020-12-12T12:00:38Z

Both @ericl and @ahbone 's opinions are impressive.

As @ahbone say, the management of the detached objs looks like shouldn't be given to the user. This will mess the code of user.
However, In practice, we do need a 'method' to make some long-life objs.

Yes, using detached named actor to customize a 'detached objs' is a method. But it's to complete, and is implicit.

As a novice of ray, I spent hours to find some method to 'detach' a plasma obj In docs/issues and discussions, but I failed.
I have to submit a new issue #12748 which lead me to here.
The process was too painful. Along the way, I always thinking, why not provide such an API. Is this a technical problem?
...

In my opinion, ray core is like a toolbox, providing enough APIs is our responsibility, and how customers use it is another thing that we may not consider too much.

Hope my ideas will help.
Thanks a lot!

fishbone · 2020-12-13T04:23:58Z

@DonYum Thanks for your ideas here, and I agree this is a useful feature to support.
My concern here is not about providing such an API/feature but how we we should support this feature better. Early discussion of it will save us effort in the future.
What worries me here is actually how to manage all these long-life objects. Suppose we have a lot of long-life objects created, now the user wants to free some objects through ray client, it's hard to know which one to free since they are all ids (maybe I was wrong, not an experienced ray user). Usually it might be useful for us to free all the global variables created by some job when they are not needed anymore. Binding them to a named actor sounds to me like a potential way to go.

@ericl Here for RayDP, as you mentioned that:

So in the Spark on Ray case, all actors run would be all within a single job.

In this way, it looks like all objects can be passed back to the driver. Why detached objects will help here?
If you have some sample code about the usecase we'd like to support, it'll be helpful.

ericl · 2020-12-16T20:44:56Z

Perhaps we don't allow "true" global objects in the first iteration, to avoid the problems mentioned above. If an object can only be transferred to another actor, then that would solve the RayDP use case (single job) and also allow other use cases which span multiple jobs.

In this way, it looks like all objects can be passed back to the driver. Why detached objects will help here?

The issue is the objects are still owned by the producing worker (the one that call ray.put). If that worker is killed, the object is lost, even if the reference is passed back to the driver.

>>> @ray.remote
... class A:
...    def __init__(self):
...       self.ref = ray.put("hi")
...    def get(self):
...       return self.ref
... 
>>> 
>>> a = A.remote()
>>> ref = ray.get(a.get.remote())
>>> ray.get(ref)
'hi'
>>> ray.kill(a)
>>> ray.get(ref)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/eric/Desktop/ray/python/ray/worker.py", line 1378, in get
    raise value
ray.exceptions.ObjectLostError: Object ffffffffffffffffa67dc375e60ddd1a23bd3bb90100000002000000 is lost due to node failure.

fishbone · 2020-12-17T07:49:52Z

Thanks for the explanation. I'll spend some time to check how to support this version.

fishbone · 2021-01-24T07:06:28Z

@ericl, sorry for not making progress for a while. I recently picked it again.

I'm wondering whether this will make RayDP's case and other cases work: ref = ray.get(a.get.remote(), clone=True)
Underline this get, we'll make the ref be the thing created by this task, so its life cycle is independent of the actor.

Since the value stored is immutable, whenever we need an updated value, we need to call actor's get again, which will generate a new ref.

Btw, actually doing ownership transfer is pretty hard, since the owner metadata is propagated inline with the object reference.

We don't need to take care of this, since ref is referencing the object owned by itself and all the other copied later on is about ref.

Please let me know whether it's the correct direction to go or not.

ericl · 2021-01-26T00:46:45Z

I'm wondering whether this will make RayDP's case and other cases work: ref = ray.get(a.get.remote(), clone=True)
Underline this get, we'll make the ref be the thing created by this task, so its life cycle is independent of the actor.

Is this the same thing as ray.put(ray.get(a.get.remote())) -- making a copy? It could be another solution though the cost is we need to use twice the memory for a short period of time.

fishbone · 2021-01-27T07:26:59Z

I double-checked your sample code, and I realized that I misunderstood it. It's about return ObjectRef directly in the remote. I updated the API a little bit. I'm trying to avoid introducing the API to make the system like a global key-value store, but to make it more like a variable's ownership can be shared/moved.

>>> @ray.remote
... class A:
...    def __init__(self):
...       self.ref = ray.put("hi")
...    def get(self):
...       return self.ref
... 
>>> 
>>> a = A.remote()
>>> ref = ray.get(a.get.remote()).shared()
>>> ray.get(ref)

Here, if we give ObjectRef an API, maybe called shared, and the return, ref, is a local object id, and inside GCS, we have a link from ref -> loc.

And here, we need to update raylet to make an object shared by multiple tasks. I haven't checked how to do this yet, but I feel this makes the API looks cleaner, and the major updates should sit in raylet. As for inline value, since it's small and needs to be copied to remote tasks anyway, it should be OK to create a new object. If we do this:

shared_ref = ray.get(a.get.remote()).shared()
ray.get(shared_ref)

ref = ray.get(a.get.remote())
ray.get(ref)

There will be a duplicate; otherwise, there is no extra cost for memory. Correct me if I was wrong here.

So most of the updates should be in raylet. Here are some thoughts from me:

When ref = ray.get(a.get.remote()).shared() is executed, raylet will trigger copy and pull data from remote.
The object stored in raylet will have information like: owners: [task1, task2]. The object will be eligible to be evicted if none of the owners is the local node.
When task1 finished, delete notify raylet and remote it there.

Maybe some step is hard to do, so please give me some comments.

fishbone · 2021-02-06T01:05:56Z

@ericl Here is my plan to support this

Update plasma store to support two objects share the same entry. LRU policy needs to be updated to manage it at the entry-level. After this, we enable ref sharing at a low level.
Introduce API in the plasma store to make an object share another object's entry.
Update GCS to include sharing information. The shared obj_ref equal with each other except the ownership is different. They are symmetric.
Update python/java API to support the semantic.

So the overall flow might look like

py/java: triggering by a_ref = b_ref.shared()
put a into GCS with ownership of the current node. put a to b's sharing list and b to a's sharing list.
start to download data of b to local
after downloading finished, in plasma store, make a shared by b.
do this for the nested refs in b too

In this way, the objects' life-cycle stays the same as before. We also avoid duplicating low-level copy. And we give the fundamental support for ownership sharing. Maybe ownership sharing is enough for now. We can call delete in the first owner to support ownership transferring.

If you think it's good to go, I'll start implementing it.

ericl · 2021-02-06T01:11:08Z

Hmm this seems a little complicated, I'm not sure we should be changing plasma here or introducing entry sharing as a concept. How about we outline a couple different design options before coming to a conclusion?

I was originally thinking this would be a simple change, where the object is simply not deleted when the owner dies. If it turns out to be much harder then the feature might not be worth it.

Btw: there is a new ownership-based object directory that might conflict with this change: #12552

fishbone · 2021-02-06T01:37:36Z

@ericl
Update plasma store to support obj sharing in low-level is not that complicated. We need to update ObjectTable with the value part to a raw pointer. And for the eviction, we need to manage it based on entry-level. Now it's ObjectId in the hash table with a linked list. We need to update its entry pointer in the hash table with a linked list. The overall update in this module is ok.

The sharing info tells the lower-level store that you don't need to create an entry, but instead, you can share it with the existing one. I haven't figured out how to ingest this sharing info in GCS or the new ownership-based object directory.

I was originally thinking this would be a simple change, where the object is simply not deleted when the owner dies. If it turns out to be much harder then the feature might not be worth it.

I thought about this before. But if this node(A) passes this obj_ref to another node(B), and B needs to download this info from the owner(X), which is dead. So B also needs this information that the obj_ref is owned by X but pinned in B.
Without this information, B will think X is dead, and it'll throw errors as the current design. But with this info, it looks like supporting object sharing, which can be implemented better.

If we want to support pin an obj locally and the visibility is limited to the current node, it'll be a simpler change. We just don't delete the pinned obj even owner is dead.
But if we'd like to implement ownership sharing, we need to make more changes.

ericl · 2021-02-06T01:40:06Z

I thought about this before. But if this node(A) passes this obj_ref to another node(B), and B needs to download this info from the owner(X), which is dead. So B also needs this information that the obj_ref is owned by X but pinned in B.

You can fall back to the GCS in this case right? The owner can respond that the object is owned by the GCS if it's up; if it's down the caller can auto-retry at the GCS just in case.

fishbone · 2021-02-06T01:51:09Z

@ericl Maybe I missed some parts in the system. I thought the owner needs to be up to access that object.

ref is owned by X
A get ref and pin it locally.
X got kill then.
A can still access data pointed by ref
A passes ref to B
B wants to access ref, and it needs the owner to be up under the current design. It's fate-shared by the owner. X is a job doing some work, for example, accumulation, it finished the job and wants to pass the result to the driver. So X will not be up anymore after returning the results to the driver.

ericl · 2021-02-06T02:05:08Z

B wants to access ref, and it needs the owner to be up under the current design. It's fate-shared by the owner. X is a job doing some work, for example, accumulation, it finished the job and wants to pass the result to the driver. So X will not be up anymore after returning the results to the driver.

Right, but we can change the metadata resolution protocol to fall back to querying the GCS if X is down. So in this case, B can fallback to querying the GCS before giving up resolving the object metadata.

Btw, another related use case is that we might consider transferring object ownership to the GCS if it's ever spilled to external storage such as S3 (cc @rkooo567 @clarkzinzow ). This would allow the object to be retrieved even if the owner has been killed for some reason (might be useful for large scale distributed shuffle use cases). In this case, we would want to automatically transfer ownership to the GCS on spill without any API call from the user.

fishbone · 2021-02-06T02:21:44Z

@ericl

Right, but we can change the metadata resolution protocol to fall back to querying the GCS if X is down. So in this case, B can fallback to querying the GCS before giving up resolving the object metadata.

But even it falls back to querying the GCS, it still can't get the actual data since the owner is dead.

I feel that we are talking about different scenarios.

One is to transfer/share the ownership from one node to another one. For example, a complicated actor finished a task and decided to pass the object created locally to the driver and it'll be terminated after. This is the case I was thinking about.
Another one is to improve availability. We'd like to relax the restriction about fate-sharing. If a ref is used in some node and if in the cluster (or locally), it's available somewhere, we'll take that even the owner is down. We assume the owner will be up eventually. So in this case, the new API: ray.util.set_object_lifetime(obj_ref, lifetime="detached"|"lru"|<actorRef>) is a way to modify the behavior of eviction policy. In this case, we always query the local store first and the owner later.

Btw, another related use case is that we might consider transferring object ownership to the GCS if it's ever spilled to external storage such as S3 (cc @rkooo567 @clarkzinzow ). This would allow the object to be retrieved even if the owner has been killed for some reason (might be useful for large scale distributed shuffle use cases). In this case, we would want to automatically transfer ownership to the GCS on spill without any API call from the user.

Is this one also case 2?

ericl · 2021-02-06T02:25:37Z

But even it falls back to querying the GCS, it still can't get the actual data since the owner is dead.

If we transfer the metadata to the GCS this would fix that right? The protocol I'm imagining is:

Original owner:

Transfer the metadata to the GCS
Forward any future queries to the GCS

Resolver:

First try the original owner
If the owner says the object is transferred to the GCS, or the owner is down, retry at the GCS
If the GCS says it doesn't know about the object, then an error is raised.
Get the data (it should be available unless the node fails). We aren't worrying about the node failure case --- that could be handled with spilling.

One is to transfer/share the ownership from one node to another one. For example, a complicated actor finished a task and decided to pass the object created locally to the driver and it'll be terminated after. This is the case I was thinking about.

This seems like a more narrow case than general ownership transfer to GCS. Note that nodes don't own objects, worker processes within the node own objects. The problem here is not a node dying, it's the ability to kill a worker actor.

Another one is to improve availability. We'd like to relax the restriction about fate-sharing. If a ref is used in some node and if in the cluster (or locally), it's available somewhere, we'll take that even the owner is down. We assume the owner will be up eventually. So in this case, the new API: ray.util.set_object_lifetime(obj_ref, lifetime="detached"|"lru"|) is a way to modify the behavior of eviction policy. In this case, we always query the local store first and the owner later.

Hmm I don't quite get this one, once the owner is dead, it will never come back right?

Is this one also case 2?

I think it's neither; it's transferring ownership to the GCS.

fishbone · 2021-02-06T03:22:57Z

@ericl Thanks for the clarification. It looks like we are talking about the same problem. The only difference is about who should own that object. Let me summarize our ideas below, and please correct me if I was wrong.

First one:

GCS will be the owner of the object.
ObjectId remains the same as before. So we'll have objects stored in the node even the original owner(task) is not there.
Resolver query the owner first and then fall back to GCS to identify where the object is.
We can't evict the object in the node where the original owner lives since GCS owns it. Some garbage collection or explicit deletion should be made.

Things need to update:

A way to transfer the ownership of an object to GCS
Update resolver for falling back if the owner is dead
Take care of object life-cycle management. It's used to be done by the original owner. Now GCS should take this responsibility.

Second one:

A task can shadow copy an obj_ref from another node. This task will be the owner of the object.
We have two object_id now pointing to the same object but owned by different task/actor
Resolver will check whether the object is in the local node(also check the shared one) or not. Others remain the same.
The lifetime of the objects stays the same as before.

Things need to update:

Make storage level manage objects in entry-level instead of ID. As for external storage for spilling, it's the same.
When the id is not locally available, the resolver should also check whether the shared ids are local to avoid copying.
Maintain a new field in the object table in GCS about sharing ids.

In my opinion about the pros and cons:

First one: 1) ObjectId remains the same, so after running for a while and deciding to kill an actor, we can transfer ownership to GCS; 2) underline storage doesn't need to be updated. They stay the same as before. Rows are managed by key. No two keys can reference the same value. 3) the fallback has overhead, especially when the object is retrieved frequently by different tasks. 4) The definition of the life-time of an object got changed, and we need a way to manage it in raylet, which used to be in the task.
Second one: 1) ObjectId is different, but sharing with the same underline data. So users need to call sharing at the beginning. 2) the storage layer needs to be updated to manage data based on entry and support different keys referring to the same value. 3) There is no fallback. But we need to check whether the shared objects have already been downloaded. 4) Definition of the life-time of objects stays the same, and we don't need to update this.

I feel the complexity of transferring ownership to GCS comes from object management, which looks different from before. Probably it's OK. The fallback can cause some performance issues, but it might be OK depending on the use case.
The complexity of ownership sharing comes from the underline storage layer. For plasma store, it's ok from my understanding, not sure about spilling external storage.

ericl · 2021-02-06T03:29:33Z

Hmm I see, basically you're proposing a fast logical copy which generates a new object id. @stephanie-wang any thoughts on this vs ownership transfer?

We could potentially also take a copy approach to spilling.

Does it make sense to investigate the difficulty of these two approaches more first?

fishbone · 2021-02-11T04:23:27Z

For anyone who might be lost in this long discussion, I have a summary for this thread. Please leave a comment there if you have any questions/suggestions.
cc @ericl @stephanie-wang @fschlimb @rkooo567 @DonYum

fschlimb · 2021-02-11T10:44:32Z

@iycheng Shouldn't a fourth solution be considered which simply treats an objref like a shared_ptr? E.g. nothing needed from the user, as long as one ObjRef exists the object will be available. This will probably imply moving objects to the GCS if the owner dies and additional effort to get the ref-counting right. As mentioned before, this seems to be the only fully composable option.

fishbone · 2021-02-17T18:33:44Z

@fschlimb The second one is almost similar to the one you mentioned. If we make it default for all variables which introduces extra performance cost. So at someplace, we should have some kind of call to tell the system, it's shared_ptr, not unique_ptr. In the second solution, it's like the caller will do this. Or we make callee do this kind of work, like return an ObjectRef and user set the 'shared bit' set to true and in caller side, it'll just do this work automatically.

From the API we can do this one easily if we support the second proposal. One is explicitly shared and the other one is default shared.

fschlimb · 2021-02-18T10:38:11Z

@iycheng Yes, I agree, these are closely related.

The point I am trying to make is that I do not see how 2 (or 1 and 3) is (are) composable. Without shared being the default, how would different packages orchestrated to a larger program synchronize? I see only 2 practical options: 1. all producers explicitly declare all outgoing refs shared or 2. the consumers always convert all incoming objrefs to shared. Both are pretty fragile/error prone and cumbersome.

fishbone · 2021-02-18T22:12:27Z

@fschlimb I'm not an expert in ray's using patterns, so correct me if I'm wrong. From my understanding, the actor will be killed by the user ( not sure :p ), and the user knows which object will be used after killing. So, in this case, the user only needs to make that particular object shared.

One difference between (2) and (4) is that (2) always generates new object id while (4) reuse the old one. The first one make it fit into the current framework easily, and we'll only have overhead for the necessary one.

Default is definitely one way to do this at the API level, but we need to revisit the reference counting algorithm and memory management. It'll be pinned at the node for shared objects to avoid eviction. If we make it default by all, it'll waste a lot of memory.

guykhazma · 2021-03-31T14:27:35Z

Hi,

is this issue still active? was there any progress?

Thanks

kira-lin · 2021-05-31T06:41:25Z

Hi all,
First, thank you all for working on this issue! I am from the raydp team. This feature is very important in our project, so we want to settle on a solution and implement it.

I have read the proposals. @ericl @iycheng Is there any progress since then? All of them work for us, but some are harder to implement than others. So far the proposals assume objects are already in the plasma, but is it possible for a worker to put an object into plasma on behalf of another worker? In other words, we can add an API like ray.put(obj, owner=another_worker). Here another_worker could be a global named worker. Changing the ownership info in the first place might save us from having inconsistent copies. I'm not an expert of ray core, so please correct me if I'm wrong.

Thanks

ericl · 2021-05-31T18:24:02Z

We have a public design doc here on the future approach, which is in progress: https://docs.google.com/document/d/1Immr9m049Ikj8Pq3G06Csm8sHyhBvzpWENGB5lHPbOY/edit?usp=drivesdk

…

On Sun, May 30, 2021, 11:41 PM Zhi Lin ***@***.***> wrote: Hi all, First, thank you all for working on this issue! I am from the raydp <https://github.com/oap-project/raydp> team. This feature is very important in our project, so we want to settle on a solution and implement it. I have read the proposals. @ericl <https://github.com/ericl> @iycheng <https://github.com/iycheng> Is there any progress since then? All of them work for us, but some are harder to implement than others. So far the proposals assume objects are already in the plasma, but is it possible for a worker to put an object into plasma on behalf of another worker? In other words, we can add an API like ray.put(obj, owner=another_worker). Here another_worker could be a global named worker. Changing the ownership info in the first place might save us from having inconsistent copies. I'm not an expert of ray core, so please correct me if I'm wrong. Thanks — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12635 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSQYMOHA3Q6TBMHXX4TTQMVSJANCNFSM4UOG4ZQQ> .

ericl · 2021-11-18T02:23:54Z

I've updated the description given the current state of things. The proposal is to extend the ray.put with owner API to allow GCS owned objects. In the future, we can then enable automatic transfer to GCS.

jovany-wang · 2021-12-21T17:47:09Z

However, it would be nice to support ray.put(data, _owner="global") to avoid the need for that hack, and allow the object to be truly HA (e.g., tracked durably in HA GCS storage).

You mean the data is put into the backend storage? When and how do we clear it? by TTL or explicit API?

ericl · 2021-12-21T18:42:40Z

Only the metadata ownership would be handled by the GCS--- everything else including ref counting remains the same.

ericl added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 5, 2020

ericl added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Dec 8, 2020

edoakes added the serve Ray Serve Related Issue label Dec 11, 2020

edoakes removed the serve Ray Serve Related Issue label Oct 18, 2021

ericl added this to Ray Core Public Roadmap Nov 11, 2021

ericl moved this to Planned in Ray Core Public Roadmap Nov 11, 2021

ericl mentioned this issue Nov 12, 2021

[Bug] Object lost even though a reference to it exists #20018

Closed

2 tasks

ericl added the size-large label Nov 12, 2021

ericl modified the milestones: Containerized Workers, Core Backlog Nov 12, 2021

ericl added P2 Important issue, but not time-critical and removed P2 Important issue, but not time-critical labels Nov 17, 2021

ericl assigned fishbone Nov 18, 2021

ericl moved this from Planned to In discussion in Ray Core Public Roadmap Nov 20, 2021

ericl unassigned fishbone Nov 20, 2021

ericl moved this from In discussion to Planned in Ray Core Public Roadmap Nov 23, 2021

ericl moved this from Planned to In discussion in Ray Core Public Roadmap Nov 23, 2021

ericl moved this from In discussion to Planned in Ray Core Public Roadmap Nov 23, 2021

ericl moved this from Planned to In discussion in Ray Core Public Roadmap Nov 23, 2021

rkooo567 added the core Issues that should be addressed in Ray Core label Dec 9, 2022

Lorien2027 mentioned this issue Jun 6, 2023

[Core][Serve] Memory leak in Ray components and modules #35714

Closed

[core] Support detached/GCS owned objects #12635

[core] Support detached/GCS owned objects #12635

Comments

ericl commented Dec 5, 2020 • edited Loading

Overview

fishbone commented Dec 6, 2020 • edited Loading

ericl commented Dec 7, 2020 • edited Loading

fishbone commented Dec 10, 2020

ericl commented Dec 10, 2020

ericl commented Dec 10, 2020

ericl commented Dec 10, 2020

fishbone commented Dec 11, 2020

ericl commented Dec 11, 2020 via email

DonYum commented Dec 12, 2020

fishbone commented Dec 13, 2020

ericl commented Dec 16, 2020

fishbone commented Dec 17, 2020

fishbone commented Jan 24, 2021 • edited Loading

ericl commented Jan 26, 2021

fishbone commented Jan 27, 2021 • edited Loading

fishbone commented Feb 6, 2021 • edited Loading

ericl commented Feb 6, 2021 • edited Loading

fishbone commented Feb 6, 2021 • edited Loading

ericl commented Feb 6, 2021

fishbone commented Feb 6, 2021 • edited Loading

ericl commented Feb 6, 2021 • edited Loading

fishbone commented Feb 6, 2021 • edited Loading

ericl commented Feb 6, 2021 • edited Loading

fishbone commented Feb 6, 2021 • edited Loading

ericl commented Feb 6, 2021 • edited Loading

fishbone commented Feb 11, 2021

fschlimb commented Feb 11, 2021 • edited Loading

fishbone commented Feb 17, 2021

fschlimb commented Feb 18, 2021

fishbone commented Feb 18, 2021

guykhazma commented Mar 31, 2021

kira-lin commented May 31, 2021

ericl commented May 31, 2021 via email

ericl commented Nov 18, 2021

jovany-wang commented Dec 21, 2021

ericl commented Dec 21, 2021

ericl commented Dec 5, 2020 •

edited

Loading

fishbone commented Dec 6, 2020 •

edited

Loading

ericl commented Dec 7, 2020 •

edited

Loading

fishbone commented Jan 24, 2021 •

edited

Loading

fishbone commented Jan 27, 2021 •

edited

Loading

fishbone commented Feb 6, 2021 •

edited

Loading

ericl commented Feb 6, 2021 •

edited

Loading

fishbone commented Feb 6, 2021 •

edited

Loading

fishbone commented Feb 6, 2021 •

edited

Loading

ericl commented Feb 6, 2021 •

edited

Loading

fishbone commented Feb 6, 2021 •

edited

Loading

ericl commented Feb 6, 2021 •

edited

Loading

fishbone commented Feb 6, 2021 •

edited

Loading

ericl commented Feb 6, 2021 •

edited

Loading

fschlimb commented Feb 11, 2021 •

edited

Loading