Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] RandomAccessDataset.multiget return unexpected values for missing keys. #44768

Open
sunyakun opened this issue Apr 16, 2024 · 1 comment · May be fixed by #44769
Open

[Data] RandomAccessDataset.multiget return unexpected values for missing keys. #44768

sunyakun opened this issue Apr 16, 2024 · 1 comment · May be fixed by #44769
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P3 Issue moderate in impact or severity

Comments

@sunyakun
Copy link

What happened + What you expected to happen

the ray.data.RandomAccessDataset.multiget expected return a None for missing records, in fact, I got an unexpected value for the missing key.

I find this PR update the _RandomAccessWorker.multiget: #24825, and it use the np.searchsorted to speed up the multiget, but the np.searchsorted will return the insertion points for missing records and it use the search result directly to get the row from the block without test col[i] == key, just like the code here:

i = bisect.bisect_left(column, x)
if i != len(column) and column[i] == x:
return i
return None

Versions / Dependencies

Ray: latest master
Python: 3.9.2
OS: linux

Reproduction script

import ray
import ray.data

kv_store = ray.data.from_items(
    [i for i in range(0, 1000, 2)]
).repartition(5).to_random_access_dataset(key="item", num_workers=1)

print(ray.get(kv_store.get_async(1)), ray.get(kv_store.get_async(901)))
# output: None None

print(kv_store.multiget([1, 901]))
# output: [{'item': 2}, {'item': 902}]

Issue Severity

None

@sunyakun sunyakun added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 16, 2024
@sunyakun sunyakun changed the title [<Ray component: Data>] RandomAccessDataset.multiget return unexpected values for missing keys. [Data] RandomAccessDataset.multiget return unexpected values for missing keys. Apr 16, 2024
@tespent
Copy link
Contributor

tespent commented Apr 16, 2024

I can reproduce this problem and I created a pull request #44769 trying to fix this.

@anyscalesam anyscalesam added the data Ray Data-related issues label Apr 16, 2024
@bveeramani bveeramani added P3 Issue moderate in impact or severity and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P3 Issue moderate in impact or severity
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants