-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Align extract_partitioning_index
logic with upstream shuffling
#60
Align extract_partitioning_index
logic with upstream shuffling
#60
Conversation
cc @VibhuJawa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial review since this is a wip.
- I can test the fix out on larger datasets if you feel it's ready for testing.
- You might need to amend/update existing commits to include a signoff (commit -s), to pass the DCO check
Signed-off-by: rjzamora <[email protected]>
This PR adds a new tutorial to demonstrate data curation for PEFT use-cases. Signed-off-by: Mehran Maghoumi <[email protected]> Signed-off-by: rjzamora <[email protected]>
Signed-off-by: rjzamora <[email protected]>
d453b6e
to
1f28a35
Compare
@ayushdg - thanks for the review. Still don't have an evironment to test this myself. I will try to do that later today, but if it's easy for you to test it is very welcome on my end :) |
Can confirm I'm seeing expected results on the larger scale dataset with this fix. I'll run a few more tests on my end but it's generally looking good. |
* Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies Signed-off-by: Ayush Dattagupta <[email protected]> * Add comment around import, move constant import to global scope Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]>
@VibhuJawa - Still trying to fix my environment so I can confirm locally, but won't the |
In theory the shuffle gives incorrect results, but the dataset/num_partitions here is small enough that it doesn't impact the correctness of final results (duplicate documents detected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, lets merge it in when we add the requested unit test.
Signed-off-by: rjzamora <[email protected]>
npartitions_right, | ||
npartitions_right, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble wrapping my head around the "correct" way to test the batched case. However, my sense is that this test already covers the critical requirement that we are extracting a partitioning index that is consistent with ddf.shuffle
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I don't think this specific bug impacts the batched case any differently than the reproducer here so I think it should be good to go.
Signed-off-by: rjzamora <[email protected]>
Signed-off-by: rjzamora <[email protected]>
b0e41ab
to
647406f
Compare
Signed-off-by: rjzamora <[email protected]>
from nemo_curator.utils.fuzzy_dedup_utils.shuffle_utils import ( | ||
rearange_by_column_direct, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I moved this import into merge_left_to_shuffled_right
, because shuffle_utils
currently requires cudf
/dask_cuda
, while other logic in this module does not.
Signed-off-by: rjzamora <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @rjzamora
npartitions_right, | ||
npartitions_right, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I don't think this specific bug impacts the batched case any differently than the reproducer here so I think it should be good to go.
…DIA#60) * update extract_partitioning_index with compat code Signed-off-by: rjzamora <[email protected]> * [Tutorials] Add a tutorial for PEFT data curation (NVIDIA#45) This PR adds a new tutorial to demonstrate data curation for PEFT use-cases. Signed-off-by: Mehran Maghoumi <[email protected]> Signed-off-by: rjzamora <[email protected]> * move compat code to _compat file Signed-off-by: rjzamora <[email protected]> * Only import PII constants during Curator import (NVIDIA#61) * Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies Signed-off-by: Ayush Dattagupta <[email protected]> * Add comment around import, move constant import to global scope Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> * add unit test Signed-off-by: rjzamora <[email protected]> * add pytest.mark.gpu Signed-off-by: rjzamora <[email protected]> * move extract_partitioning_index import for now Signed-off-by: rjzamora <[email protected]> * test both cudf and pandas Signed-off-by: rjzamora <[email protected]> * spelling Signed-off-by: rjzamora <[email protected]> --------- Signed-off-by: rjzamora <[email protected]> Signed-off-by: Mehran Maghoumi <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> Co-authored-by: Mehran Maghoumi <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]>
Dask modified how
partitioning_index
is used for shuffling in dask/dask#10705 (included indask>=2023.12.1
). This PR modifiesextract_partitioning_index
to use the same logic.TODO: