-
Notifications
You must be signed in to change notification settings - Fork 83
Pull requests: NVIDIA/NeMo-Curator
Author
Label
Projects
Milestones
Reviews
Assignee
Sort
Pull requests list
Create separate files for each deduplication class
gpuci
Run GPU CI/CD on PR
#389
opened Nov 22, 2024 by
sarahyurick
Loading…
Fix GPU error messages for fuzzy deduplication
#387
opened Nov 22, 2024 by
sarahyurick
•
Draft
1 of 2 tasks
Fuzzy Dedup: Make skipping the False positive check the default
enhancement
New feature or request
gpuci
Run GPU CI/CD on PR
#386
opened Nov 21, 2024 by
ayushdg
Loading…
2 of 3 tasks
Remove Run GPU CI/CD on PR
max_text_bytes_per_part
gpuci
#385
opened Nov 20, 2024 by
sarahyurick
Loading…
Global Run GPU CI/CD on PR
cache_dir
variable for exact, fuzzy, and semantic deduplication
gpuci
#384
opened Nov 19, 2024 by
sarahyurick
Loading…
3 tasks done
Add blocksize to
DocumentDataset.read_*
that uses dd.from_map
#374
opened Nov 15, 2024 by
praateekmahajan
•
Draft
3 tasks
Synthetic data generation for Retriever Evaluation
#370
opened Nov 14, 2024 by
vinay-raman
Loading…
3 tasks done
Convert
translation_example.py
into a Jupyter Notebook tutorial
#336
opened Oct 29, 2024 by
sarahyurick
•
Draft
Add READMEs to
examples/
and nemo_curator/scripts
directories
#332
opened Oct 28, 2024 by
sarahyurick
Loading…
Add codepath for computing buckets without int conversion
enhancement
New feature or request
gpuci
Run GPU CI/CD on PR
#326
opened Oct 25, 2024 by
ayushdg
Loading…
3 tasks done
Dapt data curation tutorial fuzzy and semantic dedupe
gpuci
Run GPU CI/CD on PR
#322
opened Oct 24, 2024 by
ruchaa-apte
Loading…
Add blocksize to
DocumentDataset.read_*
that uses dask_cudf.read_*
#285
opened Oct 8, 2024 by
praateekmahajan
Loading…
3 tasks
Added example notebook for translation with ct2 model.
documentation
Improvements or additions to documentation
Previous Next
ProTip!
Updated in the last three days: updated:>2024-11-21.