-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RE-OPENED ELSEWHERE] HuggingFace support for Domain Classifier #138
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments.
Hi @VibhuJawa this is ready for review, can confirm the HuggingFace domain classifier produces the same results as our previous pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks great to me. Just have some nits around type hints. Looks great to me.
Thanks for working on this @sarahyurick, code is so much cleaner and usable now
Thanks @VibhuJawa ! Updated, LMK what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor thing. Looks great, thank you so much!
tutorials/distributed_data_classification/distributed_data_classification.ipynb
Outdated
Show resolved
Hide resolved
Thanks @ryantwolf and @VibhuJawa ! Should be ready for another review. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good on my end, thanks a bunch for this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Stricter query planning checks with newer versions of dask Signed-off-by: Ayush Dattagupta <[email protected]> * Add checks to tests/__init__ Signed-off-by: Ayush Dattagupta <[email protected]> * Check sys.modules to ensure dask-expr is not enabled Signed-off-by: Ayush Dattagupta <[email protected]> * Search for "dask_expr" in sys modules Co-authored-by: Richard (Rick) Zamora <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> * use dask_expr instead of dask-expr Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> Co-authored-by: Richard (Rick) Zamora <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
* Applying SEO Best Pratices (NVIDIA#104) * Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <[email protected]> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <[email protected]> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <[email protected]> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Andrew Schilling <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * Shuffle CC result on group before writing out (NVIDIA#110) Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst (NVIDIA#113) Added links to tutorials Signed-off-by: jgerh <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * first commit Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * mv under modules dir Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * first commit Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * mv under modules dir Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * first commit Signed-off-by: Vibhu Jawa <[email protected]> * mv under modules dir Signed-off-by: Vibhu Jawa <[email protected]> * embed by cluster saved Signed-off-by: Vibhu Jawa <[email protected]> * id map script Signed-off-by: Vibhu Jawa <[email protected]> * test commit Signed-off-by: Vibhu Jawa <[email protected]> * add id map script Signed-off-by: Vibhu Jawa <[email protected]> * Cleanup compute_embeddings_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * Cleanup compute_embeddings_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * Pre-commit style fixes Signed-off-by: Vibhu Jawa <[email protected]> * clustering_dask_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * Minor clean up to sort_clusters_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * cleanup semdedup_crossfit Signed-off-by: Vibhu Jawa <[email protected]> * Remove undo changes Signed-off-by: Vibhu Jawa <[email protected]> * Remove rename changes Signed-off-by: Vibhu Jawa <[email protected]> * Fix rename Signed-off-by: Vibhu Jawa <[email protected]> * Readme formatting Signed-off-by: Vibhu Jawa <[email protected]> * add dask to semdedup_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * configure max memory using a cli Signed-off-by: Vibhu Jawa <[email protected]> * Dumb id results to parquet Signed-off-by: Vibhu Jawa <[email protected]> * Embedding fixes Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * Working end to end Signed-off-by: Vibhu Jawa <[email protected]> * Minor yaml fixes Signed-off-by: Vibhu Jawa <[email protected]> * Undo changes to index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update .pre-commit-config.yaml Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Undo changes to docs/personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update fuzzy_dedup.py Signed-off-by: Vibhu Jawa <[email protected]> * Undo changes to docs/personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Add end to end script in readme.md Signed-off-by: Vibhu Jawa <[email protected]> * Add type hints Signed-off-by: Vibhu Jawa <[email protected]> * Use dask for sort_clusters Signed-off-by: Vibhu Jawa <[email protected]> * Make sort_clusters work on MNMG scales Signed-off-by: Vibhu Jawa <[email protected]> * Cleaned up dask shutdown Signed-off-by: Vibhu Jawa <[email protected]> * Decrease noise in E2E scripts Signed-off-by: Vibhu Jawa <[email protected]> * Clean up scripts Signed-off-by: Vibhu Jawa <[email protected]> * Fix scripts/end_to_end_script.sh Signed-off-by: Vibhu Jawa <[email protected]> * Some more cleanup Signed-off-by: Vibhu Jawa <[email protected]> * Add copyright Signed-off-by: Vibhu Jawa <[email protected]> * Fix README.md Signed-off-by: Vibhu Jawa <[email protected]> * Address reviews Signed-off-by: Vibhu Jawa <[email protected]> * Make work with a SemDedupConfig Signed-off-by: Vibhu Jawa <[email protected]> * Make work with SemDedupConfig Signed-off-by: Vibhu Jawa <[email protected]> * Move to nemo-curator's logger Signed-off-by: Vibhu Jawa <[email protected]> * Semdedup-extract_dedup_data.py Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Applying SEO Best Pratices (NVIDIA#104) * Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <[email protected]> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <[email protected]> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <[email protected]> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Andrew Schilling <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Fix bad merge Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Add Module for embedding+clustering Signed-off-by: Vibhu Jawa <[email protected]> * Add sorting to clustering Signed-off-by: Vibhu Jawa <[email protected]> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <[email protected]> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <[email protected]> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <[email protected]> * Fix Readme.md Signed-off-by: Vibhu Jawa <[email protected]> * Add a environment variable to silence HF warnings Signed-off-by: Vibhu Jawa <[email protected]> * dask-cudf fix Signed-off-by: Vibhu Jawa <[email protected]> * dask-cudf fix Signed-off-by: Vibhu Jawa <[email protected]> * dask-cudf fix Signed-off-by: Vibhu Jawa <[email protected]> * Make config a flat file based on reviews Signed-off-by: Vibhu Jawa <[email protected]> * Add docstrings Signed-off-by: Vibhu Jawa <[email protected]> * Fix argparse and seed function Signed-off-by: Vibhu Jawa <[email protected]> * Use argparse to read config Signed-off-by: Vibhu Jawa <[email protected]> * Move around config files Signed-off-by: Vibhu Jawa <[email protected]> * Move around config files Signed-off-by: Vibhu Jawa <[email protected]> * Move around config files Signed-off-by: Vibhu Jawa <[email protected]> * Remove end_to_end_script.sh Signed-off-by: Vibhu Jawa <[email protected]> * Append Readme Signed-off-by: Vibhu Jawa <[email protected]> * Address Reviews Signed-off-by: Vibhu Jawa <[email protected]> * Change config Signed-off-by: Vibhu Jawa <[email protected]> * Make embedding creation optionally lazy Signed-off-by: Vibhu Jawa <[email protected]> * fix docstring Signed-off-by: Vibhu Jawa <[email protected]> * Address Reviews and docstrings Signed-off-by: Vibhu Jawa <[email protected]> * Address Reviews and make eps_thresholds a list of values Signed-off-by: Vibhu Jawa <[email protected]> * Minor import fix Signed-off-by: Vibhu Jawa <[email protected]> * Empty Commit Signed-off-by: Vibhu Jawa <[email protected]> * Add modules to __init__ and README.md Signed-off-by: Vibhu Jawa <[email protected]> * Fix init Signed-off-by: Vibhu Jawa <[email protected]> * Move comment Signed-off-by: Vibhu Jawa <[email protected]> * Empty commit to restart CI (which failed due to a download issue) Signed-off-by: Vibhu Jawa <[email protected]> * Empty commit to restart CI (which failed due to a download issue) Signed-off-by: Vibhu Jawa <[email protected]> --------- Signed-off-by: Andrew Schilling <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: jgerh <[email protected]> Signed-off-by: avinashvem <[email protected]> Co-authored-by: Andrew Schilling <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> Co-authored-by: jgerh <[email protected]> Co-authored-by: avinashvem <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Chris Alexiuk <[email protected]>
* Begin implementation on OpenAI client Signed-off-by: Ryan Wolf <[email protected]> * Fix relative import Signed-off-by: Ryan Wolf <[email protected]> * Add temperature Signed-off-by: Ryan Wolf <[email protected]> * Modify client interface and begin ultrachat Signed-off-by: Ryan Wolf <[email protected]> * Change type annotation in openai client Signed-off-by: Ryan Wolf <[email protected]> * Make imports easier Signed-off-by: Ryan Wolf <[email protected]> * Reformat to match nemotron report Signed-off-by: Ryan Wolf <[email protected]> * Add yaml conversion Signed-off-by: Ryan Wolf <[email protected]> * Fix index error Signed-off-by: Ryan Wolf <[email protected]> * Add error handling for yaml parsing Signed-off-by: Ryan Wolf <[email protected]> * Fix error Signed-off-by: Ryan Wolf <[email protected]> * Add additional yaml parsing check Signed-off-by: Ryan Wolf <[email protected]> * Add more yaml error handling Signed-off-by: Ryan Wolf <[email protected]> * Export conversion error Signed-off-by: Ryan Wolf <[email protected]> * Change variable naming Signed-off-by: Ryan Wolf <[email protected]> * Make error catching more general Signed-off-by: Ryan Wolf <[email protected]> * Refactor list out of nemotron Signed-off-by: Ryan Wolf <[email protected]> * Add prompt helper function Signed-off-by: Ryan Wolf <[email protected]> * Add revisions and writing prompts Signed-off-by: Ryan Wolf <[email protected]> * Fix default prompt templates Signed-off-by: Ryan Wolf <[email protected]> * Add closed qa Signed-off-by: Ryan Wolf <[email protected]> * Fix prompt Signed-off-by: Ryan Wolf <[email protected]> * Add math and coding Signed-off-by: Ryan Wolf <[email protected]> * Add problem generation Signed-off-by: Ryan Wolf <[email protected]> * Rename function Signed-off-by: Ryan Wolf <[email protected]> * Add dialogue support Signed-off-by: Ryan Wolf <[email protected]> * Fix mispell Signed-off-by: Ryan Wolf <[email protected]> * Add two turn generation Signed-off-by: Ryan Wolf <[email protected]> * Add reward model as judge Signed-off-by: Ryan Wolf <[email protected]> * Refactor reward query Signed-off-by: Ryan Wolf <[email protected]> * Add error handling for non-reward models Signed-off-by: Ryan Wolf <[email protected]> * Add error handling to sync client Signed-off-by: Ryan Wolf <[email protected]> * Add open qa pipeline Signed-off-by: Ryan Wolf <[email protected]> * Improve docs and add writing pipeline Signed-off-by: Ryan Wolf <[email protected]> * Add closed qa pipeline Signed-off-by: Ryan Wolf <[email protected]> * Add math pipeline Signed-off-by: Ryan Wolf <[email protected]> * Add python pipeline Signed-off-by: Ryan Wolf <[email protected]> * Add async nemotron generator Signed-off-by: Ryan Wolf <[email protected]> * Fix await with index Signed-off-by: Ryan Wolf <[email protected]> * Add seed parameter Signed-off-by: Ryan Wolf <[email protected]> * Add missing await Signed-off-by: Ryan Wolf <[email protected]> * Fix parameter names Signed-off-by: Ryan Wolf <[email protected]> * Fix subscript await issues Signed-off-by: Ryan Wolf <[email protected]> * Switch parsing method for reward model Signed-off-by: Ryan Wolf <[email protected]> * Add initial docs Signed-off-by: Ryan Wolf <[email protected]> * Add nemo deploy client Signed-off-by: Ryan Wolf <[email protected]> * Add easy import Signed-off-by: Ryan Wolf <[email protected]> * Move conversation formatter Signed-off-by: Ryan Wolf <[email protected]> * Add other file Signed-off-by: Ryan Wolf <[email protected]> * Update nemotron import Signed-off-by: Ryan Wolf <[email protected]> * Update model client import Signed-off-by: Ryan Wolf <[email protected]> * Remove model in query call Signed-off-by: Ryan Wolf <[email protected]> * Add extra index Signed-off-by: Ryan Wolf <[email protected]> * Fix response indexing Signed-off-by: Ryan Wolf <[email protected]> * Add top k Signed-off-by: Ryan Wolf <[email protected]> * Remove extras Signed-off-by: Ryan Wolf <[email protected]> * Add safe import for nemo deploy Signed-off-by: Ryan Wolf <[email protected]> * Add pandas conversions Signed-off-by: Ryan Wolf <[email protected]> * Add partition default Signed-off-by: Ryan Wolf <[email protected]> * Add no format Signed-off-by: Ryan Wolf <[email protected]> * Move no format location Signed-off-by: Ryan Wolf <[email protected]> * Use top_k in nemo client Signed-off-by: Ryan Wolf <[email protected]> * Address vibhu's review Signed-off-by: Ryan Wolf <[email protected]> * Add logging import Signed-off-by: Ryan Wolf <[email protected]> * Fix import Signed-off-by: Ryan Wolf <[email protected]> * Fix tqdm Signed-off-by: Ryan Wolf <[email protected]> * Add missing awaits Signed-off-by: Ryan Wolf <[email protected]> * Standardize names Signed-off-by: Ryan Wolf <[email protected]> * Address Ayush nit Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
* Begin docs Signed-off-by: Ryan Wolf <[email protected]> * Add slurm sdk example Signed-off-by: Ryan Wolf <[email protected]> * Use safe import Signed-off-by: Ryan Wolf <[email protected]> * Fix bugs in sdk Signed-off-by: Ryan Wolf <[email protected]> * Update docs and tweak scripts Signed-off-by: Ryan Wolf <[email protected]> * Add interface helper function Signed-off-by: Ryan Wolf <[email protected]> * Update docs Signed-off-by: Ryan Wolf <[email protected]> * Fix formatting Signed-off-by: Ryan Wolf <[email protected]> * Add config docstring Signed-off-by: Ryan Wolf <[email protected]> * Address comments Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Vibhu Jawa <[email protected]>
updates: - [github.com/pre-commit/pre-commit-hooks: v4.5.0 → v4.6.0](pre-commit/pre-commit-hooks@v4.5.0...v4.6.0) - [github.com/psf/black: 24.3.0 → 24.4.2](psf/black@24.3.0...24.4.2) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix bug with torch rmm and nemo Signed-off-by: Ryan Wolf <[email protected]> * Change pycld2 version pin Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]>
* Preving plugging an allocator twice Signed-off-by: Vibhu Jawa <[email protected]> * Remove extra import Signed-off-by: Vibhu Jawa <[email protected]> * Fix defaults for RMM-POOL and other style fixes Signed-off-by: Vibhu Jawa <[email protected]> * Switch it rmm_pytorch off by default Signed-off-by: Vibhu Jawa <[email protected]> --------- Signed-off-by: Vibhu Jawa <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
7f1b5f7
to
e121329
Compare
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Closes #72
Closes #71
https://huggingface.co/nvidia/domain-classifier