Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global cache_dir variable for exact, fuzzy, and semantic deduplication #384

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

sarahyurick
Copy link
Collaborator

@sarahyurick sarahyurick commented Nov 19, 2024

TODO:

  • Exact deduplication files
  • Fuzzy deduplication files
  • Semantic deduplication files

@sarahyurick sarahyurick changed the title Global cache variable for exact, fuzzy, and semantic deduplication Global cache_dir variable for exact, fuzzy, and semantic deduplication Nov 19, 2024
Comment on lines +154 to +160
You also need to set a global variable representing the cache directory where the outputs are written:

.. code-block:: python

from nemo_curator.cache import initialize_cache_directory

initialize_cache_directory("cache_dir")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also make more sense to call this something else, like deduplication_outputs.

id_column=id_column,
id_column_type=id_column_type,
which_to_keep=config.which_to_keep,
output_dir=os.path.join(cache_dir, config.clustering_save_loc),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to re-add output_dir to SemanticClusterLevelDedup and/or add it as a parameter for SemDedup

self.sorted_clusters_dir = os.path.join(
get_cache_directory(), "clustering", "sorted"
)
self.output_dir = os.path.join(get_cache_directory(), "clustering")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.output_dir = os.path.join(get_cache_directory(), "clustering")
self.output_dir = os.path.join(get_cache_directory(), "duplicates")

?

sarahyurick and others added 6 commits November 19, 2024 16:13
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added the gpuci Run GPU CI/CD on PR label Nov 20, 2024
@sarahyurick sarahyurick marked this pull request as ready for review November 20, 2024 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpuci Run GPU CI/CD on PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant