Global `cache_dir` variable for exact, fuzzy, and semantic deduplication #384

sarahyurick · 2024-11-19T22:21:06Z

TODO:

Exact deduplication files
Fuzzy deduplication files
Semantic deduplication files

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick · 2024-11-19T23:58:22Z

docs/user-guide/semdedup.rst

+You also need to set a global variable representing the cache directory where the outputs are written:
+
+.. code-block:: python
+
+    from nemo_curator.cache import initialize_cache_directory
+
+    initialize_cache_directory("cache_dir")


Could also make more sense to call this something else, like deduplication_outputs.

sarahyurick · 2024-11-19T23:59:25Z

nemo_curator/modules/semantic_dedup.py

            id_column=id_column,
            id_column_type=id_column_type,
            which_to_keep=config.which_to_keep,
-            output_dir=os.path.join(cache_dir, config.clustering_save_loc),


Might want to re-add output_dir to SemanticClusterLevelDedup and/or add it as a parameter for SemDedup

sarahyurick · 2024-11-20T00:12:40Z

nemo_curator/modules/semantic_dedup.py

+            self.sorted_clusters_dir = os.path.join(
+                get_cache_directory(), "clustering", "sorted"
+            )
+            self.output_dir = os.path.join(get_cache_directory(), "clustering")


Suggested change

self.output_dir = os.path.join(get_cache_directory(), "clustering")

self.output_dir = os.path.join(get_cache_directory(), "duplicates")

?

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick added 3 commits November 19, 2024 14:20

add global cache variable and use it for exact dedup

769e2ea

Signed-off-by: Sarah Yurick <[email protected]>

global cache for semdedup

b77139c

Signed-off-by: Sarah Yurick <[email protected]>

run black and modify pytest

337cec8

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick changed the title ~~Global cache variable for exact, fuzzy, and semantic deduplication~~ Global cache_dir variable for exact, fuzzy, and semantic deduplication Nov 19, 2024

sarahyurick commented Nov 20, 2024

View reviewed changes

sarahyurick and others added 6 commits November 19, 2024 16:13

update image notebook

6d55d8c

Signed-off-by: Sarah Yurick <[email protected]>

Merge branch 'main' into global_cache_dir

622912b

save fuzzy dedup progress

4cb26d5

Signed-off-by: Sarah Yurick <[email protected]>

save progress

b001622

Signed-off-by: Sarah Yurick <[email protected]>

update remaining docs

0c14626

Signed-off-by: Sarah Yurick <[email protected]>

run black

7486459

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick added the gpuci Run GPU CI/CD on PR label Nov 20, 2024

sarahyurick marked this pull request as ready for review November 20, 2024 23:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global `cache_dir` variable for exact, fuzzy, and semantic deduplication #384

Global `cache_dir` variable for exact, fuzzy, and semantic deduplication #384

sarahyurick commented Nov 19, 2024 •

edited

Loading

sarahyurick Nov 19, 2024

sarahyurick Nov 19, 2024

sarahyurick Nov 20, 2024

	self.output_dir = os.path.join(get_cache_directory(), "clustering")
	self.output_dir = os.path.join(get_cache_directory(), "duplicates")

Global cache_dir variable for exact, fuzzy, and semantic deduplication #384

Are you sure you want to change the base?

Global cache_dir variable for exact, fuzzy, and semantic deduplication #384

Conversation

sarahyurick commented Nov 19, 2024 • edited Loading

sarahyurick Nov 19, 2024

Choose a reason for hiding this comment

sarahyurick Nov 19, 2024

Choose a reason for hiding this comment

sarahyurick Nov 20, 2024

Choose a reason for hiding this comment

Global `cache_dir` variable for exact, fuzzy, and semantic deduplication #384

Global `cache_dir` variable for exact, fuzzy, and semantic deduplication #384

sarahyurick commented Nov 19, 2024 •

edited

Loading