Add repo level ordering tf #377

shivdeep-singh-ibm · 2024-07-04T05:26:58Z

Why are these changes needed?

Related issue number (if any).

Param-S

will continue to review the remaining files later today.

Param-S · 2024-07-04T12:29:11Z

.make.versions

@@ -73,7 +73,7 @@ INGEST_TO_PARQUET_VERSION=$(DPK_VERSION)

 KFP_DOCKER_VERSION=$(DPK_VERSION)
 KFP_DOCKER_VERSION_v2=$(DPK_VERSION)
-
+REPO_LVL_ORDER_RAY_VERSION=$(DPK_VERSION)


are we going to add support for python runtime?

Param-S · 2024-07-04T12:31:27Z

transforms/code/repo-level-ordering/README.md

@@ -0,0 +1,8 @@
+# Repo Level Order Transform 
+The repo level order transforms serves as a simple exemplar to demonstrate the development


Description of the transform needs to be updated.

Param-S · 2024-07-04T12:32:52Z

transforms/code/repo-level-ordering/ray/Dockerfile

+COPY --chown=ray:users data-processing-lib-ray/ data-processing-lib-ray/ 
+RUN cd data-processing-lib-ray    && pip install --no-cache-dir -e .
+
+#COPY requirements.txt requirements.txt


can we remove commented steps

Param-S · 2024-07-04T12:34:57Z

transforms/code/repo-level-ordering/ray/README.md

+testing and IDE set up.
+
+## Summary 
+This project wraps the [repo_level_order transform](../python) with a Ray runtime.


can you update python link/redirection till we add support for python runtime.

Param-S · 2024-07-04T12:40:27Z

transforms/code/repo-level-ordering/ray/README.md

+
+## Configuration and command line Options
+
+repo_level_order configuration and command line options are the same as for the base python transform. 


since we don't have base python transform, we need to update this sentence.

Param-S · 2024-07-04T12:43:39Z

transforms/code/repo-level-ordering/ray/README.md

+Transform Configuration.
+
+- For output:
+   either the output is directly a file or if dominant language flag is enabled, it should output


Can we brief the terms like "domiant language", "superrows", "different types of sorting currently supported" as quick reference...

or we can have a separate doc page which explains the overall architecture of this transform, different stages of the transform, need for backend storage and different backend storage supported.

while explaining the types of sorting, we may need to mention about the emerge package dependency .

Param-S · 2024-07-04T13:20:15Z

transforms/code/repo-level-ordering/ray/pyproject.toml

+license = {text = "Apache-2.0"}
+readme = {file = "README.md", content-type = "text/markdown"}
+authors = [
+    { name = "David Wood", email = "[email protected]" },


need to update the authors?

Param-S · 2024-07-04T13:24:57Z

transforms/code/repo-level-ordering/ray/src/dpk_repo_level_order/internal/check_languages.py

+        "vue",
+    ]
+    ret = True
+    for l in non_prog_list:


can we use in operator like:

if lang.lower in non_prog_list: return False return True

Param-S · 2024-07-11T09:36:33Z

transforms/code/repo_level_ordering/ray/Dockerfile

+COPY --chown=ray:users data-processing-lib-ray/ data-processing-lib-ray/ 
+RUN cd data-processing-lib-ray    && pip install --no-cache-dir -e .
+
+#COPY requirements.txt requirements.txt


can we remove this commented statements

Param-S · 2024-07-11T12:55:10Z

transforms/code/repo_level_ordering/ray/src/dpk_repo_level_order/internal/check_languages.py

+    This function takes a table with columns ['title'] and ['language'] where title
+    is a path and language represents a programming language.
+    """
+    df = table.to_pandas()


I am just wondering if we can avoid this arrow -> pandas conversion, will that help with execution time and memory. if we use arrow.compute.value_counts() to get the language count.

this code uses pandas for most functionality, may need to be re-written.

I am suggesting the change only for this function. As per my understanding, this specific conversion scoped to current function. I can see the df is referenced at line no. 45 and line 65.

Param-S · 2024-07-11T14:16:26Z

transforms/code/repo_level_ordering/ray/src/repo_level_order_transform.py

+        # of RepoLevelOrderTransformConfiguration class
+        super().__init__(config)
+
+        from data_processing.utils import get_logger


can you move this import to top along with other imports

This is for use in actor

agreed, since this is distributed code, this needs its own logger import (apparently).

Param-S · 2024-07-11T14:19:09Z

transforms/code/repo_level_ordering/ray/src/repo_level_order_transform.py

+        self.config = config
+        self.store = None
+        self.grouping_column = config.get(grouping_column_key)
+        self.store = None


self.store is initialized multiple places line no. 90 and 92. can you remove these redundant assignments.

Param-S · 2024-07-11T14:23:24Z

...sforms/code/repo_level_ordering/ray/src/dpk_repo_level_order/internal/store/store_factory.py

+store_s3_keyid_key = "store_s3_key"
+store_s3_secret_key = "store_s3_secret"
+store_s3_endpoint_key = "store_s3_endpoint"
+store_type_key = "store_type"


store_type_key is initialized two times, can we remove one?

Param-S · 2024-07-12T06:08:16Z

transforms/code/repo_level_ordering/ray/src/repo_level_order_transform.py

+
+class RepoLevelOrderRuntime(DefaultRayTransformRuntime):
+    """
+    Exact dedup runtime support


mentioning of exact dedup. can you correct this comment

Param-S · 2024-07-12T06:36:17Z

transforms/code/repo_level_ordering/ray/src/dpk_repo_level_order/internal/store/ray_store.py

+                data = self.pool.get_next_unordered()
+                if data != None:
+                    result = result + [k for k in data]
+            except Exception as e:


except repeated. can you remove one

Param-S · 2024-07-12T06:36:46Z

transforms/code/repo_level_ordering/ray/src/dpk_repo_level_order/internal/store/ray_store.py

+                data = self.pool.get_next_unordered()
+                if data != None:
+                    result = result + data
+            except Exception as e:


except Exception as e: repeated here too

Signed-off-by: Shivdeep Singh <[email protected]>

Param-S · 2024-07-12T12:36:07Z

transforms/code/repo_level_ordering/ray/src/dpk_repo_level_order/internal/repo_grouper.py

+                access_key=key,
+                secret_key=secret,
+                endpoint_override=endpoint,
+                request_timeout=20,


request_timeout and connect_time is hardcoded. can we parameterized with default.

Signed-off-by: Shivdeep Singh <[email protected]>

blublinsky · 2024-07-12T16:48:51Z

transforms/code/repo_level_ordering/ray/src/repo_level_order_transform.py

+
+        p_pool = ActorPool(processors)
+        replies = list(p_pool.map(lambda a, x: a.process.remote(x[0], x[1]), p_input))
+        return {"nrepos": len(p_input)}


I think you need to delete actors and pool here

daw3rd

Approved for rush status, but needs to be returned to to address comments.

daw3rd · 2024-07-12T16:05:21Z

transforms/code/repo_level_ordering/README.md

@@ -0,0 +1,8 @@
+# Repo Level Order Transform 
+The repo level order transforms serves as a simple exemplar to demonstrate the development


daw3rd · 2024-07-12T16:19:09Z

transforms/code/repo_level_ordering/ray/README.md

+
+This transform requires the input data to have the following columns atleast: 
+
+- repo name: Name of the repo, it is used for grouping in this transform.


This belongs under the next paragraph, which duplicates some of this info.

daw3rd · 2024-07-12T16:20:16Z

transforms/code/repo_level_ordering/ray/README.md

+
+The transform gives the option to write the repo to file in the following ways.
+
+a) sort the repo content by file path and write a parquet with multiple rows


This could use some more description. It is not clear to me what is going on here. An example might help.

daw3rd · 2024-07-12T16:21:33Z

transforms/code/repo_level_ordering/ray/README.md

+       --repo_lvl_sorting_enabled \
+       --repo_lvl_sorting_algo SORT_SEMANTIC \
+       --repo_lvl_output_by_langs   
+```


You should also add the new section on running transform images...see https://github.com/IBM/data-prep-kit/tree/dev/transforms/code/code2parquet/python#to-see-results-of-the-transform

daw3rd · 2024-07-12T16:27:02Z

transforms/code/repo_level_ordering/ray/src/repo_level_order_transform.py

+        # of RepoLevelOrderTransformConfiguration class
+        super().__init__(config)
+
+        from data_processing.utils import get_logger


agreed, since this is distributed code, this needs its own logger import (apparently).

daw3rd · 2024-07-12T16:27:37Z

transforms/code/repo_level_ordering/ray/src/repo_level_order_transform.py

+        """
+        Put Transform-specific to convert one Table to 0 or more tables. It also returns
+        a dictionary of execution statistics - arbitrary dictionary
+        This implementation makes no modifications so effectively implements a copy of the


daw3rd · 2024-07-12T16:54:40Z

transforms/code/repo_level_ordering/ray/test/test_repo_level_order.py

+
+
+def test_sample_1():
+    pass


ummm, needs a test.

blublinsky · 2024-07-12T21:14:14Z

transforms/code/repo_level_ordering/ray/src/dpk_repo_level_order/internal/repo_grouper.py

+from pyarrow.parquet import ParquetDataset, write_table
+
+
+class DataAccessAlternative:


Why do we need data access alternative using S3FS. Thats not good. This is unreliable and uses disk space. Why can't we use data access from the library

blublinsky · 2024-07-12T21:22:23Z

...ring/ray/src/dpk_repo_level_order/internal/sorting/semantic_ordering/sort_by_semantic_dep.py

+    ".m",
+    ".rb",
+}
+


I do not like this being hard coded. Can we externalize this into file

blublinsky · 2024-07-12T21:22:32Z

...ring/ray/src/dpk_repo_level_order/internal/sorting/semantic_ordering/sort_by_semantic_dep.py

+    ".m",
+    ".rb",
+}
+


I do not like this being hard coded. Can we externalize this into file

blublinsky · 2024-07-12T21:25:04Z

...ordering/ray/src/dpk_repo_level_order/internal/sorting/semantic_ordering/topological_sort.py

+    g.add_edge("L", "M")
+
+    return g
+


This should be in the test, not the main module

blublinsky · 2024-07-12T21:25:17Z

...ring/ray/src/dpk_repo_level_order/internal/sorting/semantic_ordering/sort_by_semantic_dep.py

+    ".m",
+    ".rb",
+}
+


I do not like this being hard coded. Can we externalize this into file

blublinsky · 2024-07-12T21:27:17Z

...repo_level_ordering/ray/src/dpk_repo_level_order/internal/sorting/semantic_ordering/utils.py

+            return original_df_cp
+
+
+def configure_log():


Why do we need special log configuration here?

blublinsky · 2024-07-12T21:30:22Z

transforms/code/repo_level_ordering/ray/src/dpk_repo_level_order/internal/store/ray_store.py

+            return []
+
+
+class KeyedValueListActorPool:


This is not really good. Take a look at the exact dedup for the same functionality

blublinsky · 2024-07-12T21:31:02Z

transforms/code/repo_level_ordering/ray/src/dpk_repo_level_order/internal/store/ray_store.py

+        # may randomly append to an actor
+        dict_ = random.choice(self.processors)
+        return ray.get(dict_.put.remote(key, value))
+


Why do you do ray.get here? its put. Let it be asynch

blublinsky · 2024-07-12T21:35:10Z

transforms/code/repo_level_ordering/ray/src/dpk_repo_level_order/internal/store/store.py

+
+    Limitations of filesystem constrain the size of keys and values in this store.
+    """
+


Why. Again, take a look at ededup to see Ray native way

blublinsky · 2024-07-12T21:35:55Z

transforms/code/repo_level_ordering/ray/src/dpk_repo_level_order/internal/store/store.py

+            self.put(key, value)
+
+
+if __name__ == "__main__":


This does not belong here its a test

remove redundant code from repo-level-order transform Signed-off-by: Shivdeep Singh <[email protected]>

Param-S

Considering there will be separate PRs which will take up the README updates and cleaning up the additional print statements. I approve this PR.

daw3rd · 2024-07-16T16:38:24Z

There is much more here to be resolved than readmes. I'd also like to see merge of dev into this PR before merging it into dev. Some checks have been added in dev that may not be being tested here.

Signed-off-by: Shivdeep Singh <[email protected]>

shivdeep-singh-ibm · 2024-07-18T14:38:58Z

The issues listed this PR are tracked here.

shivdeep-singh-ibm · 2024-07-18T14:40:22Z

I will merge these changes to get a n initial of transform into the repo so that it is possible to continue on
issue: #418

Signed-off-by: Shivdeep Singh <[email protected]>

daw3rd · 2024-07-15T19:30:16Z

transforms/code/repo_level_ordering/ray/Makefile

+set-versions: 
+	$(MAKE) TRANSFORM_PYTHON_VERSION=dummy TOML_VERSION=$(REPO_LVL_ORDER_RAY_VERSION) .transforms.set-versions 
+
+build-dist:: set-versions .defaults.build-dist 


Please remove set-versions

shivdeep-singh-ibm requested a review from Param-S July 4, 2024 05:26

shivdeep-singh-ibm self-assigned this Jul 4, 2024

Param-S requested a review from sapthasurendran July 4, 2024 12:05

Param-S requested changes Jul 4, 2024

View reviewed changes

shivdeep-singh-ibm force-pushed the add_repo_level_ordering_tf branch 2 times, most recently from a35a049 to 13d9069 Compare July 11, 2024 06:44

Param-S requested changes Jul 11, 2024

View reviewed changes

Param-S reviewed Jul 12, 2024

View reviewed changes

Add repo-order-transform

99cfe7b

Signed-off-by: Shivdeep Singh <[email protected]>

shivdeep-singh-ibm force-pushed the add_repo_level_ordering_tf branch from 7bb7aaf to 99cfe7b Compare July 12, 2024 11:48

Param-S reviewed Jul 12, 2024

View reviewed changes

shivdeep-singh-ibm changed the title ~~[DRAFT ]Add repo level ordering tf~~ Add repo level ordering tf Jul 12, 2024

parameterize timeout constants in repo level ordering.

481383a

Signed-off-by: Shivdeep Singh <[email protected]>

shivdeep-singh-ibm requested a review from daw3rd July 12, 2024 13:56

blublinsky reviewed Jul 12, 2024

View reviewed changes

daw3rd approved these changes Jul 12, 2024

View reviewed changes

blublinsky reviewed Jul 12, 2024

View reviewed changes

remove redundant code

0c20e62

remove redundant code from repo-level-order transform Signed-off-by: Shivdeep Singh <[email protected]>

Param-S approved these changes Jul 16, 2024

View reviewed changes

minor fixes for repo_level_order

4ad9357

Signed-off-by: Shivdeep Singh <[email protected]>

revert deleting actors from orchestrator

cfcb4eb

Signed-off-by: Shivdeep Singh <[email protected]>

daw3rd reviewed Jul 23, 2024

View reviewed changes

daw3rd merged commit 7cfc16c into IBM:dev Jul 23, 2024
21 checks passed

		@@ -0,0 +1,8 @@
		# Repo Level Order Transform
		The repo level order transforms serves as a simple exemplar to demonstrate the development


		## Configuration and command line Options

		repo_level_order configuration and command line options are the same as for the base python transform.


		This transform requires the input data to have the following columns atleast:

		- repo name: Name of the repo, it is used for grouping in this transform.


		The transform gives the option to write the repo to file in the following ways.

		a) sort the repo content by file path and write a parquet with multiple rows

		from pyarrow.parquet import ParquetDataset, write_table


		class DataAccessAlternative:


		Limitations of filesystem constrain the size of keys and values in this store.
		"""

Add repo level ordering tf #377

Add repo level ordering tf #377

Conversation

shivdeep-singh-ibm commented Jul 4, 2024

Why are these changes needed?

Related issue number (if any).

Param-S left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daw3rd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Param-S left a comment

Choose a reason for hiding this comment

daw3rd commented Jul 16, 2024

shivdeep-singh-ibm commented Jul 18, 2024

shivdeep-singh-ibm commented Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

shivdeep-singh-ibm commented Jul 18, 2024 •

edited

Loading