-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add repo level ordering tf #377
Add repo level ordering tf #377
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will continue to review the remaining files later today.
@@ -73,7 +73,7 @@ INGEST_TO_PARQUET_VERSION=$(DPK_VERSION) | |||
|
|||
KFP_DOCKER_VERSION=$(DPK_VERSION) | |||
KFP_DOCKER_VERSION_v2=$(DPK_VERSION) | |||
|
|||
REPO_LVL_ORDER_RAY_VERSION=$(DPK_VERSION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we going to add support for python runtime?
@@ -0,0 +1,8 @@ | |||
# Repo Level Order Transform | |||
The repo level order transforms serves as a simple exemplar to demonstrate the development |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Description of the transform needs to be updated.
COPY --chown=ray:users data-processing-lib-ray/ data-processing-lib-ray/ | ||
RUN cd data-processing-lib-ray && pip install --no-cache-dir -e . | ||
|
||
#COPY requirements.txt requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove commented steps
testing and IDE set up. | ||
|
||
## Summary | ||
This project wraps the [repo_level_order transform](../python) with a Ray runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update python link/redirection till we add support for python runtime.
|
||
## Configuration and command line Options | ||
|
||
repo_level_order configuration and command line options are the same as for the base python transform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we don't have base python transform, we need to update this sentence.
Transform Configuration. | ||
|
||
- For output: | ||
either the output is directly a file or if dominant language flag is enabled, it should output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we brief the terms like "domiant language", "superrows", "different types of sorting currently supported" as quick reference...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we can have a separate doc page which explains the overall architecture of this transform, different stages of the transform, need for backend storage and different backend storage supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while explaining the types of sorting, we may need to mention about the emerge
package dependency .
license = {text = "Apache-2.0"} | ||
readme = {file = "README.md", content-type = "text/markdown"} | ||
authors = [ | ||
{ name = "David Wood", email = "[email protected]" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to update the authors?
"vue", | ||
] | ||
ret = True | ||
for l in non_prog_list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use in
operator like:
if lang.lower in non_prog_list:
return False
return True
a35a049
to
13d9069
Compare
COPY --chown=ray:users data-processing-lib-ray/ data-processing-lib-ray/ | ||
RUN cd data-processing-lib-ray && pip install --no-cache-dir -e . | ||
|
||
#COPY requirements.txt requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove this commented statements
This function takes a table with columns ['title'] and ['language'] where title | ||
is a path and language represents a programming language. | ||
""" | ||
df = table.to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just wondering if we can avoid this arrow -> pandas conversion, will that help with execution time and memory. if we use arrow.compute.value_counts() to get the language count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code uses pandas for most functionality, may need to be re-written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am suggesting the change only for this function. As per my understanding, this specific conversion scoped to current function. I can see the df
is referenced at line no. 45 and line 65.
# of RepoLevelOrderTransformConfiguration class | ||
super().__init__(config) | ||
|
||
from data_processing.utils import get_logger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move this import to top along with other imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for use in actor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, since this is distributed code, this needs its own logger import (apparently).
self.config = config | ||
self.store = None | ||
self.grouping_column = config.get(grouping_column_key) | ||
self.store = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.store is initialized multiple places line no. 90 and 92. can you remove these redundant assignments.
store_s3_keyid_key = "store_s3_key" | ||
store_s3_secret_key = "store_s3_secret" | ||
store_s3_endpoint_key = "store_s3_endpoint" | ||
store_type_key = "store_type" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
store_type_key is initialized two times, can we remove one?
|
||
class RepoLevelOrderRuntime(DefaultRayTransformRuntime): | ||
""" | ||
Exact dedup runtime support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mentioning of exact dedup. can you correct this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
data = self.pool.get_next_unordered() | ||
if data != None: | ||
result = result + [k for k in data] | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except repeated. can you remove one
data = self.pool.get_next_unordered() | ||
if data != None: | ||
result = result + data | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except Exception as e:
repeated here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Signed-off-by: Shivdeep Singh <[email protected]>
7bb7aaf
to
99cfe7b
Compare
access_key=key, | ||
secret_key=secret, | ||
endpoint_override=endpoint, | ||
request_timeout=20, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
request_timeout and connect_time is hardcoded. can we parameterized with default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Shivdeep Singh <[email protected]>
|
||
p_pool = ActorPool(processors) | ||
replies = list(p_pool.map(lambda a, x: a.process.remote(x[0], x[1]), p_input)) | ||
return {"nrepos": len(p_input)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to delete actors and pool here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved for rush status, but needs to be returned to to address comments.
@@ -0,0 +1,8 @@ | |||
# Repo Level Order Transform | |||
The repo level order transforms serves as a simple exemplar to demonstrate the development |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong docs
|
||
This transform requires the input data to have the following columns atleast: | ||
|
||
- repo name: Name of the repo, it is used for grouping in this transform. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This belongs under the next paragraph, which duplicates some of this info.
|
||
The transform gives the option to write the repo to file in the following ways. | ||
|
||
a) sort the repo content by file path and write a parquet with multiple rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could use some more description. It is not clear to me what is going on here. An example might help.
--repo_lvl_sorting_enabled \ | ||
--repo_lvl_sorting_algo SORT_SEMANTIC \ | ||
--repo_lvl_output_by_langs | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should also add the new section on running transform images...see https://github.com/IBM/data-prep-kit/tree/dev/transforms/code/code2parquet/python#to-see-results-of-the-transform
# of RepoLevelOrderTransformConfiguration class | ||
super().__init__(config) | ||
|
||
from data_processing.utils import get_logger |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, since this is distributed code, this needs its own logger import (apparently).
""" | ||
Put Transform-specific to convert one Table to 0 or more tables. It also returns | ||
a dictionary of execution statistics - arbitrary dictionary | ||
This implementation makes no modifications so effectively implements a copy of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong doc
|
||
|
||
def test_sample_1(): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ummm, needs a test.
from pyarrow.parquet import ParquetDataset, write_table | ||
|
||
|
||
class DataAccessAlternative: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need data access alternative using S3FS. Thats not good. This is unreliable and uses disk space. Why can't we use data access from the library
".m", | ||
".rb", | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like this being hard coded. Can we externalize this into file
".m", | ||
".rb", | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like this being hard coded. Can we externalize this into file
g.add_edge("L", "M") | ||
|
||
return g | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in the test, not the main module
".m", | ||
".rb", | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like this being hard coded. Can we externalize this into file
return original_df_cp | ||
|
||
|
||
def configure_log(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need special log configuration here?
return [] | ||
|
||
|
||
class KeyedValueListActorPool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really good. Take a look at the exact dedup for the same functionality
# may randomly append to an actor | ||
dict_ = random.choice(self.processors) | ||
return ray.get(dict_.put.remote(key, value)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you do ray.get here? its put. Let it be asynch
|
||
Limitations of filesystem constrain the size of keys and values in this store. | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why. Again, take a look at ededup to see Ray native way
self.put(key, value) | ||
|
||
|
||
if __name__ == "__main__": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not belong here its a test
remove redundant code from repo-level-order transform Signed-off-by: Shivdeep Singh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering there will be separate PRs which will take up the README updates and cleaning up the additional print statements. I approve this PR.
There is much more here to be resolved than readmes. I'd also like to see merge of dev into this PR before merging it into dev. Some checks have been added in dev that may not be being tested here. |
Signed-off-by: Shivdeep Singh <[email protected]>
The issues listed this PR are tracked here. |
I will merge these changes to get a n initial of transform into the repo so that it is possible to continue on |
Signed-off-by: Shivdeep Singh <[email protected]>
set-versions: | ||
$(MAKE) TRANSFORM_PYTHON_VERSION=dummy TOML_VERSION=$(REPO_LVL_ORDER_RAY_VERSION) .transforms.set-versions | ||
|
||
build-dist:: set-versions .defaults.build-dist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove set-versions
Why are these changes needed?
Related issue number (if any).