Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add repo level ordering tf #377

Merged
merged 5 commits into from
Jul 23, 2024

Conversation

shivdeep-singh-ibm
Copy link
Collaborator

Why are these changes needed?

Related issue number (if any).

Copy link
Collaborator

@Param-S Param-S left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will continue to review the remaining files later today.

@@ -73,7 +73,7 @@ INGEST_TO_PARQUET_VERSION=$(DPK_VERSION)

KFP_DOCKER_VERSION=$(DPK_VERSION)
KFP_DOCKER_VERSION_v2=$(DPK_VERSION)

REPO_LVL_ORDER_RAY_VERSION=$(DPK_VERSION)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we going to add support for python runtime?

@@ -0,0 +1,8 @@
# Repo Level Order Transform
The repo level order transforms serves as a simple exemplar to demonstrate the development
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description of the transform needs to be updated.

COPY --chown=ray:users data-processing-lib-ray/ data-processing-lib-ray/
RUN cd data-processing-lib-ray && pip install --no-cache-dir -e .

#COPY requirements.txt requirements.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove commented steps

testing and IDE set up.

## Summary
This project wraps the [repo_level_order transform](../python) with a Ray runtime.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update python link/redirection till we add support for python runtime.


## Configuration and command line Options

repo_level_order configuration and command line options are the same as for the base python transform.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we don't have base python transform, we need to update this sentence.

Transform Configuration.

- For output:
either the output is directly a file or if dominant language flag is enabled, it should output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we brief the terms like "domiant language", "superrows", "different types of sorting currently supported" as quick reference...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or we can have a separate doc page which explains the overall architecture of this transform, different stages of the transform, need for backend storage and different backend storage supported.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while explaining the types of sorting, we may need to mention about the emerge package dependency .

license = {text = "Apache-2.0"}
readme = {file = "README.md", content-type = "text/markdown"}
authors = [
{ name = "David Wood", email = "[email protected]" },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to update the authors?

"vue",
]
ret = True
for l in non_prog_list:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use in operator like:

if lang.lower in non_prog_list:
     return False
return True

@shivdeep-singh-ibm shivdeep-singh-ibm force-pushed the add_repo_level_ordering_tf branch 2 times, most recently from a35a049 to 13d9069 Compare July 11, 2024 06:44
COPY --chown=ray:users data-processing-lib-ray/ data-processing-lib-ray/
RUN cd data-processing-lib-ray && pip install --no-cache-dir -e .

#COPY requirements.txt requirements.txt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove this commented statements

This function takes a table with columns ['title'] and ['language'] where title
is a path and language represents a programming language.
"""
df = table.to_pandas()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just wondering if we can avoid this arrow -> pandas conversion, will that help with execution time and memory. if we use arrow.compute.value_counts() to get the language count.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code uses pandas for most functionality, may need to be re-written.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am suggesting the change only for this function. As per my understanding, this specific conversion scoped to current function. I can see the df is referenced at line no. 45 and line 65.

# of RepoLevelOrderTransformConfiguration class
super().__init__(config)

from data_processing.utils import get_logger
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this import to top along with other imports

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for use in actor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, since this is distributed code, this needs its own logger import (apparently).

self.config = config
self.store = None
self.grouping_column = config.get(grouping_column_key)
self.store = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.store is initialized multiple places line no. 90 and 92. can you remove these redundant assignments.

store_s3_keyid_key = "store_s3_key"
store_s3_secret_key = "store_s3_secret"
store_s3_endpoint_key = "store_s3_endpoint"
store_type_key = "store_type"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store_type_key is initialized two times, can we remove one?


class RepoLevelOrderRuntime(DefaultRayTransformRuntime):
"""
Exact dedup runtime support
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mentioning of exact dedup. can you correct this comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

data = self.pool.get_next_unordered()
if data != None:
result = result + [k for k in data]
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except repeated. can you remove one

data = self.pool.get_next_unordered()
if data != None:
result = result + data
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except Exception as e: repeated here too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Signed-off-by: Shivdeep Singh <[email protected]>
access_key=key,
secret_key=secret,
endpoint_override=endpoint,
request_timeout=20,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request_timeout and connect_time is hardcoded. can we parameterized with default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@shivdeep-singh-ibm shivdeep-singh-ibm changed the title [DRAFT ]Add repo level ordering tf Add repo level ordering tf Jul 12, 2024

p_pool = ActorPool(processors)
replies = list(p_pool.map(lambda a, x: a.process.remote(x[0], x[1]), p_input))
return {"nrepos": len(p_input)}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to delete actors and pool here

Copy link
Member

@daw3rd daw3rd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved for rush status, but needs to be returned to to address comments.

@@ -0,0 +1,8 @@
# Repo Level Order Transform
The repo level order transforms serves as a simple exemplar to demonstrate the development
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong docs


This transform requires the input data to have the following columns atleast:

- repo name: Name of the repo, it is used for grouping in this transform.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This belongs under the next paragraph, which duplicates some of this info.


The transform gives the option to write the repo to file in the following ways.

a) sort the repo content by file path and write a parquet with multiple rows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use some more description. It is not clear to me what is going on here. An example might help.

--repo_lvl_sorting_enabled \
--repo_lvl_sorting_algo SORT_SEMANTIC \
--repo_lvl_output_by_langs
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# of RepoLevelOrderTransformConfiguration class
super().__init__(config)

from data_processing.utils import get_logger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, since this is distributed code, this needs its own logger import (apparently).

"""
Put Transform-specific to convert one Table to 0 or more tables. It also returns
a dictionary of execution statistics - arbitrary dictionary
This implementation makes no modifications so effectively implements a copy of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong doc



def test_sample_1():
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ummm, needs a test.

from pyarrow.parquet import ParquetDataset, write_table


class DataAccessAlternative:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need data access alternative using S3FS. Thats not good. This is unreliable and uses disk space. Why can't we use data access from the library

".m",
".rb",
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like this being hard coded. Can we externalize this into file

".m",
".rb",
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like this being hard coded. Can we externalize this into file

g.add_edge("L", "M")

return g

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in the test, not the main module

".m",
".rb",
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like this being hard coded. Can we externalize this into file

return original_df_cp


def configure_log():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need special log configuration here?

return []


class KeyedValueListActorPool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really good. Take a look at the exact dedup for the same functionality

# may randomly append to an actor
dict_ = random.choice(self.processors)
return ray.get(dict_.put.remote(key, value))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you do ray.get here? its put. Let it be asynch


Limitations of filesystem constrain the size of keys and values in this store.
"""

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why. Again, take a look at ededup to see Ray native way

self.put(key, value)


if __name__ == "__main__":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not belong here its a test

remove redundant code from repo-level-order transform

Signed-off-by: Shivdeep Singh <[email protected]>
Copy link
Collaborator

@Param-S Param-S left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering there will be separate PRs which will take up the README updates and cleaning up the additional print statements. I approve this PR.

@daw3rd
Copy link
Member

daw3rd commented Jul 16, 2024

There is much more here to be resolved than readmes. I'd also like to see merge of dev into this PR before merging it into dev. Some checks have been added in dev that may not be being tested here.

Signed-off-by: Shivdeep Singh <[email protected]>
@shivdeep-singh-ibm
Copy link
Collaborator Author

The issues listed this PR are tracked here.

@shivdeep-singh-ibm
Copy link
Collaborator Author

shivdeep-singh-ibm commented Jul 18, 2024

I will merge these changes to get a n initial of transform into the repo so that it is possible to continue on
issue: #418

set-versions:
$(MAKE) TRANSFORM_PYTHON_VERSION=dummy TOML_VERSION=$(REPO_LVL_ORDER_RAY_VERSION) .transforms.set-versions

build-dist:: set-versions .defaults.build-dist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove set-versions

@daw3rd daw3rd merged commit 7cfc16c into IBM:dev Jul 23, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants