diff --git a/doc/quick-start/quick-start.md b/doc/quick-start/quick-start.md
index 06331ea16..513e84a8f 100644
--- a/doc/quick-start/quick-start.md
+++ b/doc/quick-start/quick-start.md
@@ -1,5 +1,53 @@
# Quick Start for Data Prep Kit
-Here we provided short examples of various uses of the Data Prep Kit.
+Here we provided short examples of various uses of the Data Prep Kit. Most users who want to jump right in can use standard pip install to deploy the data-prep-kit and the python or ray transforms to their virtual python environment.
+
+- When setting up a virtual environment it is recommended to use python3.11 as in the example below using conda.
+
+
+ - setup a virtual environment (example using conda)
+ \
+ `conda create -n data-prep-kit-1 -y python=3.11`
+
+
+ - Install the gcc/g++ that is required while building fastext:
+\
+ `conda install gcc_linux-64`
+\
+ `conda install gxx_linux-64`
+
+
+ - activate the new conda environment
+\
+ `conda activate data-prep-kit-1`
+
+
+ - make sure env is switched to data-prep-kit-1 and Check python version.
+\
+ `python --version`
+\
+ The command above should say: 3.11
+
+
+ - install jupyter lab
+\
+ `pip3 install jupyterlab`
+
+then
+- Deploy the latest release of the data prep toolkit library
+
+ `pip3 install data-prep-toolkit`
+
+ or
+- Deploy the latest releases of the data prep toolkit library and all python transforms
+
+ `pip3 install data-prep-toolkit-transforms`
+
+ or
+- Deploy the latest releases of the data prep toolkit library, all python transforms and all ray transforms
+
+ `pip3 install data-prep-toolkit-transforms-ray`
+
+
## Running transforms
diff --git a/examples/notebooks/code/sample-notebook_llama.ipynb b/examples/notebooks/code/sample-notebook_llama.ipynb
deleted file mode 100644
index 9a6c033e0..000000000
--- a/examples/notebooks/code/sample-notebook_llama.ipynb
+++ /dev/null
@@ -1,913 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "NbF_Zw3KBazf"
- },
- "source": [
- "# **Demo on building data prep pipeline for model fine tuning** "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- " \n",
- ""
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "_-NOkuTxiP7r",
- "outputId": "043f32fc-c476-433e-86b6-d7e9abd4d285"
- },
- "source": [
- "This demo notebook shows how to use [data-prep-kit](https://github.com/IBM/data-prep-kit) to build a data preparation pipeline that can be used for fine tuning Llama models. We will discuss the various data preparation steps to process raw data (code repositories), tokenise it and fine tune using Llama models. We will also discuss a novel recipe for semantic ordering of files in a repository which has shown to enhance model training. Please see our [paper](https://arxiv.org/abs/2407.13739) here for more details. For this demo, we will use the [codeparrot/github-code](https://huggingface.co/datasets/codeparrot/github-code) dataset hosted on Hugging Face datasets. \n",
- "\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Setup\n",
- "\n",
- "Install data-prep-toolkit and datasets library. This notebook requires atleast 8 cpus. \n",
- "To run on google colab, it is recommended to change the runtime to TPUs to get the required number of cpus.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {},
- "outputs": [],
- "source": [
- "%%capture logpip --no-stderr\n",
- "!pip install data-prep-toolkit-transforms-ray==0.2.1.dev1\n",
- "!pip install datasets"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "8VhIsZViaU2i"
- },
- "source": [
- "We use parallel processing capability using Ray, so that beyond the demo, a user can also use this for actual production runs on larger datasets, with minor code changes. Please read [here](https://github.com/IBM/data-prep-kit?tab=readme-ov-file#-about-) on various features of data-prep-kit that includes flexibility of compute to run from laptop to cluster. There are three parameters, that the user can change, as per usecase:\n",
- "\n",
- "`runtime_num_worker`: number of parallel workers to be used\n",
- "\n",
- "`num_cpus`: number of cpus to be used per worker\n",
- "\n",
- "`run_locally: True` start a ray cluster for parallel computation\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "J_UbnF9wbj95"
- },
- "outputs": [],
- "source": [
- "from data_processing_ray.runtime.ray import RayTransformLauncher\n",
- "from data_processing.utils import ParamsUtils\n",
- "import sys\n",
- "\n",
- "#Default parameters for computation\n",
- "worker_options = {\"num_cpus\": 0.8}\n",
- "common_config_params = {\n",
- " \"run_locally\": True,\n",
- " \"runtime_worker_options\": ParamsUtils.convert_to_ast(worker_options),\n",
- " \"runtime_num_workers\": 2,\n",
- " }\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- "We will do all the processing in `sample_data` folder. This concludes our setup section. "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {},
- "outputs": [],
- "source": [
- "!mkdir -p sample_data\n",
- "!mkdir -p sample_data/hf_2_parquet"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Data Preparation Steps\n",
- "\n",
- "We now discuss the various data preparation steps to transform the raw data to a tokenised format post cleaning and transforming the data. We use the [parquet data format](https://parquet.apache.org/) for all our operations. This helps to efficiently scale the data for actual production runs, beyond the demo. \n",
- "\n",
- "1. HuggingFace2Parquet: Read the dataset from HF and convert into parquet format. \n",
- "2. Exact Deduplication: Remove exact duplicates. \n",
- "3. Fuzzy Deduplication: Remove near duplicates. \n",
- "4. Programming Lang Selection: Select the programming languages to be used for the analysis.\n",
- "5. Code Quality Annotations: Annotate whether a given code file is of high quality or not using various rules.\n",
- "6. Filtering: Filter dataset to retain only programming language of interest. \n",
- "7. Semantic Ordering: Organise code files by their semantic dependencies. \n",
- "8. Tokenization: Tokenise the data for model fine tuning.\n",
- "\n",
- "The data processing pipeline is organised such that the output of the previous transform is used as input to the next one. Refer to the papers [here](https://arxiv.org/pdf/2405.04324) and [here](https://arxiv.org/abs/2407.13739) for complete details for each of the above steps. "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "xliMSdQEEwYx"
- },
- "source": [
- "## 1. Huggingface datasets to Parquet\n",
- "\n",
- "This is the first component of this pipeline. It ingests a dataset `codeparrot/github-code` from huggingface and converts it into\n",
- "parquet files for consumption by the next steps in this data processing pipeline.\n",
- "\n",
- "For this demo we are trying to process a few records. The following fields can be updated in case you want to use more data.\n",
- "_total_files_ = 10
\n",
- "_rows_per_file_ = 10\n",
- "\n",
- "The output of this stage of the pipeline would be written to `sample_data/hf_2_parquet`."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "wit7ic1GauWN",
- "outputId": "cc9ee442-ea65-446c-d495-e5ac83bd5f1c"
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Writing sample_data/hf_2_parquet/data_0.parquet\n",
- "Writing sample_data/hf_2_parquet/data_1.parquet\n",
- "Writing sample_data/hf_2_parquet/data_2.parquet\n",
- "Writing sample_data/hf_2_parquet/data_3.parquet\n",
- "Writing sample_data/hf_2_parquet/data_4.parquet\n",
- "Writing sample_data/hf_2_parquet/data_5.parquet\n",
- "Writing sample_data/hf_2_parquet/data_6.parquet\n",
- "Writing sample_data/hf_2_parquet/data_7.parquet\n",
- "Writing sample_data/hf_2_parquet/data_8.parquet\n",
- "Writing sample_data/hf_2_parquet/data_9.parquet\n"
- ]
- }
- ],
- "source": [
- "import os\n",
- "import pyarrow as pa\n",
- "import pyarrow.parquet as pq\n",
- "\n",
- "from datasets import load_dataset\n",
- "\n",
- "import uuid\n",
- "from data_processing.utils import TransformUtils\n",
- "from collections import defaultdict\n",
- "\n",
- "DATASET_NAME='codeparrot/github-code'\n",
- "\n",
- "ds = load_dataset(DATASET_NAME, \n",
- " streaming=True, \n",
- " split=\"train\",\n",
- " trust_remote_code=True)\n",
- "\n",
- "def row_mapper(row):\n",
- " return {\n",
- " 'ext': TransformUtils.get_file_extension(row['path'])[1],\n",
- " 'document_id': str(uuid.uuid4())\n",
- " }\n",
- "\n",
- "parquet_data_output = \"sample_data/hf_2_parquet\"\n",
- "\n",
- "def hf_dataset_to_parquet(ds, skip, nrows, file_name, mapper=None, renamed_columns=[]):\n",
- " dst_ = ds.skip(skip).take(nrows)\n",
- " data_dict = defaultdict(list)\n",
- "\n",
- " dst = dst_.map(mapper)\n",
- "\n",
- " for data in dst:\n",
- " for k, v in data.items():\n",
- " data_dict[k].append(v)\n",
- "\n",
- " for old, new in renamed_columns:\n",
- " data_dict[new] = data_dict[old]\n",
- " del data_dict[old]\n",
- "\n",
- " table = pa.Table.from_pydict(data_dict)\n",
- " pq.write_table(table, file_name)\n",
- "\n",
- "\n",
- "## Create parquet files \n",
- "\n",
- "total_files = 10\n",
- "rows_per_file = 10\n",
- "for num in range(total_files):\n",
- " file_name = os.path.join(\n",
- " f\"{parquet_data_output}\",\n",
- " f\"data_{num}.parquet\"\n",
- " )\n",
- " print (f\"Writing {file_name}\")\n",
- " hf_dataset_to_parquet(ds, \n",
- " 1 * rows_per_file,\n",
- " rows_per_file,\n",
- " file_name=file_name,\n",
- " mapper=row_mapper,\n",
- " renamed_columns=[(\"code\", \"contents\"),\n",
- " (\"path\", \"title\")])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 2. Exact deduplication\n",
- "\n",
- "This step will find exact duplicates in the 'content' column and remove them. This is done by computing SHA256 hash on the code files and remove records having identical hashes.\n",
- "\n",
- "The transform specific params for exact deduplication are:
\n",
- " _ededup_hash_cpu_ - Number of cpus per worker
\n",
- " _ededup_num_hashes_ - Number of workers used to store hashes
\n",
- " _ededup_doc_column_ - Name of column which has to be checked for deduplication
\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "bRUfjHExbd1g",
- "outputId": "39459ec3-491a-4a6d-c80d-8b9bf1333a15"
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "18:41:16 INFO - Running locally\n",
- "18:41:16 INFO - exact dedup params are {'doc_column': 'contents', 'hash_cpu': 0.5, 'num_hashes': 2}\n",
- "18:41:16 INFO - exact dedup params are {'doc_column': 'contents', 'hash_cpu': 0.5, 'num_hashes': 2}\n",
- "18:41:16 INFO - data factory data_ is using local data access: input_folder - sample_data/hf_2_parquet output_folder - sample_data/ededup_out\n",
- "18:41:16 INFO - data factory data_ max_files -1, n_sample -1\n",
- "18:41:16 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n",
- "18:41:16 INFO - pipeline id pipeline_id\n",
- "18:41:16 INFO - code location None\n",
- "18:41:16 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n",
- "18:41:16 INFO - actor creation delay 0\n",
- "18:41:16 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}\n",
- "2024-08-21 18:41:18,293\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32m127.0.0.1:8265 \u001b[39m\u001b[22m\n",
- "\u001b[36m(orchestrate pid=50349)\u001b[0m 18:41:18 INFO - orchestrator started at 2024-08-21 18:41:18\n",
- "\u001b[36m(orchestrate pid=50349)\u001b[0m 18:41:18 INFO - Number of files is 10, source profile {'max_file_size': 0.029517173767089844, 'min_file_size': 0.029506683349609375, 'total_file_size': 0.2951393127441406}\n",
- "\u001b[36m(orchestrate pid=50349)\u001b[0m 18:41:18 INFO - Cluster resources: {'cpus': 16, 'gpus': 0, 'memory': 27.052905273623765, 'object_store': 2.0}\n",
- "\u001b[36m(orchestrate pid=50349)\u001b[0m 18:41:18 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n",
- "\u001b[36m(orchestrate pid=50349)\u001b[0m 18:41:19 INFO - Completed 1 files in 0.008073918024698893 min\n",
- "\u001b[36m(orchestrate pid=50349)\u001b[0m 18:41:19 INFO - Completed 2 files in 0.008082167307535807 min\n",
- "\u001b[36m(orchestrate pid=50349)\u001b[0m 18:41:19 INFO - Completed 3 files in 0.008125070730845134 min\n",
- "\u001b[36m(orchestrate pid=50349)\u001b[0m 18:41:19 INFO - Completed 4 files in 0.008133133252461752 min\n",
- "\u001b[36m(orchestrate pid=50349)\u001b[0m 18:41:19 INFO - Completed 5 files in 0.008171304066975912 min\n",
- "\u001b[36m(orchestrate pid=50349)\u001b[0m 18:41:19 INFO - Completed 6 files in 0.008177065849304199 min\n",
- "\u001b[36m(orchestrate pid=50349)\u001b[0m 18:41:19 INFO - Completed 7 files in 0.008215431372324626 min\n",
- "\u001b[36m(orchestrate pid=50349)\u001b[0m 18:41:19 INFO - Completed 8 files in 0.008221383889516194 min\n",
- "\u001b[36m(orchestrate pid=50349)\u001b[0m 18:41:19 INFO - Completed 8 files (80.0%) in 0.008222103118896484 min. Waiting for completion\n",
- "\u001b[36m(orchestrate pid=50349)\u001b[0m 18:41:19 INFO - Completed processing 10 files in 0.0082634170850118 min\n",
- "\u001b[36m(orchestrate pid=50349)\u001b[0m 18:41:19 INFO - done flushing in 0.0004973411560058594 sec\n",
- "18:41:29 INFO - Completed execution in 0.21302308638890585 min, execution result 0\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "0"
- ]
- },
- "execution_count": 5,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "import os\n",
- "import sys\n",
- "from ededup_transform_ray import EdedupRayTransformConfiguration\n",
- "\n",
- "input_folder = parquet_data_output # Output of previous stage is used as input.\n",
- "output_folder = \"sample_data/ededup_out\"\n",
- "\n",
- "local_conf = {\n",
- " \"input_folder\": input_folder,\n",
- " \"output_folder\": output_folder,\n",
- "}\n",
- "\n",
- "ededup_params = {\n",
- " # ededup parameters\n",
- " \"ededup_hash_cpu\": 0.5,\n",
- " \"ededup_num_hashes\": 2,\n",
- " \"ededup_doc_column\": \"contents\",\n",
- " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf)\n",
- "}\n",
- "\n",
- "params = common_config_params | ededup_params\n",
- "sys.argv = ParamsUtils.dict_to_req(d=params)\n",
- "ededup_launcher = RayTransformLauncher(EdedupRayTransformConfiguration())\n",
- "ededup_launcher.launch()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 3. Fuzzy Deduplication\n",
- "\n",
- "This step will find near duplicates and remove them. The code is broken into two code cells, one for adding document ids to the parquet file and then running fuzzy dedup. Document id addition is a prerequisite for fuzzy dedup. \n",
- "\n",
- "We first add the document ids as an additional column to the parquet files.
\n",
- "_doc_column_ - specifies name of the column containing the document (required for ID generation)
\n",
- "_hash_column_ - specifies name of the column created to hold the string document id, if None, id is not generated
\n",
- "_int_id_column_ - specifies name of the column created to hold the integer document id, if None, id is not generated
\n",
- "At least one of hash_column or int_id_column must be specified.\n",
- "\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "H4cYttNlbgf0",
- "outputId": "72790550-fac1-4dba-a332-fb36e4dcf483"
- },
- "outputs": [],
- "source": [
- "input_folder = \"sample_data/ededup_out\"\n",
- "output_folder = \"sample_data/docid_out\"\n",
- "\n",
- "\n",
- "from doc_id_transform_ray import DocIDRayTransformConfiguration\n",
- "local_conf = {\n",
- " \"input_folder\": input_folder,\n",
- " \"output_folder\": output_folder,\n",
- "}\n",
- "\n",
- "doc_id_params = {\n",
- " # doc id configuration\n",
- " \"doc_id_doc_column\": \"contents\",\n",
- " \"doc_id_hash_column\": \"hash_column\",\n",
- " \"doc_id_int_column\": \"int_id_column\",\n",
- " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf)\n",
- "}\n",
- "\n",
- "params = doc_id_params | common_config_params\n",
- "sys.argv = ParamsUtils.dict_to_req(d=params)\n",
- "launcher = RayTransformLauncher(DocIDRayTransformConfiguration())\n",
- "launcher.launch()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Post adding the document ids, the next step is to run fuzzy deduplication. We apply a two-step method for this: (1) compute MinHashes of all the documents and then utilize Locally Sensitive Hashing (LSH) to group documents based on their MinHash fingerprints, (2) measure Jaccard similarity between each pair of documents\n",
- "in the same bucket and annotate documents except one as duplicates based on a similarity\n",
- "threshold. \n",
- "\n",
- "Some important transform specific params are:
\n",
- "_fdedup_doc_column_ - Column to be used for deduplication
\n",
- "_fdedup_threshold_ - specifies the Jaccard similarity threshold (default is 0.7)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "b11MMQEheO6q",
- "outputId": "4e6f4d73-4e60-4a28-b3c5-392c8c220111"
- },
- "outputs": [],
- "source": [
- "input_folder = \"sample_data/docid_out\"\n",
- "output_folder = \"sample_data/fdedup_out\"\n",
- "\n",
- "\n",
- "import os\n",
- "import sys\n",
- "\n",
- "from data_processing.utils import ParamsUtils\n",
- "from fdedup_transform_ray import FdedupRayTransformConfiguration\n",
- "\n",
- "local_conf = {\n",
- " \"input_folder\": input_folder,\n",
- " \"output_folder\": output_folder,\n",
- "}\n",
- "worker_options = {\"num_cpus\": 0.8}\n",
- "code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n",
- "fdedup_params = {\n",
- " # columns used\n",
- " \"fdedup_doc_column\": \"contents\",\n",
- " \"fdedup_id_column\": \"int_id_column\",\n",
- " \"fdedup_cluster_column\": \"hash_column\",\n",
- " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf)\n",
- "}\n",
- "\n",
- "params = common_config_params| fdedup_params\n",
- "\n",
- "# Pass commandline params\n",
- "sys.argv = ParamsUtils.dict_to_req(d=params)\n",
- "\n",
- "# launch\n",
- "fdedup_launcher = RayTransformLauncher(FdedupRayTransformConfiguration())\n",
- "fdedup_launcher.launch()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 4. Programming Language Selection\n",
- "\n",
- "This module helps retain the code files for language of interest which can be specified using selected_languages_file. Post this step, a new column is added, that contains the programming language name. One can use the code in the Filtering step to do analytics on how many files are found for which languages and thereby selectively filter. \n",
- "\n",
- "The important parameters used by this transform are:
\n",
- "_lang_allowed_langs_file_key_ - A file with a list of allowed languages.
\n",
- "_lang_lang_column_key_ - The name of column which has programming language.
\n",
- "_lang_output_column_key_ - The name of annotation column.
\n",
- "\n",
- "For this demo, we will use this [file](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/proglang_select/python/test-data/languages/allowed-code-languages.txt) to specify languages of interest and the module will add a new column called \"language_of_interest\" which can have two values 0/1. 1 is added for all rows that have code files belonging to programming language specified in the list."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "QGaG8NWUAbAu",
- "outputId": "ac40800f-d48a-4e64-c488-da8a16b7f6d5"
- },
- "outputs": [],
- "source": [
- "input_folder = \"sample_data/fdedup_out\"\n",
- "output_folder = \"sample_data/ps_out\"\n",
- "\n",
- "# download allowed-code-languages.txt\n",
- "!wget https://raw.githubusercontent.com/IBM/data-prep-kit/dev/transforms/code/proglang_select/python/test-data/languages/allowed-code-languages.txt\n",
- "selected_languages_file = \"./allowed-code-languages.txt\"\n",
- "\n",
- "from proglang_select_transform_ray import ProgLangSelectRayConfiguration\n",
- "from proglang_select_transform import (\n",
- " lang_allowed_langs_file_key,\n",
- " lang_lang_column_key,\n",
- " lang_output_column_key,\n",
- ")\n",
- "\n",
- "# create parameters\n",
- "language_column_name = \"language\"\n",
- "annotated_column_name = \"language_of_interest\"\n",
- "\n",
- "local_conf = {\n",
- " \"input_folder\": input_folder,\n",
- " \"output_folder\": output_folder,\n",
- "}\n",
- "\n",
- "langselect_config = {\n",
- " lang_allowed_langs_file_key: selected_languages_file,\n",
- " lang_lang_column_key: language_column_name,\n",
- " lang_output_column_key: annotated_column_name,\n",
- " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf)\n",
- "}\n",
- "\n",
- "params = common_config_params| langselect_config\n",
- "\n",
- "sys.argv = ParamsUtils.dict_to_req(d=params)\n",
- "\n",
- "# create launcher\n",
- "launcher = RayTransformLauncher(ProgLangSelectRayConfiguration())\n",
- "launcher.launch()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 5. Code Quality\n",
- "\n",
- "We experiment with various code quality metrics but finally retain the four code quality metrics used by (Li et al., 2023) to balance the tradeoff between code quality versus data volume."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 24,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "18:42:36 INFO - Running locally\n",
- "18:42:36 INFO - data factory data_ is using local data access: input_folder - sample_data/ps_out output_folder - sample_data/cq_out\n",
- "18:42:36 INFO - data factory data_ max_files -1, n_sample -1\n",
- "18:42:36 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n",
- "18:42:36 INFO - pipeline id pipeline_id\n",
- "18:42:36 INFO - code location None\n",
- "18:42:36 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n",
- "18:42:36 INFO - actor creation delay 0\n",
- "18:42:36 INFO - job details {'job category': 'preprocessing', 'job name': 'code_quality', 'job type': 'ray', 'job id': 'job_id'}\n",
- "2024-08-21 18:42:38,257\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32m127.0.0.1:8265 \u001b[39m\u001b[22m\n",
- "\u001b[36m(orchestrate pid=50820)\u001b[0m 18:42:39 INFO - orchestrator started at 2024-08-21 18:42:39\n",
- "\u001b[36m(orchestrate pid=50820)\u001b[0m 18:42:39 INFO - Number of files is 2, source profile {'max_file_size': 0.021309852600097656, 'min_file_size': 0.015825271606445312, 'total_file_size': 0.03713512420654297}\n",
- "\u001b[36m(orchestrate pid=50820)\u001b[0m 18:42:39 INFO - Cluster resources: {'cpus': 16, 'gpus': 0, 'memory': 27.17441406287253, 'object_store': 2.0}\n",
- "\u001b[36m(orchestrate pid=50820)\u001b[0m 18:42:39 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n",
- "\u001b[36m(orchestrate pid=50820)\u001b[0m 18:42:39 INFO - Completed 0 files (0.0%) in 4.450480143229167e-06 min. Waiting for completion\n",
- "\u001b[36m(RayTransformFileProcessor pid=50838)\u001b[0m /Users/shivdeep/workspace/projects/current/oss-data-prep/patchsets/20aug/examples/notebooks/code/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
- "\u001b[36m(RayTransformFileProcessor pid=50838)\u001b[0m warnings.warn(\n",
- "\u001b[36m(orchestrate pid=50820)\u001b[0m 18:42:41 INFO - Completed processing 2 files in 0.03352181911468506 min\n",
- "\u001b[36m(orchestrate pid=50820)\u001b[0m 18:42:41 INFO - done flushing in 0.0006060600280761719 sec\n",
- "\u001b[36m(RayTransformFileProcessor pid=50838)\u001b[0m Token indices sequence length is longer than the specified maximum sequence length for this model (4244 > 1024). Running this sequence through the model will result in indexing errors\n",
- "18:42:51 INFO - Completed execution in 0.24930408000946044 min, execution result 0\n",
- "\u001b[36m(RayTransformFileProcessor pid=50837)\u001b[0m /Users/shivdeep/workspace/projects/current/oss-data-prep/patchsets/20aug/examples/notebooks/code/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
- "\u001b[36m(RayTransformFileProcessor pid=50837)\u001b[0m warnings.warn(\n",
- "\u001b[36m(RayTransformFileProcessor pid=50837)\u001b[0m Token indices sequence length is longer than the specified maximum sequence length for this model (3350 > 1024). Running this sequence through the model will result in indexing errors\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "0"
- ]
- },
- "execution_count": 24,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "input_folder = \"sample_data/ps_out\"\n",
- "output_folder = \"sample_data/cq_out\"\n",
- "\n",
- "from code_quality_transform_ray import CodeQualityRayTransformConfiguration\n",
- "\n",
- "local_conf = {\n",
- " \"input_folder\": input_folder,\n",
- " \"output_folder\": output_folder,\n",
- "}\n",
- "language_column_name = \"language\"\n",
- "params = {\n",
- " \"cq_contents_column_name\": \"contents\",\n",
- " \"cq_language_column_name\": language_column_name,\n",
- " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf)\n",
- "}\n",
- "\n",
- "params = common_config_params| params\n",
- "sys.argv = ParamsUtils.dict_to_req(d=params)\n",
- "\n",
- "# create launcher\n",
- "launcher = RayTransformLauncher(CodeQualityRayTransformConfiguration())\n",
- "# launch\n",
- "launcher.launch()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "oXu_i9jLAo9H"
- },
- "source": [
- "## 6. Filtering\n",
- "\n",
- "This step can be used to filter the code files based on our chosen conditions. In this demo example, we have only used one annotation of adding programming language names for each code file. To demonstrate the utility, we will use this module to retain only code files of interest."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "OAl7B58oAyZQ",
- "outputId": "5fc229ef-bb87-4e34-9302-1670b8832d97"
- },
- "outputs": [],
- "source": [
- "input_folder = \"sample_data/cq_out\"\n",
- "output_folder = \"sample_data/filter_out\"\n",
- "\n",
- "\n",
- "from filter_transform import (\n",
- " filter_columns_to_drop_cli_param,\n",
- " filter_criteria_cli_param,\n",
- " filter_logical_operator_cli_param,\n",
- ")\n",
- "from filter_transform_ray import FilterRayTransformConfiguration\n",
- "\n",
- "local_conf = {\n",
- " \"input_folder\": input_folder,\n",
- " \"output_folder\": output_folder,\n",
- "}\n",
- "\n",
- "# This is just an example criteria to filter\n",
- "filter_criteria = [\n",
- " \"language_of_interest = 1\",\n",
- " \"total_num_lines > 10 AND total_num_lines < 90\"\n",
- "]\n",
- "filter_logical_operator = \"AND\"\n",
- "filter_columns_to_drop = [\"language_of_interest\", \"hash_column\"]\n",
- "\n",
- "filter_params = {\n",
- " filter_criteria_cli_param: filter_criteria,\n",
- " filter_columns_to_drop_cli_param: filter_columns_to_drop,\n",
- " filter_logical_operator_cli_param: filter_logical_operator,\n",
- " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf)\n",
- "}\n",
- "\n",
- "\n",
- "sys.argv = ParamsUtils.dict_to_req(common_config_params| filter_params)\n",
- "launcher = RayTransformLauncher(FilterRayTransformConfiguration())\n",
- "launcher.launch()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 7. Semantic Ordering of Code Files\n",
- "\n",
- "In this step, we order the code files such that we pack files from the same repository together, arranging them to prioritize semantic dependencies. We identify these dependencies by analyzing file imports and create a directed acyclic graph, where each file is a node and edges represent API imports between files. After breaking any cycles in the graph, we perform a topological sort to establish an ordering of files based on their semantic dependencies. We then organize the files in a repository by placing documentation and build files first, followed by the ordered set of files with semantic dependencies, and finally the remaining non-connected files. These non-connected files are arranged according to their folder structure, using a depth-first search to traverse the repository. Finally, we determine the dominant programming language of a repository based on file extensions and presence of build files, to organise repo-ordered files by programming languages.\n",
- "\n",
- "\n",
- "This transform has following parameters:
\n",
- " _repo_lvl_sorting_enabled_ - If True, the repo level output is sorted using _repo_lvl_sorting_algo_
\n",
- " _repo_lvl_sorting_algo_ - Select the sorting algorithm to be used for repo level sorting. Use SORT_SEMANTIC_NORMALISED to organise by semantic dependencies or SORT_BY_PATH to arrange files based on folder structure in a repository.
\n",
- " _repo_lvl_store_backend_dir_ - Directory to use for local store. Needed only when repo_lvl_store_type=local
\n",
- " _repo_lvl_output_by_langs_ - If True, it organises output into folders of programming language.
\n",
- " _repo_lvl_combine_rows_ - If True, it combines the contents of repo into a single row.
\n",
- "\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "18:43:07 INFO - Running locally\n",
- "18:43:07 INFO - data factory data_ is using local data access: input_folder - sample_data/filter_out output_folder - sample_data/rlo_out\n",
- "18:43:07 INFO - data factory data_ max_files -1, n_sample -1\n",
- "18:43:07 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n",
- "18:43:07 INFO - pipeline id pipeline_id\n",
- "18:43:07 INFO - code location None\n",
- "18:43:07 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n",
- "18:43:07 INFO - actor creation delay 0\n",
- "18:43:07 INFO - job details {'job category': 'preprocessing', 'job name': 'repo_lvl', 'job type': 'ray', 'job id': 'job_id'}\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Creating Store Params\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "2024-08-21 18:43:08,706\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32m127.0.0.1:8265 \u001b[39m\u001b[22m\n",
- "\u001b[36m(orchestrate pid=51009)\u001b[0m 18:43:09 INFO - orchestrator started at 2024-08-21 18:43:09\n",
- "\u001b[36m(orchestrate pid=51009)\u001b[0m 18:43:09 INFO - Number of files is 2, source profile {'max_file_size': 0.010923385620117188, 'min_file_size': 0.004130363464355469, 'total_file_size': 0.015053749084472656}\n",
- "\u001b[36m(orchestrate pid=51009)\u001b[0m 18:43:09 INFO - Cluster resources: {'cpus': 16, 'gpus': 0, 'memory': 27.229208374395967, 'object_store': 2.0}\n",
- "\u001b[36m(orchestrate pid=51009)\u001b[0m 18:43:09 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n",
- "\u001b[36m(orchestrate pid=51009)\u001b[0m 18:43:09 INFO - => get_transform_config started\n",
- "\u001b[36m(orchestrate pid=51009)\u001b[0m 18:43:09 INFO - dict_keys(['store_backend_dir', 'store_type', 's3_creds'])\n",
- "\u001b[36m(orchestrate pid=51009)\u001b[0m 18:43:09 INFO - <= get_transform_config\n",
- "\u001b[36m(orchestrate pid=51009)\u001b[0m 18:43:09 INFO - Completed 0 files (0.0%) in 2.797444661458333e-06 min. Waiting for completion\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\u001b[36m(orchestrate pid=51009)\u001b[0m Init Store params\n",
- "\u001b[36m(RayTransformFileProcessor pid=51025)\u001b[0m Creating local store.\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "\u001b[36m(orchestrate pid=51009)\u001b[0m 18:43:09 INFO - Completed processing 2 files in 0.00863101085027059 min\n",
- "\u001b[36m(orchestrate pid=51009)\u001b[0m 18:43:09 INFO - done flushing in 0.0005919933319091797 sec\n",
- "\u001b[36m(orchestrate pid=51009)\u001b[0m 18:43:09 INFO - Store Backend is None\n",
- "\u001b[36m(orchestrate pid=51009)\u001b[0m 18:43:09 INFO - Stage 1 Finished in 0:00:00.524131.\n",
- "\u001b[36m(RayTransformFileProcessor pid=51024)\u001b[0m 18:43:09 WARNING - table is empty, skipping processing\n",
- "\u001b[36m(orchestrate pid=51009)\u001b[0m 18:43:10 I - Repo level sorting is enabled. Algo: SORT_SEMANTIC_NORMALISED\n",
- "\u001b[36m(orchestrate pid=51009)\u001b[0m 18:43:10 I - normalised semantic sort enabled\n",
- "\u001b[36m(orchestrate pid=51009)\u001b[0m 18:43:10 I - Output by language enabled.\n",
- "\u001b[36m(orchestrate pid=51009)\u001b[0m 18:43:10 I - Combine rows enabled.\n",
- "\u001b[36m(orchestrate pid=51009)\u001b[0m 18:43:10 I - Processing 2 repos with 2 workers\n",
- "\u001b[36m(orchestrate pid=51009)\u001b[0m 18:43:11 I - Finished the transform in 0:00:02.268307 \n",
- "\u001b[36m(GroupByRepoActor pid=51030)\u001b[0m 18:43:11 I - Write C/wvuRc2%2Frc2client, tables: 1\n",
- "18:43:21 INFO - Completed execution in 0.2429994503657023 min, execution result 0\n",
- "\u001b[36m(GroupByRepoActor pid=51029)\u001b[0m 18:43:11 I - Write C/becm%2Fmpt-solver, tables: 1\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\u001b[36m(orchestrate pid=51009)\u001b[0m Creating local store.\u001b[32m [repeated 2x across cluster]\u001b[0m\n"
- ]
- }
- ],
- "source": [
- "input_folder = \"sample_data/filter_out\"\n",
- "output_folder = \"sample_data/rlo_out\"\n",
- "\n",
- "import tempfile\n",
- "from repo_level_order_transform import RepoLevelOrderRayTransformConfiguration\n",
- "with tempfile.TemporaryDirectory() as tmpdirname:\n",
- "\n",
- " # create parameters\n",
- " local_conf = {\n",
- " \"input_folder\": input_folder,\n",
- " \"output_folder\": output_folder,\n",
- " }\n",
- "\n",
- " worker_options = {\"num_cpus\": 0.8}\n",
- " code_location = {\"github\": \"github\", \"commit_hash\": \"12345\", \"path\": \"path\"}\n",
- "\n",
- " repo_level_params = {\n",
- " \"repo_lvl_sorting_algo\": \"SORT_SEMANTIC_NORMALISED\",\n",
- " \"repo_lvl_store_type\": \"local\",\n",
- " \"repo_lvl_store_backend_dir\": tmpdirname,\n",
- " \"repo_lvl_output_by_langs\": True,\n",
- " \"repo_lvl_combine_rows\": True,\n",
- " \"repo_lvl_sorting_enabled\": True,\n",
- " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf)\n",
- " }\n",
- "\n",
- " \n",
- " sys.argv = ParamsUtils.dict_to_req(d= common_config_params| repo_level_params)\n",
- " launcher = RayTransformLauncher(RepoLevelOrderRayTransformConfiguration())\n",
- " launcher.launch()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "id": "byK75Kb1A3E7"
- },
- "source": [
- "## 8. Tokenization\n",
- "\n",
- "Next, we tokenize the data to be used for fine tuning. \n",
- "\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {
- "colab": {
- "base_uri": "https://localhost:8080/"
- },
- "id": "kBYg93WMBBq6",
- "outputId": "b3e0541e-4a3d-46f4-8809-ccc8778a53fc"
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "18:43:23 INFO - Running locally\n",
- "18:43:23 INFO - data factory data_ is using local data access: input_folder - sample_data/rlo_out output_folder - sample_data/tokenize_out\n",
- "18:43:23 INFO - data factory data_ max_files -1, n_sample -1\n",
- "18:43:23 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n",
- "18:43:23 INFO - pipeline id pipeline_id\n",
- "18:43:23 INFO - code location None\n",
- "18:43:23 INFO - number of workers 2 worker options {'num_cpus': 0.8, 'max_restarts': -1}\n",
- "18:43:23 INFO - actor creation delay 0\n",
- "18:43:23 INFO - job details {'job category': 'preprocessing', 'job name': 'Tokenization', 'job type': 'ray', 'job id': 'job_id'}\n",
- "2024-08-21 18:43:24,730\tINFO worker.py:1744 -- Started a local Ray instance. View the dashboard at \u001b[1m\u001b[32m127.0.0.1:8265 \u001b[39m\u001b[22m\n",
- "\u001b[36m(orchestrate pid=51142)\u001b[0m 18:43:25 INFO - orchestrator started at 2024-08-21 18:43:25\n",
- "\u001b[36m(orchestrate pid=51142)\u001b[0m 18:43:25 INFO - Number of files is 3, source profile {'max_file_size': 0.015120506286621094, 'min_file_size': 0.00478363037109375, 'total_file_size': 0.0263824462890625}\n",
- "\u001b[36m(orchestrate pid=51142)\u001b[0m 18:43:25 INFO - Cluster resources: {'cpus': 16, 'gpus': 0, 'memory': 27.185578919015825, 'object_store': 2.0}\n",
- "\u001b[36m(orchestrate pid=51142)\u001b[0m 18:43:25 INFO - Number of workers - 2 with {'num_cpus': 0.8, 'max_restarts': -1} each\n",
- "\u001b[36m(RayTransformFileProcessor pid=51160)\u001b[0m /Users/shivdeep/workspace/projects/current/oss-data-prep/patchsets/20aug/examples/notebooks/code/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
- "\u001b[36m(RayTransformFileProcessor pid=51160)\u001b[0m warnings.warn(\n",
- "\u001b[36m(orchestrate pid=51142)\u001b[0m 18:43:27 INFO - Completed 1 files in 0.02494144837061564 min\n",
- "\u001b[36m(orchestrate pid=51142)\u001b[0m 18:43:27 INFO - Completed 1 files (33.333333333333336%) in 0.024943232536315918 min. Waiting for completion\n",
- "\u001b[36m(orchestrate pid=51142)\u001b[0m 18:43:27 INFO - Completed processing 3 files in 0.02498700221379598 min\n",
- "\u001b[36m(orchestrate pid=51142)\u001b[0m 18:43:27 INFO - done flushing in 0.0005640983581542969 sec\n",
- "18:43:37 INFO - Completed execution in 0.24013988176981607 min, execution result 0\n",
- "\u001b[36m(RayTransformFileProcessor pid=51161)\u001b[0m /Users/shivdeep/workspace/projects/current/oss-data-prep/patchsets/20aug/examples/notebooks/code/venv/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.\n",
- "\u001b[36m(RayTransformFileProcessor pid=51161)\u001b[0m warnings.warn(\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- "0"
- ]
- },
- "execution_count": 11,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "input_folder = \"sample_data/rlo_out\"\n",
- "output_folder = \"sample_data/tokenize_out\"\n",
- "\n",
- "from tokenization_transform_ray import TokenizationRayConfiguration\n",
- "\n",
- "local_conf = {\n",
- " \"input_folder\": input_folder,\n",
- " \"output_folder\": output_folder,\n",
- "}\n",
- "\n",
- "tf_params= {\n",
- " \"data_local_config\": ParamsUtils.convert_to_ast(local_conf)\n",
- "}\n",
- "sys.argv = ParamsUtils.dict_to_req(d=common_config_params| tf_params)\n",
- "# create launcher\n",
- "launcher = RayTransformLauncher(TokenizationRayConfiguration())\n",
- "# Launch the ray actor(s) to process the input\n",
- "launcher.launch()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "id": "xFUrzzjeBFfJ"
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- }
- ],
- "metadata": {
- "accelerator": "TPU",
- "colab": {
- "gpuType": "V28",
- "provenance": [],
- "toc_visible": true
- },
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.11.6"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}