Skip to content

A project for determining the similarity of python repositories based on embedding approach

License

Notifications You must be signed in to change notification settings

RepoMining/RepoSim4Py

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RepoSim4Py

Description

RepoSim4Py is a project for determining the semantic similarity among python repositories based on embedding approach. This project contains series of Jupyter Notebook and Python Script conducted for our approach to detect semantically similar python repositories using different models based on embedding approach.

These Jupyter Notebook files include the whole process for choosing (evaluating) the best model considering Doc2Vec models and pre-trained models based on Transformers architecture. However, the rest Python Script files are the process of generating different-level embeddings and calculating cosine similarities by using the best models.

We considered different-level information of a repository: codes, docstrings, requirements, readmes, and structures. Currently, on each level (apart from structure level, we discard this level at the end), our best performing model is UniXCoder fine-tuned on code search task with AdvTest dataset.

More details about model evaluations can be found in Doc2Vec and Embedding folder. Moreover, more details on our approach's implementations and applications can be found in the Script folder.

Features

Given a list of python repositories' name, RepoSim4Py can:

  • Extracting all codes, docstrings, requirements, and readmes from the given repositories.
  • Generating the embeddings for each level (as mentioned as before) using a UniXCoder model fine-tuned on the nl-code-search task.
  • Aggregating these 4 level mean embeddings into 1 repo-level mean embedding.
  • Calculating semantic similarities between 2 (at least 2) repositories on different level embeddings (codes-level, docs-level, requirements-level, readmes-level, and repo-level).
  • Saving repositories information (such as codes, docs, code embeddings, repository embeddings, etc.) into a .pkl file, and semantic similarity calculation results into a .csv file.

Installation

Prerequisites

  • Python 3.9+
  • pip

Package dependencies

accelerate==0.21.0
torch==2.0.1
numpy==1.25.0
pandas==2.0.3
transformers==4.30.2
sentence-transformers==2.2.2
tqdm==4.65.0
inspect4py==0.0.8
scikit-learn==1.3.0
matplotlib==3.7.2

Installation from code

  1. Clone the project: You can download the source code using the following instruction.
git clone https://github.com/PythonSimilarity/RepoSim4Py.git
  1. Install package dependencies: Install the required dependencies by running the following command.
pip install -r requirements.txt

When you follow the instructions above, you will have reached the basic conditions for running this project.

Usage

You need to change your current directory to Script folder before you enjoy the above features.

cd Script

and then using the following format to run this program.

python RepoSim4Py.py --input <repo1> <repo2> ... --output <output_dir> [--eval]

For example:

python RepoSim4Py.py -i lepture/authlib idan/oauthlib evonove/django-oauth-toolkit selwin/python-user-agents SmileyChris/django-countries django-compressor/django-compressor billpmurphy/hask pytoolz/toolz Suor/funcy przemyslawjanpietrzak/pyMonet -o output/ --eval

The input is a list of GitHub repository names (at least 2) in the format of <owner>/<repo> (e.g. lepture/authlib). The output of the script is a python pickle file <output_dir>/output.pkl which stores a list of dictionaries containing all repositories' information, including name, topics, license, stars, extracted codes/docstrings/requirements/readmes list, embeddings corresponding to extracted information, mean embedding corresponding to extracted information, as well as the repo-level mean embedding. This file can be used for later experiments such as semantic similarity calculation/comparison.

When --eval is specified, the script will also save a csv file with 9 columns: repo1, repo2, topics1, topics2, code_sim, doc_sim, requirement_sim, readme_sim, and repo_sim, representing two repositories and their similarity scores in terms of different-level mean embeddings. This file will compare each pair of repositories in the input list and save the results at <output_dir>/evaluation_result.csv.

License

Distributed under the MIT License. See LICENSE for more information.

Acknowledgments

About

A project for determining the similarity of python repositories based on embedding approach

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published