FalsePositiveDetection/README.md at master · Kaushl2208/FalsePositiveDetection · GitHub

This repository is intented to work towards building a first class False positives detection method.

Documentation about the project and weekly progress reports can be found on this link:Fossology GSoC

Using the Jupyter Notebook

Clone the repository using git clone [email protected]:Kaushl2208/FalsePositiveDetection.git
You can install Jupyter Notebook in your system or Use Jupyter Notebook Extension in VS Code.

Input file type and Expected changes:

Input file should be .csv file characteristics:

a. "Copyright": That contains the copyright statements

b. "Manual Tag" (optional): If you want to calculate the accuracy over manual tagging.
One flag should be provided clutter_flag, Which tells the script to remove the unwanted clutter from the TP copyright statements.

Output File:

The output file will be updatedChanges.csv containing:

a. "Hit&Miss": which tells us about the algorithm's output, t for a true copyright statement and f for a false copyright statement.

b. "edited_text" : if the clutter_flag was true, The updated text without clutter will be seen here.

Miscellaneous

You can also train the model over for your specific wordset/dataset/knowledge bag.
Provide the dataset in the spacy input format and run the model_train.py script.

a. You can also tune in the number of epochs and even train your own model with name in modelName variable.

b. Normally it will train the pre used en_core_web_sm model over the new training set and may result into more accuracy for specific set.