Monitoring AI-Modified Content at Scale

Installation | Usage | Using Your Own Datasets

A simple and effective method for estimating the fraction of text in a large corpus that has been substantially modified or generated by AI:

Distributional GPT Detection. In contrast with instance-level detection, this framework focuses on population-level estimates. We demonstrate how to estimate the proportion of content in a given corpus that has been generated or significantly modified by AI, without the need to perform inference on any individual instance.
Easy Deployment and Usage. Our code can quickly estimate the distribution of both AI- and human-generated text without an expensive model training procedure. Using these estimated text distributions, we can accurately predict the fraction of text in a large corpus that has been substantially modified or generated by AI.

Installation

This repository was built using Python 3.8.19, but should be backwards compatible with any Python >= 3.8. This repository was developed and has been thoroughly tested with pandas 2.0.3, numpy 1.24.4, pyarrow 15.0.2, fastparquet 2024.2.0, scipy 1.10.1, and ipykernel 6.29.4.

You can install this package locally via an editable installation or the provided yml file:

git clone https://github.com/Weixin-Liang/Mapping-the-Increasing-Use-of-LLMs-in-Scientific-Papers.git
cd Mapping-the-Increasing-Use-of-LLMs-in-Scientific-Papers
conda env create -f environment.yml

If you run into any problems during the installation process, please file a GitHub Issue.

Usage

Once installed, estimating distributions and running inference is easy (see demo.ipynb for the full demo):

from src.estimation import estimate_text_distribution
from src.MLE import MLE

# call function estimate_text_distribution to get the AI content distribution & human content distribution
estimate_text_distribution(f"data/training_data/CS/human_data.parquet",f"data/training_data/CS/ai_data.parquet",f"distribution/CS.parquet")
# load the word occurrences frequency into our framework
model=MLE(f"distribution/CS.parquet")
# validate our method using mixed corpus with known ground truth alpha
for alpha in [0,0.025,0.05,0.075,0.1,0.125,0.15,0.175,0.2,0.225,0.25]:
    estimated,ci=model.inference(f"data/validation_data/CS/ground_truth_alpha_{alpha}.parquet")
    error=abs(estimated-alpha)
    print(f"{'Ground Truth':>10},{'Prediction':>10},{'CI':>10},{'Error':>10}")
    print(f"{alpha:10.3f},{estimated:10.3f},{ci:10.3f},{error:10.3f}")

For a complete demonstration, check out demo.ipynb.

Using Your Own Datasets

This repository includes the arXiv abstracts used for the analysis in our second paper. However, our framework can easily be extended to other domains of your choice. It requires two datasets--one consisting of documents written entirely by humans, and another consisting of documents written entirely by AI--which are used to estimate the distribution of human- and AI-generated text in your chosen domain. Using these estimates, you can perform inference on a target dataset with an unknown fraction of AI-generated content.

The function estimate_text_distribution in src.estimation requires two file path as input to indicate where human- and AI-generated text are stored. The two input files should be .parquet format. For human-generated text, our provided function need the input parquet file to have a column named as human_sentence and required data to be organized as one tokenized sentence(a list of word) per row. Similarly, for ai-generated text, our provided function need a column named as ai_sentence and required data to be organized as one tokenized sentence(a list of word) per row.

example of human-generated data:

human_sentence
["This", "is", "an", "example"]
["Another", "sentence", "for", "you"]

example of ai-generated data:

ai_sentence
["This", "is", "an", "example"]
["Another", "sentence", "for", "you"]

For inference on target dataset, the function inference in class MLE also requires a file path as input. It also need the input parquet file to have a column named as inference_sentence and required data to be organized as one tokenized sentence(a list of word) per row.

example of inference data:

inference_sentence
["This", "is", "an", "example"]
["Another", "sentence", "for", "you"]

Note that we provide our tokenize function in tokenize_demo.ipynb for reference.

Repository Structure

Below is a high-level overview of the repository/project file-tree:

data/ - Data source consisting of arXiv abstract data across five main fields (Physics, Mathematics, Computer Science, Statistics, and Electrical Engineering and Systems Science). The training_data folder contains corpora known to be entirely AI-generated or human-written, which are used for distribution estimation. The validation_data folder contains corpora with mixed AI-generated and human-written data, whose ground truth portion is known. This is used to validate the effectiveness of our framework. Details on the data can be found in our second paper.
distribution/ - Folder to save the distribution parquet generated by the estimate_text_distribution function for demo purposes.
src/ - Package source providing core utilities for distribution estimation, framework loading, data inference, etc.
LICENSE - All code is made available under the MIT License; happy hacking!
demo.ipynb - Demonstration of our framework on arXiv abstracts across five main fields. This includes estimating the distributions of human- and AI-generated content, followed by a validation process on manually mixed data with a known ground truth portion of AI-written text.
tokenize_demo.ipynb - Demonstration of how we tokenize arXiv abstracts, including the tokenize function. Here we use the spaCy(https://spacy.io/) library, but other tools like nltk are also feasible. You may need to modify the function in your own cases.
environment.yml - Full project configuration details (including dependencies), as well as tool configurations.
README.md - You are here!

Citation

If you find our code or framework useful in your work, please cite our first paper and second paper:

@article{liang2024monitoring,
  title={Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews},
  author={Liang, Weixin and Izzo, Zachary and Zhang, Yaohui and Lepp, Haley and Cao, Hancheng and Zhao, Xuandong and Chen, Lingjiao and Ye, Haotian and Liu, Sheng and Huang, Zhi and others},
  journal={arXiv preprint arXiv:2403.07183},
  year={2024}
}
@article{liang2024mapping,
  title={Mapping the Increasing Use of LLMs in Scientific Papers},
  author={Liang, Weixin and Zhang, Yaohui and Wu, Zhengxuan and Lepp, Haley and Ji, Wenlong and Zhao, Xuandong and Cao, Hancheng and Liu, Sheng and He, Siyu and Huang, Zhi and others},
  journal={arXiv preprint arXiv:2404.01268},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Monitoring AI-Modified Content at Scale

Installation

Usage

Using Your Own Datasets

Repository Structure

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
distribution		distribution
src		src
LICENSE		LICENSE
demo.ipynb		demo.ipynb
environment.yml		environment.yml
readme.md		readme.md
tokenize_demo.ipynb		tokenize_demo.ipynb

License

Weixin-Liang/Mapping-the-Increasing-Use-of-LLMs-in-Scientific-Papers

Folders and files

Latest commit

History

Repository files navigation

Monitoring AI-Modified Content at Scale

Installation

Usage

Using Your Own Datasets

Repository Structure

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages