CodeLLM Paper

This repository provides a curated list of research papers focused on Large Language Models (LLMs) for code. It aims to facilitate researchers and practitioners in exploring the rapidly growing body of literature on this topic. The papers are systematically collected from various top-tier venues, categorized, and labeled for easier navigation.

A. Venues

We have systematically selected papers from the following venues, which are top-tier conferences and journals in SE/PL/Sec/NLP communities.

Software Engineering (SE)
- ICSE2023, FSE2023, ASE2023, ISSTA2023, TSE2023, TOSEM2023
- ICSE2024, FSE2024, ASE2024, ISSTA2024, TSE2024, TOSEM2024
Programming Languages (PL)
Security (Sec)
- S&P2023, USENIXSec2023, CCS2023, NDSS2023
- S&P2024, NDSS2024
Natural Language Processing (NLP)
- ACL2023, EMNLP2023, NAACL2023
- ACL2024, EMNLP2024, NAACL2024

The papers accepted by USENIXSec2024 and CCS2024 have not been published in the proceedings. Due to the large volume, we do not systematically collect the papers published in top-tier ML conferences (ICML, NeurIPS, and ICLR) and arXiv. However, we are keeping manually adding important works published in these venues. We plan to expand the collection over time, and contributions are welcome. For details, see the section How to Contribute.

B. Selection Strategy

Abstract Extraction: Extract the abstracts from bib files or HTML files. The bib and HTML files of the above listed venues are stored in the directory data/rawdata.
Keyword Matching: Filter abstracts that meet both of the following conditions:
- Contains at least one keyword from: {"pretrain", "LLM", "large language model", "transformer", "code model"}.
- Contains the keyword "code" or "program".
Relevance Check Using LLMs: Use LLMs to verify if the papers obtained in Step 2 are related to LLMs for code.
Manual Labeling: Manually assign labels to the papers based on domain knowledge.

All the selected papers along with the labels are maintained in the json file data/labeldata/labeldata.json. src/process.py is the python script used for selecting and labeling papers.

C. Taxonomy

The papers in this repository are categorized along three dimensions: Application, Principle, and Research Paradigm. Each paper is assigned multiple labels based on these categories. Note that categories are not necessarily disjoint.

C.1. Application

This category focuses on typical tasks in Software Engineering (SE) and Programming Languages (PL).

General Coding Task (31)
Code Generation (188)
- Program Synthesis (81)
- Code Completion (22)
- Program Repair (38)
- Program Transformation (29)
Program Testing (49)
- Fuzzing (20)
- Library Testing (1)
- DBMS Testing (1)
- Compiler Testing (4)
- Protocol Fuzzing (1)
- Mutation Testing (2)
- Unit Testing (7)
- Differential Testing (1)
- Debugging (9)
- Bug Reproduction (2)
- Vulnerability Exploitation (6)
Static Analysis (121)
- Syntactic Analysis (1)
- Pointer Analysis (3)
- Call Graph Analysis (2)
- Data-flow Analysis (8)
- Type Inference (3)
- Specification Inference (7)
- Equivalence Checking (1)
- Code Similarity Analysis (5)
- Bug Detection (58)
- Program Verification (17)
- Program Optimization (3)
- Program Decompilation (8)
- Code Summarization (8)
- Code Search (5)
- Software Composition Analysis (3)
Software Maintenance and Deployment (18)
- Code Review (6)
- Documentation Generation (2)
- Commit Message Generation (4)
- Software Configuration (1)
- System Log Analysis (4)

C.2. Principle

This category concentrates on the LLMs' ability in understanding different forms of code and the non-functional properties of the LLMs (e.g., security and robustness). We also consider how to utilize the LLMs for general reasoning problems, such as typical agent-centric designs and specific PL designs for LLMs.

Code Model (107)
- Code Model Training (83)
  - Source Code Model (64)
  - IR Code Model (5)
  - Binary Code Model (14)
- Code Model Security (20)
- Code Model Robustness (4)
Hallucination In Reasoning (11)
PL Design For LLMs (3)
Agent Design (18)
- Prompt Strategy (36)
  - Retrieval-augmented Generation (7)
  - Reason With Code (17)
  - Sampling And Ranking (3)
- Planning (8)

C.3. Research Paradigm

This category includes studies on benchmarks, empirical evaluations, and surveys. The papers that do not belong to the following three categories are purely technical papers.

Benchmark (45)
Empirical Study (76)
Survey (15)

D. How to Contribute

D.1. PR Submission

We welcome contributions to expand this repository. If you want to add new papers to the list, please follow these steps:

Prepare a JSON File: Format the file like data/labeldata/patch/example.json. Each paper should include:
- title, authors, abstract, url, venue, and labels (aligned with the taxonomy in data/labeldata/patch).
Upload the File: Place the JSON file in the data/labeldata/patch directory.
Update Markdown Files: Run the following command to update the repository:
```
cd src && python patch.py
```

If you want to add new labels and change the current taxonomy, please post an issue first and suggest your taxonomy (See below).

D.2. Issue Submission

Another option is to post the papers you wish to add in an issue. Please include a permanently valid link to the paper and specify the venue. If you'd like, you can also categorize the paper based on your understanding of the work by attaching appropriate labels from the existing options in data/category.json or by creating new ones. We will add the paper to our repository very soon.

D.3. Request for Batch Updates

To facilitate timely batch updates to the paper repository, we prefer to utilize the proceedings of various conferences and journals. Here are several examples: ASE2024, OOPSLA2023, S&P2023, and ACL2024. By parsing and extracting information from bib files and HTML files (See data/rawdata), including abstracts, we can semi-automatically classify papers based on the aforementioned selection strategy. If the conference or journal you are following has recently released its complete proceedings, please notify us by submitting an issue. We will prioritize the batch update and add the corresponding conference or journal name to the venue list.

E. Disclaimer and Contact

This paper repository is intended solely for research purposes. All raw data is sourced from publicly available information on ACM, IEEE, and corresponding conference websites. Any content involving additional copyright information, including full PDF versions of the papers, is not disclosed in this repository.

For any questions or suggestions, please contact stephenw.wangcp@gmail.com or wang6590@purdue.edu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CodeLLM Paper

Table of Contents

A. Venues

B. Selection Strategy

C. Taxonomy

C.1. Application

C.2. Principle

C.3. Research Paradigm

D. How to Contribute

D.1. PR Submission

D.2. Issue Submission

D.3. Request for Batch Updates

E. Disclaimer and Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

CodeLLM Paper

Table of Contents

A. Venues

B. Selection Strategy

C. Taxonomy

C.1. Application

C.2. Principle

C.3. Research Paradigm

D. How to Contribute

D.1. PR Submission

D.2. Issue Submission

D.3. Request for Batch Updates

E. Disclaimer and Contact