This repository provides a curated list of research papers focused on Large Language Models (LLMs) for code. It aims to facilitate researchers and practitioners in exploring the rapidly growing body of literature on this topic. The papers are systematically collected from various top-tier venues, categorized, and labeled for easier navigation.
We have systematically selected papers from the following venues, which are top-tier conferences and journals in SE/PL/Sec/NLP communities.
-
Software Engineering (SE)
-
Programming Languages (PL)
-
Security (Sec)
-
Natural Language Processing (NLP)
The papers accepted by USENIXSec2024 and CCS2024 have not been published in the proceedings. Due to the large volume, we do not systematically collect the papers published in top-tier ML conferences (ICML, NeurIPS, and ICLR) and arXiv. However, we are keeping manually adding important works published in these venues. We plan to expand the collection over time, and contributions are welcome. For details, see the section How to Contribute.
-
Abstract Extraction: Extract the abstracts from bib files or HTML files. The bib and HTML files of the above listed venues are stored in the directory
data/rawdata
. -
Keyword Matching: Filter abstracts that meet both of the following conditions:
-
Contains at least one keyword from:
{"pretrain", "LLM", "large language model", "transformer", "code model"}
. -
Contains the keyword
"code"
or"program"
.
-
-
Relevance Check Using LLMs: Use LLMs to verify if the papers obtained in Step 2 are related to LLMs for code.
-
Manual Labeling: Manually assign labels to the papers based on domain knowledge.
All the selected papers along with the labels are maintained in the json file data/labeldata/labeldata.json
. src/process.py
is the python script used for selecting and labeling papers.
The papers in this repository are categorized along three dimensions: Application, Principle, and Research Paradigm. Each paper is assigned multiple labels based on these categories. Note that categories are not necessarily disjoint.
This category focuses on typical tasks in Software Engineering (SE) and Programming Languages (PL).
- General Coding Task (31)
- Code Generation (188)
- Program Synthesis (81)
- Code Completion (22)
- Program Repair (38)
- Program Transformation (29)
- Program Testing (49)
- Fuzzing (20)
- Library Testing (1)
- DBMS Testing (1)
- Compiler Testing (4)
- Protocol Fuzzing (1)
- Mutation Testing (2)
- Unit Testing (7)
- Differential Testing (1)
- Debugging (9)
- Bug Reproduction (2)
- Vulnerability Exploitation (6)
- Static Analysis (121)
- Syntactic Analysis (1)
- Pointer Analysis (3)
- Call Graph Analysis (2)
- Data-flow Analysis (8)
- Type Inference (3)
- Specification Inference (7)
- Equivalence Checking (1)
- Code Similarity Analysis (5)
- Bug Detection (58)
- Program Verification (17)
- Program Optimization (3)
- Program Decompilation (8)
- Code Summarization (8)
- Code Search (5)
- Software Composition Analysis (3)
- Software Maintenance and Deployment (18)
This category concentrates on the LLMs' ability in understanding different forms of code and the non-functional properties of the LLMs (e.g., security and robustness). We also consider how to utilize the LLMs for general reasoning problems, such as typical agent-centric designs and specific PL designs for LLMs.
- Code Model (107)
- Code Model Training (83)
- Source Code Model (64)
- IR Code Model (5)
- Binary Code Model (14)
- Code Model Security (20)
- Code Model Robustness (4)
- Code Model Training (83)
- Hallucination In Reasoning (11)
- PL Design For LLMs (3)
- Agent Design (18)
- Prompt Strategy (36)
- Planning (8)
This category includes studies on benchmarks, empirical evaluations, and surveys. The papers that do not belong to the following three categories are purely technical papers.
- Benchmark (45)
- Empirical Study (76)
- Survey (15)
We welcome contributions to expand this repository. If you want to add new papers to the list, please follow these steps:
-
Prepare a JSON File: Format the file like
data/labeldata/patch/example.json
. Each paper should include:title
,authors
,abstract
,url
,venue
, andlabels
(aligned with the taxonomy indata/labeldata/patch
).
-
Upload the File: Place the JSON file in the
data/labeldata/patch
directory. -
Update Markdown Files: Run the following command to update the repository:
cd src && python patch.py
If you want to add new labels and change the current taxonomy, please post an issue first and suggest your taxonomy (See below).
Another option is to post the papers you wish to add in an issue. Please include a permanently valid link to the paper and specify the venue. If you'd like, you can also categorize the paper based on your understanding of the work by attaching appropriate labels from the existing options in data/category.json
or by creating new ones. We will add the paper to our repository very soon.
To facilitate timely batch updates to the paper repository, we prefer to utilize the proceedings of various conferences and journals. Here are several examples: ASE2024, OOPSLA2023, S&P2023, and ACL2024. By parsing and extracting information from bib files and HTML files (See data/rawdata
), including abstracts, we can semi-automatically classify papers based on the aforementioned selection strategy. If the conference or journal you are following has recently released its complete proceedings, please notify us by submitting an issue. We will prioritize the batch update and add the corresponding conference or journal name to the venue list.
This paper repository is intended solely for research purposes. All raw data is sourced from publicly available information on ACM, IEEE, and corresponding conference websites. Any content involving additional copyright information, including full PDF versions of the papers, is not disclosed in this repository.
For any questions or suggestions, please contact [email protected] or [email protected]