-
Natural Is the Best: Model-Agnostic Code Simplification for Pre-trained Large Language Models, (FSE2024)
- Abstract: Pre-trained Large Language Models (LLM) have achieved remarkable successes in several domains. However, code-oriented LLMs are often heavy in computational complexity, and quadratically with the length of the input code sequence. Toward simplifying the input program of an LLM, the state-of-the-art approach has the strategies to filter the input code tokens based on the attention scores given by the LLM. The decision to simplify the input program should not rely on the attention patterns of an LL...
- Labels: static analysis, code search, code summarization, code model, code model training, source code model
-
On the Effectiveness of Transfer Learning for Code Search, (TSE2023)
- Abstract: The Transformer architecture and transfer learning have marked a quantum leap in natural language processing, improving the state of the art across a range of text-based tasks. This paper examines how these advancements can be applied to and improve code search. To this end, we pre-train a BERT-based model on combinations of natural language and source code data and fine-tune it on pairs of StackOverflow question titles and code answers. Our results show that the pre-trained models consistently ...
- Labels: static analysis, code search, code model, code model training, source code model
-
Self-Supervised Query Reformulation for Code Search, (FSE2023)
- Abstract: Automatic query reformulation is a widely utilized technology for enriching user requirements and enhancing the outcomes of code search. It can be conceptualized as a machine translation task, wherein the objective is to rephrase a given query into a more comprehensive alternative. While showing promising results, training such a model typically requires a large parallel corpus of query pairs (i.e., the original query and a reformulated query) that are confidential and unpublished by online code...
- Labels: static analysis, code search
-
Survey of Code Search Based on Deep Learning, (TOSEM2024)
- Abstract: Code writing is repetitive and predictable, inspiring us to develop various code intelligence techniques. This survey focuses on code search, that is, to retrieve code that matches a given natural language query by effectively capturing the semantic similarity between the query and code. Deep learning, being able to extract complex semantics information, has achieved great success in this field. Recently, various deep learning methods, such as graph neural networks and pretraining models, have b...
- Labels: survey, static analysis, code search
-
Virtual Compiler Is All You Need For Assembly Code Search, (ACL2024)
- Abstract: Assembly code search is vital for reducing the burden on reverse engineers, allowing them to quickly identify specific functions using natural language within vast binary programs.Despite its significance, this critical task is impeded by the complexities involved in building high-quality datasets. This paper explores training a Large Language Model (LLM) to emulate a general compiler. By leveraging Ubuntu packages to compile a dataset of 20 billion tokens, we further continue pre-train CodeLlam...
- Labels: code generation, program transformation, static analysis, code search, code model, code model training, source code model