-
BinCola: Diversity-Sensitive Contrastive Learning for Binary Code Similarity Detection, (TSE2024)
- Abstract: Binary Code Similarity Detection (BCSD) is a fundamental binary analysis technique in the area of software security. Recently, advanced deep learning algorithms are integrated into BCSD platforms to achieve superior performance on well-known benchmarks. However, real-world large programs embed more complex diversities due to different compilers, various optimization levels, multiple architectures and even obfuscations. Existing BCSD solutions suffer from low accuracy issues in such complicated r...
- Labels: static analysis, code similarity analysis, code model, code model training, binary code model
-
BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code Matching, (ICSE2024)
- Abstract: While third-party libraries (TPLs) are extensively reused to enhance productivity during software development, they can also introduce potential security risks such as vulnerability propagation. Software composition analysis (SCA), proposed to identify reused TPLs for reducing such risks, has become an essential procedure within modern DevSecOps. As one of the mainstream SCA techniques, binary-to-source SCA identifies the third-party source projects contained in binary files via binary source co...
- Labels: static analysis, software composition analysis, code model, code model training, binary code model
-
Codeart: Better code models by attention regularization when symbols are lacking, (FSE2024)
- Abstract: Transformer based code models have impressive performance in many software engineering tasks. However, their effectiveness degrades when symbols are missing or not informative. The reason is that the model may not learn to pay attention to the right correlations/contexts without the help of symbols. We propose a new method to pre-train general code models when symbols are lacking. We observe that in such cases, programs degenerate to something written in a very primitive language. We hence propo...
- Labels: code model, code model training, binary code model
-
Improving Binary Code Similarity Transformer Models by Semantics-Driven Instruction Deemphasis, (ISSTA2023)
- Abstract: Given a function in the binary executable form, binary code similarity analysis determines a set of similar functions from a large pool of candidate functions. These similar functions are usually compiled from the same source code with different compilation setups. Such analysis has a large number of applications, such as malware detection, code clone detection, and automatic software patching. The state-of-the art methods utilize complex Deep Learning models such as Transformer models. We obser...
- Labels: code model, code model training, binary code model, static analysis, code similarity analysis
-
Jtrans: Jump-aware transformer for binary code similarity detection, (ISSTA2022)
- Abstract: Binary code similarity detection (BCSD) has important applications in various fields such as vulnerabilities detection, software component analysis, and reverse engineering. Recent studies have shown that deep neural networks (DNNs) can comprehend instructions or control-flow graphs (CFG) of binary code and support BCSD. In this study, we propose a novel Transformer-based approach, namely jTrans, to learn representations of binary code. It is the first solution that embeds control flow informati...
- Labels: static analysis, bug detection, code model, code model training, binary code model
-
LLM4Decompile: Decompiling Binary Code with Large Language Models, (EMNLP2024)
- Abstract: Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results that are difficult to read and execute. Motivated by the advancements in Large Language Models (LLMs), we propose LLM4Decompile, the first and largest open-source LLM series (1.3B to 33B) trained to decompile binary code. We optimize the LLM training process and introduce the LLM4Decompile-End models to decompile binary directly. The resulting models significantly outperfo...
- Labels: static analysis, program decompilation, code model, code model training, binary code model
-
Lmpa: Improving decompilation by synergy of large language model and program analysis, (arXiv2023)
- Abstract: Decompilation aims to recover the source code form of a binary executable. It has many applications in security and software engineering such as malware analysis, vulnerability detection and code reuse. A prominent challenge in decompilation is to recover variable names. We propose a novel method that leverages the synergy of large language model (LLM) and program analysis. Language models encode rich multi-modal knowledge, but its limited input size prevents providing sufficient global context ...
- Labels: static analysis, program decompilation, code model, code model training, binary code model
-
Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning, (arXiv2023)
- Abstract: Binary code analysis is the foundation of crucial tasks in the security domain; thus building effective binary analysis techniques is more important than ever. Large language models (LLMs) although have brought impressive improvement to source code tasks, do not directly generalize to assembly code due to the unique challenges of assembly: (1) the low information density of assembly and (2) the diverse optimizations in assembly code. To overcome these challenges, this work proposes a hierarchica...
- Labels: static analysis, program decompilation, static analysis, code similarity analysis, code model, code model training, binary code model
-
PELICAN: exploiting backdoors of naturally trained deep learning models in binary code analysis, (USENIXSec2023)
- Abstract: Deep Learning (DL) models are increasingly used in many cyber-security applications and achieve superior performance compared to traditional solutions. In this paper, we study backdoor vulnerabilities in naturally trained models used in binary analysis. These backdoors are not injected by attackers but rather products of defects in datasets and/or training processes. The attacker can exploit these vulnerabilities by injecting some small fixed input pattern (e.g., an instruction) called backdoor ...
- Labels: code model, code model security, code model, code model training, binary code model
-
RCFG2Vec: Considering Long-Distance Dependency for Binary Code Similarity Detection, (ASE2024)
- Abstract: Binary code similarity detection(BCSD), as a fundamental technique in software security, has various applications, including malware family detection, known vulnerability detection and code plagiarism detection. Recent deep learning-based BCSD approaches have demonstrated promising performance. However, they face two significant challenges that limit detection performance. First, most approaches that use sequence networks (like RNN and Transformer) utilize coarse-grained tokenization methods, wh...
- Labels: static analysis, code similarity analysis, code model, code model training, binary code model
-
ReSym: Harnessing LLMs to Recover Variable and Data Structure Symbols from Stripped Binaries, (CCS2024)
- Abstract: Decompilation aims to recover a binary executable to the source code form and hence has a wide range of applications in cyber security, such as malware analysis and legacy code hardening. A prominent challenge is to recover variable symbols, including both primitive and complex types such as user-defined data structures, along with their symbol information such as names and types. Existing efforts focus on solving parts of the problem, eg, recovering only types (without names) or only local vari...
- Labels: code model, code model training, binary code model, static analysis, program decompilation
-
Self-Constructed Context Decompilation with Fined-grained Alignment Enhancement, (EMNLP2024)
- Abstract: Decompilation transforms compiled code back into a high-level programming language for analysis when source code is unavailable. Previous work has primarily focused on enhancing decompilation performance by increasing the scale of model parameters or training data for pre-training. Based on the characteristics of the decompilation task, we propose two methods: (1) Without fine-tuning, the Self-Constructed Context Decompilation (sc²dec) method recompiles the LLM’s decompilation results to constru...
- Labels: static analysis, program decompilation, code model, code model training, binary code model, benchmark
-
Source Code Foundation Models are Transferable Binary Analysis Knowledge Bases, (NeurIPS2024)
- Abstract: Human-Oriented Binary Reverse Engineering (HOBRE) lies at the intersection of binary and source code, aiming to lift binary code to human-readable content relevant to source code, thereby bridging the binary-source semantic gap. Recent advancements in uni-modal code model pre-training, particularly in generative Source Code Foundation Models (SCFMs) and binary understanding models, have laid the groundwork for transfer learning applicable to HOBRE. However, existing approaches for HOBRE rely hea...
- Labels: code model, code model training, binary code model
-
WaDec: Decompiling WebAssembly Using Large Language Model, (ASE2024)
- Abstract: WebAssembly (abbreviated Wasm) has emerged as a cornerstone of web development, offering a compact binary format that allows high-performance applications to run at near-native speeds in web browsers. Despite its advantages, Wasm's binary nature presents significant challenges for developers and researchers, particularly regarding readability when debugging or analyzing web applications. Therefore, effective decompilation becomes crucial. Unfortunately, traditional decompilers often struggle wit...
- Labels: static analysis, program decompilation, code model, code model training, binary code model