This repository implements InnerEye, LSTM-based cross-platform binary code embedding generating tool that appears in the following paper.
@inproceedings{zuo2019neural,
title={Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs},
author={Zuo, Fei and Li, Xiaopeng and Young, Patrick and Luo,Lannan and Zeng,Qiang and Zhang, Zhexin},
booktitle={Proceedings of the 2019 Network and Distributed Systems Security Symposium (NDSS)},
year={2019} }
The main purpose of this implementation is for providing a baseline for cross-platform binary code embedding research and the experiment results appeared in Improving Cross-Platform Binary Analysis using Representation Learning via Graph Alignment.
Our implementation is mostly based on the official implementation of the author of the paper while we apply the model to our cross-platform datasets that cover a broad range of software diciplines; SQLite3 (database), OpenSSL (network), cURL (file transfer), Httpd (webserver), libcrypto (crypto library), glibc (standard library). The data is preprocessed following the scheme described in the original paper and stored in data
directory that is structured in the same way with XBA. Other components are structured as follows.
.
├── README
├── Pipfile # Manages a Python virtualenv.
├── Pipfile.lock # Manages a Python virtualenv (Do not touch).
├── extract.py #
├── train.py #
├── utils.py #
├── validation.py #
├── data #
├── embeddings #
├── weights #
Python 3.8 or above version is required. To install python dependencies, you need to install pipenv first.
$ pip3 install pipenv
Install dependencies
$ pipenv install
Activate pipenv shell
$ pipenv shell
Extract requirements.txt
$ pipenv lock -r > requirements.txt
Install dependencies
$ pip install -r requirements.txt
A several desired sequences of executable are defined in the Makefile
.
Training Instruction2vec (i2v) embeddings and Siamese-LSTM from data in /revos/data/done/${programs}/innereye.csv
$ pipenv run -- python train.py --targets={programs}
Test the trained model
$ pipenv run -- python validation.py
Extract basic block embeddings using a model trained on {programs}
$ pipenv run -- python extract.py --targets={programs}