Yara rule generator using VirusTotal code similarity feature code-similar-to:
This Yara generator is using VirusTotal 'code-similar-to:' beta search modifier to gather code blocks from PE files and automatically create a Yara signature using them. This Yara generator was presented on GReAT Ideas July 2020 showing how you could use the generated Yara rule to hunt for similar APT samples and greatly refine the results using Kaspersky KTAE. Slides
- VirusTotal Enterprise API key
- Python 2/3, requests, json
Provide hash, get Yara rule to hunt for similar samples.
This tool accepts a PE file hash and queries VirusTotal for files sharing code blocks with it, post-processesing the results using minimal code block length and similarity score thresholds you can set.
It then iterates over the returned files, for each file collecting its code blocks, their offset and filesize which will be used to determine the file size range for the Yara rule. It ranks the code blocks that were seen across the most files returned (most popular code blocks).
User is prompted to choose how many of the most popular code blocks to include in the Yara rule. The code blocks that are picked are then compared against the code blocks from the original file that was used when executing the Yara generator to determine the Yara rule minimal matching condition.
usage: VTCodeSimilarity_YaraGen.py [-h] [--hash MD5/SHA1/SHA256]
[--threshold 50%] [--hashlist hashes.txt]
[--min_block 4] [--apikey APIKEY] [--debug]
VirusTotal Code Similarity Yara Generator
optional arguments:
-h, --help show this help message and exit
--hash MD5/SHA1/SHA256
MD5/SHA1/SHA256 hash to check on VTi
--threshold 50% Minimum similarity threshold
--hashlist hashes.txt
Path to a file containing list of hashes
--min_block 4 Minimum desired codeblock size
--apikey APIKEY VT API Key
--debug Store VirusTotal JSON data
I suggest to use --debug
if you're playing around with a rule. It will store the results from VirusTotal in a json file instead of fetching them each time.
python VTCodeSimilarity_YaraGen.py --hash 7ec8a9641d7342d1a471ebcd98e28b62 --threshold 80%
Some stats from running KTAE on the Retrohunt results of the rule above:
This VirusTotal feature is still in beta phase.
- The sample set is limited
- Packed samples are an issue
- Code blocks returned could be a subset of each other
- No code block whitelist. Code blocks might be of a 3rd party library and therefore 'benign'.
- General bugs in code similarity calculation.
The code blocks are stored in a nested dictionary:
{ binary : { counter : 1, offset : 536990455,43655401} }
Where:
binary
is the actual code blockcounter
contains the number of times this code block was seen across resultsRemoved, since it might be removed from the API in the futureasm
contains the assembly representation of that code blockoffset
contains all the offsets that this code block was seen at
Please feel free to contribute and submit pull requests. Some ideas so far:
Print the assembly code, most likely printing it interactively.Removed, since it might be removed from the API in the future- Make use of the offsets of the code blocks in the Yara condition.
- Give more information when pruning less popular code blocks.
Thanks to Juan Infantes Diaz from VirusTotal for his help and this new feature.
This project is licensed under the GNUGPLv3 - see the LICENSE.md file for details.