Automatic YARA rule generation for malware repositories. Currently used to build YARA signatures for Malpedia (https://malpedia.caad.fkie.fraunhofer.de) and limited to x86/x86-64 executables and memory dumps for Linux, macOS and Windows.
This software is useful for larger organizations like companies or CERTs as well as for indivuduals. It only requires a modern, personal computer (8 cores/threads and 16 GiB recommended) and a curated malware repository. Curated means in this context that all samples are already sorted and clustered to families. Each family can contain various samples. In general the tool works better for unpacked malware because we try to detect special code regions or functions that identify a given family.
You have a curated malware repository, formatted like this:
./malware-repo/<malware_family_name>/<sample_name_or_hash>
You build disassembly reports with smda (https://github.com/danielplohmann/smda) and store them as input files for YARA-Signator. We will need both, the malware repo in the respective folder structure and the reports. The reports will be consumed to evaluate the binary code of the samples to generate possible sequence candidates that determine the family. Those are checked against the rpository to increase the quality significantly.
When the preconditions are met you may start the process and run YARA-Signator as stated in chapter -> Workflow.
If you run postgres and yara-signator on one machine, I would recommend you to have at least 16GB memory and your system is nearly headless. Postgres will be very slow with less than 8GB memory and yara-signator uses many threads and will consume up to 6GB memory very quick. If you set the threads to 1 you will have a very low memory usage but this will result in low performance.
The full chain will create database files up to 100GB very quickly, so you should have at least 100GB space left. The speed should be dramatically increased if your space is on faster storage.
A full run over all malpedia samples on a SATA-HDD without RAID and 20GB memory running yara-signator, capstone_server and postgres on the same device takes around 10h. The insertion takes around 3-4h, the database operations 4-6h and the YARA rule derivation around 2-3h.
See our new wiki and the installation guide: https://github.com/fxb-cocacoding/yara-signator/wiki/YARA-Signator-%23-User-Manual-for-Version-0.6.0
IMPORTANT: Make sure you have not a database in postgres using the same name as mentioned in the config file. The first thing yara-signator will do is a database drop if you have not activated the skipSMDAInsertions
using a true
flag.
- Get SMDA reports from your malware pool. (https://github.com/danielplohmann/smda)
- Create a valid config file (set folders to smda reports and to your pool, etc).
- Start postgresql daemon
- Start capstone_server on port 12345, using the daemonize script
- Launch yara-signator (
java -jar target/yara-signator-0.5.0-SNAPSHOT-jar-with-dependencies.jar >> logfile.txt
) - Monitor your log file:
tail -F logfile.txt
Q: This capstone_server is a C program, why don't you use the capstone bindings for JAVA like everyone else?
A: I had some strange memory leaks (>10GB lost memory) when running many millions of opcodes against capstone over some hours. I couldn't fix them in JAVA and so I created a small program in C which communicates over TCP with other processes. If this process would have any memory leaks, you could simply restart it and the memory would be released back to the OS.
See the slides of our presentation for a short introduction of YARA-Signator and how it works: https://www.botconf.eu/wp-content/uploads/2019/12/B2019-Bilstein-Plohmann-YaraSignator.pdf
See our paper for more information: https://journal.cecyf.fr/ojs/index.php/cybin/article/view/24
The approach is described in detail in my BS thesis:
http://cocacoding.com/papers/Automatic_Generation_of_code_based_YARA_Signatures.pdf
SHA512: 0384d6ec497cbfca2ec4a7739337088c8c859e86ca63339fd3d26c8be2176e3378210ac6cbd725d08a2f672e9cb2dcecc09eef2ec2dbc975de726b0b918795b2
SHA256: 4f0530d0da48b394cb0798c434ffc70c33ae351e54c77454c87546d17ec52b60
This project was mainly developed during the writing of the previously mentioned BS thesis. According to this I would like to thank Daniel Plohmann for his great support and helpful ideas during his supervision and beyond, especially regarding to this software project.