The Elastic GDPR Scanner checks Elasticsearch instances for GDPR compliance (ie for presence of PII - Personally Identifiable Information).
The Elastic GDPR Scanner consists of 2 tools:
- the port scanner, that identifies Elasticsearch instances by port scanning the network,
- the GDPR checker itself, that tests Elasticsearch targets against lists of regexes.
Disclamer: Vincent Maury or Elastic cannot be held responsible for the use of this script! Use it at your own risk!!
These instructions will get you a copy of the project up and running on your local machine.
This piece of python has no other pre-requisite than Python 3.
It should work on any platform.
For NER usage, the script relies on a PII detection model.
Just clone this repository:
git clone https://github.com/blookot/elastic-gdpr-scannerJust run it:
python elastic-port-scanner.py -hThe script has several options:
-hwill display help.-t TARGETSto enter a specific target (hostname or single IP or IP range in CIDR format, eg 10.50.3.0/24). Defaults tolocalhost.-p PORTSto specify the port range (eg9200-9210,9300-9310) where Elasticsearch could be running. Defaults to9200.-u USERto set a username to use when trying to authenticate to Elasticsearch. Defaults toelastic.-pwd PASSWORDto set a password to use when trying to authenticate to Elasticsearch. Defaults tochangeme.-o OUTPUTto specify the name of the output file in json format. Defaults totargets.json.--nb-threads NB_THREADSto specify how many hosts you want to scan in parallel. Defaults to10.--socket-timeout TIMEOUTto set the timeout for socket connect (open port testing), in seconds. Set it to 2 on the Internet, 0.5 in local networks. Defaults to2.--log-file FILEto specify the name of the file to output logs. Defaults toes-scanner.csv--verboseturns on verbose output in console. Defaults toFalse.
Easy run on a local Elasticsearch:
python elastic-port-scanner.py -pwd myelasticpasswordOr on an Elastic Cloud instance:
python elastic-gdpr-scanner.py -t myelasticsearchid.europe-west9.gcp.elastic-cloud.com -p 443 -pwd myelasticpasswordThe targets.json file output from this port scan can be used by the GDPR scanner below.
You can grab the es-scanner.csv file generated by this script, as a report. Column titles should be self-explanatory.
You could also run the famous nmap port scanner to list hosts and ports before running the Elastic port scanner itself.
First, install nmap, then run:
nmap -h
nmap -p PORTS -sV --host-timeout TIMEOUT -oG OUTPUT TARGETSFor instance:
nmap -p 9200-9205 -sV --host-timeout 20s -oG out.txt 192.168.1.1-254Then a bit of parsing will extract the Elasticsearch instances:
awk '
/^Host: [0-9]/ {ip=$2}
/Ports:/ {
split($0, ports, "Ports: ")
split(ports[2], entries, ", ")
for (i in entries) {
if (entries[i] ~ /open\/tcp.*Elasticsearch/) {
split(entries[i], parts, "/")
print "Host: " ip ", port: " parts[1]
}
}
}' out.txtFinally, pass these targets & ports to the Elastic port scanner :-)
Analyzing for GDPR compliance requires to look inside Elasticsearch documents for the presence of PII.
The simple (and fast) approach is to extract N documents from an Elasticsearch index and search inside them using regexes.
NER (Named-entity recognition) can also be used to leverage the power of AI to infer the presence of classes of PII. To do so, we use a trained model from HuggingFace (ref) that covers 29 classes across 7 languages (English , Spanish, Swedish, German, Italian, Dutch, French)!
Setup a virtual environment and install the 2 dependencies:
python -m venv .venv
source .venv/bin/activate
pip install transformers torchThe ML model (2GB big) will be installed at first execution:
python elastic-gdpr-analyzer.py -hThe script has several options:
-hwill display help.-t TARGET_FILEto specify the json file to read cluster hosts from. Defaults totargets.json. See thetargets_example.jsonfile as example.-i INDEXto enter the name of a specific index to scan. By default, the script scans all indices.-r REGEXto enter a specific regex to look for (if set, cancels running all regexes from regexes.json file).-n NB_DOCSto set the number of documents (up to 10000) to get from each Elasticsearch index. Defaults to1.-o REPORT_FILEto specify the name of the file to output results. csv or json supported. Defaults toes-gdpr-report.csv.--no-hiddenis a flag to exclude hidden indices (which name starts with a dot). By default, the script scans for any index (hidden or not).--no-neris a flag to disable NER scanning, ie only search for regexes. By default, the script runs NER!--nb-threads NB_THREADSto specify how many hosts you want to scan in parallel. Defaults to10.--socket-timeout TIMEOUTto set the timeout for socket connect (open port testing), in seconds. Set it to 2 on the Internet, 0.5 in local networks. Defaults to2.--verboseis a flag to turn on verbose output in console.
Easy run on 100 documents:
python elastic-gdpr-analyzer.py -n 100Or analyze the 'test-rgpd' index (see below) with a custom regex:
python elastic-gdpr-analyzer.py -i 'test-rgpd' -r '[vV]incent.[mM]aury'Or analyze a custom list of targets, a custom regex and a custom json output file:
python elastic-gdpr-analyzer.py -t mytargets.json -r '[vV]incent.[mM]aury' -o gdpr-report.jsonAt the end of your tests, you may delete the virtual environment:
deactivate
rm -rf .venvIf you want a test with a true positive, you can connect to an Elasticsearch instance via Kibana dev tools and run:
POST test-rgpd/_doc/1
{
"id": "abcd",
"postDate": "2025-03-15T13:12:00",
"message": "Message lambda contenant par exemple mon nom Vincent Maury et celui de Stéphanie de Monaco qui ne devraient pas être là.",
"email": "Mon email est [email protected] (pro).",
"phone": "J'ai un faux numéro de téléphone : (+33)674642014 et j'espère qu'il n'est pas utilisé...",
"addr": "Une adresse pro au 128, rue du Faubourg Saint-Honoré, 75008 Paris.",
"iban": "On va mettre un IBAN factice comme FR76 3000 4031 8400 0078 7353 152 pour la France."
}The email and the fake IBAN should be identified by regex matching.
The 5 PII (name, email, phone, address and IBAN) should be identified by NER testing.
You can grab the es-gdpr-report.csv of any custom (json or csv) file generated by this script, as a report. Column titles should be self-explanatory.
Note: this script only scans non-internal indices (the ones not starting with .) so the sum of non-internal indices do not equal the totals of each node.
- Vincent Maury - Initial commit - blookot
This project is licensed under the Apache 2.0 License - see the LICENSE.md file for details