Skip to content

blookot/elastic-gdpr-scanner

Repository files navigation

Elastic GDPR Scanner

The Elastic GDPR Scanner checks Elasticsearch instances for GDPR compliance (ie for presence of PII - Personally Identifiable Information).

The Elastic GDPR Scanner consists of 2 tools:

  • the port scanner, that identifies Elasticsearch instances by port scanning the network,
  • the GDPR checker itself, that tests Elasticsearch targets against lists of regexes.

Disclamer: Vincent Maury or Elastic cannot be held responsible for the use of this script! Use it at your own risk!!

0. Getting Started

These instructions will get you a copy of the project up and running on your local machine.

Prerequisites

This piece of python has no other pre-requisite than Python 3.
It should work on any platform.
For NER usage, the script relies on a PII detection model.

Get ready!

Just clone this repository:

git clone https://github.com/blookot/elastic-gdpr-scanner

1. Scanning for Elasticsearch instances

Running the script

Just run it:

python elastic-port-scanner.py -h

The script has several options:

  • -h will display help.
  • -t TARGETS to enter a specific target (hostname or single IP or IP range in CIDR format, eg 10.50.3.0/24). Defaults to localhost.
  • -p PORTS to specify the port range (eg 9200-9210,9300-9310) where Elasticsearch could be running. Defaults to 9200.
  • -u USER to set a username to use when trying to authenticate to Elasticsearch. Defaults to elastic.
  • -pwd PASSWORD to set a password to use when trying to authenticate to Elasticsearch. Defaults to changeme.
  • -o OUTPUT to specify the name of the output file in json format. Defaults to targets.json.
  • --nb-threads NB_THREADS to specify how many hosts you want to scan in parallel. Defaults to 10.
  • --socket-timeout TIMEOUT to set the timeout for socket connect (open port testing), in seconds. Set it to 2 on the Internet, 0.5 in local networks. Defaults to 2.
  • --log-file FILE to specify the name of the file to output logs. Defaults to es-scanner.csv
  • --verbose turns on verbose output in console. Defaults to False.

Easy run on a local Elasticsearch:

python elastic-port-scanner.py -pwd myelasticpassword

Or on an Elastic Cloud instance:

python elastic-gdpr-scanner.py -t myelasticsearchid.europe-west9.gcp.elastic-cloud.com -p 443 -pwd myelasticpassword

Output

The targets.json file output from this port scan can be used by the GDPR scanner below.

You can grab the es-scanner.csv file generated by this script, as a report. Column titles should be self-explanatory.

nmap prescan

You could also run the famous nmap port scanner to list hosts and ports before running the Elastic port scanner itself.

First, install nmap, then run:

nmap -h
nmap -p PORTS -sV --host-timeout TIMEOUT -oG OUTPUT TARGETS

For instance:

nmap -p 9200-9205 -sV --host-timeout 20s -oG out.txt 192.168.1.1-254

Then a bit of parsing will extract the Elasticsearch instances:

awk '      
/^Host: [0-9]/ {ip=$2}
/Ports:/ {
  split($0, ports, "Ports: ")
  split(ports[2], entries, ", ")
  for (i in entries) {
    if (entries[i] ~ /open\/tcp.*Elasticsearch/) {
      split(entries[i], parts, "/")
      print "Host: " ip ", port: " parts[1]
    }
  }
}' out.txt

Finally, pass these targets & ports to the Elastic port scanner :-)

2. Running the GDPR analyzer

Analyzing for GDPR compliance requires to look inside Elasticsearch documents for the presence of PII.

The simple (and fast) approach is to extract N documents from an Elasticsearch index and search inside them using regexes.

NER (Named-entity recognition) can also be used to leverage the power of AI to infer the presence of classes of PII. To do so, we use a trained model from HuggingFace (ref) that covers 29 classes across 7 languages (English , Spanish, Swedish, German, Italian, Dutch, French)!

Running the script

Setup a virtual environment and install the 2 dependencies:

python -m venv .venv
source .venv/bin/activate
pip install transformers torch

The ML model (2GB big) will be installed at first execution:

python elastic-gdpr-analyzer.py -h

The script has several options:

  • -h will display help.
  • -t TARGET_FILE to specify the json file to read cluster hosts from. Defaults to targets.json. See the targets_example.json file as example.
  • -i INDEX to enter the name of a specific index to scan. By default, the script scans all indices.
  • -r REGEX to enter a specific regex to look for (if set, cancels running all regexes from regexes.json file).
  • -n NB_DOCS to set the number of documents (up to 10000) to get from each Elasticsearch index. Defaults to 1.
  • -o REPORT_FILE to specify the name of the file to output results. csv or json supported. Defaults to es-gdpr-report.csv.
  • --no-hidden is a flag to exclude hidden indices (which name starts with a dot). By default, the script scans for any index (hidden or not).
  • --no-ner is a flag to disable NER scanning, ie only search for regexes. By default, the script runs NER!
  • --nb-threads NB_THREADS to specify how many hosts you want to scan in parallel. Defaults to 10.
  • --socket-timeout TIMEOUT to set the timeout for socket connect (open port testing), in seconds. Set it to 2 on the Internet, 0.5 in local networks. Defaults to 2.
  • --verbose is a flag to turn on verbose output in console.

Easy run on 100 documents:

python elastic-gdpr-analyzer.py -n 100

Or analyze the 'test-rgpd' index (see below) with a custom regex:

python elastic-gdpr-analyzer.py -i 'test-rgpd' -r '[vV]incent.[mM]aury'

Or analyze a custom list of targets, a custom regex and a custom json output file:

python elastic-gdpr-analyzer.py -t mytargets.json -r '[vV]incent.[mM]aury' -o gdpr-report.json

At the end of your tests, you may delete the virtual environment:

deactivate
rm -rf .venv

Testing with a sample document

If you want a test with a true positive, you can connect to an Elasticsearch instance via Kibana dev tools and run:

POST test-rgpd/_doc/1
{
  "id": "abcd",
  "postDate": "2025-03-15T13:12:00",
  "message": "Message lambda contenant par exemple mon nom Vincent Maury et celui de Stéphanie de Monaco qui ne devraient pas être là.",
  "email": "Mon email est [email protected] (pro).",
  "phone": "J'ai un faux numéro de téléphone : (+33)674642014 et j'espère qu'il n'est pas utilisé...",
  "addr": "Une adresse pro au 128, rue du Faubourg Saint-Honoré, 75008 Paris.",
  "iban": "On va mettre un IBAN factice comme FR76 3000 4031 8400 0078 7353 152 pour la France."
}

The email and the fake IBAN should be identified by regex matching.
The 5 PII (name, email, phone, address and IBAN) should be identified by NER testing.

Report

You can grab the es-gdpr-report.csv of any custom (json or csv) file generated by this script, as a report. Column titles should be self-explanatory.

Note: this script only scans non-internal indices (the ones not starting with .) so the sum of non-internal indices do not equal the totals of each node.

Authors

  • Vincent Maury - Initial commit - blookot

License

This project is licensed under the Apache 2.0 License - see the LICENSE.md file for details

About

Scan Elasticsearch instances to check for GDPR compliance

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages