Elastic GDPR Scanner

The Elastic GDPR Scanner checks Elasticsearch instances for GDPR compliance (ie for presence of PII - Personally Identifiable Information).

The Elastic GDPR Scanner consists of 2 tools:

the port scanner, that identifies Elasticsearch instances by port scanning the network,
the GDPR checker itself, that tests Elasticsearch targets against lists of regexes.

Disclamer: Vincent Maury or Elastic cannot be held responsible for the use of this script! Use it at your own risk!!

0. Getting Started

These instructions will get you a copy of the project up and running on your local machine.

Prerequisites

This piece of python has no other pre-requisite than Python 3.
It should work on any platform.
For NER usage, the script relies on a PII detection model.

Get ready!

Just clone this repository:

git clone https://github.com/blookot/elastic-gdpr-scanner

1. Scanning for Elasticsearch instances

Running the script

Just run it:

python elastic-port-scanner.py -h

The script has several options:

-h will display help.
-t TARGETS to enter a specific target (hostname or single IP or IP range in CIDR format, eg 10.50.3.0/24). Defaults to localhost.
-p PORTS to specify the port range (eg 9200-9210,9300-9310) where Elasticsearch could be running. Defaults to 9200.
-u USER to set a username to use when trying to authenticate to Elasticsearch. Defaults to elastic.
-pwd PASSWORD to set a password to use when trying to authenticate to Elasticsearch. Defaults to changeme.
-o OUTPUT to specify the name of the output file in json format. Defaults to targets.json.
--nb-threads NB_THREADS to specify how many hosts you want to scan in parallel. Defaults to 10.
--socket-timeout TIMEOUT to set the timeout for socket connect (open port testing), in seconds. Set it to 2 on the Internet, 0.5 in local networks. Defaults to 2.
--log-file FILE to specify the name of the file to output logs. Defaults to es-scanner.csv
--verbose turns on verbose output in console. Defaults to False.

Easy run on a local Elasticsearch:

python elastic-port-scanner.py -pwd myelasticpassword

Or on an Elastic Cloud instance:

python elastic-gdpr-scanner.py -t myelasticsearchid.europe-west9.gcp.elastic-cloud.com -p 443 -pwd myelasticpassword

Output

The targets.json file output from this port scan can be used by the GDPR scanner below.

You can grab the es-scanner.csv file generated by this script, as a report. Column titles should be self-explanatory.

nmap prescan

You could also run the famous nmap port scanner to list hosts and ports before running the Elastic port scanner itself.

First, install nmap, then run:

nmap -h
nmap -p PORTS -sV --host-timeout TIMEOUT -oG OUTPUT TARGETS

For instance:

nmap -p 9200-9205 -sV --host-timeout 20s -oG out.txt 192.168.1.1-254

Then a bit of parsing will extract the Elasticsearch instances:

awk '      
/^Host: [0-9]/ {ip=$2}
/Ports:/ {
  split($0, ports, "Ports: ")
  split(ports[2], entries, ", ")
  for (i in entries) {
    if (entries[i] ~ /open\/tcp.*Elasticsearch/) {
      split(entries[i], parts, "/")
      print "Host: " ip ", port: " parts[1]
    }
  }
}' out.txt

Finally, pass these targets & ports to the Elastic port scanner :-)

2. Running the GDPR analyzer

Analyzing for GDPR compliance requires to look inside Elasticsearch documents for the presence of PII.

The simple (and fast) approach is to extract N documents from an Elasticsearch index and search inside them using regexes.

NER (Named-entity recognition) can also be used to leverage the power of AI to infer the presence of classes of PII. To do so, we use a trained model from HuggingFace (ref) that covers 29 classes across 7 languages (English , Spanish, Swedish, German, Italian, Dutch, French)!

Running the script

Setup a virtual environment and install the 2 dependencies:

python -m venv .venv
source .venv/bin/activate
pip install transformers torch

The ML model (2GB big) will be installed at first execution:

python elastic-gdpr-analyzer.py -h

The script has several options:

-h will display help.
-t TARGET_FILE to specify the json file to read cluster hosts from. Defaults to targets.json. See the targets_example.json file as example.
-i INDEX to enter the name of a specific index to scan. By default, the script scans all indices.
-r REGEX to enter a specific regex to look for (if set, cancels running all regexes from regexes.json file).
-n NB_DOCS to set the number of documents (up to 10000) to get from each Elasticsearch index. Defaults to 1.
-o REPORT_FILE to specify the name of the file to output results. csv or json supported. Defaults to es-gdpr-report.csv.
--no-hidden is a flag to exclude hidden indices (which name starts with a dot). By default, the script scans for any index (hidden or not).
--no-ner is a flag to disable NER scanning, ie only search for regexes. By default, the script runs NER!
--nb-threads NB_THREADS to specify how many hosts you want to scan in parallel. Defaults to 10.
--socket-timeout TIMEOUT to set the timeout for socket connect (open port testing), in seconds. Set it to 2 on the Internet, 0.5 in local networks. Defaults to 2.
--verbose is a flag to turn on verbose output in console.

Easy run on 100 documents:

python elastic-gdpr-analyzer.py -n 100

Or analyze the 'test-rgpd' index (see below) with a custom regex:

python elastic-gdpr-analyzer.py -i 'test-rgpd' -r '[vV]incent.[mM]aury'

Or analyze a custom list of targets, a custom regex and a custom json output file:

python elastic-gdpr-analyzer.py -t mytargets.json -r '[vV]incent.[mM]aury' -o gdpr-report.json

At the end of your tests, you may delete the virtual environment:

deactivate
rm -rf .venv

Testing with a sample document

If you want a test with a true positive, you can connect to an Elasticsearch instance via Kibana dev tools and run:

POST test-rgpd/_doc/1
{
  "id": "abcd",
  "postDate": "2025-03-15T13:12:00",
  "message": "Message lambda contenant par exemple mon nom Vincent Maury et celui de Stéphanie de Monaco qui ne devraient pas être là.",
  "email": "Mon email est [email protected] (pro).",
  "phone": "J'ai un faux numéro de téléphone : (+33)674642014 et j'espère qu'il n'est pas utilisé...",
  "addr": "Une adresse pro au 128, rue du Faubourg Saint-Honoré, 75008 Paris.",
  "iban": "On va mettre un IBAN factice comme FR76 3000 4031 8400 0078 7353 152 pour la France."
}

The email and the fake IBAN should be identified by regex matching.
The 5 PII (name, email, phone, address and IBAN) should be identified by NER testing.

Report

You can grab the es-gdpr-report.csv of any custom (json or csv) file generated by this script, as a report. Column titles should be self-explanatory.

Note: this script only scans non-internal indices (the ones not starting with .) so the sum of non-internal indices do not equal the totals of each node.

Authors

Vincent Maury - Initial commit - blookot

License

This project is licensed under the Apache 2.0 License - see the LICENSE.md file for details

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Elastic GDPR Scanner

0. Getting Started

Prerequisites

Get ready!

1. Scanning for Elasticsearch instances

Running the script

Output

nmap prescan

2. Running the GDPR analyzer

Running the script

Testing with a sample document

Report

Authors

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
elastic-gdpr-analyzer.py		elastic-gdpr-analyzer.py
elastic-port-scanner.py		elastic-port-scanner.py
regexes.json		regexes.json
targets_example.json		targets_example.json

License

blookot/elastic-gdpr-scanner

Folders and files

Latest commit

History

Repository files navigation

Elastic GDPR Scanner

0. Getting Started

Prerequisites

Get ready!

1. Scanning for Elasticsearch instances

Running the script

Output

nmap prescan

2. Running the GDPR analyzer

Running the script

Testing with a sample document

Report

Authors

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages