Skip to content
This repository has been archived by the owner on Jan 3, 2025. It is now read-only.

Commit

Permalink
Merge pull request #3 from jftuga/dev
Browse files Browse the repository at this point in the history
v1.1.0 revamp
  • Loading branch information
jftuga authored Dec 15, 2024
2 parents 23a5888 + 84ebca3 commit 16a2b2a
Show file tree
Hide file tree
Showing 14 changed files with 621 additions and 1,127 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -130,9 +130,12 @@ dmypy.json

# user defined exclusions
*.txt
!requirements.txt
*.docx
scripts/
pyvenv.cfg
playground/
.idea/encodings.xml
.idea/
*.json
.??*~

3 changes: 0 additions & 3 deletions .idea/.gitignore

This file was deleted.

8 changes: 0 additions & 8 deletions .idea/deidentify.iml

This file was deleted.

12 changes: 0 additions & 12 deletions .idea/inspectionProfiles/Project_Default.xml

This file was deleted.

6 changes: 0 additions & 6 deletions .idea/inspectionProfiles/profiles_settings.xml

This file was deleted.

4 changes: 0 additions & 4 deletions .idea/misc.xml

This file was deleted.

8 changes: 0 additions & 8 deletions .idea/modules.xml

This file was deleted.

6 changes: 0 additions & 6 deletions .idea/vcs.xml

This file was deleted.

171 changes: 124 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,79 +1,156 @@
# deidentify
Deidentify people's names along with pronoun substitution

## Synopsis

This is a command-line program used to substitute a person's given name and/or surname along with any gender specific pronouns. A [Windows GUI](https://github.com/jftuga/deidentify-gui) for this program is also available.
# Text De-identification Tool

## INTRODUCTION

This command-line tool automatically identifies and replaces personal
information in text documents using Natural Language Processing (NLP)
techniques. It focuses on finding and replacing person names and
gender-specific pronouns while maintaining the text's readability and
structure.

[Natural Language Processing](https://en.wikipedia.org/wiki/Natural_language_processing)
is a field of artificial intelligence that enables computers to
understand, interpret, and manipulate human language. This tool
specifically uses
[Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)
(NER), an NLP technique that locates and classifies named entities
(like person names, organizations, locations) in text. NER helps identify
person names even in complex contexts, making it more reliable than simple
word matching.

Key Features:
- Automatic detection of person names using
[spaCy's transformer model](https://spacy.io/universe/project/spacy-transformers)
- Gender-specific pronoun replacement with neutral alternatives
- Intelligent encoding detection and Unicode handling
- Optional HTML output with color-coded replacements
- Detection of potentially missed names *(possessives, hyphenated names)*
- Efficient metadata caching for quick reprocessing

## INSTALLATION

1. Clone the repository:
```shell
git clone https://github.com/jftuga/deidentify.git
cd deidentify
```

## Example
2. Create and activate a Python virtual environment:
```shell
# On Windows
python -m venv venv
venv\Scripts\activate

# On macOS/Linux
python3 -m venv venv
source venv/bin/activate
```
Input:
I think John Smith likes programming. You can tell he enjoys using Python.

Output:
I think PERSON likes programming. You can tell HE/SHE enjoys using Python.
3. Install dependencies:
```shell
pip install -r requirements.txt
```
**Note:** *As of 2024-12-15, spaCy is not yet supported on macOS with Python 3.13.*

## Configuration

* This program relies on [Spacy](https://spacy.io/) for [Named-entitiy recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) and [pronoun](https://en.wikipedia.org/wiki/Pronoun) substitution.
* For best results, you can set up a [Python Virtual Environment](https://docs.python.org/3/library/venv.html) and install `Spacy` with these settings:
* ![Spacy Settings](spacy_settings.png)
* `Spacy` can be installed with [other Spacy configuration options](https://spacy.io/usage).
4. Download the spaCy model:
```shell
python -m spacy download en_core_web_trf
```
**Note:** The transformer model is large (~500MB) but provides superior accuracy.

## Installation
## USAGE

Basic usage with output to STDOUT:
```shell
git clone https://github.com/jftuga/deidentify.git
python -m venv deidentify
cd deidentify
(Windows) - scripts\activate
(Linux/MacOS) - source bin/activate
python -m pip install --upgrade pip
pip install setuptools wheel
pip install spacy
python -m spacy download en_core_web_trf
python deidentify.py input.txt -r "PERSON"
```

## Usage
Generate color-coded HTML output:
```shell
python deidentify.py input.txt -r "[REDACTED]" -H -o output.html
```
usage: deidentify.py [-h] -r REPLACEMENT [-o OUTPUT_FILE] [-H] input_file

Command line options:
```shell
usage: deidentify.py [-h] -r REPLACEMENT [-o OUTPUT_FILE] [-H] [-v] input_file

positional arguments:
input_file text file to deidentify

optional arguments:
options:
-h, --help show this help message and exit
-r REPLACEMENT, --replacement REPLACEMENT
a word/phrase to replace identified names with
-o OUTPUT_FILE, --output_file OUTPUT_FILE
output file
-H, --html output in HTML format
-v, --version display program version and then exit
```

## Operation
### HTML Output Colors:
* Yellow: Gender-specific pronouns replaced with neutral alternatives
* Turquoise: Person names replaced with specified text, given by the `-r` switch

```shell
-- Windows
### Possible Misses

cd deidentify
scripts\activate
python deidentify.py -r PERSON -o output.txt input.txt
diff input.txt output.txt

-- Linux
These are listed as `possible_misses` in an intermediate JSON file named
`input--tokens.json` when using `input.txt` as the input file name.

cd deidentify
source bin/activate
python deidentify.py -r PERSON -o output.txt input.txt
diff input.txt output.txt
### Example

-- HTML Output
Input:
```
John Smith's report was excellent. He clearly understands the topic.
```

python deidentify.py -H -r PERSON -o output.htm input.txt
Output:
```
PERSON's report was excellent. HE/SHE clearly understand the topic.
```

## Possible Misses
## TECHNICAL DETAILS

The tool processes text in two stages:

1. Identification Stage: Uses spaCy's transformer model to identify:
* * Person names through Named Entity Recognition
* * Gender-specific pronouns through part-of-speech tagging

2. Replacement Stage: Replaces identified items while maintaining text integrity:
* * Processes text from end to beginning to preserve character positions
* * Handles gender-specific pronouns with neutral alternatives
* * Supports optional HTML output with color-coded replacements
* * Handles various Unicode punctuation variants

### Text Processing Features:

* Intelligent encoding detection using the `chardet` third-party Python module
* Unicode punctuation normalization
* Safe handling of mixed encodings
* Metadata caching for efficient reprocessing

### spaCy NER model

The `en_core_web_trf` (Transformer-based) model is used because:
* Highest accuracy for most NLP tasks, especially for named entity recognition
and dependency parsing
* Best performance on complex or ambiguous sentences
* Most robust handling of modern language and edge cases

However, be aware of these shortcomings vs
[other spaCy models](https://spacy.io/models/en):
* Much slower than statistical models
* Higher memory requirements (~200MB+)
* Not suitable for real-time processing of large volumes of text
* Requires GPU for optimal performance, but is still performant with CPU-only

## ACKNOWLEDGEMENTS

This tool relies on several excellent open-source projects:

* [spaCy](https://github.com/explosion/spaCy) - Industrial-strength Natural Language Processing
* [chardet](https://github.com/chardet/chardet) - Universal character encoding detector

## LICENSE

These are listed as `possible_misses` in an intermeadiate JSON file named `input--tokens.json` when using `input.txt` as the input file.
[MIT LICENSE](https://github.com/jftuga/deidentify/blob/main/LICENSE)
Loading

0 comments on commit 16a2b2a

Please sign in to comment.