Merge pull request #3 from jftuga/dev

v1.1.0 revamp
jftuga · Dec 15, 2024 · 16a2b2a · 16a2b2a
2 parents 23a5888 + 84ebca3
commit 16a2b2a
Show file tree

Hide file tree

Showing 14 changed files with 621 additions and 1,127 deletions.
diff --git a/.gitignore b/.gitignore
@@ -130,9 +130,12 @@ dmypy.json
 
 # user defined exclusions
 *.txt
+!requirements.txt
 *.docx
 scripts/
 pyvenv.cfg
 playground/
-.idea/encodings.xml
+.idea/
 *.json
+.??*~
+
diff --git a/.idea/.gitignore b/.idea/.gitignore
diff --git a/.idea/deidentify.iml b/.idea/deidentify.iml
diff --git a/.idea/inspectionProfiles/Project_Default.xml b/.idea/inspectionProfiles/Project_Default.xml
diff --git a/.idea/inspectionProfiles/profiles_settings.xml b/.idea/inspectionProfiles/profiles_settings.xml
diff --git a/.idea/misc.xml b/.idea/misc.xml
diff --git a/.idea/modules.xml b/.idea/modules.xml
diff --git a/.idea/vcs.xml b/.idea/vcs.xml
diff --git a/README.md b/README.md
@@ -1,79 +1,156 @@
-# deidentify
-Deidentify people's names along with pronoun substitution
-
-## Synopsis
-
-This is a command-line program used to substitute a person's given name and/or surname along with any gender specific pronouns. A [Windows GUI](https://github.com/jftuga/deidentify-gui) for this program is also available.
+# Text De-identification Tool
+
+## INTRODUCTION
+
+This command-line tool automatically identifies and replaces personal
+information in text documents using Natural Language Processing (NLP)
+techniques. It focuses on finding and replacing person names and
+gender-specific pronouns while maintaining the text's readability and
+structure.
+
+[Natural Language Processing](https://en.wikipedia.org/wiki/Natural_language_processing)
+is a field of artificial intelligence that enables computers to
+understand, interpret, and manipulate human language. This tool
+specifically uses
+[Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)
+(NER), an NLP technique that locates and classifies named entities
+(like person names, organizations, locations) in text. NER helps identify
+person names even in complex contexts, making it more reliable than simple
+word matching.
+
+Key Features:
+- Automatic detection of person names using
+  [spaCy's transformer model](https://spacy.io/universe/project/spacy-transformers)
+- Gender-specific pronoun replacement with neutral alternatives
+- Intelligent encoding detection and Unicode handling
+- Optional HTML output with color-coded replacements
+- Detection of potentially missed names *(possessives, hyphenated names)*
+- Efficient metadata caching for quick reprocessing
+
+## INSTALLATION
+
+1. Clone the repository:
+```shell
+git clone https://github.com/jftuga/deidentify.git
+cd deidentify
+```
 
-## Example
+2. Create and activate a Python virtual environment:
+```shell
+# On Windows
+python -m venv venv
+venv\Scripts\activate
 
+# On macOS/Linux
+python3 -m venv venv
+source venv/bin/activate
 ```
-Input:
-I think John Smith likes programming. You can tell he enjoys using Python.
 
-Output:
-I think PERSON likes programming. You can tell HE/SHE enjoys using Python.
+3. Install dependencies:
+```shell
+pip install -r requirements.txt
 ```
+**Note:** *As of 2024-12-15, spaCy is not yet supported on macOS with Python 3.13.*
 
-## Configuration
-
-* This program relies on [Spacy](https://spacy.io/) for [Named-entitiy recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) and [pronoun](https://en.wikipedia.org/wiki/Pronoun) substitution.
-* For best results, you can set up a [Python Virtual Environment](https://docs.python.org/3/library/venv.html) and install `Spacy` with these settings:
-* ![Spacy Settings](spacy_settings.png)
-* `Spacy` can be installed with [other Spacy configuration options](https://spacy.io/usage).
+4. Download the spaCy model:
+```shell
+python -m spacy download en_core_web_trf
+```
+**Note:** The transformer model is large (~500MB) but provides superior accuracy.
 
-## Installation
+## USAGE
 
+Basic usage with output to STDOUT:
 ```shell
-git clone https://github.com/jftuga/deidentify.git
-python -m venv deidentify
-cd deidentify
-(Windows) - scripts\activate
-(Linux/MacOS) - source bin/activate
-python -m pip install --upgrade pip
-pip install setuptools wheel
-pip install spacy
-python -m spacy download en_core_web_trf
+python deidentify.py input.txt -r "PERSON"
 ```
 
-## Usage
+Generate color-coded HTML output:
+```shell
+python deidentify.py input.txt -r "[REDACTED]" -H -o output.html
 ```
-usage: deidentify.py [-h] -r REPLACEMENT [-o OUTPUT_FILE] [-H] input_file
+
+Command line options:
+```shell
+usage: deidentify.py [-h] -r REPLACEMENT [-o OUTPUT_FILE] [-H] [-v] input_file
 
 positional arguments:
   input_file            text file to deidentify
 
-optional arguments:
+options:
   -h, --help            show this help message and exit
   -r REPLACEMENT, --replacement REPLACEMENT
                         a word/phrase to replace identified names with
   -o OUTPUT_FILE, --output_file OUTPUT_FILE
                         output file
   -H, --html            output in HTML format
+  -v, --version         display program version and then exit
 ```
 
-## Operation
+### HTML Output Colors:
+* Yellow: Gender-specific pronouns replaced with neutral alternatives
+* Turquoise: Person names replaced with specified text, given by the `-r` switch
 
-```shell
--- Windows 
+### Possible Misses
 
-cd deidentify
-scripts\activate
-python deidentify.py -r PERSON -o output.txt input.txt
-diff input.txt output.txt
-
--- Linux
+These are listed as `possible_misses` in an intermediate JSON file named
+`input--tokens.json` when using `input.txt` as the input file name.
 
-cd deidentify
-source bin/activate
-python deidentify.py -r PERSON -o output.txt input.txt
-diff input.txt output.txt
+### Example
 
--- HTML Output
+Input:
+```
+John Smith's report was excellent. He clearly understands the topic.
+```
 
-python deidentify.py -H -r PERSON -o output.htm input.txt
+Output:
+```
+PERSON's report was excellent. HE/SHE clearly understand the topic.
 ```
 
-## Possible Misses
+## TECHNICAL DETAILS
+
+The tool processes text in two stages:
+
+1. Identification Stage: Uses spaCy's transformer model to identify:
+* * Person names through Named Entity Recognition
+* * Gender-specific pronouns through part-of-speech tagging
+
+2. Replacement Stage: Replaces identified items while maintaining text integrity:
+* * Processes text from end to beginning to preserve character positions
+* * Handles gender-specific pronouns with neutral alternatives
+* * Supports optional  HTML output with color-coded replacements
+* * Handles various Unicode punctuation variants
+
+### Text Processing Features:
+
+* Intelligent encoding detection using the `chardet` third-party Python module
+* Unicode punctuation normalization
+* Safe handling of mixed encodings
+* Metadata caching for efficient reprocessing
+
+### spaCy NER model
+
+The `en_core_web_trf` (Transformer-based) model is used because:
+* Highest accuracy for most NLP tasks, especially for named entity recognition
+  and dependency parsing
+* Best performance on complex or ambiguous sentences
+* Most robust handling of modern language and edge cases
+
+However, be aware of these shortcomings vs
+[other spaCy models](https://spacy.io/models/en):
+* Much slower than statistical models
+* Higher memory requirements (~200MB+)
+* Not suitable for real-time processing of large volumes of text
+* Requires GPU for optimal performance, but is still performant with CPU-only
+
+## ACKNOWLEDGEMENTS
+
+This tool relies on several excellent open-source projects:
+
+* [spaCy](https://github.com/explosion/spaCy) - Industrial-strength Natural Language Processing
+* [chardet](https://github.com/chardet/chardet) - Universal character encoding detector
+
+## LICENSE
 
-These are listed as `possible_misses` in an intermeadiate JSON file named `input--tokens.json` when using `input.txt` as the input file.
+[MIT LICENSE](https://github.com/jftuga/deidentify/blob/main/LICENSE)