Human_Native

PII Detection API

Overview

This API provides a service for detecting Personally Identifiable Information (PII) in text using a fine-tuned BERT model. It can identify sensitive information such as names, email addresses, phone numbers, and other personal data that might need to be redacted or handled with care.

Features

Fast and accurate PII detection
REST API interface
Containerized deployment with Docker
Built with FastAPI for high performance
Powered by a fine-tuned BERT model

Getting Started

Prerequisites

Docker
Docker Compose (optional)

Installation and Setup

Clone the repository:

git clone https://github.com/Tofu0142/Human_Native.git
cd pii-detection-api

Build the Docker image:
```
docker build -t pii-detector-api .
```

Run the container:

docker run -p 8000:8000 pii-detector-api

The API will be available at http://localhost:8000.

API Usage

Detect PII in Text

Endpoint: /predict

Method: POST

Request Body:

{
  "text": "My email is [email protected] and my phone is 555-123-4567"
}

Response:

{
  
  "redacted_text": "My name is [NAME] and my email is [EMAIL]",
  "has_pii": true,
  "confidence": 0.9999833106994629
}

Example Usage with curl

curl -X 'POST' \
  'http://localhost:8000/predict' \
  -H 'Content-Type: application/json' \
  -d '{
  "text": "My email is [email protected] and my phone is 555-123-4567"
}'

Interactive API Documentation

FastAPI automatically generates interactive API documentation:

Open a web browser
Navigate to http://localhost:8000/docs
You'll see the Swagger UI where you can explore and test all endpoints

Development

Local Development Setup

Install Poetry:

curl -sSL https://install.python-poetry.org | python3 -

Install dependencies:
```
poetry install
```
Run the application:
```
poetry run uvicorn App.app:app --reload
```

Running Tests

poetry run pytest

Data Generation and Training

The models were trained on a synthetic dataset generated to include various types of PII:

Data Generation: We used a custom data generation pipeline to create realistic text samples with and without PII
Training Process:
- Random Forest: Trained on feature-engineered text data + TF-IDF + Rule-based detection
- BERT: Fine-tuned on raw text with binary labels
Model Selection: After evaluation, we selected the BERT model for its higher accuracy and performance in identifying common PII types.

Model Comparison

Our evaluation shows the performance comparison between the Random Forest and BERT models:

Metric	Random Forest	BERT
Accuracy	0.943000	1.000000
Precision	0.873503	1.000000
Recall	0.951876	1.000000
F1 Score	0.911007	1.000000
AUC	0.985578	1.000000
Inference Time	32.590107	29.393385

The BERT model provides higher accuracy and with lower inference time, so we selected the BERT model for the API.

Assumptions and Design Decisions

Dual Model Approach: We implemented both a deep learning model (BERT) and a traditional ML model (Random Forest) to provide options for different use cases.
Binary Classification: The models perform binary classification (PII/No PII) rather than multi-class classification of specific PII types.
Confidence Score: The API returns a confidence score to allow users to set their own thresholds for PII detection.
Stateless API: The API is designed to be stateless, making it easy to scale horizontally.
Docker Deployment: The solution is containerized for easy deployment in various environments.
No Persistent Storage: The API doesn't store any of the processed text, ensuring privacy.
Performance Optimization: The models are loaded once at startup to minimize inference time.
Error Handling: The API includes robust error handling for empty texts and other edge cases.

##Future Work and Considerations

Model Architecture and Performance

Hybrid Inference Strategy: Implement a cascading approach where the faster Random Forest model performs initial screening, and the more accurate BERT model only processes uncertain cases.
Model Optimization: Explore model quantization, distillation, or smaller pre-trained models like DistilBERT to reduce model size and improve inference speed.

Data and Training Strategies

Synthetic Data Limitations: While our models perform excellently on synthetic data, real-world text may present additional challenges. Consider generating more diverse synthetic data or incorporating real-world examples.
Class Imbalance: Evaluate and address potential class imbalance issues using techniques like oversampling, undersampling, or weighted loss functions.

Monitoring and Continuous Improvement

Model Performance Monitoring: Implement systems to track key metrics over time and design feedback mechanisms for users to report false positives/negatives.
A/B Testing Framework: Design a framework to safely introduce model improvements by comparing multiple model versions simultaneously.

Extended Application Scenarios

Multilingual Support: Explore multilingual BERT models or language-specific models to support PII detection across different languages.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
App		App
Trained_models/pii_detector_model_bert		Trained_models/pii_detector_model_bert
tests		tests
.coverage		.coverage
.flake8		.flake8
.gitignore		.gitignore
README.md		README.md
codecov.yml		codecov.yml
demo.ipynb		demo.ipynb
dockerfile		dockerfile
inference_time_comparison.png		inference_time_comparison.png
performance_comparison.png		performance_comparison.png
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Human_Native

PII Detection API

Overview

Features

Getting Started

Prerequisites

Installation and Setup

API Usage

Detect PII in Text

Example Usage with curl

Interactive API Documentation

Development

Local Development Setup

Running Tests

Data Generation and Training

Model Comparison

Assumptions and Design Decisions

Model Architecture and Performance

Data and Training Strategies

Monitoring and Continuous Improvement

Extended Application Scenarios

License

About

Uh oh!

Releases

Packages

Languages

Tofu0142/Human_Native

Folders and files

Latest commit

History

Repository files navigation

Human_Native

PII Detection API

Overview

Features

Getting Started

Prerequisites

Installation and Setup

API Usage

Detect PII in Text

Example Usage with curl

Interactive API Documentation

Development

Local Development Setup

Running Tests

Data Generation and Training

Model Comparison

Assumptions and Design Decisions

Model Architecture and Performance

Data and Training Strategies

Monitoring and Continuous Improvement

Extended Application Scenarios

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages