Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deidentification project #4

Open
wants to merge 1 commit into
base: gh-pages
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions projects/deidentification-text.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
remote_theme: nhsx/nhsx-io-theme
title: Machine learning approaches to deidentification of text based data
description: Machine learning approaches to deidentification of text based data
permalink: /deidentification-text/
---

# Machine learning approaches to deidentification of text based data

**Keywords:** Natural Language Processing, Confidential Data, Patient Experience, Clinical Notes

**Need:** A lot of the text produced within healthcare settings includes confidential information only coincidentally and could easily be deidentified. For example, patient experience data in the form of text is not, in general, identifiable, but sometimes patients will include irrelevant personal information, for example when they require a response to their feedback. The presence of small quantities of identifiable data scattered throughout a large body of text makes it impossible to freely share and collaborate on the data as well as to work on it within the cloud without rigorous IG procedures.

Being able to scrub identifiable information automatically from text based data like patient and staff experience (as well as clinical data too, in certain circumstances), and other relevant datasets, would give data scientists and researchers in the NHS better access to deidentified text data for the purposes of building algorithms and data products.

**Current Knowledge/Examples & Possible Techniques/Approaches:** [Philter](https://github.com/BCHSI/philter-ucsf) is available under the permissive BSD-3 licence. [UKCRIS](https://crisnetwork.co/uk-cris-programme) uses similar technology but the code is likely proprietary.

**Related Previous Internship Projects:** n/a as first year of the scheme.

**Enables Future Work:** This work would be extremely useful for a lot of text based analytics in healthcare and could potentially generate a lot of data with an open licence that could be used by individuals across public and private healthcare contexts.

**Outcome/Learning Objectives:** The outcome would be an algorithm or system of algorithms that is capable of de-identifying text based data. Learning outcomes would likely include a broad range of supervised and unsupervised methods as well as the production and maintenance of data pipelines which can process moderately large datasets.

**Datasets:** Datasets would need to be either synthesised (which may reduce the quality of the resulting algorithm) or be sourced from NHS organisations with strict data protection policies applied.

**Desired skill set:** When applying please highlight any experience around work with text data and specifically patient and staff experience/ clinical text data, natural language processing, coding experience (including any coding in the open), and any other data science experience you feel relevant.