nhsx · ChrisBeeley · Sep 7, 2021
diff --git a/projects/deidentification-text.md b/projects/deidentification-text.md
@@ -0,0 +1,26 @@
+---
+remote_theme: nhsx/nhsx-io-theme
+title: Machine learning approaches to deidentification of text based data
+description: Machine learning approaches to deidentification of text based data
+permalink: /deidentification-text/
+---
+
+# Machine learning approaches to deidentification of text based data
+
+**Keywords:**  Natural Language Processing, Confidential Data, Patient Experience, Clinical Notes
+
+**Need:** A lot of the text produced within healthcare settings includes confidential information only coincidentally and could easily be deidentified. For example, patient experience data in the form of text is not, in general, identifiable, but sometimes patients will include irrelevant personal information, for example when they require a response to their feedback. The presence of small quantities of identifiable data scattered throughout a large body of text makes it impossible to freely share and collaborate on the data as well as to work on it within the cloud without rigorous IG procedures.
+
+Being able to scrub identifiable information automatically from text based data like patient and staff experience (as well as clinical data too, in certain circumstances), and other relevant datasets, would give data scientists and researchers in the NHS better access to deidentified text data for the purposes of building algorithms and data products.
+
+**Current Knowledge/Examples & Possible Techniques/Approaches:**  [Philter](https://github.com/BCHSI/philter-ucsf) is available under the permissive BSD-3 licence. [UKCRIS](https://crisnetwork.co/uk-cris-programme) uses similar technology but the code is likely proprietary. 
+
+**Related Previous Internship Projects:** n/a as first year of the scheme.
+
+**Enables Future Work:** This work would be extremely useful for a lot of text based analytics in healthcare and could potentially generate a lot of data with an open licence that could be used by individuals across public and private healthcare contexts.
+
+**Outcome/Learning Objectives:** The outcome would be an algorithm or system of algorithms that is capable of de-identifying text based data. Learning outcomes would likely include a broad range of supervised and unsupervised methods as well as the production and maintenance of data pipelines which can process moderately large datasets.
+
+**Datasets:** Datasets would need to be either synthesised (which may reduce the quality of the resulting algorithm) or be sourced from NHS organisations with strict data protection policies applied.
+
+**Desired skill set:** When applying please highlight any experience around work with text data and specifically patient and staff experience/ clinical text data, natural language processing, coding experience (including any coding in the open), and any other data science experience you feel relevant.