nlp-tlp
diff --git a/‎docs/source/anonymisation.rst
+110 b/‎docs/source/anonymisation.rst
+110
diff --git a/‎docs/source/index.rst
+1 b/‎docs/source/index.rst
+1
diff --git a/‎docs/source/usage.rst
+17-5 b/‎docs/source/usage.rst
+17-5
diff --git a/‎mudlark/column_config.py
-1 b/‎mudlark/column_config.py
-1
diff --git a/‎mudlark/main.py
+29-18 b/‎mudlark/main.py
+29-18
diff --git a/‎tests/conftest.py
+13-1 b/‎tests/conftest.py
+13-1
@@ -0,0 +1,110 @@
+Anonymisation
+=============
+
+Mudlark can now be used to fully anonymise a Maintenance Work Order CSV file. In this section we will cover anonymising the text column, and anonymising other columns.
+
+Anonymising text
+----------------
+
+To anonymise text, simply set the ``anonymise_text`` argument to True, either via command line or in a config file.
+
+When this is set to true, any terms that Mudlark deems to be "asset identifiers" will be anonymised and replaced with a unique asset identifier. For example, given the following data::
+
+    text
+    ABC 124 is broken
+    ABC 123 has a problem
+    ABC-124 is broken
+    pumps busted
+    enGiNe was broken
+    a leak was Formed
+
+The result will be::
+
+    text
+    Asset1 is broken
+    Asset2 has a problem
+    Asset1 is broken
+    pump bust
+    engine is broken
+    a leak is form
+
+Note how both ``ABC 124`` and ``ABC-124`` both become ``Asset1`` and ``ABC 123`` becomes ``Asset2``.
+
+Some further notes on how this works:
+
+* Currently a term is recognised as an asset identifier if it has one or more uppercase letter(s), followed by either nothing, a space, or a hyphen, then followed by one or more digit.
+* The order is randomised prior to generating the numbers for each asset identifier.
+
+Anonymising/processing other columns
+------------------------------------
+
+Mudlark can also anonymise and/or process other columns. To do this, set ``column_config_path`` prior to running Mudlark (either via command line or in a config file). An example column config file might look like this::
+
+    columns:
+      - name: floc
+        handler: FLOC
+      - name: mwo_id
+        handler: RandomiseInteger
+      - name: unique_floc_desc
+        handler: ToUniqueString
+        prefix: FLOC_Desc_
+      - name: other
+        handler: None
+
+This file dictates which columns should be kept by Mudlark, and how it should handle each of them. There are currently four available "handlers":
+
+* ``None`` simply passes the column through to the output file without doing anything to it.
+* ``RandomiseInteger`` replaces each unique value of the column with a randomised integer (with 7 digits).
+* ``FLOC`` treats the column as a Functional Location. It splits it based on either a ``.`` or a ``-``, then converts each unique value of each level of the FLOC hierarchy into a number. For example, ``123-45-67`` might become ``1-1-1``, ``123-45-68`` might become ``1-1-2``, and so on.
+* ``ToUniqueString`` converts each unique value into an anonymised string, starting with the given ``prefix``. For example, ``Pump FLOC`` might become ``FLOC_Desc_1``, ``Belt FLOC`` might become ``FLOC_Desc_2``, and so on.
+
+Example
+^^^^^^^
+
+Here is an example dataset::
+
+    text,cost,other,floc,mwo_id,unique_floc_desc
+    ABC 124 is broken,123,test,123.45.67,123,FLOC 123
+    replace,43,xx,123.45.68,123,FLOC 124
+    X/X,540,test,123.45.69,123,FLOC 125
+    ABC 123 has a problem,3,test,123.45.67,123,FLOC 123
+    slurries,4.3,xx,123.45.67,123,FLOC 123
+    ABC-124 is broken,4.33,yrds,123.45.67,123,FLOC 123
+    pumps busted,43.43,tyrdtrd,111.45.67,123,FLOC 125
+    enGiNe was broken,4332.3,6t554,112.45.67,123,FLOC 126
+    a leak was Formed,333,545,113.45.67,123,FLOC 127
+
+We are going to anonymise the text (as discussed in the first section), and will keep the ``floc``, ``mwo_id``, and ``unique_floc_desc`` columns. Here is our ``column-config.yml``::
+
+    columns:
+      - name: floc
+        handler: FLOC
+      - name: mwo_id
+        handler: RandomiseInteger
+      - name: unique_floc_desc
+        handler: ToUniqueString
+        prefix: FLOC_Desc_
+      - name: other
+        handler: None
+
+And our Mudlark config file::
+
+    input_path: my-file.csv
+    output_path: my-file-out.csv
+    text_column: text
+    output_format: csv
+    anonymise_text: true
+    column_config: column-config.yml
+
+Once we run Mudlark, the output will be::
+
+    text,other,floc,mwo_id,unique_floc_desc
+    Asset1 is broken,test,1_1_1,2462749,FLOC_Desc_1
+    replace,xx,1_1_2,7832383,FLOC_Desc_2
+    x/x,test,1_1_3,5472030,FLOC_Desc_3
+    Asset2 has a problem,test,1_1_1,2806910,FLOC_Desc_1
+    slurry,xx,1_1_1,1640112,FLOC_Desc_1
+    Asset1 is broken,yrds,1_1_1,7360650,FLOC_Desc_1
+    pump bust,tyrdtrd,2_1_1,9995977,FLOC_Desc_3
+    engine is broken,6t554,3_1_1,6573352,FLOC_Desc_4
+    a leak is form,545,4_1_1,6717645,FLOC_Desc_5
@@ -44,5 +44,6 @@ Part of the normalisation stage involves replacing any words appearing in a pred
    self
    installation
    usage
+   anonymisation
    implementation
 
@@ -39,12 +39,15 @@ Optional arguments:
     * - Argument
       - Type
       - Details
-    * - ``output-format``
-      - Text
-      - The format to save the output. Can be either 'quickgraph' (saves the output as a QuickGraph-compatible JSON file) or 'csv' (saves the output as a CSV file). [default: quickgraph]
     * - ``output-path``
       - Text
       - The path to save the normalised dataset to once complete.
+    * - ``output-format``
+      - Text
+      - The format to save the output. Can be either 'quickgraph' (saves the output as a QuickGraph-compatible JSON file) or 'csv' (saves the output as a CSV file). [default: quickgraph]
+    * - ``anonymise-text``
+      - Boolean
+      - Whether to anonymise asset identifiers in the text. If true, any asset identifiers e.g. "ABC 123", "ABC123", "ABC-123" will be converted to anonymised identifiers e.g. Asset1. See the next section for more details.
     * - ``max-rows``
       - Integer
       - If specified, the output will be randomly sampled to contain the specified maximum number of rows.
@@ -57,12 +60,20 @@ Optional arguments:
     * - ``drop-duplicates``
       - Boolean
       - If true, any rows with the same text in the text field as another row will be dropped. [default: False]
-    * - ``csv-keep-columns``
+    * - ``column-config-path``
       - Text
-      - If specified, only the given columns will be kept in the final output. Columns should be given as a comma separated list surrounded by double quotes, e.g. "col1, col2, col3"... This argument is only relevant when output_format = csv.
+      - If specified, the given path (must be a ``.yml`` file) will be used to determine which columns to keep in the output, and what to do with those columns. See the next section for more details.
     * - ``quickgraph-id-columns``
       - Text
       - If specified, the given column(s) will be used as id columns when generating output for QuickGraph. You may specify one column (for example 'my_id'), or multiple columns separated via comma (for example 'my_id, surname'). This argument is only relevant when output_format = quickgraph.
+    * - ``dump-anonymised-terms-path``
+      - Text
+      - If specified, any terms that were anonymised by Mudlark (assuming ``anonymise-text`` was set to ``True``) will be saved to this path. This allows you to reverse the anonymisation at a later date if necessary.
+    * - ``dump-processed-columns-path``
+      - Text
+      - If specified, any columns that were processed by Mudlark (assuming ``column-config-path`` was set) will be saved to this path. This allows you to reverse the column processing at a later date if necessary.
+
+
 
 Simple example
 ^^^^^^^^^^^^^^
@@ -102,6 +113,7 @@ Then, you can read it in via the ``config`` argument::
 
 Note that the arguments have underscores (``_``) instead of dashes (``-``) when written in the yaml file.
 
+
 Running Mudlark in Python
 -------------------------
 
 
@@ -25,4 +25,3 @@ class Column(BaseModel):
 
 class ColumnConfig(ConfiguredBaseModel):
     columns: List[Column]
-    output_path: Optional[StrictStr] = None
@@ -1,6 +1,8 @@
 """The main functions of Mudlark, i.e. normalise_csv and normalise_text."""
 import json
 import typer
+import random
+from collections import OrderedDict
 from typing_extensions import Annotated, Optional
 from typer_config import use_yaml_config
 
@@ -36,12 +38,6 @@ def normalise_csv(
             "'short text', 'risk name', etc."
         ),
     ],
-    anonymise_text: Annotated[
-        bool,
-        typer.Argument(
-            help="Whether to anonymise asset identifiers in the text."
-        ),
-    ] = False,
     output_path: Annotated[
         str,
         typer.Option(
@@ -56,6 +52,12 @@ def normalise_csv(
             "a QuickGraph-compatible JSON file)."
         ),
     ] = "quickgraph",
+    anonymise_text: Annotated[
+        bool,
+        typer.Argument(
+            help="Whether to anonymise asset identifiers in the text."
+        ),
+    ] = False,
     max_rows: Annotated[
         Optional[int],
         typer.Option(
@@ -105,6 +107,15 @@ def normalise_csv(
             "mudlark will be dumped to the given path."
         ),
     ] = None,
+    dump_processed_columns_path: Annotated[
+        Optional[str],
+        typer.Option(
+            help="If specified, all original columns from the CSV that "
+            "were modified by Mudlark will be dumped to the given path. "
+            "This can be used to map the values from the new CSV back to "
+            "the old CSV."
+        ),
+    ] = None,
 ):
     """Normalise the CSV located at the given path.
 
@@ -186,7 +197,7 @@ def normalise_csv(
     # Maintain a list of anonymised terms to dump later, if needed
     #
     # TODO: Move into its own function/refactor this code
-    anonymised_terms_map = {}
+    anonymised_terms_map = OrderedDict()
     if anonymise_text:
         anonymised_terms = set()
         logger.info("Anonymising text...")
@@ -199,20 +210,20 @@ def normalise_csv(
 
         # Map each anonymised term to an Asset ID e.g.
         # ABC123 -> Asset1
-        for term in anonymised_terms:
+        anonymised_terms = sorted(list(anonymised_terms))
+        norm_term_map = {}
+        random.shuffle(anonymised_terms)
+        for i, term in enumerate(anonymised_terms):
             term_normed = term.replace(" ", "").replace("-", "")
-            if term_normed not in anonymised_terms_map:
-                n = len(anonymised_terms_map) + 1
-                anonymised_terms_map[term_normed] = f"Asset{n}"
-            if term_normed in anonymised_terms_map:
-                anonymised_terms_map[term] = anonymised_terms_map[term_normed]
+            if term_normed not in norm_term_map:
+                norm_term_map[term_normed] = f"Asset{len(norm_term_map) + 1}"
+            anonymised_terms_map[term] = norm_term_map[term_normed]
 
         # If desired, dump the anonymised terms to a file
         if dump_anonymised_terms_path and len(anonymised_terms) > 0:
             with open(dump_anonymised_terms_path, "w", encoding="utf-8") as f:
                 for i, term in enumerate(anonymised_terms):
-                    term_normed = term.replace(" ", "").replace("-", "")
-                    f.write(term + ", " + (anonymised_terms_map[term_normed]))
+                    f.write(term + ", " + (anonymised_terms_map[term]))
                     f.write("\n")
             logger.info(
                 f"Dumped {len(anonymised_terms)} anonymised terms to "
@@ -250,11 +261,11 @@ def normalise_csv(
             df, mappings[c.name] = process_column(
                 df, c.name, c.handler, c.prefix
             )
-        if column_config.output_path:
-            with open(column_config.output_path, "w", encoding="utf-8") as f:
+        if dump_processed_columns_path:
+            with open(dump_processed_columns_path, "w", encoding="utf-8") as f:
                 json.dump(mappings, f, indent=2)
                 logger.info(
-                    f"Dumped column details to {column_config.output_path}."
+                    f"Dumped column details to {dump_processed_columns_path}."
                 )
 
     if not output_path:
 
@@ -34,6 +34,19 @@ def expected_output_path(request):
     return os.path.join(FIXTURE_DIR, "output", f"{name}")
 
 
+@pytest.fixture
+def expected_output_terms_path(request):
+    """Return the path of the given dataset e.g. test.csv.
+
+    Args:
+        request (object): The request object.
+
+    Returns:
+        str: The dataset path.
+    """
+    name = request.param
+    return os.path.join(FIXTURE_DIR, "output", f"{name}")
+
 
 @pytest.fixture
 def test_correction_dictionary_path(request):
@@ -47,4 +60,3 @@ def test_correction_dictionary_path(request):
     """
     name = request.param
     return os.path.join(FIXTURE_DIR, "input", f"{name}")
-
Original file line number	Diff line number	Diff line change
`@@ -25,4 +25,3 @@ class Column(BaseModel):`
`25`	`25`
`26`	`26`	`class ColumnConfig(ConfiguredBaseModel):`
`27`	`27`	`columns: List[Column]`
`28`		`- output_path: Optional[StrictStr] = None`