Skip to content

Commit a502da8

Browse files
updated docs and added new section on anonymisation. fixed some tests
1 parent 56a1221 commit a502da8

15 files changed

+361
-26
lines changed

docs/source/anonymisation.rst

+110
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
Anonymisation
2+
=============
3+
4+
Mudlark can now be used to fully anonymise a Maintenance Work Order CSV file. In this section we will cover anonymising the text column, and anonymising other columns.
5+
6+
Anonymising text
7+
----------------
8+
9+
To anonymise text, simply set the ``anonymise_text`` argument to True, either via command line or in a config file.
10+
11+
When this is set to true, any terms that Mudlark deems to be "asset identifiers" will be anonymised and replaced with a unique asset identifier. For example, given the following data::
12+
13+
text
14+
ABC 124 is broken
15+
ABC 123 has a problem
16+
ABC-124 is broken
17+
pumps busted
18+
enGiNe was broken
19+
a leak was Formed
20+
21+
The result will be::
22+
23+
text
24+
Asset1 is broken
25+
Asset2 has a problem
26+
Asset1 is broken
27+
pump bust
28+
engine is broken
29+
a leak is form
30+
31+
Note how both ``ABC 124`` and ``ABC-124`` both become ``Asset1`` and ``ABC 123`` becomes ``Asset2``.
32+
33+
Some further notes on how this works:
34+
35+
* Currently a term is recognised as an asset identifier if it has one or more uppercase letter(s), followed by either nothing, a space, or a hyphen, then followed by one or more digit.
36+
* The order is randomised prior to generating the numbers for each asset identifier.
37+
38+
Anonymising/processing other columns
39+
------------------------------------
40+
41+
Mudlark can also anonymise and/or process other columns. To do this, set ``column_config_path`` prior to running Mudlark (either via command line or in a config file). An example column config file might look like this::
42+
43+
columns:
44+
- name: floc
45+
handler: FLOC
46+
- name: mwo_id
47+
handler: RandomiseInteger
48+
- name: unique_floc_desc
49+
handler: ToUniqueString
50+
prefix: FLOC_Desc_
51+
- name: other
52+
handler: None
53+
54+
This file dictates which columns should be kept by Mudlark, and how it should handle each of them. There are currently four available "handlers":
55+
56+
* ``None`` simply passes the column through to the output file without doing anything to it.
57+
* ``RandomiseInteger`` replaces each unique value of the column with a randomised integer (with 7 digits).
58+
* ``FLOC`` treats the column as a Functional Location. It splits it based on either a ``.`` or a ``-``, then converts each unique value of each level of the FLOC hierarchy into a number. For example, ``123-45-67`` might become ``1-1-1``, ``123-45-68`` might become ``1-1-2``, and so on.
59+
* ``ToUniqueString`` converts each unique value into an anonymised string, starting with the given ``prefix``. For example, ``Pump FLOC`` might become ``FLOC_Desc_1``, ``Belt FLOC`` might become ``FLOC_Desc_2``, and so on.
60+
61+
Example
62+
^^^^^^^
63+
64+
Here is an example dataset::
65+
66+
text,cost,other,floc,mwo_id,unique_floc_desc
67+
ABC 124 is broken,123,test,123.45.67,123,FLOC 123
68+
replace,43,xx,123.45.68,123,FLOC 124
69+
X/X,540,test,123.45.69,123,FLOC 125
70+
ABC 123 has a problem,3,test,123.45.67,123,FLOC 123
71+
slurries,4.3,xx,123.45.67,123,FLOC 123
72+
ABC-124 is broken,4.33,yrds,123.45.67,123,FLOC 123
73+
pumps busted,43.43,tyrdtrd,111.45.67,123,FLOC 125
74+
enGiNe was broken,4332.3,6t554,112.45.67,123,FLOC 126
75+
a leak was Formed,333,545,113.45.67,123,FLOC 127
76+
77+
We are going to anonymise the text (as discussed in the first section), and will keep the ``floc``, ``mwo_id``, and ``unique_floc_desc`` columns. Here is our ``column-config.yml``::
78+
79+
columns:
80+
- name: floc
81+
handler: FLOC
82+
- name: mwo_id
83+
handler: RandomiseInteger
84+
- name: unique_floc_desc
85+
handler: ToUniqueString
86+
prefix: FLOC_Desc_
87+
- name: other
88+
handler: None
89+
90+
And our Mudlark config file::
91+
92+
input_path: my-file.csv
93+
output_path: my-file-out.csv
94+
text_column: text
95+
output_format: csv
96+
anonymise_text: true
97+
column_config: column-config.yml
98+
99+
Once we run Mudlark, the output will be::
100+
101+
text,other,floc,mwo_id,unique_floc_desc
102+
Asset1 is broken,test,1_1_1,2462749,FLOC_Desc_1
103+
replace,xx,1_1_2,7832383,FLOC_Desc_2
104+
x/x,test,1_1_3,5472030,FLOC_Desc_3
105+
Asset2 has a problem,test,1_1_1,2806910,FLOC_Desc_1
106+
slurry,xx,1_1_1,1640112,FLOC_Desc_1
107+
Asset1 is broken,yrds,1_1_1,7360650,FLOC_Desc_1
108+
pump bust,tyrdtrd,2_1_1,9995977,FLOC_Desc_3
109+
engine is broken,6t554,3_1_1,6573352,FLOC_Desc_4
110+
a leak is form,545,4_1_1,6717645,FLOC_Desc_5

docs/source/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -44,5 +44,6 @@ Part of the normalisation stage involves replacing any words appearing in a pred
4444
self
4545
installation
4646
usage
47+
anonymisation
4748
implementation
4849

docs/source/usage.rst

+17-5
Original file line numberDiff line numberDiff line change
@@ -39,12 +39,15 @@ Optional arguments:
3939
* - Argument
4040
- Type
4141
- Details
42-
* - ``output-format``
43-
- Text
44-
- The format to save the output. Can be either 'quickgraph' (saves the output as a QuickGraph-compatible JSON file) or 'csv' (saves the output as a CSV file). [default: quickgraph]
4542
* - ``output-path``
4643
- Text
4744
- The path to save the normalised dataset to once complete.
45+
* - ``output-format``
46+
- Text
47+
- The format to save the output. Can be either 'quickgraph' (saves the output as a QuickGraph-compatible JSON file) or 'csv' (saves the output as a CSV file). [default: quickgraph]
48+
* - ``anonymise-text``
49+
- Boolean
50+
- Whether to anonymise asset identifiers in the text. If true, any asset identifiers e.g. "ABC 123", "ABC123", "ABC-123" will be converted to anonymised identifiers e.g. Asset1. See the next section for more details.
4851
* - ``max-rows``
4952
- Integer
5053
- If specified, the output will be randomly sampled to contain the specified maximum number of rows.
@@ -57,12 +60,20 @@ Optional arguments:
5760
* - ``drop-duplicates``
5861
- Boolean
5962
- If true, any rows with the same text in the text field as another row will be dropped. [default: False]
60-
* - ``csv-keep-columns``
63+
* - ``column-config-path``
6164
- Text
62-
- If specified, only the given columns will be kept in the final output. Columns should be given as a comma separated list surrounded by double quotes, e.g. "col1, col2, col3"... This argument is only relevant when output_format = csv.
65+
- If specified, the given path (must be a ``.yml`` file) will be used to determine which columns to keep in the output, and what to do with those columns. See the next section for more details.
6366
* - ``quickgraph-id-columns``
6467
- Text
6568
- If specified, the given column(s) will be used as id columns when generating output for QuickGraph. You may specify one column (for example 'my_id'), or multiple columns separated via comma (for example 'my_id, surname'). This argument is only relevant when output_format = quickgraph.
69+
* - ``dump-anonymised-terms-path``
70+
- Text
71+
- If specified, any terms that were anonymised by Mudlark (assuming ``anonymise-text`` was set to ``True``) will be saved to this path. This allows you to reverse the anonymisation at a later date if necessary.
72+
* - ``dump-processed-columns-path``
73+
- Text
74+
- If specified, any columns that were processed by Mudlark (assuming ``column-config-path`` was set) will be saved to this path. This allows you to reverse the column processing at a later date if necessary.
75+
76+
6677

6778
Simple example
6879
^^^^^^^^^^^^^^
@@ -102,6 +113,7 @@ Then, you can read it in via the ``config`` argument::
102113

103114
Note that the arguments have underscores (``_``) instead of dashes (``-``) when written in the yaml file.
104115

116+
105117
Running Mudlark in Python
106118
-------------------------
107119

mudlark/column_config.py

-1
Original file line numberDiff line numberDiff line change
@@ -25,4 +25,3 @@ class Column(BaseModel):
2525

2626
class ColumnConfig(ConfiguredBaseModel):
2727
columns: List[Column]
28-
output_path: Optional[StrictStr] = None

mudlark/main.py

+29-18
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
"""The main functions of Mudlark, i.e. normalise_csv and normalise_text."""
22
import json
33
import typer
4+
import random
5+
from collections import OrderedDict
46
from typing_extensions import Annotated, Optional
57
from typer_config import use_yaml_config
68

@@ -36,12 +38,6 @@ def normalise_csv(
3638
"'short text', 'risk name', etc."
3739
),
3840
],
39-
anonymise_text: Annotated[
40-
bool,
41-
typer.Argument(
42-
help="Whether to anonymise asset identifiers in the text."
43-
),
44-
] = False,
4541
output_path: Annotated[
4642
str,
4743
typer.Option(
@@ -56,6 +52,12 @@ def normalise_csv(
5652
"a QuickGraph-compatible JSON file)."
5753
),
5854
] = "quickgraph",
55+
anonymise_text: Annotated[
56+
bool,
57+
typer.Argument(
58+
help="Whether to anonymise asset identifiers in the text."
59+
),
60+
] = False,
5961
max_rows: Annotated[
6062
Optional[int],
6163
typer.Option(
@@ -105,6 +107,15 @@ def normalise_csv(
105107
"mudlark will be dumped to the given path."
106108
),
107109
] = None,
110+
dump_processed_columns_path: Annotated[
111+
Optional[str],
112+
typer.Option(
113+
help="If specified, all original columns from the CSV that "
114+
"were modified by Mudlark will be dumped to the given path. "
115+
"This can be used to map the values from the new CSV back to "
116+
"the old CSV."
117+
),
118+
] = None,
108119
):
109120
"""Normalise the CSV located at the given path.
110121
@@ -186,7 +197,7 @@ def normalise_csv(
186197
# Maintain a list of anonymised terms to dump later, if needed
187198
#
188199
# TODO: Move into its own function/refactor this code
189-
anonymised_terms_map = {}
200+
anonymised_terms_map = OrderedDict()
190201
if anonymise_text:
191202
anonymised_terms = set()
192203
logger.info("Anonymising text...")
@@ -199,20 +210,20 @@ def normalise_csv(
199210

200211
# Map each anonymised term to an Asset ID e.g.
201212
# ABC123 -> Asset1
202-
for term in anonymised_terms:
213+
anonymised_terms = sorted(list(anonymised_terms))
214+
norm_term_map = {}
215+
random.shuffle(anonymised_terms)
216+
for i, term in enumerate(anonymised_terms):
203217
term_normed = term.replace(" ", "").replace("-", "")
204-
if term_normed not in anonymised_terms_map:
205-
n = len(anonymised_terms_map) + 1
206-
anonymised_terms_map[term_normed] = f"Asset{n}"
207-
if term_normed in anonymised_terms_map:
208-
anonymised_terms_map[term] = anonymised_terms_map[term_normed]
218+
if term_normed not in norm_term_map:
219+
norm_term_map[term_normed] = f"Asset{len(norm_term_map) + 1}"
220+
anonymised_terms_map[term] = norm_term_map[term_normed]
209221

210222
# If desired, dump the anonymised terms to a file
211223
if dump_anonymised_terms_path and len(anonymised_terms) > 0:
212224
with open(dump_anonymised_terms_path, "w", encoding="utf-8") as f:
213225
for i, term in enumerate(anonymised_terms):
214-
term_normed = term.replace(" ", "").replace("-", "")
215-
f.write(term + ", " + (anonymised_terms_map[term_normed]))
226+
f.write(term + ", " + (anonymised_terms_map[term]))
216227
f.write("\n")
217228
logger.info(
218229
f"Dumped {len(anonymised_terms)} anonymised terms to "
@@ -250,11 +261,11 @@ def normalise_csv(
250261
df, mappings[c.name] = process_column(
251262
df, c.name, c.handler, c.prefix
252263
)
253-
if column_config.output_path:
254-
with open(column_config.output_path, "w", encoding="utf-8") as f:
264+
if dump_processed_columns_path:
265+
with open(dump_processed_columns_path, "w", encoding="utf-8") as f:
255266
json.dump(mappings, f, indent=2)
256267
logger.info(
257-
f"Dumped column details to {column_config.output_path}."
268+
f"Dumped column details to {dump_processed_columns_path}."
258269
)
259270

260271
if not output_path:

tests/conftest.py

+13-1
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,19 @@ def expected_output_path(request):
3434
return os.path.join(FIXTURE_DIR, "output", f"{name}")
3535

3636

37+
@pytest.fixture
38+
def expected_output_terms_path(request):
39+
"""Return the path of the given dataset e.g. test.csv.
40+
41+
Args:
42+
request (object): The request object.
43+
44+
Returns:
45+
str: The dataset path.
46+
"""
47+
name = request.param
48+
return os.path.join(FIXTURE_DIR, "output", f"{name}")
49+
3750

3851
@pytest.fixture
3952
def test_correction_dictionary_path(request):
@@ -47,4 +60,3 @@ def test_correction_dictionary_path(request):
4760
"""
4861
name = request.param
4962
return os.path.join(FIXTURE_DIR, "input", f"{name}")
50-

0 commit comments

Comments
 (0)