Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
751eac1
Regenerated for stable API version
Nov 7, 2024
d7bed4a
Custom methods
Nov 8, 2024
f04a4d1
Customization WIP
Nov 8, 2024
e525196
Add client methods
Nov 8, 2024
eb370ea
Overrides for async client
Nov 8, 2024
d3cfa69
Update tests for new API spec
Nov 8, 2024
a8fc006
AttributeError: 'DeidentificationClient' object has no attribute 'beg…
Nov 8, 2024
7d97948
Merge branch 'main' of https://github.com/alexathomases/azure-sdk-for…
Nov 8, 2024
eedbd1a
Internal operations!
Nov 8, 2024
9208417
Still missing client method
Nov 8, 2024
f6aa621
Update tests for job refactor
Nov 8, 2024
68ed6f1
Inheritance for customizations
Nov 9, 2024
1c0e9b7
Model imports
Nov 9, 2024
b39a426
Add kwargs for maxpagesize
Nov 9, 2024
a502da5
Add @distributed_trace
Nov 9, 2024
1f2a357
Fix pylint-next errors/warnings
Nov 10, 2024
a7cc69e
Update configuration files
Nov 11, 2024
d3936e1
Changelog update
Nov 11, 2024
e081272
Updating tests for new API version
Nov 12, 2024
b268504
Pagination test, fixes for urls
Nov 15, 2024
9d78e4a
work in progress test sanitizing
Dec 12, 2024
3c49ca2
Regenerated for stable API version
Nov 7, 2024
973debc
Custom methods
Nov 8, 2024
66be3e8
Customization WIP
Nov 8, 2024
7ab9e05
Add client methods
Nov 8, 2024
8572d18
Overrides for async client
Nov 8, 2024
258adb9
Update tests for new API spec
Nov 8, 2024
36aa6e4
AttributeError: 'DeidentificationClient' object has no attribute 'beg…
Nov 8, 2024
96f4d3a
Internal operations!
Nov 8, 2024
97f7359
Still missing client method
Nov 8, 2024
aacbebc
Update tests for job refactor
Nov 8, 2024
b2f743b
Inheritance for customizations
Nov 9, 2024
4f29b8f
Model imports
Nov 9, 2024
9f31f62
Add kwargs for maxpagesize
Nov 9, 2024
b7dcc03
Add @distributed_trace
Nov 9, 2024
b0cf1ce
Fix pylint-next errors/warnings
Nov 10, 2024
6976f32
Update configuration files
Nov 11, 2024
da929e1
Changelog update
Nov 11, 2024
1564fe9
Updating tests for new API version
Nov 12, 2024
ca151be
Pagination test, fixes for urls
Nov 15, 2024
3af8307
work in progress test sanitizing
Dec 12, 2024
1783cba
Merge branch 'main' of https://github.com/alexathomases/azure-sdk-for…
josiahvinson Apr 28, 2025
37d6e74
Tests running against latest TypeSpec
josiahvinson Apr 30, 2025
57fc9bb
Update TypeSpec before customizations
josiahvinson Apr 30, 2025
ee7f0f9
Pull in SDK client name updates
josiahvinson May 1, 2025
0d75537
Update changelog and samples
josiahvinson May 2, 2025
5e5992e
update changelog to unreleased
josiahvinson May 2, 2025
a0fbb2f
remove unreleased beta version from changelog
josiahvinson May 2, 2025
438e711
Updating version to 1.0.0
josiahvinson May 2, 2025
6419709
Update README, samples
josiahvinson May 6, 2025
72de0ca
Update spelling
josiahvinson May 6, 2025
6d0a372
Separate samples for each operation
josiahvinson May 6, 2025
d1a5ffd
adding black formatting
josiahvinson May 7, 2025
864866b
update snippets after black formatting
josiahvinson May 7, 2025
45970af
Update generated code
josiahvinson May 13, 2025
55dcb03
Updating TypeSpec commit
josiahvinson May 19, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .vscode/cspell.json
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@
"sdk/eventhub/azure-eventhub/**",
"sdk/easm/azure-defender-easm/azure/defender/easm/**",
"sdk/graphrbac/azure-graphrbac/**",
"sdk/healthdataaiservices/azure-health-deidentification/tests/data/**/*",
"sdk/healthinsights/azure-healthinsights-cancerprofiling/azure/**",
"sdk/healthinsights/azure-healthinsights-clinicalmatching/azure/**",
"sdk/formrecognizer/azure-ai-formrecognizer/samples/sample_forms/**",
Expand Down Expand Up @@ -224,6 +225,8 @@
"dateutil",
"ddos",
"decryptor",
"deidentification",
"deidservice",
"delenv",
"dependened",
"deque",
Expand Down Expand Up @@ -426,6 +429,7 @@
"struct",
"STRUCT",
"substringof",
"surrogated",
"systemperf",
"tenvparallel",
"Teradata",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,14 +1,31 @@
# Release History

## 1.0.0b2 (Unreleased)
## 1.0.0 (Unreleased)

### Features Added

### Breaking Changes
- Introduced `DeidentificationCustomizationOptions` and `DeidentificationJobCustomizationOptions` models.
- Added `surrogate_locale` field in these models.
- Moved `redaction_format` field into these models.
- Introduced `overwrite` property in `TargetStorageLocation` model, which allows a job to overwrite existing documents in the storage location.

### Bugs Fixed
### Breaking Changes

### Other Changes
- Changed method names in `DeidentificationClient` to match functionality:
- Changed the `deidentify` method name to `deidentify_text`.
- Changed the `begin_create_job` method name to `begin_deidentify_documents`.
- Renamed the property `DeidentificationContent.operation` to `operation_type`.
- Deprecated `DocumentDataType`.
- Changed the model `DeidentificationDocumentDetails`:
- Renamed `input` to `input_location`.
- Renamed `output` to `output_location`.
- Changed the model `DeidentificationJob`
- Renamed `name` to `job_name`.
- Renamed `operation` to `operation_type`.
- Renamed the model `OperationState` to `OperationStatus`.
- Changed `path` field to `location` in `SourceStorageLocation` and `TargetStorageLocation`.
- Changed `outputPrefix` behavior to no longer include `job_name` by default.
- Deprecated `path` and `location` from `TaggerResult` model.

## 1.0.0b1 (2024-08-15)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,4 @@ include azure/health/deidentification/py.typed
recursive-include tests *.py
recursive-include samples *.py *.md
include azure/__init__.py
include azure/health/__init__.py
include azure/health/__init__.py
236 changes: 191 additions & 45 deletions sdk/healthdataaiservices/azure-health-deidentification/README.md
Original file line number Diff line number Diff line change
@@ -1,82 +1,220 @@
# Azure Health Data Services de-identification service client library for Python

This package contains a client library for the de-identification service in Azure Health Data Services which
enables users to tag, redact, or surrogate health data containing Protected Health Information (PHI).
For more on service functionality and important usage considerations, see [the de-identification service overview][product_documentation].

# Azure Health Deidentification client library for Python
Azure.Health.Deidentification is a managed service that enables users to tag, redact, or surrogate health data.
This library support API versions `2024-11-15` and earlier.

Use the client library for the de-identification service to:
- Discover PHI in unstructured text
- Replace PHI in unstructured text with placeholder values
- Replace PHI in unstructured text with realistic surrogate values
- Manage asynchronous jobs to de-identify documents in Azure Storage

[Source code](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/healthdataaiservices/azure-health-deidentification/azure/health/deidentification)
| [Package (PyPI)](https://pypi.org/project/azure-health-deidentification)
| [API reference documentation](https://learn.microsoft.com/python/api/overview/azure/health-deidentification)
| [Product documentation][product_documentation]
| [Samples][samples]

## Getting started

### Prequisites

- Python 3.9 or later is required to use this package.
- Install [pip][pip].
- You need an [Azure subscription][azure_sub] to use this package.
- [Deploy the de-identification service][deid_quickstart].
- [Configure Azure role-based access control (RBAC)][deid_rbac] for the operations you will perform.

### Install the package

```bash
python -m pip install azure-health-deidentification
```

#### Prequisites
### Authentication
To authenticate with the de-identification service, install [`azure-identity`][azure_identity_pip]:

- Python 3.8 or later is required to use this package.
- You need an [Azure subscription][azure_sub] to use this package.
- An existing Azure Health Deidentification instance.
#### Create with an Azure Active Directory Credential
To use an [Azure Active Directory (AAD) token credential][authenticate_with_token],
provide an instance of the desired credential type obtained from the
[azure-identity][azure_identity_credentials] library.
```bash
python -m pip install azure.identity
```

You can use [DefaultAzureCredential][default_azure_credential] to automatically find the best credential to use at runtime.

To authenticate with AAD, you must first [pip][pip] install [`azure-identity`][azure_identity_pip]
You will need a **service URL** to instantiate a client object. You can find the service URL for a particular resource in the [Azure portal][azure_portal], or using the [Azure CLI][azure_cli]:

After setup, you can choose which type of [credential][azure_identity_credentials] from azure.identity to use.
As an example, [DefaultAzureCredential][default_azure_credential] can be used to authenticate the client:
```bash
# Get the service URL for the resource
az deidservice show --name "<resource-name>" --resource-group "<resource-group-name>" --query "properties.serviceUrl"
```

Set the values of the client ID, tenant ID, and client secret of the AAD application as environment variables:
`AZURE_CLIENT_ID`, `AZURE_TENANT_ID`, `AZURE_CLIENT_SECRET`
Optionally, save the service URL as an environment variable named `AZURE_HEALTH_DEIDENTIFICATION_ENDPOINT` for the sample client initialization code.

Use the returned token credential to authenticate the client:
Create a client with the endpoint and credential:
<!-- SNIPPET: examples.create_client -->

```python
>>> from azure.health.deidentification import DeidentificationClient
>>> from azure.identity import DefaultAzureCredential
>>> client = DeidentificationClient(endpoint='<endpoint>', credential=DefaultAzureCredential())
endpoint = os.environ["AZURE_HEALTH_DEIDENTIFICATION_ENDPOINT"]
credential = DefaultAzureCredential()
client = DeidentificationClient(endpoint, credential)
```

<!-- END SNIPPET -->

## Key concepts

**Operation Modes**
- Tag: Will return a structure of offset and length with the PHI category of the related text spans.
- Redact: Will return output text with placeholder stubbed text. ex. `[name]`
- Surrogate: Will return output text with synthetic replacements.
- `My name is John Smith`
- `My name is Tom Jones`
### De-identification operations:
Given an input text, the de-identification service can perform three main operations:
- `Tag` returns the category and location within the text of detected PHI entities.
- `Redact` returns output text where detected PHI entities are replaced with placeholder text. For example `John` replaced with `[name]`.
- `Surrogate` returns output text where detected PHI entities are replaced with realistic replacement values. For example, `My name is John Smith` could become `My name is Tom Jones`.

### Available endpoints
There are two ways to interact with the de-identification service. You can send text directly, or you can create jobs
to de-identify documents in Azure Storage.

You can de-identify text directly using the `DeidentificationClient`:
<!-- SNIPPET: deidentify_text_surrogate.surrogate -->

```python
body = DeidentificationContent(input_text="Hello, my name is John Smith.")
result: DeidentificationResult = client.deidentify_text(body)
print(f'\nOriginal Text: "{body.input_text}"')
print(f'Surrogated Text: "{result.output_text}"') # Surrogated output: Hello, my name is <synthetic name>.
```

<!-- END SNIPPET -->

To de-identify documents in Azure Storage, see [Tutorial: Configure Azure Storage to de-identify documents][deid_configure_storage]
for prerequisites and configuration options.

To run the sample code below, populate the following environment variables:
- `AZURE_STORAGE_ACCOUNT_LOCATION`: an Azure Storage container endpoint, like `https://<storageaccount>.blob.core.windows.net/<container>`.
- `INPUT_PREFIX`: the prefix of the input document name(s) in the container. For example, providing `folder1` would create a job that would process documents like `https://<storageaccount>.blob.core.windows.net/<container>/folder1/document1.txt`

The client exposes a `begin_deidentify_documents` method that returns a [LROPoller](https://learn.microsoft.com/python/api/azure-core/azure.core.polling.lropoller) instance. You can get the result of the operation by calling `result()`, optionally passing in a `timeout` value in seconds:
<!-- SNIPPET: deidentify_documents.sample -->

```python
endpoint = os.environ["AZURE_HEALTH_DEIDENTIFICATION_ENDPOINT"]
storage_location = os.environ["AZURE_STORAGE_ACCOUNT_LOCATION"]
inputPrefix = os.environ["INPUT_PREFIX"]
outputPrefix = "_output"

credential = DefaultAzureCredential()

client = DeidentificationClient(endpoint, credential)

**Job Integration with Azure Storage**
Instead of sending text, you can send an Azure Storage Location to the service. We will asynchronously
process the list of files and output the deidentified files to a location of your choice.
jobname = f"sample-job-{uuid.uuid4().hex[:8]}"

Limitations:
- Maximum file count per job: 1000 documents
- Maximum file size per file: 2 MB
job = DeidentificationJob(
source_location=SourceStorageLocation(
location=storage_location,
prefix=inputPrefix,
),
target_location=TargetStorageLocation(location=storage_location, prefix=outputPrefix, overwrite=True),
)

finished_job: DeidentificationJob = client.begin_deidentify_documents(jobname, job).result(timeout=60)

print(f"Job Name: {finished_job.job_name}")
print(f"Job Status: {finished_job.status}")
print(f"File Count: {finished_job.summary.total_count if finished_job.summary is not None else 0}")
```

<!-- END SNIPPET -->

## Examples
The following sections provide code samples covering some of the most common client use cases, including:

- [Discover PHI in unstructured text](#discover-phi-in-unstructured-text)
- [Replace PHI in unstructured text with placeholder values](#replace-phi-in-unstructured-text-with-placeholder-values)
- [Replace PHI in unstructured text with realistic surrogate values](#replace-phi-in-unstructured-text-with-realistic-surrogate-values)

See the [samples][samples] for code files illustrating common patterns, including creating and managing jobs to de-identify documents in Azure Storage.

### Discover PHI in unstructured text
When you specify the `TAG` operation, the service will return information about the PHI entities it detects. You can use this information to customize your de-identification workflow:
<!-- SNIPPET: deidentify_text_tag.tag -->

```python
>>> from azure.health.deidentification import DeidentificationClient
>>> from azure.identity import DefaultAzureCredential
>>> from azure.core.exceptions import HttpResponseError
body = DeidentificationContent(
input_text="Hello, I'm Dr. John Smith.", operation_type=DeidentificationOperationType.TAG
)
result: DeidentificationResult = client.deidentify_text(body)
print(f'\nOriginal Text: "{body.input_text}"')

if result.tagger_result and result.tagger_result.entities:
print(f"Tagged Entities:")
for entity in result.tagger_result.entities:
print(
f'\tEntity Text: "{entity.text}", Entity Category: "{entity.category}", Offset: "{entity.offset.code_point}", Length: "{entity.length.code_point}"'
)
else:
print("\tNo tagged entities found.")
```

<!-- END SNIPPET -->

>>> client = DeidentificationClient(endpoint='<endpoint>', credential=DefaultAzureCredential())
>>> try:
<!-- write test code here -->
except HttpResponseError as e:
print('service responds error: {}'.format(e.response.json()))
### Replace PHI in unstructured text with placeholder values
When you specify the `REDACT` operation, the service will replace the PHI entities it detects with placeholder values. You can learn more about [redaction customization][deid_redact].
<!-- SNIPPET: deidentify_text_redact.redact -->

```python
body = DeidentificationContent(
input_text="It's great to work at Contoso.", operation_type=DeidentificationOperationType.REDACT
)
result: DeidentificationResult = client.deidentify_text(body)
print(f'\nOriginal Text: "{body.input_text}"')
print(f'Redacted Text: "{result.output_text}"') # Redacted output: "It's great to work at [organization]."
```

## Next steps
<!-- END SNIPPET -->

- Find a bug, or have feedback? Raise an issue with "Health Deidentification" Label.
### Replace PHI in unstructured text with realistic surrogate values
The default operation is the `SURROGATE` operation. Using this operation, the service will replace the PHI entities it detects with realistic surrogate values:
<!-- SNIPPET: deidentify_text_surrogate.surrogate -->

```python
body = DeidentificationContent(input_text="Hello, my name is John Smith.")
result: DeidentificationResult = client.deidentify_text(body)
print(f'\nOriginal Text: "{body.input_text}"')
print(f'Surrogated Text: "{result.output_text}"') # Surrogated output: Hello, my name is <synthetic name>.
```

<!-- END SNIPPET -->

### Troubleshooting
The `DeidentificationClient` raises various `AzureError` [exceptions][azure_error]. For example, if you
provide an invalid service URL, an `ServiceRequestError` would be raised with a message indicating the failure cause.
In the following code snippet, the error is handled and displayed:
<!-- SNIPPET: examples.handle_error -->

```python
error_client = DeidentificationClient("https://contoso.deid.azure.com", credential)
body = DeidentificationContent(input_text="Hello, I'm Dr. John Smith.")

try:
error_client.deidentify_text(body)
except AzureError as e:
print("\nError: " + e.message)
```

<!-- END SNIPPET -->

If you encounter an error indicating that the service is unable to access source or target storage in a de-identification job:
- Ensure you [assign a managed identity][deid_managed_identity] to your de-identification service
- Ensure you [assign appropriate permissions][deid_rbac] to the managed identity to access the storage account

## Next steps

Find a bug, or have feedback? Raise an issue with the [Health Deidentification][github_issue_label] label.

## Troubleshooting

- **Unabled to Access Source or Target Storage**
- **Unable to Access Source or Target Storage**
- Ensure you create your deid service with a system assigned managed identity
- Ensure your storage account has given permissions to that managed identity

Expand All @@ -99,10 +237,18 @@ additional questions or comments.

<!-- LINKS -->
[code_of_conduct]: https://opensource.microsoft.com/codeofconduct/
[authenticate_with_token]: https://learn.microsoft.com/azure/cognitive-services/authentication?tabs=powershell#authenticate-with-an-authentication-token
[azure_identity_credentials]: https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/identity/azure-identity#credentials
[product_documentation]: https://learn.microsoft.com/azure/healthcare-apis/deidentification/
[azure_identity_pip]: https://pypi.org/project/azure-identity/
[default_azure_credential]: https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/identity/azure-identity#defaultazurecredential
[pip]: https://pypi.org/project/pip/
[azure_sub]: https://azure.microsoft.com/free/

[deid_quickstart]: https://learn.microsoft.com/azure/healthcare-apis/deidentification/quickstart
[deid_redact]: https://learn.microsoft.com/azure/healthcare-apis/deidentification/redaction-format
[deid_rbac]: https://learn.microsoft.com/azure/healthcare-apis/deidentification/manage-access-rbac
[deid_managed_identity]: https://learn.microsoft.com/azure/healthcare-apis/deidentification/managed-identities
[deid_configure_storage]: https://learn.microsoft.com/azure/healthcare-apis/deidentification/configure-storage
[azure_cli]: https://learn.microsoft.com/cli/azure/healthcareapis/deidservice?view=azure-cli-latest
[azure_portal]: https://ms.portal.azure.com
[azure_error]: https://learn.microsoft.com/python/api/azure-core/azure.core.exceptions.azureerror
[samples]: https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/healthdataaiservices/azure-health-deidentification/samples
[github_issue_label]: https://github.com/Azure/azure-sdk-for-python/labels/Health%20Deidentification
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"CrossLanguagePackageId": "HealthDataAIServices.DeidServices",
"CrossLanguageDefinitionId": {
"azure.health.deidentification.models.DeidentificationContent": "HealthDataAIServices.DeidServices.DeidentificationContent",
"azure.health.deidentification.models.DeidentificationCustomizationOptions": "HealthDataAIServices.DeidServices.DeidentificationCustomizationOptions",
"azure.health.deidentification.models.DeidentificationDocumentDetails": "HealthDataAIServices.DeidServices.DeidentificationDocumentDetails",
"azure.health.deidentification.models.DeidentificationDocumentLocation": "HealthDataAIServices.DeidServices.DeidentificationDocumentLocation",
"azure.health.deidentification.models.DeidentificationJob": "HealthDataAIServices.DeidServices.DeidentificationJob",
"azure.health.deidentification.models.DeidentificationJobCustomizationOptions": "HealthDataAIServices.DeidServices.DeidentificationJobCustomizationOptions",
"azure.health.deidentification.models.DeidentificationJobSummary": "HealthDataAIServices.DeidServices.DeidentificationJobSummary",
"azure.health.deidentification.models.DeidentificationResult": "HealthDataAIServices.DeidServices.DeidentificationResult",
"azure.health.deidentification.models.PhiEntity": "HealthDataAIServices.DeidServices.PhiEntity",
"azure.health.deidentification.models.PhiTaggerResult": "HealthDataAIServices.DeidServices.PhiTaggerResult",
"azure.health.deidentification.models.SourceStorageLocation": "HealthDataAIServices.DeidServices.SourceStorageLocation",
"azure.health.deidentification.models.StringIndex": "HealthDataAIServices.DeidServices.StringIndex",
"azure.health.deidentification.models.TargetStorageLocation": "HealthDataAIServices.DeidServices.TargetStorageLocation",
"azure.health.deidentification.models.DeidentificationOperationType": "HealthDataAIServices.DeidServices.DeidentificationOperationType",
"azure.health.deidentification.models.OperationStatus": "Azure.Core.Foundations.OperationState",
"azure.health.deidentification.models.PhiCategory": "HealthDataAIServices.DeidServices.PhiCategory",
"azure.health.deidentification.DeidentificationClient.get_job": "HealthDataAIServices.DeidServices.getJob",
"azure.health.deidentification.DeidentificationClient.begin_deidentify_documents": "HealthDataAIServices.DeidServices.deidentifyDocuments",
"azure.health.deidentification.DeidentificationClient.cancel_job": "HealthDataAIServices.DeidServices.cancelJob",
"azure.health.deidentification.DeidentificationClient.delete_job": "HealthDataAIServices.DeidServices.deleteJob",
"azure.health.deidentification.DeidentificationClient.deidentify_text": "HealthDataAIServices.DeidServices.deidentifyText"
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -2,5 +2,5 @@
"AssetsRepo": "Azure/azure-sdk-assets",
"AssetsRepoPrefixPath": "python",
"TagPrefix": "python/healthdataaiservices/azure-health-deidentification",
"Tag": "python/healthdataaiservices/azure-health-deidentification_a8eed6d322"
"Tag": "python/healthdataaiservices/azure-health-deidentification_a9eda6ed27"
}
Loading