Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Study Instance UID and Series Instance UID not changed when dicom contains Related Series Sequence tags #153

Closed
skarrea opened this issue Oct 1, 2020 · 6 comments

Comments

@skarrea
Copy link

skarrea commented Oct 1, 2020

It seems that Deid does not manage to find and change the the Study and Series Instance UID tags that are outside the Related Series Sequence when Related Series Sequence tags are present.

I've appended a (almost) minimally working example reproducing the error I see in my data. The python code first generates a dicom file and then tries to clean it by changing contains:StudyInstanceUID and contains:SeriesInstanceUID from their original values to modified.

A copy of the output is copied below. Ignore the weird warning formatting it is due to the windows console not having colors.

################################################################################
Original file
################################################################################
(0008, 0012) Instance Creation Date              DA: '20201001'
(0008, 0033) Content Time                        TM: '20201001'
(0008, 1250)  Related Series Sequence   3 item(s) ----
   (0020, 000d) Study Instance UID                  UI: RelStudy0
   (0020, 000e) Series Instance UID                 UI: RelSeries0
   ---------
   (0020, 000d) Study Instance UID                  UI: RelStudy1
   (0020, 000e) Series Instance UID                 UI: RelSeries1
   ---------
   (0020, 000d) Study Instance UID                  UI: RelStudy2
   (0020, 000e) Series Instance UID                 UI: RelSeries2
   ---------
(0010, 0010) Patient's Name                      PN: 'ReproducingTheError'
(0020, 000d) Study Instance UID                  UI: SeriesUID
(0020, 000e) Series Instance UID                 UI: StudyUID
File saved.
�[93mWARNING �[0mNo specification, loading default base deid.dicom



################################################################################
Cleaned file
################################################################################
(0008, 0012) Instance Creation Date              DA: '20201001'
(0008, 0033) Content Time                        TM: '20201001'
(0008, 1250)  Related Series Sequence   3 item(s) ----
   (0020, 000d) Study Instance UID                  UI: modified
   (0020, 000e) Series Instance UID                 UI: modified
   ---------
   (0020, 000d) Study Instance UID                  UI: modified
   (0020, 000e) Series Instance UID                 UI: modified
   ---------
   (0020, 000d) Study Instance UID                  UI: modified
   (0020, 000e) Series Instance UID                 UI: modified
   ---------
(0010, 0010) Patient's Name                      PN: 'ReproducingTheError'
(0020, 000d) Study Instance UID                  UI: SeriesUID
(0020, 000e) Series Instance UID                 UI: StudyUID

Producing code:

import os
import datetime
import tempfile
from deid.config import DeidRecipe
from deid.dicom import get_files, get_identifiers, replace_identifiers

from pydicom.dataset import Dataset, FileDataset
from pydicom.uid import generate_uid
from pydicom.dataset import Dataset
from pydicom.sequence import Sequence

recipe = DeidRecipe('test.dicom')

# Create some temporary filenames
suffix = '.dcm'
filename_little_endian = tempfile.NamedTemporaryFile(suffix=suffix).name
filename_big_endian = tempfile.NamedTemporaryFile(suffix=suffix).name

print("Setting dataset values...")
# Create the FileDataset instance (initially no data elements, but file_meta
# supplied)
ds = FileDataset(filename_little_endian, {},
                 preamble=b"\0" * 128)

# Add the data elements -- not trying to set all required here. Check DICOM
# standard

# Set the transfer syntax
ds.is_little_endian = True
ds.is_implicit_VR = True

# Referemced image sequence
 
ds.PatientName = 'ReproducingTheError'

dt = datetime.datetime.now()
ds.InstanceCreationDate = dt.strftime('%Y%m%d')
ds.ContentTime = dt.strftime('%Y%m%d')

studyUID = generate_uid()
rel_im_ser_list = []
ds.RelatedSeriesSequence = Sequence([Dataset(), Dataset(), Dataset()])
for i in range(3):
    ds.RelatedSeriesSequence[i].StudyInstanceUID = 'RelStudy' + str(i)
    ds.RelatedSeriesSequence[i].SeriesInstanceUID = 'RelSeries' + str(i)

ds.SeriesInstanceUID = 'StudyUID'
ds.StudyInstanceUID = 'SeriesUID'

print(2*'\n')
print(80*'#')
print('Original file')
print(80*'#')
print(ds)

ds.save_as(filename_little_endian)
print("File saved.")


files = list(get_files(filename_little_endian))

items = get_identifiers(files)

for item in items:
    items[item]['new_val'] = 'modified'

cleaned_files = replace_identifiers(files,
                                    deid = recipe,
                                    ids = items)

print(2*'\n')
print(80*'#')
print('Cleaned file')
print(80*'#')

print(cleaned_files[0])

os.remove(filename_little_endian)

Deid recipe

FORMAT dicom

%header

REPLACE contains:StudyInstanceUID var:new_val
REPLACE contains:SeriesInstanceUID var:new_val
@vsoch
Copy link
Member

vsoch commented Oct 1, 2020

Thank you for the report, and the clear way to reproduce! I'm testing locally and I've reproduced, and I'll share what I've found. It does look like the replacement function itself works okay - here we see that all the occurrences of the instance UIDs are changed to modified:

: parser.fields                                                                                                                            
Out[6]: 
{'(0008, 0012)': (0008, 0012) Instance Creation Date              DA: '20201001'  [InstanceCreationDate],
 '(0008, 0033)': (0008, 0033) Content Time                        TM: '20201001'  [ContentTime],
 '(0008, 1250)': (0008, 1250) Related Series Sequence             SQ: <Sequence, length 3>  [RelatedSeriesSequence],
 '(0020, 000d)': (0020, 000d) Study Instance UID                  UI: modified  [StudyInstanceUID],
 '(0020, 000e)': (0020, 000e) Series Instance UID                 UI: modified  [SeriesInstanceUID],
 '(0010, 0010)': (0010, 0010) Patient's Name                      PN: 'ReproducingTheError'  [PatientName],
 '(0008, 1250)__0__(0020, 000d)': (0020, 000d) Study Instance UID                  UI: modified  [RelatedSeriesSequence__StudyInstanceUID],
 '(0008, 1250)__0__(0020, 000e)': (0020, 000e) Series Instance UID                 UI: modified  [RelatedSeriesSequence__SeriesInstanceUID],
 '(0008, 1250)__1__(0020, 000d)': (0020, 000d) Study Instance UID                  UI: modified  [RelatedSeriesSequence__StudyInstanceUID],
 '(0008, 1250)__1__(0020, 000e)': (0020, 000e) Series Instance UID                 UI: modified  [RelatedSeriesSequence__SeriesInstanceUID],
 '(0008, 1250)__2__(0020, 000d)': (0020, 000d) Study Instance UID                  UI: modified  [RelatedSeriesSequence__StudyInstanceUID],
 '(0008, 1250)__2__(0020, 000e)': (0020, 000e) Series Instance UID                 UI: modified  [RelatedSeriesSequence__SeriesInstanceUID]}

However the bug seems to be in actually using those fields to do the save, because the resulting file appears as you've showed. I suspect it's indexing. In the DeidParser perform_action we do grab the fields correctly:

fields
{'(0020, 000d)': (0020, 000d) Study Instance UID                  UI: RelStudy0  [StudyInstanceUID],
 '(0008, 1250)__0__(0020, 000d)': (0020, 000d) Study Instance UID                  UI: RelStudy0  [RelatedSeriesSequence__StudyInstanceUID],
 '(0008, 1250)__1__(0020, 000d)': (0020, 000d) Study Instance UID                  UI: RelStudy1  [RelatedSeriesSequence__StudyInstanceUID],
 '(0008, 1250)__2__(0020, 000d)': (0020, 000d) Study Instance UID                  UI: RelStudy2  [RelatedSeriesSequence__StudyInstanceUID]}

Ah so I found the bug (now we are in the class add_value method. The issue is that the dicom generated doesn't have a value attribute, likely because you created it on the fly. The reason it has the field but then only replaces the nested ones is because we use not having the value directly as an indicator that there is a nested field, e.g.,

# Nested fields
while not hasattr(element, "value"):
    element = element.element

I don't know why, but the iteration through the self.fields seems to change both instances (the nested and non-nested) but doesn't actual edit the dicom. Here is an example - see that modified appears twice in self.fields, but only once in the self.dicom

 self.fields
{'(0008, 0012)': (0008, 0012) Instance Creation Date              DA: '20201001'  [InstanceCreationDate],
 '(0008, 0033)': (0008, 0033) Content Time                        TM: '20201001'  [ContentTime],
 '(0008, 1250)': (0008, 1250) Related Series Sequence             SQ: <Sequence, length 3>  [RelatedSeriesSequence],
 '(0020, 000d)': (0020, 000d) Study Instance UID                  UI: modified  [StudyInstanceUID],
 '(0020, 000e)': (0020, 000e) Series Instance UID                 UI: RelSeries0  [SeriesInstanceUID],
 '(0010, 0010)': (0010, 0010) Patient's Name                      PN: 'ReproducingTheError'  [PatientName],
 '(0008, 1250)__0__(0020, 000d)': (0020, 000d) Study Instance UID                  UI: modified  [RelatedSeriesSequence__StudyInstanceUID],
 '(0008, 1250)__0__(0020, 000e)': (0020, 000e) Series Instance UID                 UI: RelSeries0  [RelatedSeriesSequence__SeriesInstanceUID],
 '(0008, 1250)__1__(0020, 000d)': (0020, 000d) Study Instance UID                  UI: RelStudy1  [RelatedSeriesSequence__StudyInstanceUID],
 '(0008, 1250)__1__(0020, 000e)': (0020, 000e) Series Instance UID                 UI: RelSeries1  [RelatedSeriesSequence__SeriesInstanceUID],
 '(0008, 1250)__2__(0020, 000d)': (0020, 000d) Study Instance UID                  UI: RelStudy2  [RelatedSeriesSequence__StudyInstanceUID],
 '(0008, 1250)__2__(0020, 000e)': (0020, 000e) Series Instance UID                 UI: RelSeries2  [RelatedSeriesSequence__SeriesInstanceUID]}

> self.dicom
(0008, 0012) Instance Creation Date              DA: '20201001'
(0008, 0033) Content Time                        TM: '20201001'
(0008, 1250)  Related Series Sequence   3 item(s) ---- 
   (0020, 000d) Study Instance UID                  UI: modified
   (0020, 000e) Series Instance UID                 UI: RelSeries0
   ---------
   (0020, 000d) Study Instance UID                  UI: RelStudy1
   (0020, 000e) Series Instance UID                 UI: RelSeries1
   ---------
   (0020, 000d) Study Instance UID                  UI: RelStudy2
   (0020, 000e) Series Instance UID                 UI: RelSeries2
   ---------
(0010, 0010) Patient's Name                      PN: 'ReproducingTheError'
(0020, 000d) Study Instance UID                  UI: SeriesUID
(0020, 000e) Series Instance UID                 UI: StudyUID

The way this works is that (technically) the self.fields should always have a DicomField class that references the same element in the dicom data, that way when we edit one, the other edits as well. But this doesn't appear to be happening for some reason.

I tried removing the deepcopy of the fields (added recently) and this isn't the issue - we still don't change the first one. I then moved earlier in the code and just tried manually changing a value - this hints that perhaps the reference to the top level field is incorrect, in fact it's pointing to the one that comes much later!

field
Out[5]: (0020, 000d) Study Instance UID                  UI: RelStudy0  [StudyInstanceUID]

In [6]: field.element.value
Out[6]: 'RelStudy0'

In [7]: self.dicom
Out[7]: 
(0008, 0012) Instance Creation Date              DA: '20201001'
(0008, 0033) Content Time                        TM: '20201001'
(0008, 1250)  Related Series Sequence   3 item(s) ---- 
   (0020, 000d) Study Instance UID                  UI: RelStudy0
   (0020, 000e) Series Instance UID                 UI: RelSeries0
   ---------
   (0020, 000d) Study Instance UID                  UI: RelStudy1
   (0020, 000e) Series Instance UID                 UI: RelSeries1
   ---------
   (0020, 000d) Study Instance UID                  UI: RelStudy2
   (0020, 000e) Series Instance UID                 UI: RelSeries2
   ---------
(0010, 0010) Patient's Name                      PN: 'ReproducingTheError'
(0020, 000d) Study Instance UID                  UI: SeriesUID
(0020, 000e) Series Instance UID                 UI: StudyUID

In [8]: field.element.value
Out[8]: 'RelStudy0'

In [9]: field.element.value = 'ADIFFERENTONE'

In [10]: self.dicom
Out[10]: 
(0008, 0012) Instance Creation Date              DA: '20201001'
(0008, 0033) Content Time                        TM: '20201001'
(0008, 1250)  Related Series Sequence   3 item(s) ---- 
   (0020, 000d) Study Instance UID                  UI: ADIFFERENTONE
   (0020, 000e) Series Instance UID                 UI: RelSeries0
   ---------
   (0020, 000d) Study Instance UID                  UI: RelStudy1
   (0020, 000e) Series Instance UID                 UI: RelSeries1
   ---------
   (0020, 000d) Study Instance UID                  UI: RelStudy2
   (0020, 000e) Series Instance UID                 UI: RelSeries2
   ---------
(0010, 0010) Patient's Name                      PN: 'ReproducingTheError'
(0020, 000d) Study Instance UID                  UI: SeriesUID
(0020, 000e) Series Instance UID                 UI: StudyUID

In the above, since we indexed uid (0008, 0012) we would expect the top level one to change, but that isn't the case. So then I stepped back and looked at fields, get_fields method, and it appears that the issue arises because we use dataset.iterall(), which essentially flattens the dicom fields (and gets them mixed up). Here is an example of using iterall vs. just items():

In [19]: for key, value in dataset.items():
    ...:     print(key)
    ...:     print(value)
    ...:
(0008, 0012)
(0008, 0012) Instance Creation Date              DA: '20201001'
(0008, 0033)
(0008, 0033) Content Time                        TM: '20201001'
(0008, 1250)
(0008, 1250) Related Series Sequence             SQ: <Sequence, length 3>
(0010, 0010)
(0010, 0010) Patient's Name                      PN: 'ReproducingTheError'
(0020, 000d)
(0020, 000d) Study Instance UID                  UI: SeriesUID
(0020, 000e)
(0020, 000e) Series Instance UID                 UI: StudyUID

In [20]: for item in dataset.iterall():
    ...:     print(item)
    ...:
    ...:
(0008, 0012) Instance Creation Date              DA: '20201001'
(0008, 0033) Content Time                        TM: '20201001'
(0008, 1250) Related Series Sequence             SQ: <Sequence, length 3>
(0020, 000d) Study Instance UID                  UI: RelStudy0
(0020, 000e) Series Instance UID                 UI: RelSeries0
(0020, 000d) Study Instance UID                  UI: RelStudy1
(0020, 000e) Series Instance UID                 UI: RelSeries1
(0020, 000d) Study Instance UID                  UI: RelStudy2
(0020, 000e) Series Instance UID                 UI: RelSeries2
(0010, 0010) Patient's Name                      PN: 'ReproducingTheError'
(0020, 000d) Study Instance UID                  UI: SeriesUID
(0020, 000e) Series Instance UID                 UI: StudyUID

So I've been interactively debugging for a few hours this morning - I know the issue is here with respect to how we iterate through the elements and save the association, but I haven't yet figured out what's wrong / how to fix it. @wetzelj do you have any ideas?

@vsoch
Copy link
Member

vsoch commented Oct 1, 2020

oh oops, I spoke too soon... I think I fixed it! Let me put in a PR.

vsoch added a commit that referenced this issue Oct 1, 2020
this leads to an error in replacement of nested fields and top level
fields having mixed up references. We fixed this by iterating through
the dataset as is, instead of using iterall.

Signed-off-by: vsoch <[email protected]>
@wetzelj
Copy link
Contributor

wetzelj commented Oct 1, 2020

I've only cursorily read this (and have to hop on to a meeting now), but wasn't the iterall introduced to enable us to act on sequences and sub-sequences as well? Does removing the iterall break anything with detecting data in those subsequences?

@vsoch
Copy link
Member

vsoch commented Oct 1, 2020

Here is the PR! #154

@wetzelj actually I don't think so! The reason is that we already loop through a sequence, discover that it is a sequence, and then add it's values as datasets to parse through. So we sort of unwrap the list. I think this issue likely arose because this function first just iterated through fields (and did it's own manual extraction of nested) but then you showed me the magic of iterall and I changed it, and I suspect we didn't fully test the replace again. That said, possibly now we lost something that iterall was providing? Please take a look at the PR and let me know what you think!

@wetzelj
Copy link
Contributor

wetzelj commented Oct 2, 2020

Sounds good. I ran some tests and it seems to be okay - I didn't notice anything being lost, just the new corrected functionality. :) I'll keep an eye out as I continue development/testing.

vsoch added a commit that referenced this issue Oct 2, 2020
fixing issue #153 that fields are not properly indexed
@vsoch
Copy link
Member

vsoch commented Oct 2, 2020

Fixed with #154

@vsoch vsoch closed this as completed Oct 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants