Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow override of dup. field error, and zlib compression level #65

Merged
merged 7 commits into from
Aug 25, 2018

Conversation

ihnorton
Copy link
Contributor

  • Allow override of duplicate field error, with a warning instead (occasionally occur with buggy writers)

  • Allow changing the compression level - can significantly improve write performance at minimal space cost. As an illustrative example:

timing test details
import nrrd
nrrd.reader._NRRD_ALLOW_DUPLICATE_FIELD = True
img,hdr = nrrd.read("/path/to/dwi.nhdr")
%timeit -n1 -r1 nrrd.write("test.nrrd", img, hdr)
193 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
hdr['encoding'] = 'gz'
%timeit -n1 -r1 nrrd.write("test_zl_9.nrrd", img, hdr)
20.2 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
nrrd.writer._ZLIB_LEVEL = -1
hdr['encoding'] = 'gz'
%timeit -n1 -r1 nrrd.write("test_zl_default.nrrd", img, hdr)
9.94 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
nrrd.writer._ZLIB_LEVEL = 1
hdr['encoding'] = 'gz'
%timeit -n1 -r1 nrrd.write("test_zl_1.nrrd", img, hdr)
2.68 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


Output files below (test.nrrd is original, and zl_default is roughly level 6 per the docs):

!ls -lah *.nrrd

    -rw-r--r--  1 inorton  staff    61M Aug 24 16:38 test.nrrd
    -rw-r--r--  1 inorton  staff    27M Aug 24 16:39 test_zl_1.nrrd
    -rw-r--r--  1 inorton  staff    26M Aug 24 16:38 test_zl_9.nrrd
    -rw-r--r--  1 inorton  staff    26M Aug 24 16:38 test_zl_default.nrrd

So, level 9 is ~10x slower than level 1, but saves only 1 MB. This is brain MRI data.

@codecov-io
Copy link

codecov-io commented Aug 24, 2018

Codecov Report

Merging #65 into master will increase coverage by 0.74%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #65      +/-   ##
==========================================
+ Coverage   86.64%   87.39%   +0.74%     
==========================================
  Files           6        6              
  Lines         352      357       +5     
  Branches      113      114       +1     
==========================================
+ Hits          305      312       +7     
+ Misses         23       22       -1     
+ Partials       24       23       -1
Impacted Files Coverage Δ
nrrd/writer.py 87.17% <100%> (ø) ⬆️
nrrd/reader.py 81.98% <100%> (+1.85%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update caf8506...4183a59. Read the comment docs.

Copy link
Collaborator

@addisonElliott addisonElliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good contribution.

My major issue with it is the fact that it requires changing private members in the module.

I would make these parameters as apart of the read function instead.

Additionally, would be nice if you could add documentation regarding this (edit the docstring or look at the RST files in the docs directory).

Also, issue with Travis failing on Python 2.7, need to resolve this as well.


dup_message = "Duplicate header field: 'type'"
try:
header = nrrd.read_header(header_txt_tuple)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can make this one line with the assertRaisesRegex function. Check here for an example using https://github.com/mhe/pynrrd/blob/master/nrrd/tests/test_writing.py#L178

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

expected_header = {u'type': 'float', u'dimension': 3}
header_txt_tuple = ('NRRD0005', 'type: float', 'dimension: 3', 'type: float')

dup_message = "Duplicate header field: 'type'"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along with comment below, would remove dup_message and just put as string in assertRaisesRegex function

@addisonElliott addisonElliott self-assigned this Aug 24, 2018
@ihnorton
Copy link
Contributor Author

Also, issue with Travis failing on Python 2.7, need to resolve this as well.

Addressed.

I would make these parameters as apart of the read function instead.

I made the compression level a parameter of write and added the docstring.
But the field error override is kind of odd/special, and violates the nrrd spec, so it probably shouldn't be advertised.

Copy link
Collaborator

@addisonElliott addisonElliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned that in practice that duplicate fields happen in NRRD files. If you have a duplicate field, how do you want it to handle it? Right now it overwrites and saves the last field given. Is this the best way to handle it?

Also, I was thinking maybe it would be worthwhile to write a test for testing the compression on gzip and bzip2. My thoughts are to set the compression level to something besides 9 for both of these and then test the file size is a certain size?

nrrd/writer.py Outdated
@@ -263,7 +267,7 @@ def _write_data(data, fh, header):

# Construct the compressor object based on encoding
if header['encoding'] in ['gzip', 'gz']:
compressobj = zlib.compressobj(9, zlib.DEFLATED, zlib.MAX_WBITS | 16)
compressobj = zlib.compressobj(compression_level, zlib.DEFLATED, zlib.MAX_WBITS | 16)
elif header['encoding'] in ['bzip2', 'bz2']:
compressobj = bz2.BZ2Compressor()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the compression level to parameter here so that it applies for bzip2 as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

nrrd/writer.py Outdated
@@ -124,6 +124,10 @@ def write(filename, data, header={}, detached_header=False, custom_field_map=Non
custom_field_map : :class:`dict` (:class:`str`, :class:`str`), optional
Dictionary used for parsing custom field types where the key is the custom field name and the value is a
string identifying datatype for the custom field.
compression_level : int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use :class:int here just so Sphinx will link to it. Just to stay consistent with the other parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, for consistency. But linking docs for native types is excessive IMHO, and it makes the docstrings ugly outside of sphinx.

nrrd/writer.py Outdated
@@ -124,6 +124,10 @@ def write(filename, data, header={}, detached_header=False, custom_field_map=Non
custom_field_map : :class:`dict` (:class:`str`, :class:`str`), optional
Dictionary used for parsing custom field types where the key is the custom field name and the value is a
string identifying datatype for the custom field.
compression_level : int
Int specifying compression level, when applicable.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of when applicable, maybe be more specific and say when encoding is set to compression for BZIP or GZIP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

nrrd/reader.py Outdated
@@ -15,6 +16,10 @@

_NRRD_REQUIRED_FIELDS = ['dimension', 'type', 'encoding', 'sizes']

# Duplicated fields are prohibited by the spec, but do occur in the wild.
# Set True to allow duplicate fields, with a warning.
_NRRD_ALLOW_DUPLICATE_FIELD = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this a parameter to the read & read_header functions. As I said, I'm not a fan of having to edit a private member in a module, one that especially isn't documented.

I understand your apprehension not to document this because it is not supported by the NRRD specification, but I think whatever parameters we add, they should be documented.

What you can do though, in the docstring, you can put a warning saying it is not supported. Something like this (I'm not sure if it can go in the argument list, but you can put it in the body of the docstring for sure).

    .. warning::
        Allowing duplicate fields with :obj:`allow_duplicate_fields` is not explicitly supported in the NRRD specification. Use this parameter at your own risk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It is no longer a private field, but I will not be adding it to read. This really is a special case, just like the existing chunk size parameter.

nrrd/reader.py Outdated
if not _NRRD_ALLOW_DUPLICATE_FIELD:
raise NRRDError(dup_message)

warnings.warn(dup_message)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So with a duplicated field, the old field will be overwritten. Would it be better if we added an integer to the end of the field. Something like 2 so that it doesn't conflict but both of the values are saved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to overwrite. If the values were duplicated then there would probably need to be a hack in the save path to remove the second key while saving, which would just be confusing.

@codecov-io
Copy link

Codecov Report

Merging #65 into master will increase coverage by 0.74%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #65      +/-   ##
==========================================
+ Coverage   86.64%   87.39%   +0.74%     
==========================================
  Files           6        6              
  Lines         352      357       +5     
  Branches      113      114       +1     
==========================================
+ Hits          305      312       +7     
+ Misses         23       22       -1     
+ Partials       24       23       -1
Impacted Files Coverage Δ
nrrd/reader.py 81.98% <100%> (+1.85%) ⬆️
nrrd/writer.py 87.17% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update caf8506...77d5d2c. Read the comment docs.

@ihnorton
Copy link
Contributor Author

Right now it overwrites and saves the last field given. Is this the best way to handle it?

Yes

Also, I was thinking maybe it would be worthwhile to write a test for testing the compression on gzip and bzip2. My thoughts are to set the compression level to something besides 9 for both of these and then test the file size is a certain size?

Done. For now I commented out the assert about file size difference for bz2, because for the binary ball data, output is the same size at all levels.

@addisonElliott addisonElliott merged commit b744a63 into mhe:master Aug 25, 2018
@addisonElliott
Copy link
Collaborator

LGTM 👍

I guess I wasn't fully thinking of the application for the duplicate field section. I think I agree that it should not be a function parameter in read or read_header.

The tests look good!

@ihnorton
Copy link
Contributor Author

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants