Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files
|
|
|
ChristianGeng
left a comment
There was a problem hiding this comment.
Apiwise I think that having only one utils.hash as before and distinguishing calculations based on the include_order_and_names kwarg is nice.
I cannot do a technical review of the md5 algorithm though, so I am not attempting.
The test coverage is fairly extensive. This will help a lot in the discovery of implementation changes on the pandas side, or even in the unlikely case that sth changes in hashlib.md5.
I would not be passionate about the name of the keyword argument, be it "strict" or "include_order_and_names" as long as the motivation becomes clear. The code and the issue description contain already links to the pandas issue where this was discussed. I wonder whether links should even be extended to point to the audformat issues that led to the implementation as is (the pyarrow metdatata has, effectively clarifying why this was needed)?
Apart from that I think that this is good and can be approved.
|
I added a reference to #419 in the description and renamed the argument to |
Closes #447
Adds the
strictargument toaudformat.utils.hash().If
Trueit will take the order of rows and the names of index levels and columns into account. The returned hash is then identical to the one attached to parquet table files.A hashing variant that takes into account the order of rows, name of columns and levels, and the data types was needed for calculating a hash for parquet tables as introduced in #419. Before, this was handled by a private method, independent of
audformat.utils.hash().