Bug: dataset.py - All 4 Open statements in this file are lacking encoding="utf-8" argument #1171

CAW-nz · 2024-11-21T01:53:04Z

Originally I had identified that the two Open statements in EvaluationDataset.save_as() are lacking encoding="utf-8" argument - refer to the below information. However further testing has shown that ALL 4 file Open() statements need this encoding argument.

The equivalent two Open statements in Synthesizer.save_as() have this argument.
The lack of encoding="utf-8" argument is causing issues in the output of the json and csv files when calling the dataset version.

For example synthesizer.save_as(file_type='json',directory=out_dir) produced the following output snippet for context:

"context": [
" for achieving \nambitious climate change mitigation goals and climate resilient \ndevelopment ( high conﬁdence ). Climate resilient development is \nenabled by increased international cooperation including mobilising \nand enhancing access to ﬁnance, particularly for developing countries, \nvulnerable regions, sectors and groups and aligning ﬁnance ﬂows \nfor climate action to be consistent with ambition levels and funding \nneeds ( high conﬁdence )."

whereas when converted to a dataset via

from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset(goldens=synthesizer.synthetic_goldens)
dataset.save_as(file_type='json',directory=out_dir)

for the same data, dataset.save_as() gave...

"context": [
" for achieving \nambitious climate change mitigation goals and climate resilient \ndevelopment ( high con\ufb01dence ). Climate resilient development is \nenabled by increased international cooperation including mobilising \nand enhancing access to \ufb01nance, particularly for developing countries, \nvulnerable regions, sectors and groups and aligning \ufb01nance \ufb02ows \nfor climate action to be consistent with ambition levels and funding \nneeds ( high con\ufb01dence )."

All instances of words starting with "fi" (which was actually stored as a single character "ﬁ") became "\ufb01" and "fl" (actually single character "ﬂ") became "\ufb02". This happened throughout the entire json file.
Another character that isn't being handled is "1.5°C" which appears as "1.5\u00b0C" in the Dataset version but properly output in Synthesizer version.

The call with dataset.save_as(file_type='csv') is even failing for my data probably due to this issue (whereas synthesizer.save_as(file_type='csv') is successful).
Bottom of trace dump=

File C:\ProgramData\anaconda3\Lib\encodings\cp1252.py:19, in IncrementalEncoder.encode(self, input, final)
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 911: character maps to <undefined>

Further testing has shown that the 2 open("r") versions in this dataset.py file also need the same encoding argument - otherwise they can sometimes inject incorrect data at the time of read.

The only other differences in both save_as() methods is that the json versions have a different json.dump argument.
synthesizer.py has json.dump(json_data, file, indent=4, ensure_ascii=False)
dataset.py has json.dump(json_data, file, indent=4)

Whatever is the appropriate argument, they should be consistent in both versions too.
My testing has shown that the synthesizer version gives better more consistently readable output than current code in dataset.saveas() - even after addition of the encoding argument (to both open "r" and "w" calls). eg "1.5°C" was still saved as "1.5\u00b0C" without the ensure_ascii=False argument for json.dump(). Therefore it seems we simply should be applying all arguments from the synthesizer versions.

Therefore in summary there are 5 line changes needing to fix this issue:
Line 352 (in add_test_cases_from_json_file method) changes to: with open(file_path, "r", encoding="utf-8") as file:
Line 512 (in add_goldens_from_json_file method) changes to: with open(file_path, "r", encoding="utf-8") as file:
Line 767 (in save_as method) changes to: with open(full_file_path, "w", encoding="utf-8") as file:
Line 778 (in save_as method) changes to: json.dump(json_data, file, indent=4, ensure_ascii=False)
Line 781 (in save_as method) changes to: with open(full_file_path, "w", newline="", encoding="utf-8") as file:

The text was updated successfully, but these errors were encountered:

CAW-nz changed the title ~~Bug: dataset.py - The two Open statements in EvaluationDataset.save_as() are lacking encoding="utf-8" argument~~ Bug: dataset.py - All 4 Open statements are lacking encoding="utf-8" argument Nov 22, 2024

CAW-nz changed the title ~~Bug: dataset.py - All 4 Open statements are lacking encoding="utf-8" argument~~ Bug: dataset.py - All 4 Open statements in this file are lacking encoding="utf-8" argument Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: dataset.py - All 4 Open statements in this file are lacking encoding="utf-8" argument #1171

Bug: dataset.py - All 4 Open statements in this file are lacking encoding="utf-8" argument #1171

CAW-nz commented Nov 21, 2024 •

edited

Loading

Bug: dataset.py - All 4 Open statements in this file are lacking encoding="utf-8" argument #1171

Bug: dataset.py - All 4 Open statements in this file are lacking encoding="utf-8" argument #1171

Comments

CAW-nz commented Nov 21, 2024 • edited Loading

CAW-nz commented Nov 21, 2024 •

edited

Loading