Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: dataset.py - All 4 Open statements in this file are lacking encoding="utf-8" argument #1171

Open
CAW-nz opened this issue Nov 21, 2024 · 0 comments

Comments

@CAW-nz
Copy link

CAW-nz commented Nov 21, 2024

Originally I had identified that the two Open statements in EvaluationDataset.save_as() are lacking encoding="utf-8" argument - refer to the below information. However further testing has shown that ALL 4 file Open() statements need this encoding argument.

The equivalent two Open statements in Synthesizer.save_as() have this argument.
The lack of encoding="utf-8" argument is causing issues in the output of the json and csv files when calling the dataset version.

For example synthesizer.save_as(file_type='json',directory=out_dir) produced the following output snippet for context:

"context": [
" for achieving \nambitious climate change mitigation goals and climate resilient \ndevelopment ( high confidence ). Climate resilient development is \nenabled by increased international cooperation including mobilising \nand enhancing access to finance, particularly for developing countries, \nvulnerable regions, sectors and groups and aligning finance flows \nfor climate action to be consistent with ambition levels and funding \nneeds ( high confidence )."

whereas when converted to a dataset via

from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset(goldens=synthesizer.synthetic_goldens)
dataset.save_as(file_type='json',directory=out_dir)

for the same data, dataset.save_as() gave...

"context": [
" for achieving \nambitious climate change mitigation goals and climate resilient \ndevelopment ( high con\ufb01dence ). Climate resilient development is \nenabled by increased international cooperation including mobilising \nand enhancing access to \ufb01nance, particularly for developing countries, \nvulnerable regions, sectors and groups and aligning \ufb01nance \ufb02ows \nfor climate action to be consistent with ambition levels and funding \nneeds ( high con\ufb01dence )."

All instances of words starting with "fi" (which was actually stored as a single character "fi") became "\ufb01" and "fl" (actually single character "fl") became "\ufb02". This happened throughout the entire json file.
Another character that isn't being handled is "1.5°C" which appears as "1.5\u00b0C" in the Dataset version but properly output in Synthesizer version.

The call with dataset.save_as(file_type='csv') is even failing for my data probably due to this issue (whereas synthesizer.save_as(file_type='csv') is successful).
Bottom of trace dump=

File C:\ProgramData\anaconda3\Lib\encodings\cp1252.py:19, in IncrementalEncoder.encode(self, input, final)
18 def encode(self, input, final=False):
---> 19 return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 911: character maps to <undefined>


Further testing has shown that the 2 open("r") versions in this dataset.py file also need the same encoding argument - otherwise they can sometimes inject incorrect data at the time of read.


The only other differences in both save_as() methods is that the json versions have a different json.dump argument.
synthesizer.py has json.dump(json_data, file, indent=4, ensure_ascii=False)
dataset.py has json.dump(json_data, file, indent=4)

Whatever is the appropriate argument, they should be consistent in both versions too.
My testing has shown that the synthesizer version gives better more consistently readable output than current code in dataset.saveas() - even after addition of the encoding argument (to both open "r" and "w" calls). eg "1.5°C" was still saved as "1.5\u00b0C" without the ensure_ascii=False argument for json.dump(). Therefore it seems we simply should be applying all arguments from the synthesizer versions.


Therefore in summary there are 5 line changes needing to fix this issue:
Line 352 (in add_test_cases_from_json_file method) changes to: with open(file_path, "r", encoding="utf-8") as file:
Line 512 (in add_goldens_from_json_file method) changes to: with open(file_path, "r", encoding="utf-8") as file:
Line 767 (in save_as method) changes to: with open(full_file_path, "w", encoding="utf-8") as file:
Line 778 (in save_as method) changes to: json.dump(json_data, file, indent=4, ensure_ascii=False)
Line 781 (in save_as method) changes to: with open(full_file_path, "w", newline="", encoding="utf-8") as file:

@CAW-nz CAW-nz changed the title Bug: dataset.py - The two Open statements in EvaluationDataset.save_as() are lacking encoding="utf-8" argument Bug: dataset.py - All 4 Open statements are lacking encoding="utf-8" argument Nov 22, 2024
@CAW-nz CAW-nz changed the title Bug: dataset.py - All 4 Open statements are lacking encoding="utf-8" argument Bug: dataset.py - All 4 Open statements in this file are lacking encoding="utf-8" argument Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant