You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Originally I had identified that the two Open statements in EvaluationDataset.save_as() are lacking encoding="utf-8" argument - refer to the below information. However further testing has shown that ALL 4 file Open() statements need this encoding argument.
The equivalent two Open statements in Synthesizer.save_as() have this argument.
The lack of encoding="utf-8" argument is causing issues in the output of the json and csv files when calling the dataset version.
For example synthesizer.save_as(file_type='json',directory=out_dir) produced the following output snippet for context:
"context": [
" for achieving \nambitious climate change mitigation goals and climate resilient \ndevelopment ( high confidence ). Climate resilient development is \nenabled by increased international cooperation including mobilising \nand enhancing access to finance, particularly for developing countries, \nvulnerable regions, sectors and groups and aligning finance flows \nfor climate action to be consistent with ambition levels and funding \nneeds ( high confidence )."
whereas when converted to a dataset via
from deepeval.dataset import EvaluationDataset dataset = EvaluationDataset(goldens=synthesizer.synthetic_goldens) dataset.save_as(file_type='json',directory=out_dir)
for the same data, dataset.save_as() gave...
"context": [
" for achieving \nambitious climate change mitigation goals and climate resilient \ndevelopment ( high con\ufb01dence ). Climate resilient development is \nenabled by increased international cooperation including mobilising \nand enhancing access to \ufb01nance, particularly for developing countries, \nvulnerable regions, sectors and groups and aligning \ufb01nance \ufb02ows \nfor climate action to be consistent with ambition levels and funding \nneeds ( high con\ufb01dence )."
All instances of words starting with "fi" (which was actually stored as a single character "fi") became "\ufb01" and "fl" (actually single character "fl") became "\ufb02". This happened throughout the entire json file.
Another character that isn't being handled is "1.5°C" which appears as "1.5\u00b0C" in the Dataset version but properly output in Synthesizer version.
The call with dataset.save_as(file_type='csv') is even failing for my data probably due to this issue (whereas synthesizer.save_as(file_type='csv') is successful).
Bottom of trace dump=
UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 911: character maps to <undefined>
Further testing has shown that the 2 open("r") versions in this dataset.py file also need the same encoding argument - otherwise they can sometimes inject incorrect data at the time of read.
The only other differences in both save_as() methods is that the json versions have a different json.dump argument. synthesizer.py has json.dump(json_data, file, indent=4, ensure_ascii=False) dataset.py has json.dump(json_data, file, indent=4)
Whatever is the appropriate argument, they should be consistent in both versions too.
My testing has shown that the synthesizer version gives better more consistently readable output than current code in dataset.saveas() - even after addition of the encoding argument (to both open "r" and "w" calls). eg "1.5°C" was still saved as "1.5\u00b0C" without the ensure_ascii=False argument for json.dump(). Therefore it seems we simply should be applying all arguments from the synthesizer versions.
Therefore in summary there are 5 line changes needing to fix this issue:
Line 352 (in add_test_cases_from_json_file method) changes to: with open(file_path, "r", encoding="utf-8") as file:
Line 512 (in add_goldens_from_json_file method) changes to: with open(file_path, "r", encoding="utf-8") as file:
Line 767 (in save_as method) changes to: with open(full_file_path, "w", encoding="utf-8") as file:
Line 778 (in save_as method) changes to: json.dump(json_data, file, indent=4, ensure_ascii=False)
Line 781 (in save_as method) changes to: with open(full_file_path, "w", newline="", encoding="utf-8") as file:
The text was updated successfully, but these errors were encountered:
CAW-nz
changed the title
Bug: dataset.py - The two Open statements in EvaluationDataset.save_as() are lacking encoding="utf-8" argument
Bug: dataset.py - All 4 Open statements are lacking encoding="utf-8" argument
Nov 22, 2024
CAW-nz
changed the title
Bug: dataset.py - All 4 Open statements are lacking encoding="utf-8" argument
Bug: dataset.py - All 4 Open statements in this file are lacking encoding="utf-8" argument
Nov 22, 2024
Originally I had identified that the two Open statements in EvaluationDataset.save_as() are lacking encoding="utf-8" argument - refer to the below information. However further testing has shown that ALL 4 file Open() statements need this encoding argument.
The equivalent two Open statements in Synthesizer.save_as() have this argument.
The lack of encoding="utf-8" argument is causing issues in the output of the json and csv files when calling the dataset version.
For example synthesizer.save_as(file_type='json',directory=out_dir) produced the following output snippet for context:
whereas when converted to a dataset via
from deepeval.dataset import EvaluationDataset
dataset = EvaluationDataset(goldens=synthesizer.synthetic_goldens)
dataset.save_as(file_type='json',directory=out_dir)
for the same data, dataset.save_as() gave...
All instances of words starting with "fi" (which was actually stored as a single character "fi") became "\ufb01" and "fl" (actually single character "fl") became "\ufb02". This happened throughout the entire json file.
Another character that isn't being handled is "1.5°C" which appears as "1.5\u00b0C" in the Dataset version but properly output in Synthesizer version.
The call with dataset.save_as(file_type='csv') is even failing for my data probably due to this issue (whereas synthesizer.save_as(file_type='csv') is successful).
Bottom of trace dump=
Further testing has shown that the 2 open("r") versions in this dataset.py file also need the same encoding argument - otherwise they can sometimes inject incorrect data at the time of read.
The only other differences in both save_as() methods is that the json versions have a different json.dump argument.
synthesizer.py has json.dump(json_data, file, indent=4, ensure_ascii=False)
dataset.py has json.dump(json_data, file, indent=4)
Whatever is the appropriate argument, they should be consistent in both versions too.
My testing has shown that the synthesizer version gives better more consistently readable output than current code in dataset.saveas() - even after addition of the encoding argument (to both open "r" and "w" calls). eg "1.5°C" was still saved as "1.5\u00b0C" without the
ensure_ascii=False
argument for json.dump(). Therefore it seems we simply should be applying all arguments from the synthesizer versions.Therefore in summary there are 5 line changes needing to fix this issue:
Line 352 (in add_test_cases_from_json_file method) changes to:
with open(file_path, "r", encoding="utf-8") as file:
Line 512 (in add_goldens_from_json_file method) changes to:
with open(file_path, "r", encoding="utf-8") as file:
Line 767 (in save_as method) changes to:
with open(full_file_path, "w", encoding="utf-8") as file:
Line 778 (in save_as method) changes to:
json.dump(json_data, file, indent=4, ensure_ascii=False)
Line 781 (in save_as method) changes to:
with open(full_file_path, "w", newline="", encoding="utf-8") as file:
The text was updated successfully, but these errors were encountered: