-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
It seems that in subset casia2015, some samples are like {'c[hn]':'xxx', 'en': 'aa'}.
So when using data = load_dataset('wmt17', "zh-en") to load the wmt17 zh-en dataset, which will raise the exception:
Traceback (most recent call last):
File "train.py", line 78, in <module>
data = load_dataset(args.dataset, "zh-en")
File "/usr/local/lib/python3.7/dist-packages/datasets/load.py", line 1684, in load_dataset
use_auth_token=use_auth_token,
File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 705, in download_and_prepare
dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 1221, in _download_and_prepare
super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 793, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 1215, in _prepare_split
num_examples, num_bytes = writer.finalize()
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_writer.py", line 533, in finalize
self.write_examples_on_file()
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_writer.py", line 410, in write_examples_on_file
self.write_batch(batch_examples=batch_examples)
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_writer.py", line 503, in write_batch
arrays.append(pa.array(typed_sequence))
File "pyarrow/array.pxi", line 230, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
File "/usr/local/lib/python3.7/dist-packages/datasets/arrow_writer.py", line 198, in __arrow_array__
out = cast_array_to_feature(out, type, allow_number_to_str=not self.trying_type)
File "/usr/local/lib/python3.7/dist-packages/datasets/table.py", line 1675, in wrapper
return func(array, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/table.py", line 1846, in cast_array_to_feature
return array_cast(array, feature(), allow_number_to_str=allow_number_to_str)
File "/usr/local/lib/python3.7/dist-packages/datasets/table.py", line 1675, in wrapper
return func(array, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/table.py", line 1756, in array_cast
raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{pa_type}")
TypeError: Couldn't cast array of type
struct<c[hn]: string, en: string, zh: string>
to
struct<en: string, zh: string>
So the solution of this problem is to change the original array manually:
if 'c[hn]' in str(array.type):
py_array = array.to_pylist()
data_list = []
for vo in py_array:
tmp = {
'en': vo['en'],
}
if 'zh' not in vo:
tmp['zh'] = vo['c[hn]']
else:
tmp['zh'] = vo['zh']
data_list.append(tmp)
array = pa.array(data_list, type=pa.struct([
pa.field('en', pa.string()),
pa.field('zh', pa.string()),
]))
Therefore, maybe a correct version of original casia2015 file need to be updated
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working