This repository was archived by the owner on Jul 7, 2023. It is now read-only.
-
Couldn't load subscription status.
- Fork 3.7k
This repository was archived by the owner on Jul 7, 2023. It is now read-only.
Cannot download MRPC data #1246
Copy link
Copy link
Closed
Description
Description
I get UnicodeDecodeError when trying to generate the "MSR Paraphrase Corpus" data. It happens when using either t2t-datagen or t2t-trainer.
Environment information
OS: macOS 10.13.4
$ pip freeze | grep tensor
mesh-tensorflow==0.0.4
tensor2tensor==1.11.0
tensorboard==1.12.0
tensorflow==1.12.0
tensorflow-metadata==0.9.0
tensorflow-probability==0.5.0
$ python -V
Python 3.6.4
For bugs: reproduction and error logs
# Steps to reproduce:
$ t2t-datagen \
--data_dir=~/t2t_data/msr_paraphrase_corpus \
--tmp_dir=/tmp/t2t_tmp \
--problem=msr_paraphrase_corpus
# Error logs:
INFO:tensorflow:Generated 8152 Examples
INFO:tensorflow:Found vocab file: /Users/ywkim/t2t_data/msr_paraphrase_corpus/vocab.msr_paraphrase_corpus.8192.subwords
Traceback (most recent call last):
File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/bin/t2t-datagen", line 28, in <module>
tf.app.run()
File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/bin/t2t-datagen", line 23, in main
t2t_datagen.main(argv)
File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensor2tensor/bin/t2t_datagen.py", line 198, in main
generate_data_for_registered_problem(problem)
File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensor2tensor/bin/t2t_datagen.py", line 260, in generate_data_for_registered_problem
problem.generate_data(data_dir, tmp_dir, task_id)
File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensor2tensor/data_generators/text_problems.py", line 306, in generate_data
self.generate_encoded_samples(data_dir, tmp_dir, split)), paths)
File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensor2tensor/data_generators/generator_utils.py", line 165, in generate_files
for case in generator:
File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensor2tensor/data_generators/text_problems.py", line 542, in generate_encoded_samples
for sample in generator:
File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensor2tensor/data_generators/mrpc.py", line 114, in generate_samples
for row in tf.gfile.Open(os.path.join(mrpc_dir, "dev_ids.tsv")):
File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 220, in __next__
return self.next()
File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 214, in next
retval = self.readline()
File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 184, in readline
return self._prepare_value(self._read_buf.ReadLineAsString())
File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 100, in _prepare_value
return compat.as_str_any(val)
File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 107, in as_str_any
return as_str(value)
File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 80, in as_text
return bytes_or_text.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 12: invalid start byte
Metadata
Metadata
Assignees
Labels
No labels