Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Cannot download MRPC data #1246

@ywkim

Description

@ywkim

Description

I get UnicodeDecodeError when trying to generate the "MSR Paraphrase Corpus" data. It happens when using either t2t-datagen or t2t-trainer.

Environment information

OS: macOS 10.13.4

$ pip freeze | grep tensor
mesh-tensorflow==0.0.4
tensor2tensor==1.11.0
tensorboard==1.12.0
tensorflow==1.12.0
tensorflow-metadata==0.9.0
tensorflow-probability==0.5.0

$ python -V
Python 3.6.4

For bugs: reproduction and error logs

# Steps to reproduce:
$ t2t-datagen \                                                                                                                          
  --data_dir=~/t2t_data/msr_paraphrase_corpus \
  --tmp_dir=/tmp/t2t_tmp \
  --problem=msr_paraphrase_corpus
# Error logs:
INFO:tensorflow:Generated 8152 Examples
INFO:tensorflow:Found vocab file: /Users/ywkim/t2t_data/msr_paraphrase_corpus/vocab.msr_paraphrase_corpus.8192.subwords                                                
Traceback (most recent call last):
  File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/bin/t2t-datagen", line 28, in <module>                                                                    
    tf.app.run()
  File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run                          
    _sys.exit(main(argv))
  File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/bin/t2t-datagen", line 23, in main                                                                        
    t2t_datagen.main(argv)
  File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensor2tensor/bin/t2t_datagen.py", line 198, in main                          
    generate_data_for_registered_problem(problem)
  File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensor2tensor/bin/t2t_datagen.py", line 260, in generate_data_for_registered_problem
    problem.generate_data(data_dir, tmp_dir, task_id)
  File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensor2tensor/data_generators/text_problems.py", line 306, in generate_data   
    self.generate_encoded_samples(data_dir, tmp_dir, split)), paths)
  File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensor2tensor/data_generators/generator_utils.py", line 165, in generate_files
    for case in generator:
  File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensor2tensor/data_generators/text_problems.py", line 542, in generate_encoded_samples
    for sample in generator:
  File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensor2tensor/data_generators/mrpc.py", line 114, in generate_samples         
    for row in tf.gfile.Open(os.path.join(mrpc_dir, "dev_ids.tsv")):
  File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 220, in __next__                   
    return self.next()
  File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 214, in next                       
    retval = self.readline()
  File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 184, in readline                   
    return self._prepare_value(self._read_buf.ReadLineAsString())
  File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensorflow/python/lib/io/file_io.py", line 100, in _prepare_value             
    return compat.as_str_any(val)
  File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 107, in as_str_any                    
    return as_str(value)
  File "/Users/ywkim/.local/share/virtualenvs/rally-f4OA2-t-/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 80, in as_text                        
    return bytes_or_text.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 12: invalid start byte

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions