Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in training sm_cnn, ValueError: could not convert string to float: '<pad>' #142

Open
liudonglei opened this issue Aug 18, 2018 · 6 comments
Assignees

Comments

@liudonglei
Copy link

$ python train.py --mode static --gpu 1
Note: You are using GPU for training
Dataset TREC Mode static
VOCAB num 13
LABEL.target_class: 13
LABELS: ['', '2', '0', '7', '3', '1', '8', '4', '5', '9', '6', '\t', '.']
Train instance 53417
Dev instance 1148
Test instance 1517
Shift model to GPU
Time Epoch Iteration Progress (%Epoch) Loss Dev/Loss Accuracy Dev/Accuracy
Traceback (most recent call last):
File "train.py", line 147, in
for batch_idx, batch in enumerate(train_iter):
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/iterator.py", line 151, in iter
self.train)
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/batch.py", line 27, in init
setattr(self, name, field.process(batch, device=device, train=train))
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/field.py", line 188, in process
tensor = self.numericalize(padded, device=device, train=train)
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/field.py", line 308, in numericalize
arr = self.postprocessing(arr, None, train)
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 37, in call
x = pipe.call(x, *args)
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 52, in call
return [self.convert_token(tok, *args) for tok in x]
File "/home/dm/anaconda3/envs/theano.3/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 52, in
return [self.convert_token(tok, *args) for tok in x]
File "train.py", line 62, in
postprocessing=data.Pipeline(lambda arr, _, train: [float(y) for y in arr]))
File "train.py", line 62, in
postprocessing=data.Pipeline(lambda arr, _, train: [float(y) for y in arr]))
ValueError: could not convert string to float: ''

@liudonglei liudonglei changed the title in training sm_cnn, what the <PAD> meaning? in training sm_cnn, ValueError: could not convert string to float: '<pad>' Aug 18, 2018
@liudonglei
Copy link
Author

(castor) [ldl@402 sm_cnn 15:15:35] $ python train.py --mode static --no_cuda
Dataset TREC Mode static
VOCAB num 13
LABEL.target_class: 13
LABELS: ['', '2', '0', '7', '3', '1', '8', '4', '5', '9', '6', '\t', '.']
Train instance 53417
Dev instance 1148
Test instance 1517
Time Epoch Iteration Progress (%Epoch) Loss Dev/Loss Accuracy Dev/Accuracy
Traceback (most recent call last):
File "train.py", line 147, in
for batch_idx, batch in enumerate(train_iter):
File "/home/ldl/anaconda2/envs/castor/lib/python3.6/site-packages/torchtext/data/iterator.py", line 151, in iter
self.train)
File "/home/ldl/anaconda2/envs/castor/lib/python3.6/site-packages/torchtext/data/batch.py", line 27, in init
setattr(self, name, field.process(batch, device=device, train=train))
File "/home/ldl/anaconda2/envs/castor/lib/python3.6/site-packages/torchtext/data/field.py", line 188, in process
tensor = self.numericalize(padded, device=device, train=train)
File "/home/ldl/anaconda2/envs/castor/lib/python3.6/site-packages/torchtext/data/field.py", line 308, in numericalize
arr = self.postprocessing(arr, None, train)
File "/home/ldl/anaconda2/envs/castor/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 37, in call
x = pipe.call(x, *args)
File "/home/ldl/anaconda2/envs/castor/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 52, in call
return [self.convert_token(tok, *args) for tok in x]
File "/home/ldl/anaconda2/envs/castor/lib/python3.6/site-packages/torchtext/data/pipeline.py", line 52, in
return [self.convert_token(tok, *args) for tok in x]
File "train.py", line 62, in
postprocessing=data.Pipeline(lambda arr, _, train: [float(y) for y in arr]))
File "train.py", line 62, in
postprocessing=data.Pipeline(lambda arr, _, train: [float(y) for y in arr]))
ValueError: could not convert string to float: ''

@Impavidity
Copy link
Member

Hey @liudonglei To my understanding, you are using your own dataset, right ?
Can you post your dataset format in this thread? It will be more easier for me to understand this issue.

@liudonglei
Copy link
Author

liudonglei commented Sep 28, 2018

@Impavidity Not my own dataset, I just try the sm_cnn model on TrecQA dataset in your Castor-data repo, My all steps follow the steps in Castor/README.md and Castor/sm_cnn/README.md

@SawanKumar28
Copy link

Hi @liudonglei, were you able to resolve this issue? I am facing the same issue.

@liudonglei
Copy link
Author

Hi @liudonglei, were you able to resolve this issue? I am facing the same issue.

Sorry, I can't, I am unfamiliar with the torchtext package this repo used.

@liudonglei
Copy link
Author

@rosequ
@SawanKumar28
Hi, today i try this repo again and fix this problem,
this problem come from the file trec_dataset.py to use the torchtext.data.TabularDataset. I don't know why, That maybe some bug of Python's class inheritance.
after debugging half day, I locate the file trec_dataset.py and borrow the similar code from BLOG http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext to make the repo works.

you can just replace the trec_dataset.py with the bellow code:

----the right trec_dataset.py file ----
from torchtext import data

class TrecDataset:
dirname = 'data'
@classmethod
def splits(self, question_id, question_field, answer_field, external_field, label_field):

    tv_datafields = [('qid', question_id), ('label', label_field), ('question', question_field),
        ('answer', answer_field), ('ext_feat', external_field)]

    train, dev, test  = data.TabularDataset.splits(
        path="data", # the root directory where the data lies
        #train='train.csv', validation="valid.csv",
        train='trecqa.train.tsv', validation='trecqa.dev.tsv', test='trecqa.test.tsv',
        #train='ttt.csv', validation='ttt.csv', test='ttt.csv',
        format='tsv',
        #skip_header=True, # if your csv header has a header, make sure to pass this to ensure it doesn't get proceesed as data!
        fields=tv_datafields)
    return train, dev, test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants