You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am building a sequence tagger that tags each character in the sentence. I have training data of a few million sentences resulting into ~1 billion training examples. Here one training example is one character with the corresponding label.
I am instantiating ColumnCorpus like this:
corpus = ColumnCorpus(data_path, columns,
in_memory=False,
train_file='train',
test_file='test',
dev_file='dev')
Initially, I was getting OOM error so I used the in_memory flag. However, the loading takes forever and the job gets killed. I am using 2gpu's. I have the following questions:
Is ColumnCorpus the right data format for this big data size? Or there is a data format better suited for this purpose?
Is there a different instantiation of ColumnCorpus which is better suited for large datasets?
The text was updated successfully, but these errors were encountered:
Question
I am building a sequence tagger that tags each character in the sentence. I have training data of a few million sentences resulting into ~1 billion training examples. Here one training example is one character with the corresponding label.
I am instantiating ColumnCorpus like this:
Initially, I was getting OOM error so I used the in_memory flag. However, the loading takes forever and the job gets killed. I am using 2gpu's. I have the following questions:
The text was updated successfully, but these errors were encountered: