[Question]: ColumnCorpus taking forever to load large dataset #3494

pxb5080 · 2024-07-07T22:43:04Z

Question

I am building a sequence tagger that tags each character in the sentence. I have training data of a few million sentences resulting into ~1 billion training examples. Here one training example is one character with the corresponding label.
I am instantiating ColumnCorpus like this:

    corpus = ColumnCorpus(data_path, columns,
                          in_memory=False,
                          train_file='train',
                          test_file='test',
                          dev_file='dev')

Initially, I was getting OOM error so I used the in_memory flag. However, the loading takes forever and the job gets killed. I am using 2gpu's. I have the following questions:

Is ColumnCorpus the right data format for this big data size? Or there is a data format better suited for this purpose?
Is there a different instantiation of ColumnCorpus which is better suited for large datasets?

The text was updated successfully, but these errors were encountered:

pxb5080 added the question Further information is requested label Jul 7, 2024

pxb5080 closed this as completed Jul 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: ColumnCorpus taking forever to load large dataset #3494

[Question]: ColumnCorpus taking forever to load large dataset #3494

pxb5080 commented Jul 7, 2024

[Question]: ColumnCorpus taking forever to load large dataset #3494

[Question]: ColumnCorpus taking forever to load large dataset #3494

Comments

pxb5080 commented Jul 7, 2024

Question