Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: ColumnCorpus taking forever to load large dataset #3494

Closed
pxb5080 opened this issue Jul 7, 2024 · 0 comments
Closed

[Question]: ColumnCorpus taking forever to load large dataset #3494

pxb5080 opened this issue Jul 7, 2024 · 0 comments
Labels
question Further information is requested

Comments

@pxb5080
Copy link

pxb5080 commented Jul 7, 2024

Question

I am building a sequence tagger that tags each character in the sentence. I have training data of a few million sentences resulting into ~1 billion training examples. Here one training example is one character with the corresponding label.
I am instantiating ColumnCorpus like this:

    corpus = ColumnCorpus(data_path, columns,
                          in_memory=False,
                          train_file='train',
                          test_file='test',
                          dev_file='dev')

Initially, I was getting OOM error so I used the in_memory flag. However, the loading takes forever and the job gets killed. I am using 2gpu's. I have the following questions:

  1. Is ColumnCorpus the right data format for this big data size? Or there is a data format better suited for this purpose?
  2. Is there a different instantiation of ColumnCorpus which is better suited for large datasets?
@pxb5080 pxb5080 added the question Further information is requested label Jul 7, 2024
@pxb5080 pxb5080 closed this as completed Jul 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant