Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A little problem in train function #3

Open
largelymfs opened this issue Jul 18, 2014 · 2 comments
Open

A little problem in train function #3

largelymfs opened this issue Jul 18, 2014 · 2 comments

Comments

@largelymfs
Copy link

Hi!
The code is great!

I use this code to implement the paragraph2vec, and I found that there my be several iterations of the training.If we use the following code to train for several times like this:

for (int i = 0; i < n; i++)
        model.train(sentences);

The first loop is right,but the next loops' memory usage is larger and larger.
After reviewing the code word2vec.h,I have found the code in word2vec.hmay have a problem:

 #pragma omp parallel for
        for (size_t i=0; i < n_sentences; ++i) {
          auto sentence = sentences[i].get();
            if (sentence->tokens_.empty())
                continue;
            size_t len = sentence->tokens_.size();
            for (size_t i=0; i<len; ++i) {
                 auto it = vocab_.find(sentence->tokens_[i]);
                 if (it == vocab_.end()) continue;
                 Word *word = it->second.get();
                 // subsampling
                 if (sample_ > 0) {
                     float rnd = (sqrt(word->count_ / (sample_ * total_words)    ) + 1) * (sample_ * total_words) / word->count_;
                     if (rnd < rng(eng)) continue;
                 }
                 sentence->words_.emplace_back(it->second.get());
             }

The vector sentence will be larger and larger if we use the train function for the second time . We can clear the vector first.

 #pragma omp parallel for
        for (size_t i=0; i < n_sentences; ++i) {
          auto sentence = sentences[i].get();

//By Largelymfs
sentence.clear();

            if (sentence->tokens_.empty())
                continue;
            size_t len = sentence->tokens_.size();
            for (size_t i=0; i<len; ++i) {
                 auto it = vocab_.find(sentence->tokens_[i]);
                 if (it == vocab_.end()) continue;
                 Word *word = it->second.get();
                 // subsampling
                 if (sample_ > 0) {
                     float rnd = (sqrt(word->count_ / (sample_ * total_words)    ) + 1) * (sample_ * total_words) / word->count_;
                     if (rnd < rng(eng)) continue;
                 }
                 sentence->words_.emplace_back(it->second.get());
             }

And we can put the train function into a loop.
Thanks a lot.

@jdeng
Copy link
Owner

jdeng commented Jul 18, 2014

You can check add one line to skip the vocab look step. Something like below

if (sentence->words_.empty()) {
for (size_t i=0; i<len; ++i) {
auto it = vocab_.find(sentence->tokens_[i]);
if (it == vocab_.end()) continue;
Word *word = it->second.get();
// subsampling
if (sample_ > 0) {
float rnd = (sqrt(word->count_ / (sample_ * total_words) ) + 1) * (sample_ * total_words) / word->count_;
if (rnd < rng(eng)) continue;
}
sentence->words_.emplace_back(it->second.get());
}
}

@angelteers
Copy link

hi,i want to see the result by training text8,but confront a lot of problems.i enter the commands:g++ main.cc,but there are many errors.how can i do? Thank you in advanced!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants