A little problem in train function #3

largelymfs · 2014-07-18T06:08:01Z

Hi!
The code is great!

I use this code to implement the paragraph2vec, and I found that there my be several iterations of the training.If we use the following code to train for several times like this:

for (int i = 0; i < n; i++)
        model.train(sentences);

The first loop is right,but the next loops' memory usage is larger and larger.
After reviewing the code word2vec.h,I have found the code in word2vec.hmay have a problem:

 #pragma omp parallel for
        for (size_t i=0; i < n_sentences; ++i) {
          auto sentence = sentences[i].get();
            if (sentence->tokens_.empty())
                continue;
            size_t len = sentence->tokens_.size();
            for (size_t i=0; i<len; ++i) {
                 auto it = vocab_.find(sentence->tokens_[i]);
                 if (it == vocab_.end()) continue;
                 Word *word = it->second.get();
                 // subsampling
                 if (sample_ > 0) {
                     float rnd = (sqrt(word->count_ / (sample_ * total_words)    ) + 1) * (sample_ * total_words) / word->count_;
                     if (rnd < rng(eng)) continue;
                 }
                 sentence->words_.emplace_back(it->second.get());
             }

The vector sentence will be larger and larger if we use the train function for the second time . We can clear the vector first.

 #pragma omp parallel for
        for (size_t i=0; i < n_sentences; ++i) {
          auto sentence = sentences[i].get();

//By Largelymfs
sentence.clear();

            if (sentence->tokens_.empty())
                continue;
            size_t len = sentence->tokens_.size();
            for (size_t i=0; i<len; ++i) {
                 auto it = vocab_.find(sentence->tokens_[i]);
                 if (it == vocab_.end()) continue;
                 Word *word = it->second.get();
                 // subsampling
                 if (sample_ > 0) {
                     float rnd = (sqrt(word->count_ / (sample_ * total_words)    ) + 1) * (sample_ * total_words) / word->count_;
                     if (rnd < rng(eng)) continue;
                 }
                 sentence->words_.emplace_back(it->second.get());
             }

And we can put the train function into a loop.
Thanks a lot.

The text was updated successfully, but these errors were encountered:

jdeng · 2014-07-18T07:54:57Z

You can check add one line to skip the vocab look step. Something like below

if (sentence->words_.empty()) {
for (size_t i=0; i<len; ++i) {
auto it = vocab_.find(sentence->tokens_[i]);
if (it == vocab_.end()) continue;
Word *word = it->second.get();
// subsampling
if (sample_ > 0) {
float rnd = (sqrt(word->count_ / (sample_ * total_words) ) + 1) * (sample_ * total_words) / word->count_;
if (rnd < rng(eng)) continue;
}
sentence->words_.emplace_back(it->second.get());
}
}

angelteers · 2015-07-04T13:54:58Z

hi,i want to see the result by training text8,but confront a lot of problems.i enter the commands:g++ main.cc，but there are many errors.how can i do? Thank you in advanced!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A little problem in train function #3

A little problem in train function #3

largelymfs commented Jul 18, 2014

jdeng commented Jul 18, 2014

angelteers commented Jul 4, 2015

A little problem in train function #3

A little problem in train function #3

Comments

largelymfs commented Jul 18, 2014

jdeng commented Jul 18, 2014

angelteers commented Jul 4, 2015