Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate marian-decoder memory usage #957

Open
gregtatum opened this issue Dec 13, 2024 · 8 comments
Open

Investigate marian-decoder memory usage #957

gregtatum opened this issue Dec 13, 2024 · 8 comments
Assignees

Comments

@gregtatum
Copy link
Member

We're hitting some pretty large memory usage in Firefox when translation, ~300MB RSS for a "tiny" models, and ~500MB RSS for "base" models. We should do some memory analysis here on the native build.

@gregtatum gregtatum self-assigned this Dec 13, 2024
@gregtatum
Copy link
Member Author

This is a "tiny" model.

https://share.firefox.dev/3Bqa1T0

This was generated by running:

cd 3rd_party/browsermt-marian-dev/build

cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo

make

model_dir=../../../data/remote-settings/models/decoder-tiny-en-lt-WhOkOOIoQlmo636ixpQ3kw

echo "Translate to Lithuanian" |
valgrind --tool=dhat ./marian-decoder \
  --models $model_dir/model.bin \
  --vocabs $model_dir/vocab.spm $model_dir/vocab.spm \
  --beam-size 1 \
  --workspace 16 \
  --cpu-threads 1
==6245== Total:     766,562,556 bytes in 304,686 blocks
==6245== At t-gmax: 403,802,022 bytes in 145,759 blocks
==6245== At t-end:  16,800 bytes in 261 blocks
==6245== Reads:     919,128,652 bytes
==6245== Writes:    679,042,897 bytes

Model size on disk: 17MB

Here is a summary of the "top offenders" for memory.

Function Memory
marian::ExpressionGraph::forward 134MB
Translate::Translate threadpool 134MB
Scorer 41MB
Parameters::allocateForward 41MB
binary::loadItems (this is the serialized model loaded into memory) 41MB
Sentence Piece 8MB

It seems that the serialized model is retained in memory. The ExpressionGraph::forward appears to be loading the model into memory. I'm unclear on what the Translate::Translate threadpool is doing. It appears to be reserving the workspace, but I set that to a low value so it's unclear to me: https://github.com/browsermt/marian-dev/blob/master/src/translator/translator.h#L95-L120

@gregtatum
Copy link
Member Author

Here is the "base" model: https://share.firefox.dev/4gAIE7B

@gregtatum
Copy link
Member Author

Here is a better profile with threads disabled: https://share.firefox.dev/3DlI92Z

@gregtatum
Copy link
Member Author

Another one where I do a quick exit right before inference. "Bytes at End" has the relevant graph.

        scorers_[id] = scorers;
        std::quick_exit(0);
        graph->forward();

https://share.firefox.dev/3VMJqX7

@gregtatum
Copy link
Member Author

The Wemb field for embeddings is a big culprit on the memory inflation when running them models. They are a big part of the models. They are shipped as quantized int8 values. However, when the model is loaded int they are inflated to float32 values, quadrupling their size.

For instance in a "base" model, the Wemb are stored as 16_384_256 bytes, but once inflated in memory they are 65_536_000.

@gregtatum
Copy link
Member Author

gregtatum commented Dec 18, 2024

Inside of marian::io::binary::loadItems the call to prepareAndTransposeB is responsible for another 52MB allocation while transposing Wemb, but it is not retained during the inference process, so may not be an issue.

This is backed by bindings to gemmology.

It kind of looks like this one won't matter on the Wasm side, as it appears to be these calls to genericMalloc which are the culprites.

@gregtatum
Copy link
Member Author

gregtatum commented Dec 18, 2024

I hacked in a shared ptr for the Item's data: https://share.firefox.dev/4iL0y9t

With a quick exit right before the inference step:

230MB vs 823MB

Here is the full execution: https://share.firefox.dev/400QbqX

@gregtatum
Copy link
Member Author

gregtatum commented Dec 20, 2024

The results on the Wasm side are quite different, as it is hitting separate code paths. There are many copies of the model sitting around in memory. The first trick will be to transform the Item struct to use a shared pointer for the data. The item is copied multiple times in the code.

diff --git a/src/common/io_item.h b/src/common/io_item.h
index 24968ec4..e06272c3 100755
--- a/src/common/io_item.h
+++ b/src/common/io_item.h
@@ -3,6 +3,7 @@
 #include "common/shape.h"
 #include "common/types.h"
 
+#include <emscripten.h>
 #include <string>
 
 namespace marian {
@@ -27,7 +28,25 @@ struct Item {
         mapped(other.mapped),
         name(other.name),
         shape(other.shape),
-        type(other.type) {}
+        type(other.type) {
+    printf("!!! Item copy constructor %s (count %ld) %p\n",
+           name.c_str(),
+           other.bytes.use_count(),
+           other.bytes.get());
+
+    // clang-format off
+    EM_ASM({
+      const name = UTF8ToString($0);
+      const size = $1;
+      const pointer = $2;
+
+      ChromeUtils.addProfilerMarker(
+        `Item() "${name}" ${size} 0x${pointer.toString(16)}`,
+        { captureStack: true }
+      );
+    }, name.c_str(), bytes->size(), reinterpret_cast<size_t>(bytes.get()));
+    // clang-format on
+  }
 
   // Copy assignment operator
   Item& operator=(const Item& other) {
@@ -38,6 +57,7 @@ struct Item {
       name = other.name;
       shape = other.shape;
       type = other.type;
+      printf("!!! copy assignment operator %s\n", name.c_str());
     }
     return *this;
   }
@@ -52,6 +72,7 @@ struct Item {
         type(other.type) {
     other.ptr = nullptr;
     other.mapped = false;
+    printf("!!! move constructor %s\n", name.c_str());
   }
 
   // Move assignment operator
@@ -66,6 +87,8 @@ struct Item {
 
       other.ptr = nullptr;
       other.mapped = false;
+
+      printf("!!! Move assignment operator %s\n", name.c_str());
     }
     return *this;
   }

The Item is an immutable data source that is then read into a Tensor which is backed by a "Device" with an allocator. This abstraction is written so that these computations can happen both on the CPU side and the GPU side. In the Wasm implementation we only need the CPU. The Tensor is just a pointer and data shape. Possibly, we could construct the tensors directly from the model memory. This is somewhat hard as the tensor is passed already allocated to a lambda that captures (by value) the item then copies it into the allocated Tensor.

A quicker more immediate solve is to at least release the backing data. There are now two copies of the model. The first is the model that is passed into the engine from JavaScript. The second is the items owned by ScorerWrapper.

So if we add:

class ScorerWrapper {
  void clearItems() override { items_.clear(); }
}

And then clear both copies:

diff --git a/inference/src/translator/translation_model.cpp b/inference/src/translator/translation_model.cpp
index 84e5bd30..2004200e 100644
--- a/inference/src/translator/translation_model.cpp
+++ b/inference/src/translator/translation_model.cpp

       scorer->setShortlistGenerator(shortlistGenerator_);
     }
   }
+  printf("!!! Clearing the model memory");
   graph->forward();
+
+  for (auto scorer : scorerEnsemble) {
+    scorer->clearItems();
+  }
+
+  memory_.clear();
 }

While this could be a bit unsafe outside of Wasm by just clearing memory, it works here for how we use our current implementation. Perhaps we could add some more runtime checks for accessing the memory.

With this current change I've gotten the "tiny" model's memory from 300mb to 250mb, the "base" model from 500mb to 371mb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant