Investigate marian-decoder memory usage #957

gregtatum · 2024-12-13T14:57:05Z

We're hitting some pretty large memory usage in Firefox when translation, ~300MB RSS for a "tiny" models, and ~500MB RSS for "base" models. We should do some memory analysis here on the native build.

gregtatum · 2024-12-13T15:13:57Z

This is a "tiny" model.

https://share.firefox.dev/3Bqa1T0

This was generated by running:

cd 3rd_party/browsermt-marian-dev/build

cmake .. -DCMAKE_BUILD_TYPE=RelWithDebInfo

make

model_dir=../../../data/remote-settings/models/decoder-tiny-en-lt-WhOkOOIoQlmo636ixpQ3kw

echo "Translate to Lithuanian" |
valgrind --tool=dhat ./marian-decoder \
  --models $model_dir/model.bin \
  --vocabs $model_dir/vocab.spm $model_dir/vocab.spm \
  --beam-size 1 \
  --workspace 16 \
  --cpu-threads 1

==6245== Total:     766,562,556 bytes in 304,686 blocks
==6245== At t-gmax: 403,802,022 bytes in 145,759 blocks
==6245== At t-end:  16,800 bytes in 261 blocks
==6245== Reads:     919,128,652 bytes
==6245== Writes:    679,042,897 bytes

Model size on disk: 17MB

Here is a summary of the "top offenders" for memory.

Function	Memory
marian::ExpressionGraph::forward	134MB
Translate::Translate threadpool	134MB
Scorer	41MB
Parameters::allocateForward	41MB
binary::loadItems (this is the serialized model loaded into memory)	41MB
Sentence Piece	8MB

It seems that the serialized model is retained in memory. The ExpressionGraph::forward appears to be loading the model into memory. I'm unclear on what the Translate::Translate threadpool is doing. It appears to be reserving the workspace, but I set that to a low value so it's unclear to me: https://github.com/browsermt/marian-dev/blob/master/src/translator/translator.h#L95-L120

gregtatum · 2024-12-17T16:01:03Z

Here is the "base" model: https://share.firefox.dev/4gAIE7B

gregtatum · 2024-12-18T14:45:03Z

Here is a better profile with threads disabled: https://share.firefox.dev/3DlI92Z

gregtatum · 2024-12-18T15:08:15Z

Another one where I do a quick exit right before inference. "Bytes at End" has the relevant graph.

        scorers_[id] = scorers;
        std::quick_exit(0);
        graph->forward();

https://share.firefox.dev/3VMJqX7

gregtatum · 2024-12-18T15:39:49Z

The Wemb field for embeddings is a big culprit on the memory inflation when running them models. They are a big part of the models. They are shipped as quantized int8 values. However, when the model is loaded int they are inflated to float32 values, quadrupling their size.

For instance in a "base" model, the Wemb are stored as 16_384_256 bytes, but once inflated in memory they are 65_536_000.

gregtatum · 2024-12-18T15:56:28Z

Inside of marian::io::binary::loadItems the call to prepareAndTransposeB is responsible for another 52MB allocation while transposing Wemb, but it is not retained during the inference process, so may not be an issue.

This is backed by bindings to gemmology.

It kind of looks like this one won't matter on the Wasm side, as it appears to be these calls to genericMalloc which are the culprites.

gregtatum · 2024-12-18T17:46:17Z

I hacked in a shared ptr for the Item's data: https://share.firefox.dev/4iL0y9t

With a quick exit right before the inference step:

230MB vs 823MB

Here is the full execution: https://share.firefox.dev/400QbqX

gregtatum · 2024-12-20T16:55:46Z

The results on the Wasm side are quite different, as it is hitting separate code paths. There are many copies of the model sitting around in memory. The first trick will be to transform the Item struct to use a shared pointer for the data. The item is copied multiple times in the code.

diff --git a/src/common/io_item.h b/src/common/io_item.h
index 24968ec4..e06272c3 100755
--- a/src/common/io_item.h
+++ b/src/common/io_item.h
@@ -3,6 +3,7 @@
 #include "common/shape.h"
 #include "common/types.h"
 
+#include <emscripten.h>
 #include <string>
 
 namespace marian {
@@ -27,7 +28,25 @@ struct Item {
         mapped(other.mapped),
         name(other.name),
         shape(other.shape),
-        type(other.type) {}
+        type(other.type) {
+    printf("!!! Item copy constructor %s (count %ld) %p\n",
+           name.c_str(),
+           other.bytes.use_count(),
+           other.bytes.get());
+
+    // clang-format off
+    EM_ASM({
+      const name = UTF8ToString($0);
+      const size = $1;
+      const pointer = $2;
+
+      ChromeUtils.addProfilerMarker(
+        `Item() "${name}" ${size} 0x${pointer.toString(16)}`,
+        { captureStack: true }
+      );
+    }, name.c_str(), bytes->size(), reinterpret_cast<size_t>(bytes.get()));
+    // clang-format on
+  }
 
   // Copy assignment operator
   Item& operator=(const Item& other) {
@@ -38,6 +57,7 @@ struct Item {
       name = other.name;
       shape = other.shape;
       type = other.type;
+      printf("!!! copy assignment operator %s\n", name.c_str());
     }
     return *this;
   }
@@ -52,6 +72,7 @@ struct Item {
         type(other.type) {
     other.ptr = nullptr;
     other.mapped = false;
+    printf("!!! move constructor %s\n", name.c_str());
   }
 
   // Move assignment operator
@@ -66,6 +87,8 @@ struct Item {
 
       other.ptr = nullptr;
       other.mapped = false;
+
+      printf("!!! Move assignment operator %s\n", name.c_str());
     }
     return *this;
   }

The Item is an immutable data source that is then read into a Tensor which is backed by a "Device" with an allocator. This abstraction is written so that these computations can happen both on the CPU side and the GPU side. In the Wasm implementation we only need the CPU. The Tensor is just a pointer and data shape. Possibly, we could construct the tensors directly from the model memory. This is somewhat hard as the tensor is passed already allocated to a lambda that captures (by value) the item then copies it into the allocated Tensor.

A quicker more immediate solve is to at least release the backing data. There are now two copies of the model. The first is the model that is passed into the engine from JavaScript. The second is the items owned by ScorerWrapper.

So if we add:

class ScorerWrapper {
  void clearItems() override { items_.clear(); }
}

And then clear both copies:

diff --git a/inference/src/translator/translation_model.cpp b/inference/src/translator/translation_model.cpp
index 84e5bd30..2004200e 100644
--- a/inference/src/translator/translation_model.cpp
+++ b/inference/src/translator/translation_model.cpp

       scorer->setShortlistGenerator(shortlistGenerator_);
     }
   }
+  printf("!!! Clearing the model memory");
   graph->forward();
+
+  for (auto scorer : scorerEnsemble) {
+    scorer->clearItems();
+  }
+
+  memory_.clear();
 }

While this could be a bit unsafe outside of Wasm by just clearing memory, it works here for how we use our current implementation. Perhaps we could add some more runtime checks for accessing the memory.

With this current change I've gotten the "tiny" model's memory from 300mb to 250mb, the "base" model from 500mb to 371mb.

gregtatum added the inference label Dec 13, 2024

gregtatum self-assigned this Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate marian-decoder memory usage #957

Investigate marian-decoder memory usage #957

gregtatum commented Dec 13, 2024

gregtatum commented Dec 13, 2024

gregtatum commented Dec 17, 2024

gregtatum commented Dec 18, 2024

gregtatum commented Dec 18, 2024

gregtatum commented Dec 18, 2024

gregtatum commented Dec 18, 2024 •

edited

Loading

gregtatum commented Dec 18, 2024 •

edited

Loading

gregtatum commented Dec 20, 2024 •

edited

Loading

Investigate marian-decoder memory usage #957

Investigate marian-decoder memory usage #957

Comments

gregtatum commented Dec 13, 2024

gregtatum commented Dec 13, 2024

gregtatum commented Dec 17, 2024

gregtatum commented Dec 18, 2024

gregtatum commented Dec 18, 2024

gregtatum commented Dec 18, 2024

gregtatum commented Dec 18, 2024 • edited Loading

gregtatum commented Dec 18, 2024 • edited Loading

gregtatum commented Dec 20, 2024 • edited Loading

gregtatum commented Dec 18, 2024 •

edited

Loading

gregtatum commented Dec 18, 2024 •

edited

Loading

gregtatum commented Dec 20, 2024 •

edited

Loading