-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should use mmap
for model loading
#91
Comments
@ggerganov I'm working on putting together the PR. Almost done. I don't know anything about the order that ggml accesses the weights in. Would you say that it's sequential? If so, there's also |
you probably don't want to use madvise+ what may be more appropriate is to use |
That will definitely happen with I'm still getting up to speed on this codebase, so I'd like to hear everyone's ideas on how best we'd ensure object data structures (or their floating point content at the very least) could be made directly mappable, thus side-stepping the loading process. One dirty hack for example I've been considering, would be overriding the memory allocators to get all objects at a fixed address, and persisting that to disk. That way, when all the C/C++ objects are loaded back into memory using |
@jart I think Lines 569 to 575 in 7213110
All tensors Edit: So above I incorrectly referenced the "eval" Lines 228 to 240 in 7213110
Same stuff. If the pointer is |
In other words, there is a set memory cap? |
For loading from a physical/network drive resharding the larger models to a single file might help, imho. Whereas loading multiple files in parallel would be slower. https://github.com/jankais3r/LLaMA_MPS/blob/main/reshard.py |
I'm getting ready to take another swing at it. My idea of what to do so far:
I can do 1, I'll submit a PR for that shortly, but it isn't super clear to me how memory is laid out so that I can do 2. In particular, I'm wondering about the "whatever else needs to be done" part. I'm certain that I'm missing something, and it probably wouldn't be obvious to me what I'm breaking even after hours of monkeying. |
My concern with doing that is, wouldn't it effectively double the disk usage? LLaMA is big enough that folks are already likely to be stretched thin on disk space after creating the second copy needed to quantize the model. I'm still working on studying the codebase to determine what exactly are the transformation that need to be made at runtime. For example, if it's just a bunch of float16's on disk, and we're using a bunch of float16's in memory, then I don't see why the buffer field of these tensors couldn't just be populated with pointers to the appropriate positions in the file. Unless of course it needed to be reshaped or shifted to meet things like AVX alignment requirements. In that case, we'd ideally want to modify the quantizer script so that it generates something suitable for our purposes, so that only a single conversion step needs to be performed. |
(comment is more a #202 thing) This is the way I was thinking about it: After the model loads and the prompt is tokenized, create a hash of the context. If that hash exists in the cache directory (and flag is set), load state. Saving the state has two modes, imho: post-prompt (only works when state hasn't been loaded) and at end of generation. The post-prompt mode allows jump-starting the model. End saving allows to start-up where one left off. As to the memory organization - I'd leave that to existing code. The hash would act as pseudo-verification of that it's okay to load a buffer of bytes. The model and prompt would need to be the same, maybe even other options. Just throwing out some head-space. Haven't starting coding anything. |
Wait. What problem are we trying to solve here exactly? Are we trying to (1) eliminate the three second startup delay? Or are we trying to (2) store the changes made to memory back to disk? Because if your goal is to solve (2) then the only thing you need to save are the random seed and the prompt, since that would restore the state deterministically. Right now I'm focusing on (1) since having fast mmap() loading would not change llama.cpp's behavior, and would instead simply make it go faster. If you want (2) then this change could be the stepping stone you need. All you'd have to do is change |
I did not intent to expand the meaning of the thread. (2) should probably be addressed elsewhere. #202 |
@jart It would double the disk usage, yes. But so does converting the model, and so does quantizing it. I think people are prepared for this. You're right though in that the scripts that convert the model are probably the best way to do this. I was only thinking about implementing a cache for |
So I added some logging statements to track the read() operations that are happening. It's 200k+ lines that look like this:
All it's doing is (1) reshaping and (2) aligning the data in the file. That's why llama.cpp takes several seconds to start. It wouldn't make sense to cache a bunch of memcpy() operations. The quickest thing we could do is introduce a third conversion step that creates a new file format, where the data is in the appropriate shape and alignment ahead of time. Then we could work our way backwards through the conversion tools, to reduce the number of pipeline chores from 3 to 1. |
Here's another reason why this issue is so important. I just ran the 13B model with F16C on my workstation with 32GB of RAM. The model, once loaded, comes very close to hitting the physical memory limit, using maybe ~30GB peak RSS. Bringing memory up to the edge of swapping effectively compounds tragedy, since the kernel reacts by dropping its file caches. If we were using mmap() then the kernel would know that the loaded pages and the file pages are the same thing. But since we're copying the memory, the file cache goes away, and loading ends up taking a minute long each time. |
@apaz-cli Have you attempted implementing yet the thing you proposed? It might work if you use |
@jart I have no idea how to support that in a portable way. I haven't dug too deep into it. I'm halfway through implementing part 1. The troubling thing is actually the default implementation for opening files with the C/C++ stdlib. There is no portable way in C++11 to check the size of a file or binary stream, not even with See this link to the C standard. The C++ standard says the same about it's own streams. Although it's UB, the |
the |
I've implemented a working prototype for UNIX systems. 5b8023d Your LLaMA AGI models will now load instantly without any user visible latency. llama-mmap2.mp4The above video should be all the proof you need. I did this by writing an append-only malloc() that lets us transactionally capture heap allocations, so they can be restored on subsequent runs. It's a ~200 LOC change that only took me a few hours and it worked the first time. Much easier than the alternative, which likely would have entailed serializing C++ STL objects. This change could be productionized as it stands. I'd need to add the WIN32 mmap() code. I'd also need to store the flags and possibly file timestamps for comparison in the serialized object too, since right now the state change can only happen when However, I still firmly believe this change is not the right thing to do. The file format should be fixed so that it's aligned and doesn't need to be reshaped. Doing that will take 1k+ lines of refactorings, plus a data migration, plus changes to the GGML API. I don't think it's possible to use the ggml_init_params::mem_buffer field, because that memory region contains pointers. That makes it basically the same as my malloc() capturing code, except less generalized. If you wanted to mmap() that field in a portable way, you'd have to do what linkers do, and apply fixups to all the pointers. What I thought might make more sense, is doing a I'll also note that the gains here are mostly due to not copying memory anymore, and better cooperation with the kernel's page manager. We unfortunately aren't getting any additional gains from lazy page loading, since this is a dense model. To generate a single token, every single page in the model file needs to be loaded. What this means is that first runs that load from spinning disk are still going to be slow, even though the average case has greatly improved. I don't view that as a problem, since having the better cooperation with the page manager ensures that the kernel file caches are much less likely to be evicted. |
On devices with less RAM and no swap (iOS), will this allow the inference to proceed without hitting the memory limit by evicting weights from the page cache during inference? For the pointers issue, you could make a custom smart pointer type that uses a global variable to do the fixups at runtime (not sure if this would have a perf impact though): void *ctx_base;
template<typename T>
class ctx_ptr {
off_t offset;
inline ctx_ptr(T *value): offset(value - ctx_base) {
assert(offset > 0);
}
inline T operator->() {
return *(ctx_base + offset);
}
};
// usage:
ctx_base = 0x1234;
ctx_ptr<ggml_whatever> ptr = ctx_ptr(the_raw_ptr);
ctx_base = 0x5678;
printf("%s\n", ptr->some_field); |
mmap(2) allows to use files larger than available RAM. |
Sorry, I missed that.
This is possible for non-sharded models. For example, if you take a look at the gpt-2 example, the model loading is a straight read from the file into each tensor: However, the larger LLaMA models ( >7B ) are split into parts, and one has to merge them. Each tensor is split across all the parts. Either by rows or by columns. So the reading logic becomes complicated because of that: Lines 457 to 503 in 367946c
We could create a combined Anyway, impressive stuff! |
Maybe the conversion script could combine all the |
Perhaps the merging can be addressed with multiple mmaps stitched together into a contiguous region as described here: |
I've just tested @slaren's change locally. It works great for 7B (which only has 1-dimensional tensors), loads instantly, and has zero performance regression on evaluation. My only worry is that |
Tried it out on the single-part |
Tests are all green on the mmap branch. It now works on Windows with MSVC. This is thanks to some pretty good POSIX polyfills that myself and @oKatanaaa put a lot of thought into creating. See https://github.com/ggerganov/llama.cpp/blob/mmap/mmap.c The mmap branch no longer relies on overriding malloc(). Unfortunately I no longer see a smooth path for merging the mmap branch into master, because #370 introduced refactorings that don't mesh with its design. @sloren's suggestion combined with the new mmap() win32 polyfill would probably be our best bet right now for getting a measurable improvement into the master branch. |
I can try to integrate the Windows patch on my branch and cleanup the code a bit so that at least it can be used with 7B for now. @ggerganov what do you think would be the best way to modify ggml to allow it to create tensors without reserving memory? (like the |
Last time I researched this topic, I concluded that there is no real difference in performance between aligned and non-aligned memory memory. Probably this mattered in the past, but nowadays I think the differences are negligible. I think And even if I am wrong, given that all tensors in the model have sizes that are multiples of 32 I think it wouldn't be difficult to guarantee the alignment, right?
It has to be a new variable in the Lines 314 to 320 in 2a98bc1
Name it |
I can't imagine it'd be difficult at all. I think the main thing that misaligns the format, is those nul-terminated token strings at the beginning of the file. If you do a roundup(offset, 32) after producing those, then you might luck out and everything else will magically align. |
Totally - at this point we can easily make changes to the |
This is a breaking change that's going to give you three benefits: 1. Your inference commands should load 100x faster 2. You may be able to safely load models 2x larger 3. You can run many concurrent inference processes This was accomplished by changing the file format so we can mmap() weights directly into memory without having to read() or copy them thereby ensuring the kernel can make its file cache pages directly accessible to our inference processes; and secondly, that the file cache pages are much less likely to get evicted (which would force loads to hit disk) because they're no longer competing with memory pages that were needlessly created by gigabytes of standard i/o. Furthermore, this change ensures that tensors are aligned properly on a 32-byte boundary. That opens the door to seeing if we can get additional performance gains on some microprocessors, by using ops that require memory alignment. Lastly note that both POSIX and the Windows platform are supported Fixes ggerganov#91
This is a breaking change that's going to give you three benefits: 1. Your inference commands should load 100x faster 2. You may be able to safely load models 2x larger 3. You can run many concurrent inference processes This was accomplished by changing the file format so we can mmap() weights directly into memory without having to read() or copy them thereby ensuring the kernel can make its file cache pages directly accessible to our inference processes; and secondly, that the file cache pages are much less likely to get evicted (which would force loads to hit disk) because they're no longer competing with memory pages that were needlessly created by gigabytes of standard i/o. Furthermore, this change ensures that tensors are aligned properly on a 32-byte boundary. That opens the door to seeing if we can get additional performance gains on some microprocessors, by using ops that require memory alignment. Lastly note that both POSIX and the Windows platform are supported Fixes ggerganov#91
This is a breaking change that's going to give you three benefits: 1. Your inference commands should load 100x faster 2. You may be able to safely load models 2x larger 3. You can run many concurrent inference processes This was accomplished by changing the file format so we can mmap() weights directly into memory without having to read() or copy them thereby ensuring the kernel can make its file cache pages directly accessible to our inference processes; and secondly, that the file cache pages are much less likely to get evicted (which would force loads to hit disk) because they're no longer competing with memory pages that were needlessly created by gigabytes of standard i/o. The new file format supports single-file models like LLaMA 7b, and it also supports multi-file models like LLaMA 13B. Our Python tool now merges the foo.1, foo.2, etc. files back into a single file so that the C++ code which maps it doesn't need to reshape data every time. That's made llama.cpp so much simpler. Much of its load code has now been deleted. Furthermore, this change ensures that tensors are aligned properly on a 32-byte boundary. That opens the door to seeing if we can get additional performance gains on some microprocessors, by using ops that require memory alignment. Lastly note that both POSIX and the Windows platform are supported Fixes ggerganov#91
This is a breaking change that's going to give you three benefits: 1. Your inference commands should load 100x faster 2. You may be able to safely load models 2x larger 3. You can run many concurrent inference processes This was accomplished by changing the file format so we can mmap() weights directly into memory without having to read() or copy them thereby ensuring the kernel can make its file cache pages directly accessible to our inference processes; and secondly, that the file cache pages are much less likely to get evicted (which would force loads to hit disk) because they're no longer competing with memory pages that were needlessly created by gigabytes of standard i/o. The new file format supports single-file models like LLaMA 7b, and it also supports multi-file models like LLaMA 13B. Our Python tool now merges the foo.1, foo.2, etc. files back into a single file so that the C++ code which maps it doesn't need to reshape data every time. That's made llama.cpp so much simpler. Much of its load code has now been deleted. Furthermore, this change ensures that tensors are aligned properly on a 32-byte boundary. That opens the door to seeing if we can get additional performance gains on some microprocessors, by using ops that require memory alignment. Lastly note that both POSIX and the Windows platform are supported Fixes ggerganov#91
This is a breaking change that's going to give you three benefits: 1. Your inference commands should load 100x faster 2. You may be able to safely load models 2x larger 3. You can run many concurrent inference processes This was accomplished by changing the file format so we can mmap() weights directly into memory without having to read() or copy them thereby ensuring the kernel can make its file cache pages directly accessible to our inference processes; and secondly, that the file cache pages are much less likely to get evicted (which would force loads to hit disk) because they're no longer competing with memory pages that were needlessly created by gigabytes of standard i/o. The new file format supports single-file models like LLaMA 7b, and it also supports multi-file models like LLaMA 13B. Our Python tool now merges the foo.1, foo.2, etc. files back into a single file so that the C++ code which maps it doesn't need to reshape data every time. That's made llama.cpp so much simpler. Much of its load code has now been deleted. Furthermore, this change ensures that tensors are aligned properly on a 32-byte boundary. That opens the door to seeing if we can get additional performance gains on some microprocessors, by using ops that require memory alignment. Lastly note that both POSIX and the Windows platform are supported Fixes ggerganov#91
This is a breaking change that's going to give you three benefits: 1. Your inference commands should load 100x faster 2. You may be able to safely load models 2x larger 3. You can run many concurrent inference processes This was accomplished by changing the file format so we can mmap() weights directly into memory without having to read() or copy them thereby ensuring the kernel can make its file cache pages directly accessible to our inference processes; and secondly, that the file cache pages are much less likely to get evicted (which would force loads to hit disk) because they're no longer competing with memory pages that were needlessly created by gigabytes of standard i/o. The new file format supports single-file models like LLaMA 7b, and it also supports multi-file models like LLaMA 13B. Our Python tool now merges the foo.1, foo.2, etc. files back into a single file so that the C++ code which maps it doesn't need to reshape data every time. That's made llama.cpp so much simpler. Much of its load code has now been deleted. Furthermore, this change ensures that tensors are aligned properly on a 32-byte boundary. That opens the door to seeing if we can get additional performance gains on some microprocessors, by using ops that require memory alignment. Lastly note that both POSIX and the Windows platform are supported Fixes ggerganov#91
This is a breaking change that's going to give you three benefits: 1. Your inference commands should load 100x faster 2. You may be able to safely load models 2x larger 3. You can run many concurrent inference processes This was accomplished by changing the file format so we can mmap() weights directly into memory without having to read() or copy them thereby ensuring the kernel can make its file cache pages directly accessible to our inference processes; and secondly, that the file cache pages are much less likely to get evicted (which would force loads to hit disk) because they're no longer competing with memory pages that were needlessly created by gigabytes of standard i/o. The new file format supports single-file models like LLaMA 7b, and it also supports multi-file models like LLaMA 13B. Our Python tool now merges the foo.1, foo.2, etc. files back into a single file so that the C++ code which maps it doesn't need to reshape data every time. That's made llama.cpp so much simpler. Much of its load code has now been deleted. Furthermore, this change ensures that tensors are aligned properly on a 32-byte boundary. That opens the door to seeing if we can get additional performance gains on some microprocessors, by using ops that require memory alignment. Lastly note that both POSIX and the Windows platform are supported Fixes ggerganov#91
@ggerganov is possible to you guys make mmap work on gptj(pyg6b)? |
@Ar57m Why would it not be working for a specific model? Do you have an error message to share? We need more to go on. Also, this is not the right place. You should look up the error message in the issues, and if you can't find it, create a new one. Your other interaction on Github asks if it's possible to support loading big models on low memory devices. Yes and no. Memory mapping requires that you have enough memory to map into. So if you're running out of RAM, consider allocating swap space. This allows the OS to use the reserved disk space as though it were RAM. It isn't RAM, so expect it to run much slower. But it could make the difference between the program running and not running. |
It doesn't. It's just like swap. |
Enough virtual memory. Also there has to be enough physical memory backing it, or it will fault. So you are technically correct. Possibly the best kind of correct, but this is pedantics. |
@apaz-cli I'm trying to run pyg6b q4_0 on termux(Android), when the model is loading it crashes termux(because it's using too much ram), idk I think it's because it's not mmaping, I heard somewhere that gptj doesn't support mmap. I can run almost any size Llama based models(very slowly big models) with mmap on. |
So it doesn't create an extra copy in RAM and lives in the kernel page cache happily, loading instantly on subsequent runs.
The text was updated successfully, but these errors were encountered: