llama fp8 - enable non reuse cache flow for fp8 (#64)#766
Conversation
Add disk_offload flag that controls device_map=auto. Setting this flag enbales weights offload to disk when cpu memory runs OOM. Add const serialization path flag that gets a path for where to serialize const sections, so if there is no space on device to save all const sections they will be offloaded to disk.
* llama fp8 - enable non reuse cache flow for fp8 remove depracted kv cache fp8 flag Change-Id: Id76f94a127dee202376e8f27de7b28f58affedae * fixing lm eval Change-Id: I230fa53e7b49d8bb36397b063f652ba3def84600 * remove old quantization mode Change-Id: I538172f29870311349ed79d928cfacc60fb534e8
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
@HolyFalafel There is some overlap between this PR and #780, I see the same changes for |
Yes, in order to avoid merge conflicts, I had to base this commit on top of the commit which is relevant to #780. |
|
Hi @HolyFalafel , I see that "Code check quality" tests are failing. Could you fix formatting with ruff? |
We need to merge #780 first. I'll try it out with Synapse 1.14 but not sure it will work. If it does I'll merge it, ortherwise we'll need to wait for the release of Synapse 1.15. |
Update configuration_utils.py
|
@HolyFalafel We'll merge this PR in the |
…ggingface#2302) (huggingface#766) Co-authored-by: Karol Brejna <karolbrejna@apache.org>
…ggingface#2302) (huggingface#766) Co-authored-by: Karol Brejna <karolbrejna@apache.org>
remove depracted kv cache fp8 flag
Change-Id: Id76f94a127dee202376e8f27de7b28f58affedae
Change-Id: I230fa53e7b49d8bb36397b063f652ba3def84600
Change-Id: I538172f29870311349ed79d928cfacc60fb534e8