Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: WIP: Adjust GPU Layers #3737

Draft
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

siddimore
Copy link
Contributor

@siddimore siddimore commented Oct 6, 2024

Description

  1. Add GGUF Parser

TODO

  1. Install nvidia-smi driver
  2. Fetch GPU Device information
  3. Offload layers to GPU baed on GGUF Parser metadata

This PR fixes #3541

Notes for Reviewers

Signed commits

  • Yes, I signed my commits.

Signed-off-by: Siddharth More <[email protected]>
Copy link

netlify bot commented Oct 6, 2024

Deploy Preview for localai ready!

Name Link
🔨 Latest commit cd1dc5d
🔍 Latest deploy log https://app.netlify.com/sites/localai/deploys/6705d246290228000812ee36
😎 Deploy Preview https://deploy-preview-3737--localai.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@siddimore siddimore changed the title WIP: Figure out GPU Layers feat: WIP: Figure out GPU Layers Oct 6, 2024
go.mod Outdated Show resolved Hide resolved
localai-bot and others added 17 commits October 6, 2024 09:01
…6b61dc98b87a` (mudler#3718)

⬆️ Update ggerganov/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <[email protected]>
…dfbc9d51570c4e` (mudler#3719)

⬆️ Update ggerganov/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <[email protected]>
Updated some formatting in the doc.

Signed-off-by: JJ Asghar <[email protected]>
…f07d9d7a6077` (mudler#3725)

⬆️ Update ggerganov/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <[email protected]>
…ab770389bb442b` (mudler#3724)

⬆️ Update ggerganov/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <[email protected]>
feat(multimodal): allow to template image placeholders

Signed-off-by: Ettore Di Giacinto <[email protected]>
Signed-off-by: Ettore Di Giacinto <[email protected]>
)

* feat(vllm): add support for image-to-text

Related to mudler#3670

Signed-off-by: Ettore Di Giacinto <[email protected]>

* feat(vllm): add support for video-to-text

Closes: mudler#2318

Signed-off-by: Ettore Di Giacinto <[email protected]>

* feat(vllm): support CPU installations

Signed-off-by: Ettore Di Giacinto <[email protected]>

* feat(vllm): add bnb

Signed-off-by: Ettore Di Giacinto <[email protected]>

* chore: add docs reference

Signed-off-by: Ettore Di Giacinto <[email protected]>

* Apply suggestions from code review

Signed-off-by: Ettore Di Giacinto <[email protected]>

---------

Signed-off-by: Ettore Di Giacinto <[email protected]>
Signed-off-by: Ettore Di Giacinto <[email protected]>
…b00e0223b6fa` (mudler#3731)

⬆️ Update ggerganov/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <[email protected]>
…5c9e2b2529ff2c` (mudler#3730)

⬆️ Update ggerganov/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <[email protected]>
We default to a soft kill, however, we might want to force killing
backends after a while to avoid hanging requests (which may hallucinate
indefinetly)

Signed-off-by: Ettore Di Giacinto <[email protected]>
If the LLM does not implement any logic for PredictStream, we close the
channel immediately to not leave the process hanging.

Signed-off-by: Ettore Di Giacinto <[email protected]>
…c8930d19f45773` (mudler#3735)

⬆️ Update ggerganov/whisper.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <[email protected]>
…1bd8811a9b44` (mudler#3736)

⬆️ Update ggerganov/llama.cpp

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: mudler <[email protected]>
Signed-off-by: Siddharth More <[email protected]>
@github-actions github-actions bot added kind/documentation Improvements or additions to documentation area/ai-model labels Oct 6, 2024
@siddimore siddimore changed the title feat: WIP: Figure out GPU Layers feat: WIP: Adjust GPU Layers Oct 7, 2024
@siddimore
Copy link
Contributor Author

@mudler can u kindly check the PR approach and give some high level feedback when possible?

Next step that i will add is some sort of GPU_Layer estimator based on:

  1. VRAM from GGUF Parsing
  2. Noof GPU's on the device

Signed-off-by: Siddharth More <[email protected]>
Signed-off-by: Siddharth More <[email protected]>
@@ -70,6 +70,7 @@ type RunCMD struct {
Federated bool `env:"LOCALAI_FEDERATED,FEDERATED" help:"Enable federated instance" group:"federated"`
DisableGalleryEndpoint bool `env:"LOCALAI_DISABLE_GALLERY_ENDPOINT,DISABLE_GALLERY_ENDPOINT" help:"Disable the gallery endpoints" group:"api"`
LoadToMemory []string `env:"LOCALAI_LOAD_TO_MEMORY,LOAD_TO_MEMORY" help:"A list of models to load into memory at startup" group:"models"`
AdjustGPULayers bool `env:"LOCALAI_ADJUST_GPU_LAYERS,ADJUST_GPU_LAYERS" help:"Enable OffLoading of model layers to GPU" group:"models"`
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: I would call this something like AutomaticallyAdjustGPULayers :

Suggested change
AdjustGPULayers bool `env:"LOCALAI_ADJUST_GPU_LAYERS,ADJUST_GPU_LAYERS" help:"Enable OffLoading of model layers to GPU" group:"models"`
AutomaticallyAdjustGPULayers bool `env:"LOCALAI_AUTO_ADJUST_GPU_LAYERS,ADJUST_GPU_LAYERS" help:"Enable Automatic OffLoading of model layers to GPU" group:"models"`


// GetNvidiaGpuInfo uses pkg nvml is a go binding around C API provided by libnvidia-ml.so
// to fetch GPU stats
func GetNvidiaGpuInfo() ([]GPUInfo, error) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, but it's good practice to keep acronyms uppercase, e.g. UnmarshalYAML, GetXYZ:

Suggested change
func GetNvidiaGpuInfo() ([]GPUInfo, error) {
func GetNvidiaGPUInfo() ([]GPUInfo, error) {

}
}

func TestGetModelGGufData_URL_WithMockedEstimateModelMemoryUsage(t *testing.T) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a minor nit, but all other tests are using ginkgo - do you feel to use ginkgo as well? not a blocker in any case, at some point will refactor things out to be more consistent if needed

)

// Interface for parsing different model formats
type LocalAIGGUFParser interface {
Copy link
Owner

@mudler mudler Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit style (non-blocking): it would make the code more reusable if the interface here would require only ParseGGUFFile and have different parsers implementing their ParseGGUFFile logic, for instance a Ollama parser, a Huggingface Parser, etc.

The caller then would need to instantiate the needed parser down the line, for instance:

type GGUFParser interface {
   Parse(path string) (*ggufparser.GGUFFile, error)
}

// GetModelGGufData returns the resources estimation needed to load the model.
func GetModelGGufData(modelPath string, estimator ModelMemoryEstimator, ollamaModel bool) (*ModelEstimate, error) {
	ctx := context.Background()

	fmt.Println("ModelPath: ", modelPath)

        var  ggufParser GGUFParser
 
	// Check if the input is a valid URL
        switch {
          case isURL(modelPath):
             ggufParser = &RemoteFileParser{ctx,modelPath}
           case ollamaModel:
             ggufParser = &OllaamParser{ctx,modelPath}
          /// .. other parsers here
        }
      return estimator.Estimate(ggufRemoteData)
}

Considering that we pass an estimator down the line tells me that this actually should be part of ModelMemoryEstimator as well:

func (g GGUFEstimator) GetModelGGufData(modelPath string,  ollamaModel bool) (*ModelEstimate, error) {

@mudler
Copy link
Owner

mudler commented Oct 9, 2024

@siddimore thanks for taking a stab at this, direction looks good here - just few minor nits here and there but definitely not blockers

@siddimore
Copy link
Contributor Author

@siddimore thanks for taking a stab at this, direction looks good here - just few minor nits here and there but definitely not blockers

thanks much @mudler you are welcome!! i will improve the code and add some more testing. Appreciate the feedback and will fix the comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ai-model kind/documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: automatically adjust default gpu_layers by available GPU memory
4 participants