Pr139 dev #8

risingsunomi · 2024-08-28T03:00:51Z

Merging in PR139 development

Tested with Qwen2 Instruct, TinyLlama, and Llama3 8B models and successfully generated tokens.
Currently, the output quality is poor, often producing gibberish. I am investigating top_p, top_k, temperature settings, and the logit sampling method for improvements.
Successfully tested the setup on an NVIDIA A10 with enough resources for model generation.
Tested on local NVIDIA-based servers, achieving cross-computer generation on Linux systems. Encountered out-of-memory (OOM) errors, even with smaller models; will explore memory management on low VRAM GPUs.
Generalized the ShardedHuggingFaceModel forward function compatible with various HuggingFace transformer model classes, though specific handling for different model types is still required.
Reformatted interface test for pytorch
Updated generate_deps.py to update tinychat model select dropdown
Fix for fast check for loading tokenizers as was comparing a string to a PathLib
Update ShardedHuggingFaceModel to always use caching

…nsor to numpy

…ference_engine.py, adding in pytorch option for inference engine

… along with adding in torch topk, added in better random distribution selection for when top_p is too low or high, started work on forward_layer_cached but infer functions need to be changed and take any and not a string

…on from and to json

… where data from infer_prompt is not complete

…ken/logit coming from infer_tensor to infer_prompt, running into OOM issues trying on server

merge fork to pr branch

risingsunomi added 30 commits August 24, 2024 18:23

removing unittest, update inference return type, fixing converting te…

226a0ac

…nsor to numpy

adding nvidia quadro and t1000 support

e11bebd

updating test, updating model selection for smaller quant llama3 model

778cb6e

added updating model options to update_deps.py

56aae50

updating inference class init to take shard, updating pytorch test_in…

7df4640

…ference_engine.py, adding in pytorch option for inference engine

adding updates for inference_engine.py

aa769ca

reducing layer amount for llama3-2b-base

08e8b41

forward rewrite, adding in caching with dynamic cache, cache conversi…

7bcd35e

…on from and to json

updates to caching, stuck on issue with infer_prompt and infer_tensor…

3beea22

… where data from infer_prompt is not complete

trying to fix infer problems

87a14ca

switched everything to use caching, did more prep for encoding the to…

356bf2f

…ken/logit coming from infer_tensor to infer_prompt, running into OOM issues trying on server

fixing test

aa89032

adding init py for old python versions

b9331d7

update readme and add in init pys

2c7aa9c

adding more tests

6da3e94

adding more try catch to move through tests

d0bc93c

tests

0e221b2

added position embeddings, update test

9fc9fdb

tests

2635b4c

adding back tests

86e89eb

adding another test

64fbacd

Merge pull request #6 from exo-explore/main

fb7c73f

merge fork to pr branch

added gc collect to remove gpu, fixed tokenizers warning

0d93130

fixing device

0ae716d

adding smaller model test

7705639

testing

81d597d

added tinyllama

f1d3e31

changing top_p

bf0e606

updating test

432efb5

risingsunomi added 2 commits August 27, 2024 18:12

adding A10, adding test

2cdc14c

removing reloading of shard, changing temp and top_p

ed5bea7

risingsunomi merged commit 46667b6 into main Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pr139 dev #8

Pr139 dev #8

risingsunomi commented Aug 28, 2024

Pr139 dev #8

Pr139 dev #8

Conversation

risingsunomi commented Aug 28, 2024