You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* New Feature:
1. Sum_Rows:
fix cuda kernel overflow
fix block shape error when nrows too big
2. Im2Col:
Support Batch in cuda
Support f32 to f32 both in cpu && cuda
3. DepthWiseConv:
Support by Im2Col && MulMat
4. Pool_2d:
Supoort avg pooling in cuda
5. HardSigmoid:
Imp in cuda
6. HardSwish:
Imp in cuda
* fix tabs instead of spaces
* code clean
* CUDA POOL2D
* ADD POOL2D test case in test-backend-ops.cpp
* code clean
* fix pool2d_kernel
nits
* fix bug in pool2d kernel
* fix avg pooling, count_include_pad
nits
* test-backend-ops : add more pool_2d tests
* cuda : fix warnings and formatting
* ggml : check types in release builds too in pool_2d
* test-backend-ops : remove f16 pool_2d tests
* cuda : more style fixes
* Add assert in ggml_cuda_op_pool2d
* pool2d float padding fallback
* test-backend-ops : add dst_type to im2col
---------
Co-authored-by: slaren <[email protected]>
Copy file name to clipboardExpand all lines: examples/llava/MobileVLM-README.md
+56-2Lines changed: 56 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -111,17 +111,71 @@ llama_print_timings: eval time = 1279.03 ms / 18 runs ( 71.06 m
111
111
llama_print_timings: total time = 34570.79 ms
112
112
```
113
113
114
+
## Orin compile and run
115
+
### compile
116
+
```sh
117
+
make LLAMA_CUBLAS=1 CUDA_DOCKER_ARCH=sm_87 LLAMA_CUDA_F16=1 -j 32
118
+
```
119
+
120
+
### run on Orin
121
+
### case 1
122
+
**input**
123
+
```sh
124
+
./llava-cli \
125
+
-m /data/local/tmp/ggml-model-q4_k.gguf \
126
+
--mmproj /data/local/tmp/mmproj-model-f16.gguf \
127
+
--image /data/local/tmp/demo.jpeg \
128
+
-p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? \nAnswer the question using a single word or phrase. ASSISTANT:" \
129
+
--n-gpu-layers 999
130
+
```
131
+
**output**
132
+
```sh
133
+
134
+
encode_image_with_clip: image encoded in 296.62 ms by CLIP ( 2.06 ms per image patch)
135
+
136
+
Susan Wise Bauer
137
+
138
+
llama_print_timings: load time = 1067.64 ms
139
+
llama_print_timings: sample time = 1.53 ms / 6 runs ( 0.25 ms per token, 3934.43 tokens per second)
140
+
llama_print_timings: prompt evaltime = 306.84 ms / 246 tokens ( 1.25 ms per token, 801.72 tokens per second)
141
+
llama_print_timings: evaltime = 91.50 ms / 6 runs ( 15.25 ms per token, 65.58 tokens per second)
142
+
llama_print_timings: total time = 1352.63 ms / 252 tokens
143
+
```
144
+
145
+
### case 2
146
+
**input**
147
+
```sh
148
+
./llava-cli \
149
+
-m /data/local/tmp/ggml-model-q4_k.gguf \
150
+
--mmproj /data/local/tmp/mmproj-model-f16.gguf \
151
+
-p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat is in the image? ASSISTANT:" \
152
+
--n-gpu-layers 999
153
+
154
+
```
155
+
**output**
156
+
```sh
157
+
encode_image_with_clip: image encoded in 302.15 ms by CLIP ( 2.10 ms per image patch)
158
+
159
+
The image features a cat lying in the grass.
160
+
161
+
llama_print_timings: load time = 1057.07 ms
162
+
llama_print_timings: sample time = 3.27 ms / 11 runs ( 0.30 ms per token, 3360.83 tokens per second)
163
+
llama_print_timings: prompt evaltime = 213.60 ms / 232 tokens ( 0.92 ms per token, 1086.14 tokens per second)
164
+
llama_print_timings: evaltime = 166.65 ms / 11 runs ( 15.15 ms per token, 66.01 tokens per second)
165
+
llama_print_timings: total time = 1365.47 ms / 243 tokens
166
+
```
167
+
114
168
## Minor shortcomings
115
169
The `n_patch` of output in `ldp` is 1/4 of the input. In order to implement quickly, we uniformly modified `clip_n_patches` function to a quarter. when counting the time consumption, the calculated time will be 4 times bigger than the real cost.
116
170
117
171
## TODO
118
172
119
-
-[] Support non-CPU backend for the new operators, such as `depthwise`, `hardswish`, `hardsigmoid`
173
+
-[x] Support non-CPU backend for the new operators, such as `depthwise`, `hardswish`, `hardsigmoid`
120
174
-[ ] Optimize LDP projector performance
121
175
122
176
- Optimize the structure definition to avoid unnecessary memory rearrangements, to reduce the use of `ggml_permute_cpy`;
123
177
- Optimize operator implementation (ARM CPU/NVIDIA GPU): such as depthwise conv, hardswish, hardsigmoid, etc.
124
-
-[] run MobileVLM on `Jetson Orin`
178
+
-[x] run MobileVLM on `Jetson Orin`
125
179
-[ ] Support more model variants, such as `MobileVLM-3B`.
0 commit comments