You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -41,17 +41,26 @@ If you actually inspect the `.keys()` of the loaded tensors, you should see a lo
41
41
42
42
43
43
### 2. Creating the Visual Component GGUF
44
-
To create the GGUF for the visual components, we need to write a config for the visual encoder; make sure the config contains the correct `image_grid_pinpoints`
44
+
Next, create a new directory to hold the visual components, and copy the llava.clip/projector files, as shown below.
Now, we need to write a config for the visual encoder. In order to convert the model, be sure to use the correct `image_grid_pinpoints`, as these may vary based on the model. You can find the `image_grid_pinpoints` in `$GRANITE_MODEL/config.json`.
46
55
47
-
Note: we refer to this file as `$VISION_CONFIG` later on.
48
56
```json
49
57
{
50
58
"_name_or_path": "siglip-model",
51
59
"architectures": [
52
60
"SiglipVisionModel"
53
61
],
54
62
"image_grid_pinpoints": [
63
+
[384,384],
55
64
[384,768],
56
65
[384,1152],
57
66
[384,1536],
@@ -94,42 +103,32 @@ Note: we refer to this file as `$VISION_CONFIG` later on.
94
103
}
95
104
```
96
105
97
-
Create a new directory to hold the visual components, and copy the llava.clip/projector files, as well as the vision config into it.
At which point you should have something like this:
106
+
At this point you should have something like this:
109
107
```bash
110
108
$ ls $ENCODER_PATH
111
109
config.json llava.projector pytorch_model.bin
112
110
```
113
111
114
-
Now convert the components to GGUF; Note that we also override the image mean/std dev to `[.5,.5,.5]` since we use the siglip visual encoder - in the transformers model, you can find these numbers in the [preprocessor_config.json](https://huggingface.co/ibm-granite/granite-vision-3.1-2b-preview/blob/main/preprocessor_config.json).
112
+
Now convert the components to GGUF; Note that we also override the image mean/std dev to `[.5,.5,.5]` since we use the SigLIP visual encoder - in the transformers model, you can find these numbers in the `preprocessor_config.json`.
115
113
```bash
116
114
$ python convert_image_encoder_to_gguf.py \
117
115
-m $ENCODER_PATH \
118
116
--llava-projector $ENCODER_PATH/llava.projector \
119
117
--output-dir $ENCODER_PATH \
120
118
--clip-model-is-vision \
121
119
--clip-model-is-siglip \
122
-
--image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5
120
+
--image-mean 0.5 0.5 0.5 \
121
+
--image-std 0.5 0.5 0.5
123
122
```
124
123
125
-
this will create the first GGUF file at `$ENCODER_PATH/mmproj-model-f16.gguf`; we will refer to the abs path of this file as the `$VISUAL_GGUF_PATH.`
124
+
This will create the first GGUF file at `$ENCODER_PATH/mmproj-model-f16.gguf`; we will refer to the absolute path of this file as the `$VISUAL_GGUF_PATH.`
126
125
127
126
128
127
### 3. Creating the LLM GGUF.
129
128
The granite vision model contains a granite LLM as its language model. For now, the easiest way to get the GGUF for LLM is by loading the composite model in `transformers` and exporting the LLM so that it can be directly converted with the normal conversion path.
130
129
131
130
First, set the `LLM_EXPORT_PATH` to the path to export the `transformers` LLM to.
132
-
```
131
+
```bash
133
132
$ export LLM_EXPORT_PATH=$PWD/granite_vision_llm
134
133
```
135
134
@@ -142,7 +141,7 @@ if not MODEL_PATH:
142
141
raiseValueError("env var GRANITE_MODEL is unset!")
143
142
144
143
LLM_EXPORT_PATH= os.getenv("LLM_EXPORT_PATH")
145
-
ifnotMODEL_PATH:
144
+
ifnotLLM_EXPORT_PATH:
146
145
raiseValueError("env var LLM_EXPORT_PATH is unset!")
Note that currently you cannot quantize the visual encoder because granite vision models use SigLIP as the visual encoder, which has tensor dimensions that are not divisible by 32.
176
+
171
177
172
-
Note - the test image shown below can be found [here](https://github-production-user-asset-6210df.s3.amazonaws.com/10740300/415512792-d90d5562-8844-4f34-a0a5-77f62d5a58b5.jpg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20250221%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250221T054145Z&X-Amz-Expires=300&X-Amz-Signature=86c60be490aa49ef7d53f25d6c973580a8273904fed11ed2453d0a38240ee40a&X-Amz-SignedHeaders=host).
178
+
### 5. Running the Model in Llama cpp
179
+
Build llama cpp normally; you should have a target binary named `llama-llava-cli`, which you can pass two binaries to. As an example, we pass the the llama.cpp banner.
173
180
174
181
```bash
175
182
$ ./build/bin/llama-llava-cli -m $LLM_GGUF_PATH \
176
183
--mmproj $VISUAL_GGUF_PATH \
177
-
--image cherry_blossom.jpg \
184
+
--image ./media/llama0-banner.png \
178
185
-c 16384 \
179
-
-p "<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n<|user|>\n\<image>\nWhat type of flowers are in this picture?\n<|assistant|>\n" \
186
+
-p "<|system|>\nA chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n<|user|>\n\<image>\nWhat does the text in this image say?\n<|assistant|>\n" \
180
187
--temp 0
181
188
```
182
189
183
-
Sample response: `The flowers in the picture are cherry blossoms, which are known for their delicate pink petals and are often associated with the beauty of spring.`
190
+
Sample output: `The text in the image reads "LLAMA C++ Can it run DOOM Llama?"`
0 commit comments