Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do you load a lora model? #639

Closed
MotorCityCobra opened this issue Jan 25, 2024 · 3 comments
Closed

How do you load a lora model? #639

MotorCityCobra opened this issue Jan 25, 2024 · 3 comments

Comments

@MotorCityCobra
Copy link

I did lora fine tuning over in llama.cpp and when I was done it created two gguf files and one bin file. I'm pretty sure the bin file is the lora base but what goes in the lora field and what goes in the model field? Does the original model get loaded? Do the gguf files need to have a specific name formatted?

@LostRuins
Copy link
Owner

Model Field = Your base GGUF source model before any modifications.
LoRA adapter = Your fine tuned lora model adapter that modifies a small part of the base model.
LoRA base = An optional F16 model that will be used to apply the LoRA layers upon, for greater precision. More info here: #224

To answer your question, yes both models are required to load a LoRA.

@MotorCityCobra
Copy link
Author

My issue is that I don't know if my lora fine tuning on the Mistral 7B using llama.cpp worked because only llama models are supported or I'm having an issue loading the quantized model with the lora adapter file.
I think finetune.cpp is not producing a working lora adapter for the Mistral 7B. Any idea how I can modify anything to make this happen?

But I fine tuned on a quantized Mistral 7B model, which seemed to fine tune and reduce the loss with no errors. I get a bin file and gguf file when I'm done fine tuning. I have my quantized model as the 'model' and I try the gguf file the fine tuning produced and the error I get is the first but of text below, then when I try loading it with the bin file as the lora I get the second error text...

First error says "bad file magic"


llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.65 GiB (5.52 BPW)
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '<|im_end|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: system memory used  =   86.05 MiB
llm_load_tensors: VRAM used           = 4679.56 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
...................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx      = 2128
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 266.00 MB
llama_new_context_with_model: KV self size  =  266.00 MiB, K (f16):  133.00 MiB, V (f16):  133.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 164.35 MiB
llama_new_context_with_model: VRAM scratch buffer: 161.16 MiB
llama_new_context_with_model: total VRAM used: 5106.72 MiB (model: 4679.56 MiB, context: 427.16 MiB)

Attempting to apply LORA adapter: C:\Users\ooo\tor\llama.cpp\build\bin\Release\mistral-7b-shakespeare-LATEST2.gguf
llama_apply_lora_from_file_internal: applying lora adapter from 'C:\Users\ooo\tor\llama.cpp\build\bin\Release\mistral-7b-shakesp�Dï-�fllama_apply_lora_from_file_internal: bad file magic
gpttype_load_model: error: failed to apply lora adapter
Load Model OK: False
Could not load model: C:\Users\ooo\tor\llama.cpp\models\dolph\ggml-model-q5_0.gguf

Second error with the bin file says, "Error: the simultaneous use of LoRAs and GPU acceleration is only supported for f16 models".


llm_load_print_meta: n_vocab          = 32001
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 4.65 GiB (5.52 BPW)
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '<|im_end|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: system memory used  =   86.05 MiB
llm_load_tensors: VRAM used           = 4679.56 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
...................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx      = 2128
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 266.00 MB
llama_new_context_with_model: KV self size  =  266.00 MiB, K (f16):  133.00 MiB, V (f16):  133.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 164.35 MiB
llama_new_context_with_model: VRAM scratch buffer: 161.16 MiB
llama_new_context_with_model: total VRAM used: 5106.72 MiB (model: 4679.56 MiB, context: 427.16 MiB)

Attempting to apply LORA adapter: C:\Users\ooo\tor\llama.cpp\build\bin\Release\loro-elliots_gguf.bin
llama_apply_lora_from_file_internal: applying lora adapter from 'C:\Users\ooo\tor\llama.cpp\build\bin\Release\loro-elliots_gguf.<~u�à?llama_apply_lora_from_file_internal: r = 4, alpha = 4, scaling = 1.00
llama_apply_lora_from_file_internal: allocating 1500 MB for lora temporary buffer
llama_apply_lora_from_file_internal: warning: using a lora adapter with a quantized model may result in poor quality, use a f16 <~u�à?
Error: the simultaneous use of LoRAs and GPU acceleration is only supported for f16 models
llama_apply_lora_from_file: failed to apply lora adapter: llama_apply_lora_from_file_internal: error: the simultaneous use of Loû]u�à?gpttype_load_model: error: failed to apply lora adapter
Load Model OK: False
Could not load model: C:\Users\ooo\tor\llama.cpp\models\dolph\ggml-model-q5_0.gguf

Since that said the bin file (adapter) needed a f16 model I load the f16 model, and that gives me this error,


llm_load_print_meta: n_vocab          = 32001
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = unknown, may not work
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 13.49 GiB (16.00 BPW)
llm_load_print_meta: general.name     = models
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '<|im_end|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: system memory used  =  250.12 MiB
llm_load_tensors: VRAM used           = 13563.02 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
...................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:10000.0).
llama_new_context_with_model: n_ctx      = 2128
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 266.00 MB
llama_new_context_with_model: KV self size  =  266.00 MiB, K (f16):  133.00 MiB, V (f16):  133.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 164.35 MiB
llama_new_context_with_model: VRAM scratch buffer: 161.16 MiB
llama_new_context_with_model: total VRAM used: 13990.18 MiB (model: 13563.02 MiB, context: 427.16 MiB)

Attempting to apply LORA adapter: C:\Users\ooo\tor\llama.cpp\build\bin\Release\loro-elliots_gguf.bin
llama_apply_lora_from_file_internal: applying lora adapter from 'C:\Users\ooo\tor\llama.cpp\build\bin\Release\loro-elliots_gguf.8|>^Aqllama_apply_lora_from_file_internal: r = 4, alpha = 4, scaling = 1.00
llama_apply_lora_from_file_internal: allocating 1500 MB for lora temporary buffer

Error: the simultaneous use of LoRAs and GPU acceleration is only supported for f16 models
llama_apply_lora_from_file: failed to apply lora adapter: llama_apply_lora_from_file_internal: error: the simultaneous use of Lo^Y>^Aqgpttype_load_model: error: failed to apply lora adapter
Load Model OK: False
Could not load model: C:\Users\ooo\tor\llama.cpp\models\dolph\ggml-model-f16.gguf



@LostRuins
Copy link
Owner

I can't really provide much assistance on LoRAs as I don't really use them myself. Maybe you can try generating the composite model with llama.cpp instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants