Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing Error Message (i think). #290

Closed
ashercn97 opened this issue Jul 18, 2023 · 6 comments
Closed

Confusing Error Message (i think). #290

ashercn97 opened this issue Jul 18, 2023 · 6 comments

Comments

@ashercn97
Copy link

I am getting thsi error:

2023-07-18 01:04:39,361] [WARNING] [axolotl.validate_config:16] [PID:640] batch_size is not recommended. Please use gradient_accumulation_steps instead.
To calculate the equivalent gradient_accumulation_steps, divide batch_size / micro_batch_size / number of gpus.
[2023-07-18 01:04:39,362] [INFO] [axolotl.scripts.train:219] [PID:640] loading tokenizer... openlm-research/open_llama_3b
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at huggingface/transformers#24565
[2023-07-18 01:04:39,438] [DEBUG] [axolotl.load_tokenizer:55] [PID:640] EOS: 2 /
[2023-07-18 01:04:39,438] [DEBUG] [axolotl.load_tokenizer:56] [PID:640] BOS: 1 /
[2023-07-18 01:04:39,438] [DEBUG] [axolotl.load_tokenizer:57] [PID:640] PAD: None / None
[2023-07-18 01:04:39,438] [DEBUG] [axolotl.load_tokenizer:58] [PID:640] UNK: 0 /
[2023-07-18 01:04:39,438] [INFO] [axolotl.load_tokenized_prepared_datasets:82] [PID:640] Unable to find prepared dataset in last_run_prepared/ba96bd8ae0099721227d1c8a23d6d7a4
[2023-07-18 01:04:39,438] [INFO] [axolotl.load_tokenized_prepared_datasets:83] [PID:640] Loading raw datasets...
[2023-07-18 01:04:39,438] [INFO] [axolotl.load_tokenized_prepared_datasets:88] [PID:640] No seed provided, using default seed of 42
100%|██████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 320.57it/s]
[2023-07-18 01:04:39,971] [INFO] [axolotl.load_tokenized_prepared_datasets:264] [PID:640] tokenizing, merging, and shuffling master dataset
Traceback (most recent call last):
File "/home/studio-lab-user/axolotl/scripts/finetune.py", line 356, in
fire.Fire(train)
File "/home/studio-lab-user/.conda/envs/python39/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/studio-lab-user/.conda/envs/python39/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/studio-lab-user/.conda/envs/python39/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/studio-lab-user/axolotl/scripts/finetune.py", line 226, in train
train_dataset, eval_dataset = load_prepare_datasets(
File "/home/studio-lab-user/axolotl/src/axolotl/utils/data.py", line 393, in load_prepare_datasets
dataset = load_tokenized_prepared_datasets(
File "/home/studio-lab-user/axolotl/src/axolotl/utils/data.py", line 268, in load_tokenized_prepared_datasets
samples = samples + list(d)
File "/home/studio-lab-user/axolotl/src/axolotl/datasets.py", line 42, in iter
yield self.prompt_tokenizer.tokenize_prompt(example)
File "/home/studio-lab-user/axolotl/src/axolotl/prompt_tokenizers.py", line 116, in tokenize_prompt
tokenized_res_prompt = self._tokenize(
File "/home/studio-lab-user/axolotl/src/axolotl/prompt_tokenizers.py", line 64, in _tokenize
result = self.tokenizer(
File "/home/studio-lab-user/.conda/envs/python39/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2571, in call
raise ValueError("You need to specify either text or text_target.")
ValueError: You need to specify either text or text_target.
Traceback (most recent call last):
File "/home/studio-lab-user/.conda/envs/python39/bin/accelerate", line 8, in
sys.exit(main())
File "/home/studio-lab-user/.conda/envs/python39/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/studio-lab-user/.conda/envs/python39/lib/python3.9/site-packages/accelerate/commands/launch.py", line 979, in launch_command
simple_launcher(args)
File "/home/studio-lab-user/.conda/envs/python39/lib/python3.9/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/studio-lab-user/.conda/envs/python39/bin/python3.9', 'scripts/finetune.py', 'config/OpenOrcaTest.yml']' returned non-zero exit status 1.

My config settings are:
`
base_model: openlm-research/open_llama_3b
base_model_config: openlm-research/open_llama_3b
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
load_in_8bit: true
load_in_4bit: false
strict: false
push_dataset_to_hub:
datasets:

  • path: ashercn97/Testing
    type: alpaca
    dataset_prepared_path: last_run_prepared
    val_set_size: 0.02
    adapter: lora
    lora_model_dir:
    sequence_len: 256
    max_packed_sequence_len:
    lora_r: 8
    lora_alpha: 16
    lora_dropout: 0.0
    lora_target_modules:
  • gate_proj
  • down_proj
  • up_proj
  • q_proj
  • v_proj
  • k_proj
  • o_proj
    lora_fan_in_fan_out:
    wandb_project:
    wandb_watch:
    wandb_run_id:
    wandb_log_model:
    output_dir: ./openorca-out
    batch_size: 16
    micro_batch_size: 4
    num_epochs: 3
    optimizer: adamw_bnb_8bit
    torchdistx_path:
    lr_scheduler: cosine
    learning_rate: 0.0002
    train_on_inputs: false
    group_by_length: false
    bf16: false
    fp16: true
    tf32: false
    gradient_checkpointing: true
    early_stopping_patience:
    resume_from_checkpoint:
    local_rank:
    logging_steps: 1
    xformers_attention: true
    flash_attention:
    gptq_groupsize:
    gptq_model_v1:
    warmup_steps: 10
    eval_steps: 50
    save_steps:
    debug:
    deepspeed:
    weight_decay: 0.0
    fsdp:
    fsdp_config:
    special_tokens:
    bos_token: ""
    eos_token: "
    "
    unk_token: ""
    '
@ashercn97
Copy link
Author

Idk what happened to the formatting...

@NanoCode012
Copy link
Collaborator

Can you check all your prompts have non-empty output?

I recall this You need to specify either text or text_target could mean the tokenizer has nothing to tokenize.

@ashercn97
Copy link
Author

@NanoCode012 One of them I think is empty. Is there a way to just drop columns with empty things?

@NanoCode012
Copy link
Collaborator

@ashercn97 , unfortunately not with axolotl, I think you can simply load into pandas and drop it.

@ashercn97
Copy link
Author

@NanoCode012 Okay. I will try to do that now. Thanks so much!

@ashercn97
Copy link
Author

MY thing is fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants