This is the code for paper Towards Tool Use Alignment of Large Language Models.
Our model is on Huggingface.
Our data generation process is illustrated in the following image:
The train data are in trainset
folder
Use tar -xf dpo_train_data.tar
to unzip the DPO train data.
To obtain AlignToolLLaMA-SFT (SFT on ToolLLaMA), please run toolbench/sft_train_script_v2/align_train_deepspeed.sh
,
To obtain AlignToolLLaMA-DPO (DPO training on AlignToolLLaMA-DPO), please run toolbench/dpo_train_script_full_v2/train_dpo_mem.sh
The testset in ToolAlign for helpfulness, harmlessness, and autonomy are in testset/
ToolAlign_testset/
: Testset for harmlessness and autonomy.
unsafe_api.json
: testset for harmful tool responses.
unsafe_input_intro.json
and unsafe_input_safellama.json
: testset for harmful instructions.
- unsafe_input_intro.json: prompt ChatGPT to transform these instructions into unsafe ones
- unsafe_input_safellama.json: sample and rewrite from Anthropic Red Teaming Dataset
without_tool.json
: testset for autonomy.
ToolBench_testset/
: Testset for helpfulness.
This testset is copied from ToolBench
First, you should follwing ToolBench Data Release to prepare the tool environment.
After downloading the data use Google Drive or Tsinghua Cloud link provided in ToolBench, you can see the /data/toolenv/tools
folder and please put the folder into ToolAlign/server
folder.
We recommend to prepare the tool environment following StableToolBench by download the StableToolBench cache to avoid downloading useless data.
After downloading the cache provide in StableToolBench, unzip the folder into the ToolAlign/server
folder and ensure the ToolAlign/server
folder contains tools folder.
Scripts under test_script/toolbench_script/
are test for helpfulness, where aligntoolllama.sh
and toolllama.sh
are for AlignToolLLaMA-SFT/DPO and ToolLLaMA, respectively.
Scripts under test_script/toolalign_script/
are test for harmlessness and autonomy, where aligntoolllama.sh
and toolllama.sh
are for AlignToolLLaMA-SFT/DPO and ToolLLaMA, respectively.
For helpfulness, we follow ToolEval in ToolBench to evaluate the LLMs. The following paragraph refers to ToolBench and StableToolBench. The file structure of the directory should be:
├── /aligntoolllama_sft_CoT@1/
│ ├── /G1_instruction/
│ │ ├── /[email protected]
│ │ └── ...
│ ├── /G1_tool/
│ │ ├── /[email protected]
│ │ └── ...
│ ├── ...
│ ├── /G3_instruction/
│ │ ├── /[email protected]
│ │ └── ...
You should first prepare the data by run:
toolbench/tooleval/convert_to_answer_format.sh
Then you can calculate the pass rate by run toolbench/tooleval/eval_rass_rate.sh
and calculate the win rate by run toolbench/tooleval/eval_win_rate.sh
For harmlessness and autonomy, the file structure of the directory should be: The file structure of the directory should be:
├── /aligntoolllama_sft/
│ ├── /unsafe_api/
│ │ ├── /[email protected]
│ │ └── ...
│ ├── /unsafe_input_intro/
│ │ ├── /[email protected]
│ │ └── ...
│ ├── /unsafe_input_safellama/
│ │ ├── /[email protected]
│ │ └── ...
│ ├── /without_tool/
│ │ ├── /[email protected]
│ │ └── ...
For harmlessness, run python evaluation/ToolAlign/unsafe_api.py
to evaluate the refusal response rate of harmful tool responses and obtain a .json file.
The structure of the .json file should be
{
"54905": {
"is_refuse": true # or null/false
},
...
}
where the key is the test query id.
For harmful tool responses, the refusal response rate is calculated by "is_refuse": true
and the number of "is_refuse": false
in the saved .json file.
Run python evaluation/ToolAlign/unsafe_input.py
to evaluate the refusal response rate of harmful instructions and obtain a .json file.
The file structure should be
{
"43280": {
"is_refused": true # or false
},
...
}
where the key is the test query/response id.
We found that for responses not refused due to being harmful but because the tool could not resolve the issue after being called, GPT-4 often categorizes them as successfully refusing unsafe instructions.
So, for harmful instructions, we manually check the the refusal responses with "is_refused": true
, and label the fake refusal response as "is_refused": false
.
Finally, you can calculate the refusal response rate by "is_refused": true
after manual check in the .json file and unsafe_input_intro.json
and unsafe_input_safellama.json
).
For autonomy, run python evaluation/ToolAlign/without_tool.py
to evaluate the direct response rate, and you will obtain a .json file.
The file structure should be
{
"47": {
"answer_without_tool": false # or true
},
...
}
You can calculate the direct response rate by
"answer_without_tool": true
in the .json file, and