Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars #12034

Open
wants to merge 33 commits into
base: master
Choose a base branch
from

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Feb 22, 2025

TL;DR: fixes tool calling of Qwen 2.5 Coder 0.5B/1.5B/3B/7B/32B... at any temperature

instructions to build this branch
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git remote add ochafik https://github.com/ochafik/llama.cpp
git fetch ochafik
git checkout ochafik/tool-bench-prod
cmake -B build -DLLAMA_CURL=1
cmake --build build -t llama-server --parallel --config Release
alias llama-server=./build/bin/llama-server
llama-server --jinja -fa -c 0 -hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF
  • Added support for regex grammar triggers, and respect when they should be matching at the start only (was already declared but not implemented; should avoid spurious triggering when the triggers were defined as wide-catches).
    • In llama.h, deprecating llama_sampler_init_grammar_lazy (which used to take tokens or words) in favour of llama_sampler_init_grammar_lazy_patterns (which takes tokens or full-string regex patterns w/ a group that marks from where the grammar is triggered)
  • Dramatically improved tool calls success rate of Qwen 2.5 Coder (Hermes 2 format) w/ more triggers that match what the models tends to output (esp. at higher temperatures) / looser triggers w/ regular expressions
  • Added scripts/tool_bench.py to evaluate tool call compliance probability of llama-server & ollama on different models, at different temperatures

The following heatmap shows compliance ratio on two super basic tool call tests (hello world & weather tests from examples/server/tests/unit/test_tool_call.py, now shared w/ the bench tool). 3 pairs of columns for llama-server of this PR, baseline llama-server (master) and ollama.

image

qwenc1 5b

export ARGS=( --n 30 --llama-baseline="$(which llama-server)" --temp -1 --temp 0 --temp 0.5 --temp 0.75 --temp 1 --temp 1.5 --temp 2 --temp 5 ) 

./scripts/tool_bench.py run ${ARGS[@]} --model "Qwen 2.5 Coder 7B Q4_K_M"             --output ../qwenc7b.jsonl   --hf unsloth/Qwen2.5-Coder-7B-Instruct-128K-GGUF:Q4_K_M   --ollama qwen2.5-coder:7b-instruct-q4_K_M
./scripts/tool_bench.py run ${ARGS[@]} --model "Qwen 2.5 Coder 1.5B Q4_K_M"           --output ../qwenc1.5b.jsonl --hf unsloth/Qwen2.5-Coder-1.5B-Instruct-128K-GGUF:Q4_K_M --ollama qwen2.5-coder:1.5b-instruct-q4_K_M

See gist with results for many more models

Notes about results:

  • the failures of llama-server at temp = 2 are model humour / stylistic choice (Sure! You can use the following Python code... instead of tool call)
  • ollama seems to only recognize the tool call format of the template, but models like Qwen 2.5 Coder 7B is quite... creative in its tool call outputs, esp. at higher temperatures.
  • ollama's default temperature seems to be 0.6 (hence why the row w/ @ None kinda fits results of lower rows)
  • The tests may need further tweaking to accept arguably “correct” answers. The framing of the hello world test is questionable, sometimes models just explain how they would write the code.
  • The benchmark tool also supports running test_calc_results which evaluates how well a model follows up on tool results. This seems to have more varied failure modes so it's not evaluated by default.

TODO:

  • Run & share more bench results (esp. other Qwen Coder variants!)
  • Stabilize tests / ci
  • Analyze bench times

Update llama-grammar.h

update

Update llama-grammar.h

Update common.h

Update common.h

Update sampling.cpp

Update chat.cpp

update test_tool_call.py

Update server.cpp

Update utils.hpp

Update chat.cpp

Update test_tool_call.py

Update fetch_server_test_models.py
@github-actions github-actions bot added script Script related testing Everything test related examples python python script changes server labels Feb 22, 2025
@GuuD
Copy link

GuuD commented Feb 22, 2025

Was Qwen 2.5 Coder even trained for tool use? 🤯

@ochafik
Copy link
Collaborator Author

ochafik commented Feb 23, 2025

Was Qwen 2.5 Coder even trained for tool use? 🤯

@GuuD I guess all models must be to some extent, these days. Their technical report only mentions in passing the fact that BigCodeBench is "primarily aimed at evaluating the ability of tool-use and complex instruction following" and their results on that benchmark look quite decent. But given the variety of outputs the model wraps tool calls in, I doubt they stuck to the syntax used in their jinja template.

@ochafik ochafik marked this pull request as ready for review February 25, 2025 12:01
@ochafik ochafik requested a review from ngxson as a code owner February 25, 2025 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes script Script related server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants