-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tool-call
: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars
#12034
base: master
Are you sure you want to change the base?
Conversation
Update llama-grammar.h update Update llama-grammar.h Update common.h Update common.h Update sampling.cpp Update chat.cpp update test_tool_call.py Update server.cpp Update utils.hpp Update chat.cpp Update test_tool_call.py Update fetch_server_test_models.py
…3 8b tool outputs)
Was Qwen 2.5 Coder even trained for tool use? 🤯 |
@GuuD I guess all models must be to some extent, these days. Their technical report only mentions in passing the fact that BigCodeBench is "primarily aimed at evaluating the ability of tool-use and complex instruction following" and their results on that benchmark look quite decent. But given the variety of outputs the model wraps tool calls in, I doubt they stuck to the syntax used in their jinja template. |
TL;DR: fixes tool calling of Qwen 2.5 Coder 0.5B/1.5B/3B/7B/32B... at any temperature
instructions to build this branch
llama.h
, deprecatingllama_sampler_init_grammar_lazy
(which used to take tokens or words) in favour ofllama_sampler_init_grammar_lazy_patterns
(which takes tokens or full-string regex patterns w/ a group that marks from where the grammar is triggered)scripts/tool_bench.py
to evaluate tool call compliance probability ofllama-server
&ollama
on different models, at different temperaturesThe following heatmap shows compliance ratio on two super basic tool call tests (hello world & weather tests from
examples/server/tests/unit/test_tool_call.py
, now shared w/ the bench tool). 3 pairs of columns for llama-server of this PR, baseline llama-server (master) and ollama.See gist with results for many more models
Notes about results:
Sure! You can use the following Python code...
instead of tool call)@ None
kinda fits results of lower rows)test_calc_results
which evaluates how well a model follows up on tool results. This seems to have more varied failure modes so it's not evaluated by default.TODO: