BigCodeBench v0.2.1.post3
What's Changed
- Fix
calibration
setting in the code evaluation. - Add
--no_execute
argument for code evaluation. - Support concurrent API inference for
o1
anddeepseek-chat
. - Fix API inference for Google Gemini.
- Add
--instruction_prefix
and--response_prefix
arguments for code generation. - Change
--id_range
input type. - Add
--revision
arguments for code generation.
Evaluated LLMs (144 models)
- Qwen2.5-Coder-32B-Instruct
- grok-beta
- claude-3-5-haiku-20241022
Full Changelog: v0.2.0...v0.2.1.post2