A project for evaluating reasoning capabilities in large language models (LLMs).
Read this in other languages: English, 中文.
pip install llm_evaluation_in_reasoning
Create a .env
file with the following:
OPENAI_API_KEY=<your key>
ANTHROPIC_API_KEY=<your key>
...
The api key you provided will be used to fetch the valid models supported by Litellm
.
Support GSM-Symbolic
, GSM8K
, MMLU
, SimpleBench
To run a benchmark:
llm_eval --model_name=ollama/qwen2.5:0.5b --dataset=SimpleBench # run llm_eval --help to see help information
Model support is based on Litellm
, see the docs here Litellm Providers
Clone the github repo and cd into it.
git clone https://github.com/ashengstd/llm_evaluation_in_reasoning.git
cd llm_evaluation_in_reasoning
The best way to install dependencies is to use uv
.
If you don't have it installed in your environment, you can install it with the following:
curl -LsSf https://astral.sh/uv/install.sh | sh # macOS and Linux
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" # Windows
uv sync --all-extra