ToolMaker

This repository contains the official code for the paper:

LLM Agents Making Agent Tools
Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelović and Jakob N. Kather
arXiv, Feb 2025.

Read abstract

Tool use has turned large language models (LLMs) into powerful agents that can perform complex multi-step tasks by dynamically utilising external software components. However, these tools must be implemented in advance by human developers, hindering the applicability of LLM agents in domains which demand large numbers of highly specialised tools, like in life sciences and medicine. Motivated by the growing trend of scientific studies accompanied by public code repositories, we propose ToolMaker, a novel agentic framework that autonomously transforms papers with code into LLM-compatible tools. Given a short task description and a repository URL, ToolMaker autonomously installs required dependencies and generates code to perform the task, using a closed-loop self-correction mechanism to iteratively diagnose and rectify errors. To evaluate our approach, we introduce a benchmark comprising 15 diverse and complex computational tasks spanning both medical and non-medical domains with over 100 unit tests to objectively assess tool correctness and robustness. ToolMaker correctly implements 80% of the tasks, substantially outperforming current state-of-the-art software engineering agents. ToolMaker therefore is a step towards fully autonomous agent-based scientific workflows.

Installation

First, install uv. Then, create a virtual environment with:

uv sync

Also, create a .env file in the root directory with the following content:

OPENAI_API_KEY=sk-proj-...  # your OpenAI API key (required to run toolmaker)
HF_TOKEN=hf_...  # your Hugging Face API key (required for some benchmark tools)
CUDA_VISIBLE_DEVICES=0  # if you have a GPU

Usage

First, use toolmaker to install the repository (replace $TOOL with the path to the tool definition file, e.g. benchmark/tasks/uni_extract_features.yaml):

uv run python -m toolmaker install $TOOL --name my_tool_installed

Then, use toolmaker to create the tool:

uv run python -m toolmaker create $TOOL --name my_tool --installed my_tool_installed

Finally, you can run the tool on one of the test cases:

uv run python -m toolmaker run my_tool --name kather100k_muc

Here, kather100k_muc is the name of the test case defined in the tool definition file. See benchmark/README.md for details on how tools are defined.

Visualize trajectory

To visualize the trajectory of the tool creation process (showing actions, LLM calls, etc.), use the following command:

uv run python -m toolmaker.utils.visualize -i tool_output/tools/my_uni_tool/logs.jsonl -o my_uni_tool.html

This will create a my_uni_tool.html file in the current directory which you can view in your browser.

Benchmarking

To run the unit tests that constitute the benchmark, use the following command (note that this requires the benchmark dependency group to be installed via uv sync --group benchmark):

uv run python -m pytest benchmark/tests --junit-xml=benchmark.xml -m cached  # only run cached tests (faster)

This will create a benchmark.xml containing JUnit-style XML test results.

Unit tests

To run toolmaker's own unit tests (not to be confused with the unit tests in the benchmark), use the following command:

uv run python -m pytest tests

Reference

If you find our work useful in your research or if you use parts of this code please consider citing our preprint:

@misc{wolflein2025toolmaker,
  author        = {W\"{o}lflein, Georg and Ferber, Dyke and Truhn, Daniel and Arandjelovi\'{c}, Ognjen and Kather, Jakob Nikolas},
  title         = {{LLM} Agents Making Agent Tools},
  year          = {2025},
  eprint        = {2502.11705},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2502.11705}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.vscode		.vscode
benchmark		benchmark
docker		docker
external		external
notebooks		notebooks
resources		resources
scripts		scripts
tests		tests
toolmaker		toolmaker
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ToolMaker

Installation

Usage

Visualize trajectory

Benchmarking

Unit tests

Reference

About

Languages

KatherLab/ToolMaker

Folders and files

Latest commit

History

Repository files navigation

ToolMaker

Installation

Usage

Visualize trajectory

Benchmarking

Unit tests

Reference

About

Topics

Resources

Stars

Watchers

Forks

Languages