Thank you for your interest in contributing to Reasoning Gym! This document provides guidelines and instructions for contributing to the project.
-
Clone the repository:
git clone https://github.com/open-thought/reasoning-gym.git
-
Create a virtual environment (using conda):
conda create --name reasoning_gym python=3.11 -y conda activate reasoning_gym
-
Install the package in editable mode:
pip install -e .
-
Install development dependencies:
pip install -r requirements-dev.txt
When creating new datasets, please follow these guidelines:
-
Focus on Complex Problems:
- Prioritize problems where guessing has a low probability of success (e.g., number multiplication)
- Avoid tasks with small answer sets (true/false, multiple-choice) as they create noisy rewards for RL
-
Implementation Requirements:
- Create a configuration class
- Derive your dataset class from
ProceduralDataset
(see dataset.py) - Include comprehensive unit tests
- Return dictionary items with keys:
"question"
,"answer"
, and"metadata"
- For datasets with multiple correct answers, override the
score_answer()
method (return value range: [0, 1])
-
Getting Started:
- Review an example implementation:
- Configuration & dataset class: chain_sum.py
- Unit tests: test_chain_sum.py
- Write clear question prompts that an average human can understand and answer correctly
- Review an example implementation:
-
Fork and Clone:
- Fork the repository
- Clone your fork locally
- Read more about forks
-
Create a Feature Branch:
- Work on a new branch
- Keep changes focused and minimal
-
Code Quality:
- Install pre-commit hooks:
pre-commit install
- Run
pre-commit run -a
before committing - When using AI coding assistants (cursor, aider, etc.), ensure proper formatting
- Install pre-commit hooks:
-
Submit Your PR:
- Create a Pull Request
- Request review
- Do not include changes to
GALLERY.md
(it's updated automatically) - (Optional, but desirable) If you have an OpenRouter API key, please try running DeepSeek R1 against 5-10 samples from your dataset to make sure there are no unexpected issues with your dataset.
- Update the configuration file
eval/r1/yaml/test.yaml
with your dataset:# test.yaml model: deepseek/deepseek-r1 category: test datasets: - {YOUR_DATASET_NAME} eval_dir: eval/r1 dataset_size: 10 dataset_seed: 42 developer_role: system
- Run the evaluation script:
python eval/r1/eval.py --yaml "eval/r1/yaml/test.yaml"
- Review the results in
eval/r1/test/{YOUR_DATASET_NAME}.json
and make sure there are no unexpected issues with the dataset generation, model's instruction following, or the scoring function. - Include the results in your PR description.
- Update the configuration file
-
Review Process:
- Address reviewer feedback promptly
- Keep discussions constructive
- Once approved, your changes will be merged into
main
Join our community discussion in the #reasoning-gym
channel on the GPU-Mode Discord server.