Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SWE-Bench harness development tool #521

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

victorxheng
Copy link

Codegen Harness and Evaluator for SWE Bench Development Tool @ Treehacks

This pull request contains a harness and evaluator for the SWE Bench leaderboard, and enables developers to test and evaluate their codegen models on the SWE Bench leaderboard.

It integrates directly into the Codegen agentic framework and can be built on top of.

Setup

Remember to install all the dependencies for the environment.

Usage

Edit agent.py, your codegen agent

This file contains the main logic for the agent.

The agent taps into the tree sitter using codegen. You can modify this by adding additional tools, extending its capabilities, prompts, and more.

It is invoked in the harness script.

Run harness.py to run the agent

This script will gather the correct dataset, run the agent, and save the results.

Run report.py to generate a report

This script will generate a report from the results. It will loop through all the results and generate a report to evaluate each. Currently, there is an error in the docker image.

There are currently example predictions in the predictions/results folder.

@victorxheng victorxheng requested review from codegen-team and a team as code owners February 16, 2025 16:46
@CLAassistant
Copy link

CLAassistant commented Feb 16, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ jemeza-codegen
✅ tomcodgen
❌ victorxheng
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

@jayhack jayhack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@victorxheng this is awesome!

I left some comments in there. At a high level, one of my biggest takeaways is that this would be much simplified if we use Modal for the docker images, running agents, etc.

It is baked into this project in other ways and I think it's a natural complement. @jemeza-codegen curious to get your takes as well.

If we clean out the .json + tests pass however I think it's good to merge, we want to iterate on this quickly.

Thank you for submitting this and congrats on TreeHacks!



# Get the diff between the current state and the original commit
model_patch = diff_versus_commit(codebase.repo_path, base_commit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should .gitignore generated outputs

return passed, log_text


def main_check_docker_images():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty interesting. I think spinning up sandboxes for eval can be done with Modal, which is built for spiky compute just like this. Are you familiar @victorxheng ? We have discussed this internally a lot

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was going to get it to work with Modal for treehacks but didn't have enough time. It would work much better that way.

@jayhack
Copy link
Contributor

jayhack commented Feb 17, 2025

Also @victorxheng we are happy to take it from here if you don't want to clean it up etc., up to you!

@victorxheng
Copy link
Author

@jayhack A lot of this code can definitely be cleaned up (it was a very messy write up for treehacks) and iterated on. Would love to help out, but might not have enough time; would definitely love to see where you can take this. The evaluation part definitely needs the most work, and being able to have a good way to iterate the coding agent + evaluate it would help a lot in developing on top of Codegen.

@jemeza-codegen jemeza-codegen mentioned this pull request Feb 21, 2025
1 task
jemeza-codegen added a commit that referenced this pull request Feb 21, 2025
# Motivation
Adds a SWE Bench Harness to the codegen agent.

# Content
- Loads SWE Bench dataset
- For each entry in the database a modal instance is created where an
agent can run
- Output of each agent is stored and tested on modal using `swebench`
- documentation in readme

Contributions from:
- @victorxheng : #521

# Please check the following before marking your PR as ready for review
- [x] I have updated the documentation or added new documentation as
needed

---------

Co-authored-by: jemeza-codegen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants