-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SWE-Bench harness development tool #521
base: develop
Are you sure you want to change the base?
SWE-Bench harness development tool #521
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@victorxheng this is awesome!
I left some comments in there. At a high level, one of my biggest takeaways is that this would be much simplified if we use Modal for the docker images, running agents, etc.
It is baked into this project in other ways and I think it's a natural complement. @jemeza-codegen curious to get your takes as well.
If we clean out the .json
+ tests pass however I think it's good to merge, we want to iterate on this quickly.
Thank you for submitting this and congrats on TreeHacks!
|
||
|
||
# Get the diff between the current state and the original commit | ||
model_patch = diff_versus_commit(codebase.repo_path, base_commit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@victorxheng this can be Codebase.get_diff()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should .gitignore
generated outputs
return passed, log_text | ||
|
||
|
||
def main_check_docker_images(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty interesting. I think spinning up sandboxes for eval can be done with Modal, which is built for spiky compute just like this. Are you familiar @victorxheng ? We have discussed this internally a lot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was going to get it to work with Modal for treehacks but didn't have enough time. It would work much better that way.
Also @victorxheng we are happy to take it from here if you don't want to clean it up etc., up to you! |
@jayhack A lot of this code can definitely be cleaned up (it was a very messy write up for treehacks) and iterated on. Would love to help out, but might not have enough time; would definitely love to see where you can take this. The evaluation part definitely needs the most work, and being able to have a good way to iterate the coding agent + evaluate it would help a lot in developing on top of Codegen. |
# Motivation Adds a SWE Bench Harness to the codegen agent. # Content - Loads SWE Bench dataset - For each entry in the database a modal instance is created where an agent can run - Output of each agent is stored and tested on modal using `swebench` - documentation in readme Contributions from: - @victorxheng : #521 # Please check the following before marking your PR as ready for review - [x] I have updated the documentation or added new documentation as needed --------- Co-authored-by: jemeza-codegen <[email protected]>
Codegen Harness and Evaluator for SWE Bench Development Tool @ Treehacks
This pull request contains a harness and evaluator for the SWE Bench leaderboard, and enables developers to test and evaluate their codegen models on the SWE Bench leaderboard.
It integrates directly into the Codegen agentic framework and can be built on top of.
Setup
Remember to install all the dependencies for the environment.
Usage
Edit agent.py, your codegen agent
This file contains the main logic for the agent.
The agent taps into the tree sitter using codegen. You can modify this by adding additional tools, extending its capabilities, prompts, and more.
It is invoked in the harness script.
Run harness.py to run the agent
This script will gather the correct dataset, run the agent, and save the results.
Run report.py to generate a report
This script will generate a report from the results. It will loop through all the results and generate a report to evaluate each. Currently, there is an error in the docker image.
There are currently example predictions in the
predictions/results
folder.