SWE-Bench harness development tool #521

victorxheng · 2025-02-16T16:46:21Z

Codegen Harness and Evaluator for SWE Bench Development Tool @ Treehacks

This pull request contains a harness and evaluator for the SWE Bench leaderboard, and enables developers to test and evaluate their codegen models on the SWE Bench leaderboard.

It integrates directly into the Codegen agentic framework and can be built on top of.

Setup

Remember to install all the dependencies for the environment.

Usage

Edit agent.py, your codegen agent

This file contains the main logic for the agent.

The agent taps into the tree sitter using codegen. You can modify this by adding additional tools, extending its capabilities, prompts, and more.

It is invoked in the harness script.

Run harness.py to run the agent

This script will gather the correct dataset, run the agent, and save the results.

Run report.py to generate a report

This script will generate a report from the results. It will loop through all the results and generate a report to evaluate each. Currently, there is an error in the docker image.

There are currently example predictions in the predictions/results folder.

CLAassistant · 2025-02-16T16:46:27Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 3 committers have signed the CLA.

✅ jemeza-codegen
✅ tomcodgen
❌ victorxheng
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

jayhack

@victorxheng this is awesome!

I left some comments in there. At a high level, one of my biggest takeaways is that this would be much simplified if we use Modal for the docker images, running agents, etc.

It is baked into this project in other ways and I think it's a natural complement. @jemeza-codegen curious to get your takes as well.

If we clean out the .json + tests pass however I think it's good to merge, we want to iterate on this quickly.

Thank you for submitting this and congrats on TreeHacks!

jayhack · 2025-02-17T02:08:34Z

src/codegen/extensions/swebench/harness.py

+
+
+            # Get the diff between the current state and the original commit
+            model_patch = diff_versus_commit(codebase.repo_path, base_commit)


@victorxheng this can be Codebase.get_diff()

jayhack · 2025-02-17T02:10:35Z

src/codegen/extensions/swebench/results/all_preds.jsonl

I think we should .gitignore generated outputs

jayhack · 2025-02-17T02:12:37Z

src/codegen/extensions/swebench/tests.py

+        return passed, log_text
+
+
+def main_check_docker_images():


This is pretty interesting. I think spinning up sandboxes for eval can be done with Modal, which is built for spiky compute just like this. Are you familiar @victorxheng ? We have discussed this internally a lot

Yes, I was going to get it to work with Modal for treehacks but didn't have enough time. It would work much better that way.

jayhack · 2025-02-17T02:15:59Z

Also @victorxheng we are happy to take it from here if you don't want to clean it up etc., up to you!

victorxheng · 2025-02-17T22:02:55Z

@jayhack A lot of this code can definitely be cleaned up (it was a very messy write up for treehacks) and iterated on. Would love to help out, but might not have enough time; would definitely love to see where you can take this. The evaluation part definitely needs the most work, and being able to have a good way to iterate the coding agent + evaluate it would help a lot in developing on top of Codegen.

…cks-codegen

@victorxheng

# Motivation Adds a SWE Bench Harness to the codegen agent. # Content - Loads SWE Bench dataset - For each entry in the database a modal instance is created where an agent can run - Output of each agent is stored and tested on modal using `swebench` - documentation in readme Contributions from: - @victorxheng : #521 # Please check the following before marking your PR as ready for review - [x] I have updated the documentation or added new documentation as needed --------- Co-authored-by: jemeza-codegen <[email protected]>

jemeza-codegen and others added 2 commits February 13, 2025 16:58

feat: extension that clones and parses swe bench codebases

bfe2752

treehacks

18cea4c

victorxheng requested review from codegen-team and a team as code owners February 16, 2025 16:46

Merge branch 'develop' into swe-bench-treehacks-codegen

40ea571

jayhack requested changes Feb 17, 2025

View reviewed changes

jemeza-codegen added 2 commits February 19, 2025 11:06

Merge branch 'jmeza-cg-10814-eaxmple-swe-bench' into swe-bench-treeha…

25aa980

…cks-codegen

Merge branch 'swe-bench-harness' into swe-bench-treehacks-codegen

36f6059

jemeza-codegen mentioned this pull request Feb 21, 2025

feat: swe bench harness #590

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SWE-Bench harness development tool #521

SWE-Bench harness development tool #521

victorxheng commented Feb 16, 2025

CLAassistant commented Feb 16, 2025 •

edited

Loading

jayhack left a comment

jayhack Feb 17, 2025

jayhack Feb 17, 2025

jayhack Feb 17, 2025

victorxheng Feb 17, 2025

jayhack commented Feb 17, 2025

victorxheng commented Feb 17, 2025



		# Get the diff between the current state and the original commit
		model_patch = diff_versus_commit(codebase.repo_path, base_commit)

SWE-Bench harness development tool #521

Are you sure you want to change the base?

SWE-Bench harness development tool #521

Conversation

victorxheng commented Feb 16, 2025

Codegen Harness and Evaluator for SWE Bench Development Tool @ Treehacks

Setup

Usage

Edit agent.py, your codegen agent

Run harness.py to run the agent

Run report.py to generate a report

CLAassistant commented Feb 16, 2025 • edited Loading

jayhack left a comment

Choose a reason for hiding this comment

jayhack Feb 17, 2025

Choose a reason for hiding this comment

jayhack Feb 17, 2025

Choose a reason for hiding this comment

jayhack Feb 17, 2025

Choose a reason for hiding this comment

victorxheng Feb 17, 2025

Choose a reason for hiding this comment

jayhack commented Feb 17, 2025

victorxheng commented Feb 17, 2025

CLAassistant commented Feb 16, 2025 •

edited

Loading