Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code to evaluate WebArena #13

Open
shuyanzhou opened this issue May 21, 2024 · 5 comments
Open

Code to evaluate WebArena #13

shuyanzhou opened this issue May 21, 2024 · 5 comments

Comments

@shuyanzhou
Copy link

Hi,

Thanks for the great work. I am wondering if you have plans to release the code to run WebArena?

@zehuichen123
Copy link
Collaborator

Hi,
We directly adopt evaluation code from AgentTuning :)

@shuyanzhou
Copy link
Author

Thank you for the response, but I am wondering if you perform multi-turn prompting to get one action?
image

@zehuichen123
Copy link
Collaborator

During inference, we directly adopt the JSON format output or any format requested in the system prompt. The chat format data is used for training only.

@shuyanzhou
Copy link
Author

Thank you very much for the info. We attempted to reproduce the result with the default prompt, but the SR is only 0.61%. Would you mind sharing the recorded trajectories so that we can compare what may go wrong from our end.

@shuyanzhou shuyanzhou reopened this Jun 3, 2024
@wang-qiuchen
Copy link

Hello, our project was evaluated in January 2024, and you might need to switch to an earlier official version web-arena-x/webarena@14f91d9. The website's Docker we used was downloaded from the official address https://github.com/web-arena-x/webarena/tree/main/environment_docker#wikipedia-website.
And sorry that our task machines were recycled after the project was completed, which resulted in the loss of the log files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants