-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge remote-tracking branch 'origin/main'
# Conflicts: # paper_by_key/paper_benchmark.md # paper_by_key/paper_dataset.md # paper_by_key/paper_framework.md # paper_by_key/paper_grounding.md # paper_by_key/paper_learning.md # paper_by_key/paper_model.md # paper_by_key/paper_reasoning.md # paper_by_key/paper_reinforcement learning.md # paper_by_key/paper_safety.md
- Loading branch information
Showing
5 changed files
with
64 additions
and
46 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# Papers with Keyword: Mind2Web | ||
|
||
- [From Grounding to Planning: Benchmarking Bottlenecks in Web Agents](https://arxiv.org/abs/2409.01927) | ||
- Segev Shlomov, Ben Wiesel, Aviad Sela, Ido Levy, Liane Galanti, Roy Abitbol | ||
- 🏛️ Institutions: IBM | ||
- 📅 Date: September 3, 2024 | ||
- 📑 Publisher: arXiv | ||
- 💻 Env: [Web] | ||
- 🔑 Key: [benchmark], [planning], [grounding], [Mind2Web dataset], [web navigation] | ||
- 📖 TLDR: This paper analyzes performance bottlenecks in web agents by separately evaluating grounding and planning tasks, isolating their individual impacts on navigation efficacy. Using an enhanced version of the Mind2Web dataset, the study reveals planning as a significant bottleneck, with advancements in grounding and task-specific benchmarking for elements like UI component recognition. Through experimental adjustments, the authors propose a refined evaluation framework, aiming to enhance web agents' contextual adaptability and accuracy in complex web environments. | ||
|
||
- [Identifying User Goals from UI Trajectories](https://arxiv.org/abs/2406.14314) | ||
- Omri Berkovitch, Sapir Caduri, Noam Kahlon, Anatoly Efros, Avi Caciularu, Ido Dagan | ||
- 🏛️ Institutions: Google Research, Bar-Ilan University | ||
- 📅 Date: June 20, 2024 | ||
- 📑 Publisher: arXiv | ||
- 💻 Env: [GUI] | ||
- 🔑 Key: [evaluation metric], [intent identification], [Android-In-The-Wild], [Mind2Web] | ||
- 📖 TLDR: This paper introduces the task of goal identification from observed UI trajectories, aiming to infer the user's intended task based on their GUI interactions. It proposes a novel evaluation metric to assess whether two task descriptions are paraphrases within a specific UI environment. Experiments utilizing the Android-In-The-Wild and Mind2Web datasets reveal that state-of-the-art models, such as GPT-4 and Gemini-1.5 Pro, underperform compared to humans, indicating significant room for improvement. | ||
|
||
- [WebCanvas: Benchmarking Web Agents in Online Environments](https://arxiv.org/abs/2406.12373) | ||
- Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, Zhengyang Wu | ||
- 🏛️ Institutions: Zhejiang University, iMean AI, University of Washington | ||
- 📅 Date: June 18, 2024 | ||
- 📑 Publisher: arXiv | ||
- 💻 Env: [Web] | ||
- 🔑 Key: [framework], [dataset], [benchmark], [Mind2Web-Live], [key-node evaluation] | ||
- 📖 TLDR: This paper presents WebCanvas, an online evaluation framework for web agents designed to address the dynamic nature of web interactions. It introduces a key-node-based evaluation metric to capture critical actions or states necessary for task completion while disregarding noise from insignificant events or changed web elements. The framework includes the Mind2Web-Live dataset, a refined version of the original Mind2Web static dataset, containing 542 tasks with 2,439 intermediate evaluation states. Despite advancements, the best-performing model achieves a task success rate of 23.1%, highlighting substantial room for improvement. | ||
|
||
- [GPT-4V(ision) is a Generalist Web Agent, if Grounded](https://osu-nlp-group.github.io/SeeAct/) | ||
- Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, Yu Su | ||
- 🏛️ Institutions: OSU | ||
- 📅 Date: January 1, 2024 | ||
- 📑 Publisher: ICML 2024 | ||
- 💻 Env: [Web] | ||
- 🔑 Key: [framework], [dataset], [benchmark], [grounding], [SeeAct], [Multimodal-Mind2web], [Mind2Web] | ||
- 📖 TLDR: This paper explores the capability of GPT-4V(ision), a multimodal model, as a web agent that can perform tasks across various websites by following natural language instructions. It introduces the **SEEACT** framework, enabling GPT-4V to navigate, interpret, and interact with elements on websites. Evaluated using the **Mind2Web** benchmark and an online test environment, the framework demonstrates high performance on complex web tasks by integrating grounding strategies like element attributes and image annotations to improve HTML element targeting. However, grounding remains challenging, presenting opportunities for further improvement. | ||
|
||
- [Mind2Web: Towards a Generalist Agent for the Web](https://arxiv.org/abs/2306.06070) | ||
- Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, Yu Su | ||
- 🏛️ Institutions: OSU | ||
- 📅 Date: June 9, 2023 | ||
- 📑 Publisher: NeurIPS 2023 | ||
- 💻 Env: [Web] | ||
- 🔑 Key: [dataset], [benchmark], [model], [Mind2Web], [MindAct] | ||
- 📖 TLDR: *Mind2Web* presents a dataset and benchmark specifically crafted for generalist web agents capable of performing language-guided tasks across varied websites. Featuring over 2,000 tasks from 137 sites, it spans 31 domains and emphasizes open-ended, realistic tasks in authentic, unsimplified web settings. The study proposes the *MindAct* framework, which optimizes LLMs for handling complex HTML elements by using small LMs to rank elements before full processing, thereby enhancing the efficiency and versatility of web agents in diverse contexts. |
Oops, something went wrong.