Skip to content

Latest commit

 

History

History
91 lines (81 loc) · 10.6 KB

paper_reasoning.md

File metadata and controls

91 lines (81 loc) · 10.6 KB

Papers with Keyword: reasoning

  • Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

    • Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, Caiming Xiong
    • 🏛️ Institutions: HKU, NTU, Salesforce
    • 📅 Date: Dec 5, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [GUI]
    • 🔑 Key: [model], [dataset], [planning], [reasoning], [Aguvis], [visual grounding]
    • 📖 TLDR: This paper introduces Aguvis, a unified pure vision-based framework for autonomous GUI agents that operates across various platforms. It leverages image-based observations and grounds natural language instructions to visual elements, employing a consistent action space to ensure cross-platform generalization. The approach integrates explicit planning and reasoning within the model, enhancing its ability to autonomously navigate and interact with complex digital environments. A large-scale dataset of GUI agent trajectories is constructed, incorporating multimodal reasoning and grounding. Comprehensive experiments demonstrate that Aguvis surpasses previous state-of-the-art methods in both offline and real-world online scenarios, achieving the first fully autonomous pure vision GUI agent capable of performing tasks independently without collaboration with external closed-source models. All datasets, models, and training recipes are open-sourced to facilitate future research.
  • Language Agents: Foundations, Prospects, and Risks

    • Yu Su, Diyi Yang, Shunyu Yao, Tao Yu
    • 🏛️ Institutions: OSU, Stanford, Princeton, HKU
    • 📅 Date: November 2024
    • 📑 Publisher: EMNLP 2024
    • 💻 Env: [Misc]
    • 🔑 Key: [survey], [tutorial], [reasoning], [planning], [memory], [multi-agent systems], [safty]
    • 📖 TLDR: This tutorial provides a comprehensive exploration of language agents—autonomous systems powered by large language models capable of executing complex tasks through language instructions. It delves into their theoretical foundations, potential applications, associated risks, and future directions, covering topics such as reasoning, memory, planning, tool augmentation, grounding, multi-agent systems, and safety considerations.
  • AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

    • Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant
    • 🏛️ Institutions: Tel Aviv University
    • 📅 Date: October 21, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Web]
    • 🔑 Key: [benchmark], [dataset], [planning and reasoning]
    • 📖 TLDR: AssistantBench is a benchmark designed to test the abilities of web agents in completing time-intensive, realistic web-based tasks. Covering 214 tasks across various domains, the benchmark introduces the SPA (See-Plan-Act) framework to handle multi-step planning and memory retention. AssistantBench emphasizes realistic task completion, showing that current agents achieve only modest success, with significant improvements needed for complex information synthesis and execution across multiple web domains.
  • MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

    • Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, Bowen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, Zirui Wang, Afshin Dehghan, Peter Grasch, Yinfei Yang
    • 🏛️ Institutions: Apple
    • 📅 Date: September 30, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Misc]
    • 🔑 Key: [model], [MM1.5], [vision language model], [visual grounding], [reasoning], [data-centric], [analysis]
    • 📖 TLDR: This paper introduces MM1.5, a family of multimodal large language models (MLLMs) ranging from 1B to 30B parameters, including dense and mixture-of-experts variants. MM1.5 enhances capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. The authors employ a data-centric training approach, utilizing high-quality OCR data and synthetic captions for continual pre-training, alongside an optimized visual instruction-tuning data mixture for supervised fine-tuning. Specialized variants, MM1.5-Video and MM1.5-UI, are designed for video understanding and mobile UI comprehension, respectively. Extensive empirical studies provide insights into the training processes, offering guidance for future MLLM development.
  • WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks

    • Léo Boisvert, Megh Thakkar, Maxime Gasse, Massimo Caccia, Thibault Le Sellier De Chezelles, Quentin Cappart, Nicolas Chapados, Alexandre Lacoste, Alexandre Drouin
    • 🏛️ Institutions: ServiceNow Research, Mila, Polytechnique Montréal, Université de Montréal
    • 📅 Date: July 7, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Web]
    • 🔑 Key: [benchmark], [planning], [reasoning], [WorkArena++]
    • 📖 TLDR: This paper introduces WorkArena++, a benchmark comprising 682 tasks that simulate realistic workflows performed by knowledge workers. It evaluates web agents' capabilities in planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding. The study reveals challenges faced by current large language models and vision-language models in serving as effective workplace assistants, providing a resource to advance autonomous agent development. oai_citation_attribution:0‡arXiv
  • MMInA: Benchmarking Multihop Multimodal Internet Agents

    • Ziniu Zhang, Shulin Tian, Liangyu Chen, Ziwei Liu
    • 🏛️ Institutions: NTU
    • 📅 Date: April 15, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Web]
    • 🔑 Key: [benchmark], [framework], [multihop web browsing], [multimodal tasks], [long-range reasoning]
    • 📖 TLDR: The MMInA benchmark is designed to evaluate agents' capacity to complete complex, multihop web tasks by navigating and extracting information across evolving real-world websites. Composed of 1,050 tasks across diverse domains, MMInA challenges agents with realistic, multimodal information retrieval and reasoning tasks, such as comparative shopping and travel inquiries. Despite recent advances, agents show difficulties in handling tasks requiring sequential steps across multiple sites, underscoring the need for enhanced multimodal and memory-augmented models.
  • Tur[k]ingBench: A Challenge Benchmark for Web Agents

    • Kevin Xu, Yeganeh Kordi, Kate Sanders, Yizhong Wang, Adam Byerly, Jingyu Zhang, Benjamin Van Durme, Daniel Khashabi
    • 🏛️ Institutions: JHU, Brown, UW
    • 📅 Date: March 18, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Web]
    • 🔑 Key: [benchmark], [dataset], [multi-modal reasoning], [TurkingBench], [Turking]
    • 📖 TLDR: This paper introduces Tur[k]ingBench, a benchmark comprising 158 web-grounded tasks designed to evaluate AI agents' capabilities in complex web-based environments. Unlike prior benchmarks that utilize synthetic web pages, Tur[k]ingBench leverages natural HTML pages from crowdsourcing platforms, presenting tasks with rich multi-modal contexts. The benchmark includes 32.2K instances, each with diverse inputs, challenging models to interpret and interact with web pages effectively. Evaluations of state-of-the-art models reveal significant room for improvement, highlighting the need for advanced web-based agents capable of handling real-world web interactions.
  • Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study

    • Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, Yifei Bi, Pengjie Gu, Xinrun Wang, Börje F. Karlsson, Bo An, Zongqing Lu
    • 🏛️ Institutions: NTU, BAAI, PKU
    • 📅 Date: March 5, 2024
    • 📑 Publisher: arXiv
    • 💻 Env: [Desktop]
    • 🔑 Key: [framework], [Cradle], [General Computer Control], [multimodal], [keyboard and mouse control], [long-term memory], [reasoning], [self-improvement]
    • 📖 TLDR: This paper introduces Cradle, a framework designed to achieve General Computer Control (GCC) by enabling agents to perform any computer task using only screen images (and possibly audio) as input and producing keyboard and mouse operations as output. The authors deploy Cradle in the complex AAA game Red Dead Redemption II, demonstrating its capability to follow the main storyline and complete real missions with minimal reliance on prior knowledge or resources.
  • GAIA: a benchmark for General AI Assistants

    • Grégoire Mialon, Yassine Nakkach, Aslan Tchamkerten, Albert Thomas, Laurent Dinh, and a research team from Meta AI and Hugging Face.
    • 🏛️ Institutions: Meta AI, Hugging Face
    • 📅 Date: November 21, 2023
    • 📑 Publisher: arXiv
    • 💻 Env: [Misc]
    • 🔑 Key: [benchmark], [multi-modality], [tool use], [reasoning]
    • 📖 TLDR: GAIA is a benchmark developed for evaluating general-purpose AI assistants. It aims to test assistant models across multiple modalities and complex reasoning tasks in real-world settings, including scenarios that require tool usage and open-ended question answering. With a dataset comprising 466 questions across various domains, GAIA highlights gaps between current AI performance and human capability, presenting a significant challenge for large language models such as GPT-4.
  • ReAct: Synergizing Reasoning and Acting in Language Models

    • Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao
    • 🏛️ Institutions: Princeton, Google Research
    • 📅 Date: October 6, 2022
    • 📑 Publisher: ICLR 2023
    • 💻 Env: [Misc]
    • 🔑 Key: [framework], [reasoning], [ReAct]
    • 📖 TLDR: This paper introduces ReAct, a framework that enables large language models to generate reasoning traces and task-specific actions in an interleaved manner. By combining reasoning and acting, ReAct enhances the model's ability to perform complex tasks in language understanding and interactive decision making. The approach is validated across various benchmarks, demonstrating improved performance and interpretability over existing methods.