This repo contains the code for our paper Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents.
Our paper tackles the critical question: “How to scale inference-time compute for language agents?” The solution lies in using LLMs as a world model of the internet to predict the outcomes of actions on websites. Our method, WebDreamer, employs LLM-based simulation for speculative planning on the web, surpassing reactive baselines while offering greater safety and flexibility compared to tree search methods.
Benchmark | Observation ( O ) | Method | Completion Rate | Success Rate |
---|---|---|---|---|
VisualWebArena | Screenshot+SoM | Gemini-1.5-Pro + Reactive (Koh et al., 2024a) | - | 12.0% |
GPT-4 + Reactive (Koh et al., 2024a) | - | 16.4% | ||
GPT-4o + Reactive (Koh et al., 2024a) | - | 17.7% † | ||
GPT-4o + Tree Search (Koh et al., 2024b) | - | 26.4% | ||
GPT-4o + WebDreamer | - | 23.6% (↑33.3%) | ||
Mind2Web-live | HTML | GPT-4 + Reactive (Pan et al., 2024b) | 48.8% | 23.1% |
Claude-3-Sonnet + Reactive (Pan et al., 2024b) | 47.9% | 22.1% | ||
Gemini-1.5-Pro + Reactive (Pan et al., 2024b) | 44.6% | 22.3% | ||
GPT-4-turbo + Reactive (Pan et al., 2024b) | 44.3% | 21.1% | ||
GPT-3.5-turbo + Reactive (Pan et al., 2024b) | 40.2% | 16.5% | ||
GPT-4o + Reactive (Pan et al., 2024b) | 47.6% | 22.1% | ||
GPT-4o + WebDreamer | 49.9% | 25.0% (↑13.1%) |
Compared to the reactive baselines, WebDreamer significantly improves performance by 33.3% and 13.1% on VisualWebArena and Mind2Web-live, respectively.
![image](https://private-user-images.githubusercontent.com/15921425/388620361-0afbc22d-b1eb-4026-a167-e1852cde7677.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNzQ0NjYsIm5iZiI6MTczOTE3NDE2NiwicGF0aCI6Ii8xNTkyMTQyNS8zODg2MjAzNjEtMGFmYmMyMmQtYjFlYi00MDI2LWExNjctZTE4NTJjZGU3Njc3LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDA3NTYwNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTVkZjBiNDJmNGQ3NDFmN2U1NDY1ODM5YmQwYzJkNGM0NzI0OGEzMDMwYWMzNmY3YzE1NGYzZDFhMmY1ZGQ1OTUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.1XwOknkh6-3DFSHlPpfFiOFQKHDfCD9AlnJt62K0txg)
WebDreamer effectively explores the search space through simulations, which largely reduces the reliance on real-world interactions while maintaining robust performance.
main
: Different modules of WebDreamer that can be played with independently.
vwa
: Code to reproduce our experiments on VisualWebArena. 🚧
mind2web-live
: Code to reproduce our experiments on Mind2Web-live. 🚧
The world model module predicts webpage changes in multiple format (change description, a11y tree, html).
world_model = WebWorldModel(OpenAI(api_key=os.environ["OPENAI_API_KEY"]))
screenshot_path = "demo_data/shopping_0.png"
screenshot = encode_image(screenshot_path)
screenshot = "data:image/jpeg;base64," + screenshot
action_description = "type 'red blanket' in the search bar and click search"
task = "Buy the least expensive red blanket (in any size) from 'Blankets & Throws' category."
imagination = world_model.multiple_step_change_prediction(screenshot, screenshot_path, task,
action_description,
format='accessibility', k=3)
- screenshot_path: Path to the screenshot of the webpage.
- task: Description of the goal to achieve on the webpage.
- action_description: Initial action to perform.
- format: Desired output format for webpage state changes:
- 'change' for textual descriptions.
- 'accessibility' for an accessibility tree structure.
- 'html' for HTML structure of the predicted page.
- k: Number of imagination steps to simulate.
screenshot_path = "demo_data/shopping_0.png"
screenshots = [Image.open(screenshot_path)]
actions = ["None"]
action_description_list = [
"type 'red blanket' in the search bar",
"click the element Home & Kitchen",
"type 'kobe' in the search bar",
"type 'the ohio state university' in the search bar"
]
task = "Buy the least expensive red blanket (in any size)"
scores, simulations = evaluate_simulation(
screenshots,
actions,
task,
"https://www.amazon.com",
action_description_list,
num_of_sim=3,
num_workers=50,
n=10,
steps=2
)
- screenshots: List of PIL.Image screenshots representing webpage states.
- actions: List of actions performed by the agent.
- task: Description of the goal to achieve on the webpage.
- url: The current webpage URL.
- action_description_list: List of action descriptions to evaluate.
- num_of_sim: Number of simulations per action.
- steps: Number of imagination steps per simulation.
- num_workers: Number of parallel workers for simulations.
screenshot_path = "demo_data/shopping_0.png"
screenshots = [Image.open(screenshot_path)]
actions = ["None"] # previous actions so far
action_description = "type 'red skirt' in the search bar"
task = "Buy the least expensive red skirt (in any size) on Amazon."
action_description_list = [
"type 'red skirt' in the search bar",
"click the element Women Clothes",
"type 'kobe' in the search bar",
"type 'the ohio state university' in the search bar"
]
random.shuffle(action_description_list)
selected_actions = select_actions(screenshots, actions, task, "https://www.amazon.com", action_description_list)
# Map selected indices back to action descriptions
selected_actions = [action_description_list[int(i)] for i in selected_actions]
- screenshots: List of PIL.Image screenshots representing webpage states.
- actions: List of previously executed actions.
- task: Description of the goal to achieve on the webpage.
- url: The current webpage URL.
- action_description_list: List of action descriptions to evaluate.
@article{DBLP:journals/corr/abs-2411-06559,
author = {Yu Gu and Boyuan Zheng and Boyu Gou and Kai Zhang and Cheng Chang and Sanjari Srivastava and Yanan Xie and Peng Qi and Huan Sun and Yu Su},
title = {Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents},
journal = {CoRR},
volume = {abs/2411.06559},
year = {2024},
url = {https://arxiv.org/abs/2411.06559},
eprinttype= {arXiv},
eprint = {2411.06559},
}