Hello LocalLLaMA! I wanted to share my new project, WebLlama, with you. With the project, I am also
releasing Llama-3-8B-Web, a strong action model for building
web agents that can follow
instructions, but also talk to you.
GitHub Repository:
https://github.com/McGill-NLP/webllama
Model on
Huggingface:
https://huggingface.co/McGill-NLP/Llama-3-8B-Web
An adorable mascot for our project!
Both the readme and the huggingface model goes over all the motivation, training process, and how to use
the model for inference. Note that one still needs a platform for executing an agent's action (e.g.
Playwright or BrowserGym) and a ranker model for selecting relevant elements from the HTML page. However, a
lot of that is display on the training script which is explained in the modeling readme, so I wont' go in
detail here.
Instead, here's summary from the repository:
WebLlama: The goal of our project is to build effective human-centric agents for
browsing the web. We don't want to replace users, but equip them with powerful assistants.
Modeling: We are build on top of cutting edge libraries for training Llama agents on web
navigation tasks. We will provide training scripts, optimized configs, and instructions for training
cutting-edge Llamas.
Evaluation: Benchmarks for testing Llama models on real-world web browsing. This include
human-centric browsing through dialogue (WebLINX), and we will soon add more benchmarks for automatic web
navigation (e.g. Mind2Web).
Data: Our first model is finetuned on over 24K instances of web interactions, including
click, textinput, submit, and dialogue acts. We want to continuously curate, compile and release datasets
for training better agents.
Deployment: We want to make it easy to integrate Llama models with existing deployment
platforms, including Playwright, Selenium, and BrowserGym. We are currently focusing on making this a
reality.
One thing that's quite interesting is how well the model performs against zero-shot GPT-4V (with screenshot
added since it supports vision) and other finetuned models (GPT-3.5 using the API, MindAct was trained on
Mind2Web, and is finetuned on weblinx too). Here's the result
The overall score is a combination of IoU (for actions that target an element) and F1 (for text/URL). 29%
here intuitively tells us how well a model would perform in the real world, obviously 100% is not needed
to get a good agent, but an agent getting 100% would definitely be great!
I thought this would be a great place to share and discuss this new project, since there's so much great
discussions happening for Llama training/inference, for example RoPE scaling was invented in this very
subreddit!
Also, I think WebLlama's potential will be pretty big for local use, since it's probably much better to
perform tasks using a locally hosted model that you can easily audit, vs an agent offered by a company,
which would be expensive to run, has higher latency, and might not be as secure/private since it has access
to your entire browsing history.
Happy to answer questions in the replies!