Sharing Llama-3-8B-Web, an action model designed for browsing the web by following instructions and talking to the user, and WebLlama, a new project for pushing development in Llama-based agents

Hello LocalLLaMA! I wanted to share my new project, WebLlama, with you. With the project, I am also releasing Llama-3-8B-Web, a strong action model for building web agents that can follow instructions, but also talk to you.

GitHub Repository: https://github.com/McGill-NLP/webllama
Model on Huggingface: https://huggingface.co/McGill-NLP/Llama-3-8B-Web

An adorable mascot for our project!

Both the readme and the huggingface model goes over all the motivation, training process, and how to use the model for inference. Note that one still needs a platform for executing an agent's action (e.g. Playwright or BrowserGym) and a ranker model for selecting relevant elements from the HTML page. However, a lot of that is display on the training script which is explained in the modeling readme, so I wont' go in detail here.

Instead, here's summary from the repository:

WebLlama: The goal of our project is to build effective human-centric agents for browsing the web. We don't want to replace users, but equip them with powerful assistants.

Modeling: We are build on top of cutting edge libraries for training Llama agents on web navigation tasks. We will provide training scripts, optimized configs, and instructions for training cutting-edge Llamas.

Evaluation: Benchmarks for testing Llama models on real-world web browsing. This include human-centric browsing through dialogue (WebLINX), and we will soon add more benchmarks for automatic web navigation (e.g. Mind2Web).

Data: Our first model is finetuned on over 24K instances of web interactions, including click, textinput, submit, and dialogue acts. We want to continuously curate, compile and release datasets for training better agents.

Deployment: We want to make it easy to integrate Llama models with existing deployment platforms, including Playwright, Selenium, and BrowserGym. We are currently focusing on making this a reality.

One thing that's quite interesting is how well the model performs against zero-shot GPT-4V (with screenshot added since it supports vision) and other finetuned models (GPT-3.5 using the API, MindAct was trained on Mind2Web, and is finetuned on weblinx too). Here's the result

The overall score is a combination of IoU (for actions that target an element) and F1 (for text/URL). 29% here intuitively tells us how well a model would perform in the real world, obviously 100% is not needed to get a good agent, but an agent getting 100% would definitely be great!

I thought this would be a great place to share and discuss this new project, since there's so much great discussions happening for Llama training/inference, for example RoPE scaling was invented in this very subreddit!

Also, I think WebLlama's potential will be pretty big for local use, since it's probably much better to perform tasks using a locally hosted model that you can easily audit, vs an agent offered by a company, which would be expensive to run, has higher latency, and might not be as secure/private since it has access to your entire browsing history.

Happy to answer questions in the replies!