The local UI-grounding specialist for hybrid AI agents.
Drop in a screenshot + text target, get a strict JSON bbox. 2B params. MLX-native. Apache 2.0.
If you're on Apple Silicon with ≥16 GB RAM and you need generic, max-accuracy UI grounding, use mlx-community/UI-TARS-1.5-7B-4bit. It's the obvious default — ~94% on ScreenSpot-v2, MLX-native, drops into mlx-vlm directly. ByteDance research-lab compute, you couldn't reproduce it on a budget.
browserground is for two narrower jobs:
UI-TARS is a finished model. You can use it; you can't easily extend it. The training pipeline is closed, the data mix is proprietary, the base is non-trivial to swap.
browserground is the opposite — it's a template. Open base (Qwen3-VL-2B), open training scripts, open data mix. Total recipe cost: $5 of L40S time + 26k examples + a public LoRA. Swap in your dashboard's screenshots / your customer app / your internal tooling → ship a domain-trained grounder over a weekend. The 60% generic ScreenSpot-v2 score isn't the deliverable; the recipe is. A 60-point baseline on generic screens often becomes 85-95% on your own product's narrow surface because the test distribution finally matches the training distribution.
| Model | Disk @ 4-bit | RAM at inference |
|---|---|---|
| UI-TARS-1.5-7B-MLX | ~4 GB | ~5-6 GB |
| browserground 4-bit MLX | ~1 GB | ~2 GB |
2 GB matters when you're on an 8 GB Mac, or when your agent already runs a 7B planner + an OCR model + an embedding model and you need a small grounder in the same RAM budget. Plus strict JSON output (100% parseable, no regex on prose) — small win, but real.
A direct head-to-head benchmark of browserground vs UI-TARS-1.5-7B-MLX on the same Apple Silicon hardware is forthcoming.
- You're on a Mac with ≥16 GB RAM and want max generic accuracy → use UI-TARS-1.5-7B-MLX
- You're not going to fine-tune for your product, and accuracy is the only thing that matters → use UI-TARS-1.5-7B-MLX
- You need a complete agent toolkit, not a piece → look at ByteDance's full UI-TARS stack
- You want to ship a custom UI grounder trained on your product's screenshots without spending lab-scale money — use the recipe in this repo as a template
- You're squeezing into a tight RAM budget (8 GB Mac, multi-model hybrid stack)
- You want a CLI / npm / pip / Ollama distribution layer with daemon, HTTP REST, batch, confidence-routed cloud fallback, eval-on-your-data — and you specifically want it on top of an open recipe you can re-run
Full per-split numbers (60% breakdown): mobile-app buttons 78%, text-labelled targets ~74%, icon-only ~41%. On labelled-button-heavy workloads (the common browser case), real-world accuracy is closer to the high end. Icons get fixed in v0.4 with more icon-rich training data.
Today, most AI agents route every screenshot to a cloud frontier model (GPT-4V, Claude Vision, Gemini) — just to figure out where to click. That's a $0.01–0.05 multimodal call adding 800ms–2s of round-trip latency, repeated 20–50 times per agent run. The bill compounds. The latency compounds. And screenshots full of private UI leave your machine.
A general-purpose 200B-parameter LLM is overkill for the question "where is the Submit button?" — that's a narrow vision task. The right architecture is a hybrid one: cheap fast specialist local models for the dedicated tasks they handle better, and the cloud LLM only for the planning and reasoning it's actually uniquely good at.
That's exactly what browserground is — the click-grounding specialist.
| Pure-cloud (status quo) | Hybrid (with browserground + confidence routing) | |
|---|---|---|
| Per-screenshot cost on the common case | $0.01–0.05 | $0 (local), cloud only on low-confidence escalations |
| Tokens billed by cloud per step | 1,500+ multimodal | ~40 text on the local path |
| Screenshots leave machine | yes | no for the local path |
| Rate limits | yes | no for the local path |
| Per-call latency (local path) | 800ms–2s round-trip | target ~1.5–3s MLX / ~10–14s transformers¹ |
¹ MLX numbers are targets for the 4-bit build that just shipped — first independent benchmarks land in v0.4. Transformers numbers are measured on MacBook Air M5 via MPS.
Three packaged builds, one install for every stack:
| Build | Use it for | Install |
|---|---|---|
| MLX 4-bit (1.8 GB) | Apple Silicon, fastest | npm install -g browserground (auto) or pip install "browserground[mlx]" |
| GGUF Q4_K_M + f16 mmproj | Ollama, llama.cpp, cross-platform | ollama run renezander030/browserground |
| PEFT LoRA (67 MB on Qwen3-VL-2B base) | transformers, training, fine-tuning |
pip install "browserground[transformers]" |
Plus the CLI surface every agent stack actually needs:
browserground parse <img> --target "..."— single shot, strict JSONbrowserground parse <img> --targets queries.txt --jsonl— batch modebrowserground parse <img> --target "..." --confidence --alternatives 2— confidence + diverse alternatesbrowserground serve— Unix-socket daemon (model stays loaded)browserground serve --http :8401— HTTP REST daemon (POST /api/ground)browserground eval <dir> <targets.json> --out report.json— run accuracy + format-OK + p50/p95 latency on your own labelled data
npm install -g browserground
browserground parse screenshot.png --target "Submit button"
# {"bbox_2d": [344, 612, 478, 658]}Daemon mode for fast subsequent calls:
browserground serve &
browserground parse a.png --target "Chrome icon"
browserground parse b.png --target "the back arrow"
browserground stopHTTP daemon (REST):
browserground serve --http :8401 &
curl -s -X POST localhost:8401/api/ground \
-H 'Content-Type: application/json' \
-d '{"image_path":"/abs/path/screen.png","target":"Submit button"}'Batch + confidence + eval — see docs/cli.md above.
pip install "browserground[mlx]" # Apple Silicon (recommended)
pip install "browserground[transformers]" # CUDA / CPU / MPSfrom browserground import ground, click_xy
res = ground("screenshot.png", "the green Subscribe button")
print(res["bbox_2d"], res.get("confidence"))
x, y = click_xy("screenshot.png", "the back arrow")ollama pull renezander030/browserground
ollama run renezander030/browserground "Locate: Submit button" /path/to/screen.pngmkdir -p .claude/skills/browserground
curl -sL https://raw.githubusercontent.com/renezander030/browserground/main/plugins/claude-code/SKILL.md \
> .claude/skills/browserground/SKILL.md# Add to ~/.codex/AGENTS.md
tools:
- name: browserground
command: browserground parse "$IMAGE_PATH" --target "$TARGET"
description: Locate a UI element on a screenshot. Returns {"bbox_2d":[x1,y1,x2,y2]}.from browser_use import Agent, Controller
from browserground_adapter import register
controller = Controller()
register(controller) # adds `click_target("the Submit button")` actionDrop-in adapter: plugins/browser-use/browserground_adapter.py.
from browserground_skyvern import ground_with_fallback
bbox = ground_with_fallback(
screenshot_path, target,
confidence_threshold=0.55,
cloud_fallback=your_cloud_grounding_fn,
)Adapter + integration notes: plugins/skyvern/.
- Base:
Qwen/Qwen3-VL-2B-Instruct - Method: LoRA rank 32 (34.9 M trainable params, 1.6% of base) on all linear modules of the LM
- Training mix (26k records): 6k OS-Atlas macOS desktop + 6k Android (aw_mobile) + 6k UIBert mobile + 8k wave-ui browser
- Schedule: 1 epoch, bf16, LR 1e-4 cosine, batch 1 × grad-accum 8, ~4.5 hr on a single RTX A6000
- Output: strict JSON
{"bbox_2d": [x1, y1, x2, y2]}— system prompt + LoRA produce 100% parseable output - Packaging: MLX 4-bit (Apple Silicon), GGUF Q4_K_M + f16 mmproj (Ollama / llama.cpp), PEFT adapter (transformers)
Training scripts and eval JSONs: renezander030/imgparse-tier1 (private — request access).
The gap is compute + data, not architecture. Concrete recipe to close it:
| Lever | v0.3 (this) | v0.5+ target |
|---|---|---|
| Training records | 26k | 250k–500k (10–20× more) |
| Epochs | 1 | 3–5 |
| Adapter size | LoRA rank 32 (1.6% of base) | rank 128 or full fine-tune |
| Icon-rich data | thin | balanced — closes the 41% icon split |
| Training stages | SFT only | SFT → DPO with preference data |
| Compute spend | $2.20 | ~$200–500 |
This is reproducible — the training scripts in imgparse-tier1 are the template. The current v0.3 is the recipe-validated milestone at the cheap end of the spectrum; the same code scales linearly to the higher-budget tier.
- Icon UI accuracy (~41%) lags text UI (~74%) — icons under-represented in the 26k training mix (fixed in v0.4)
- English-only training data
- No mouse-action prediction (only location — pair with an action predictor for full computer-use loops)
- MLX latency numbers are targets, not yet independently benchmarked at v0.3 release
Apache 2.0.
@misc{browserground-2026,
title = {browserground: Qwen3-VL-2B LoRA for hybrid AI agent UI grounding},
author = {Zander, René},
year = {2026},
url = {https://huggingface.co/renezander030/browserground}
}