-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Add x/followers domain skill (followers/profile scraping) #397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
MoayadAlismail
wants to merge
1
commit into
browser-use:main
Choose a base branch
from
MoayadAlismail:add-x-followers-skill
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,105 @@ | ||
| # X (Twitter) — Followers & profile scraping | ||
|
|
||
| Companion to `posting.md`. Covers reading other users' followers/following lists and profile data via DOM. The private GraphQL endpoints work too, but DOM extraction is more robust and avoids hammering APIs. | ||
|
|
||
| ## Auth check (do this first) | ||
|
|
||
| `auth_token` is HttpOnly so it never appears in `document.cookie`. Check for `ct0` + `twid` instead, and confirm `https://x.com/home` does not redirect to `https://x.com/`: | ||
|
|
||
| ```python | ||
| js(r''' | ||
| var cookies = document.cookie.split(";").map(s => s.trim().split("=")[0]); | ||
| var has_auth = cookies.includes("ct0") && cookies.includes("twid"); | ||
| var profile_link = document.querySelector('[data-testid="AppTabBar_Profile_Link"]'); | ||
| return JSON.stringify({has_auth, handle: profile_link ? profile_link.getAttribute("href") : null}); | ||
| ''') | ||
| ``` | ||
|
|
||
| If unauthenticated, X redirects follower URLs to `/i/jf/onboarding/web?redirect_after_login=...&mode=login`. Stop and ask the user — do not type credentials. | ||
|
|
||
| ## Public follower list cap (important) | ||
|
|
||
| For users you do not own, the public `/<user>/followers` list stops paginating after a bounded slice — there is no "Load more" button, no spinner, the scroll height simply stops growing. This is an X-imposed per-request limit, not a soft-block on your account. The exact size is not fixed at the often-quoted ~50–60: on a large account (Ridd, ~May 2026) the main tab yielded **73 unique** before it stopped. | ||
|
|
||
| The separate **`/<user>/verified_followers`** tab returns another ~20 entries with partial overlap, so harvesting both yields **~85–90 unique** users on a big account. `following` behaves similarly. | ||
|
|
||
| If you need the full follower graph, you need owner access or a third-party data source. Do **not** retry from a fresh tab hoping for more — the cap is on the request, not the session. | ||
|
|
||
| ## DOM extraction (preferred over GraphQL) | ||
|
|
||
| Every follower row is `article[data-testid="UserCell"]`. Inside: | ||
|
|
||
| - 3 anchors with `href="/<handle>"` (avatar, name, handle) — first one is the canonical | ||
| - 1 anchor with `href="https://t.co/..."` for the bio URL (if user has one) | ||
| - `innerText` is `"<name>\n@<handle>\n<Follow|Following|Follow back>\n<bio lines>"` | ||
|
|
||
| ```python | ||
| js(r''' | ||
| var cells = document.querySelectorAll('[data-testid="UserCell"]'); | ||
| var out = []; | ||
| cells.forEach(c => { | ||
| var anchors = c.querySelectorAll('a[role="link"]'); | ||
| var handle_href = null, ext_url = null; | ||
| anchors.forEach(a => { | ||
| var h = a.getAttribute("href") || ""; | ||
| if (h.startsWith("/") && !handle_href) handle_href = h; | ||
| if (h.startsWith("http") && !ext_url) ext_url = h; | ||
| }); | ||
| var lines = (c.innerText||"").split("\n").map(s=>s.trim()).filter(Boolean); | ||
| var bio_lines = lines.filter(l => l !== "Follow" && l !== "Following" && l !== "Follow back"); | ||
| out.push({handle: handle_href, name: bio_lines[0], at: bio_lines[1], | ||
| bio: bio_lines.slice(2).join(" | "), ext_url}); | ||
| }); | ||
| return JSON.stringify(out); | ||
| ''') | ||
| ``` | ||
|
|
||
| X virtualizes the list — old rows unmount as you scroll. **Use small scroll steps (≈800px) with a 1.4s wait** so freshly-rendered rows are captured before they recycle. Bigger steps (2000px) skip past entire batches. | ||
|
|
||
| ```python | ||
| import time | ||
| seen = {} | ||
| no_growth = 0 | ||
| while no_growth < 4 and len(seen) < 100: | ||
| for c in json.loads(js(EXTRACT)): | ||
| if c.get("at") and c["at"] not in seen: | ||
| seen[c["at"]] = c | ||
| new = len(seen) - prev; prev = len(seen) | ||
| no_growth = no_growth + 1 if new == 0 else 0 | ||
| js("window.scrollBy(0, 800)") | ||
| time.sleep(1.4) | ||
| ``` | ||
|
|
||
| ## Profile page selectors | ||
|
|
||
| Stable as of late 2025: | ||
|
|
||
| | What | Selector | | ||
| |---|---| | ||
| | Bio | `[data-testid="UserDescription"]` | | ||
| | URL field | `[data-testid="UserUrl"]` (text only; full URL needs `.title` or hover) | | ||
| | Location | `[data-testid="UserLocation"]` | | ||
| | Join date | `[data-testid="UserJoinDate"]` | | ||
| | Followers / Following counts | `a[href$="/followers"]`, `a[href$="/following"]`, `a[href$="/verified_followers"]` — read `.innerText` | | ||
| | DM button (open DMs) | `[data-testid="sendDMFromProfile"]` — present only when DMs are open | | ||
| | Tweet article | `article[data-testid="tweet"]` | | ||
| | Tweet body | `[data-testid="tweetText"]` (inside the article) | | ||
| | Pinned indicator | `[data-testid="socialContext"]` inside the article ("Pinned" label) | | ||
|
|
||
| ## GraphQL capture (when you actually need it) | ||
|
|
||
| `drain_events()` returns at most ~500 events, so a long page load can flush the Followers GraphQL request out of the buffer before you read it. Two workarounds: | ||
|
|
||
| - Call `cdp("Network.enable")` and `drain_events()` immediately after a small scroll (just a few hundred px) so only a handful of new requests fire. | ||
| - Or use `Network.getResponseBody` against a known requestId captured the moment after a single user-initiated scroll. | ||
|
|
||
| When you do see them, the relevant endpoints are GET requests to `https://x.com/i/api/graphql/<queryId>/Followers` / `FollowersYouKnow` / `VerifiedFollowers`. The query is encoded in `?variables=...&features=...&fieldToggles=...` (URL-encoded JSON). `userId` lives inside `variables`, and `cursor` is in `variables.cursor` after the first page. Required headers when replaying: `authorization` (the static Bearer), `x-csrf-token` (= `ct0` cookie value), `x-twitter-active-user: yes`, `x-twitter-auth-type: OAuth2Session`, `cookie`. The same ~50–60 cap applies — replaying with cursors does not bypass it. | ||
|
|
||
| ## Gotchas | ||
|
|
||
| - **`/home` vs `/`** — logged-in users see `x.com/home`; the redirect to `x.com/` is a fast auth check. | ||
| - **Profile loads but no tweet rows** — usually still rendering. `wait_for_load` + 2.5s `time.sleep` is enough; bigger waits don't help. | ||
| - **Account that exists but shows zero info** — could be suspended, locked, or shadow-protected. `current_tab().get("title")` will say "Profile" or contain the handle even when content is hidden. | ||
| - **t.co wrapping** — every external URL on X is wrapped in `https://t.co/<id>`. The visible domain text (e.g. "jhey.dev") is the real target; the `href` is the t.co redirect. Either is fine to record; use `http_get(t_co_url)` and follow redirects when you need the canonical. | ||
| - **Don't collect avatars.** Profile photos are bait for face-based filtering. Bio + bio URL + tweets are sufficient signal for evaluating a designer. | ||
| - **Shared daemon → tab stealing.** If a second agent (or the user) is driving the same daemon, `new_tab()` / `js()` operate on the daemon's "current tab", which the other actor keeps re-activating — your followers page silently becomes their tweet thread mid-scroll. Symptom: `location.href` flips to an unrelated URL between calls. Fix: own a dedicated target and pin every call to it. `tid = cdp("Target.createTarget", url="https://x.com/<user>/followers")["targetId"]` (pass the URL directly so you never touch the current tab), poll readiness with `js("document.readyState", target_id=tid)`, then run all extraction/scroll as `js(expr, target_id=tid)`. To navigate that same pinned tab later, `Page.navigate` over its attached `sessionId` rather than `goto_url`. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P1: Follower-harvesting loop uses
prevbefore initialization, causing aNameErroron the first iteration.Prompt for AI agents