Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 54 additions & 0 deletions domain-skills/google-trends/scraping.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Google Trends + Glimpse — Scraping & Trend Extraction

Two layers on `trends.google.com`: **raw Google Trends** (relative 0-100 index — Google's real data) and the **Glimpse** browser extension (paid) which overlays *modeled absolute monthly volume*, YoY growth, related-with-growth, breakout discovery, alerts, and CSV export. If Glimpse is installed in the connected Chrome, its data renders automatically — no separate site.

## URL patterns (explore)

- Single term: `https://trends.google.com/trends/explore?date=today%205-y&geo=US&q=<urlenc>`
- Compare (≤5 terms, one normalized axis): `...&q=term1,term2,term3` (comma-separated, each URL-encoded)
- Time window via `date=`: `today%205-y` (5y), `today%2012-m` (12mo), **`all`** (2004→present — best for structural breaks), `now%207-d`, or custom `YYYY-MM-DD%20YYYY-MM-DD`
- Geo via `geo=`: `US`, `GB`, … ; **omit `geo` entirely for Worldwide**
- Property: `&gprop=youtube|news|images|froogle`; category: `&cat=<id>`

## Glimpse overlay — detect & time it

- When Glimpse injects, the **tab title gains a " - Glimpse" suffix** — cheap "is it active?" check via `page_info()["title"]`.
- Glimpse fetches its volume model *after* the Trends chart paints. `wait_for_load()` is NOT enough — add `wait(5)`–`wait(7)`, or poll until the volume text appears.

## Extraction (single term, from innerText)

```python
EXTRACT = r'''
const out={};
let txt=document.body.innerText.replace(/[−–—]/g,'-'); // normalize unicode minus FIRST
let m=txt.match(/([\d.]+\s*[KMB])\s+searches past month/i);
out.volume=m?m[1].replace(/\s+/g,''):(/Undetermined volume/i.test(txt)?'undet':null);
m=txt.match(/(-?\d+(?:\.\d+)?)%\s*past year/i); out.yoy=m?(m[1]+'%'):null; // -? captures DOWN terms
out.dir=null;
if(out.yoy){ out.dir=out.yoy.startsWith('-')?'down':'up';
const leaf=[...document.querySelectorAll('*')].find(e=>e.children.length===0 && e.textContent.trim().replace('−','-')===out.yoy);
if(leaf){const c=getComputedStyle(leaf).color;const mm=c.match(/\d+/g);
if(mm){const r=+mm[0],g=+mm[1];out.color=(r>g+15)?'red':(g>r+15)?'green':'gray';}}}
return out;
'''
data = js(EXTRACT) # -> {volume:'44K', yoy:'50%', dir:'up', color:'green'}
```

- Poll: loop `wait(1.2); js(EXTRACT)` up to ~12× until `volume` or `yoy` is non-null.
- **Direction = sign AND color.** Green `rgb(118,191,80)` = up, red = down. The percent is rendered *unsigned* in some views, so the unicode-minus normalization + color check are BOTH needed — a naive `\d+%` silently misreads down-terms as up.
- Low-volume terms render **"Undetermined volume"** (no absolute estimate) but still show a YoY trajectory; down-terms frequently have no absolute volume at all.

## Glimpse features worth scripting

- **People Also Search**: related queries sortable by Search Volume, each with a growth bar — rising-related discovery for a seed term (channel switch: Google / YouTube / Amazon).
- **Discover Trends** (top nav): category → drill-down table *Keyword / Graph-5Y / Growth-YoY / Search Volume*, paginated, + **Download CSV**. Surfaces breakout terms you didn't guess. Click a category by matching its row text+count; results **lazy-load (~9s)** — wait before reading innerText. Row text parses as repeating `keyword \n "Get Alerts" \n "N%"`.
- **Get Alerts**: one click opens a "Tracking" popover with **Auto pre-checked + ✓ Saved** — i.e. one click == a saved auto-alert on significant movement. The "Saved" text lags ~1-2s; verify via the button flipping to "✓ Tracking", not an immediate innerText check.
- **Export Data**: CSV of the monthly series (highest fidelity; needs download handling).

## Traps

- **Glimpse absolute volume is a MODEL, not ground truth** — independent verification rates its absolute-volume accuracy as unreliable. Trust *relative shape, direction, head-to-head*; treat absolute numbers as rough.
- **Raw Trends is a relative 0-100 index** (each point ÷ the term's own peak) — never absolute. **Compare mode normalizes ALL terms to the single highest peak**, so a high-volume term visually crushes low-volume ones — only compare terms of similar magnitude.
- **Reuse one controlled tab + loop `goto_url`** (Page.navigate forces a fresh load and re-triggers Glimpse) rather than `new_tab` per query — no tab clutter, never touches the user's tabs.
- The chart is **not recharts and exposes no JSON / `__react*` props** — the monthly series isn't cleanly in the DOM (SVG path only). For exact numbers use Export Data CSV; for shape, screenshot.
- The Glimpse nav's **"Discover Trends" / "Tracking Dashboard" are overlays**, not URL routes — click the nav `div` by exact text; "Tracking Dashboard" can be finicky to trigger.
15 changes: 15 additions & 0 deletions domain-skills/reddit/scraping.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,21 @@ Fails on:
- Private / quarantined subreddits (401)
- NSFW posts without an authenticated session
- Anti-scraping 429s under load — back off or switch to the browser path
- **`http_get()` against `www.reddit.com/.json` now returns `403 Blocked`** (datacenter-IP block), even with a browser `User-Agent`. A same-origin `fetch()` from a `www.reddit.com` tab also returns an HTML block page. **Fix: use `old.reddit.com`** — navigate a tab there, then same-origin `fetch()` the `.json` endpoints (served with the user's cookies, not blocked).

### Listing endpoints (discover concerns / mine a sub, not just one post)

```python
# on an old.reddit.com tab — same-origin fetch dodges the datacenter block
goto_url("https://old.reddit.com/"); wait_for_load(); wait(2)
# /r/<sub>/{top,hot,new,controversial}.json?limit=100&t={hour,day,week,month,year,all}
expr = ('fetch("/r/ExperiencedDevs/top.json?limit=100&t=year",{headers:{Accept:"application/json"}})'
'.then(r=>r.json()).then(j=>JSON.stringify(j.data.children.map(c=>('
'{t:c.data.title,s:c.data.score,c:c.data.num_comments,f:c.data.link_flair_text||"",self:(c.data.selftext||"").slice(0,600),id:c.data.id}))))')
posts = json.loads(js(expr)) # note: avoid the word "return" in js() expressions using await/promises — it triggers a sync-IIFE wrap that breaks top-level await; use .then() chains instead
```

Rank by `num_comments` (engagement = nerve hit). To fan out over many subs/sorts in one `js()` call, wrap the per-URL fetches in `Promise.all([...].map(u => fetch(u)...)).then(a=>JSON.stringify(a))` — all arrow/implicit-return, no `return` keyword.

## Path 2: Browser DOM extraction (logged-in)

Expand Down