Skip to content

geekproxy/businesswire-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Businesswire scraper

Periodic monitor of businesswire.com press releases for a configured set of search queries, with full HTML and metadata saved locally.

Architecture

/newsroom on businesswire is Akamai Bot Manager with strict sensor_data validation — direct headless scraping of the search UI does not work, and the search modal itself is a thin client-side filter over the ~10 most recent releases the page happens to embed. Instead we use two officially published feeds (both listed in their own robots.txt):

  1. MRSS feed at feed.businesswire.com/mrss/home/?rss=... — the most recent ~50 releases with title, description, link and pubDate. This is the live source, polled every 15 minutes. The subdomain is also Akamai- protected, but curl_cffi with impersonate=chrome131 over a US residential proxy passes it cleanly.

  2. S3-hosted daily sitemap at bw-prod-sitemap.s3.us-east-1.amazonaws.com/... — every release of a given day. Public bucket, no Akamai. Used in backfill mode if the live MRSS window missed something.

Query matching is done locally: a release is kept when at least one configured query is a case-insensitive substring of its title or description. The MRSS window is ~5–10 hours, so a 15-minute pass never loses anything.

Release pages themselves are heavily protected. They're downloaded with the two-stage Akamai bypass from our 2026 guide:

  • Stage 1. Patchright (Chromium with binary stealth patches) opens the homepage over a sticky-US residential proxy, simulates real mouse and scroll motion, then visits one of the new release pages. Akamai's sensor fires under a real browser and we collect a valid _abck cookie set.
  • Stage 2. Cookies are transferred into a curl_cffi session with impersonate=chrome131 over the same sticky proxy, and every new release page is downloaded in parallel. Browser engine and TLS impersonation target match (Chromium ↔ chrome131), which is required to avoid Akamai's "browser version mismatch" anomaly check.

Pipeline of one pass

config.yaml ──> fetch_mrss() — curl_cffi+chrome131 over rotating residential
                  │
                  ▼
              feedparser ─> [title, description, link, pubDate] x ~50
                  │
                  ▼
              for each item: match query in title+description, extract
              release_id from /news/home/{id}/...; drop dupes via SQLite
                  │
                  ▼
              any new ?  ──no──> log "nothing new" and exit
                  │
                  ▼
              Patchright warmup (sticky-US, homepage + 1 release page)
                  │
                  ▼
              collect cookies (_abck, ak_bmsc, bm_sz, bm_sv)
                  │
                  ▼
              curl_cffi(impersonate=chrome131) + cookies + sticky proxy
                  │
                  ▼
              parallel download of every new release page
                  │
                  ▼
              save HTML + JSON metadata, insert into SQLite

Typical pass: 30–60 s for 0–10 new releases.

Layout

businesswire-scraper/
├── config.yaml         queries + proxy + concurrency + feed URLs
├── scraper.py          single-pass entry point
├── requirements.txt
├── articles/           output: articles/YYYY/MM/DD/{release_id}.{html,json}
├── logs/scraper.log
├── db.sqlite           dedup + index
├── venv/               Python virtualenv
└── probe_*.py          reconnaissance scripts kept for reference

SQLite schema

articles(release_id PK, url, title, published_at, first_seen_at, html_path, size_bytes)
query_hits(query, release_id, seen_at, PRIMARY KEY(query, release_id))

query_hits records every (query, release_id) pair — useful when a release matches multiple queries.

Install

git clone https://github.com/geekproxy/businesswire-scraper.git
cd businesswire-scraper
python3 -m venv venv
./venv/bin/pip install -r requirements.txt
./venv/bin/patchright install chromium

cp config.yaml.example config.yaml
cp .env.example .env
$EDITOR .env           # put your residential proxy credentials here

Patchright bundles its own Chromium and matching stealth patches.

Credentials can live either in config.yaml under proxy.username / proxy.password, or in .env as BW_PROXY_USERNAME / BW_PROXY_PASSWORD. .env wins if both are set. .env is gitignored.

Configure

Edit config.yaml:

  • queries: free-form substrings (tickers like NASDAQ: ADBE, topics like oil, cybersecurity, ...). Case-insensitive.
  • proxy.username, proxy.password: Geekproxy credentials.
  • proxy.country: ISO-2 country to pin (default us). Must match the Accept-Language claimed by the client (en-US).
  • proxy.sticky_port_range: pool of sticky-session ports. On Geekproxy each port in the range is a distinct sticky session (different IP). The scraper probes them and picks the first one that returns businesswire homepage quickly. Raise the upper bound if your pool often serves slow IPs.
  • proxy.health_check_timeout: seconds to wait per port during probing.
  • source.mrss_url: the MRSS feed URL from businesswire's robots.txt. Stable as far as we know; update if businesswire reissues the feed key.
  • source.english_only: filter to English releases (/en/ in URL).
  • runtime.article_concurrency: parallel downloads in stage 2.
  • runtime.challenge_wait_ms: how long to let Akamai sensor run on the homepage during warmup. Raise if you see "warmup did not yield Akamai cookies".

Run

Live pass:

./venv/bin/python3 scraper.py

Backfill from S3 daily sitemap (URL slug used for query matching, since sitemap has no title/description):

./venv/bin/python3 scraper.py --backfill 2026-05-04

Log goes to logs/scraper.log and stdout. Live pass typically takes 30–60 s. Backfill scales with how many sitemap URLs match the query slugs.

Schedule every 15 min via cron

*/15 * * * * cd /root/businesswire-scraper && ./venv/bin/python3 scraper.py >> logs/cron.log 2>&1

Multiple passes do not collide: dedup is in SQLite, file paths are deterministic, and an in-flight Patchright run is its own process so overlap is safe.

What goes into articles/{date}/{release_id}.json

{
  "release_id": "20260514366956",
  "url": "https://www.businesswire.com/news/home/20260514366956/en/...",
  "title": "BearingPoint publishes its Annual Report ...",
  "description": "BearingPoint, the €1+ billion management and ...",
  "published": "Fri, 15 May 2026 06:00:00 UT",
  "matched_queries": ["artificial intelligence"],
  "first_seen_at": "2026-05-15T14:18:54Z"
}

matched_queries shows every configured query that matched this release. For backfill items, an extra "source": "backfill YYYY-MM-DD" field is added.

Tuning notes

  • MRSS window. The feed holds ~50 latest releases (~5–10 hours of output). 15-minute polling never misses anything. If you want a bigger safety margin, set up a nightly cron with --backfill $(date -u -d yesterday +%F).
  • Substring vs. ranked match. This is a naive substring filter. A query like "oil" will match anything mentioning oil; a query like "NASDAQ: ADBE" is very precise because tickers usually appear verbatim in title or description. Choose specificity accordingly.
  • Residential rotation. Live MRSS fetch uses the rotating port (823) because feed.businesswire.com is tolerant of fresh IPs and residential pools occasionally hand out a "cold" IP that times out. Warmup and bulk download use a sticky port from the pool: ports 10000..10010 are distinct sticky sessions on Geekproxy; the scraper probes them, picks one that responds quickly, and uses that single IP for the whole flow. If many downloads start failing mid-pass it re-probes the pool and picks a fresh sticky.
  • _abck lifetime. When many downloads start failing, the script performs a single re-warmup and retries failed items. If problems persist, raise runtime.max_retries or shorten the cron interval.

Known limits

  • Naive substring matching. Ranked relevance — only via integrating Algolia (used internally for site content, not for press releases) or an external search.
  • Backfill uses the sitemap URL slug for matching, since the sitemap has no title/description. The slug usually contains the title in hyphenated form, so most queries match correctly, but a query that only appears in the body (e.g. a ticker hidden inside the release text) will not be caught at the sitemap stage.
  • We do not currently track query coverage history. If a release matched a new query on a later pass it won't be re-tagged, only the first match is kept in the json. If you need the full history, query the query_hits table.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages