Businesswire scraper

Periodic monitor of businesswire.com press releases for a configured set of search queries, with full HTML and metadata saved locally.

Architecture

/newsroom on businesswire is Akamai Bot Manager with strict sensor_data validation — direct headless scraping of the search UI does not work, and the search modal itself is a thin client-side filter over the ~10 most recent releases the page happens to embed. Instead we use two officially published feeds (both listed in their own robots.txt):

MRSS feed at feed.businesswire.com/mrss/home/?rss=... — the most recent ~50 releases with title, description, link and pubDate. This is the live source, polled every 15 minutes. The subdomain is also Akamai- protected, but curl_cffi with impersonate=chrome131 over a US residential proxy passes it cleanly.
S3-hosted daily sitemap at bw-prod-sitemap.s3.us-east-1.amazonaws.com/... — every release of a given day. Public bucket, no Akamai. Used in backfill mode if the live MRSS window missed something.

Query matching is done locally: a release is kept when at least one configured query is a case-insensitive substring of its title or description. The MRSS window is ~5–10 hours, so a 15-minute pass never loses anything.

Release pages themselves are heavily protected. They're downloaded with the two-stage Akamai bypass from our 2026 guide:

Stage 1. Patchright (Chromium with binary stealth patches) opens the homepage over a sticky-US residential proxy, simulates real mouse and scroll motion, then visits one of the new release pages. Akamai's sensor fires under a real browser and we collect a valid _abck cookie set.
Stage 2. Cookies are transferred into a curl_cffi session with impersonate=chrome131 over the same sticky proxy, and every new release page is downloaded in parallel. Browser engine and TLS impersonation target match (Chromium ↔ chrome131), which is required to avoid Akamai's "browser version mismatch" anomaly check.

Pipeline of one pass

config.yaml ──> fetch_mrss() — curl_cffi+chrome131 over rotating residential
                  │
                  ▼
              feedparser ─> [title, description, link, pubDate] x ~50
                  │
                  ▼
              for each item: match query in title+description, extract
              release_id from /news/home/{id}/...; drop dupes via SQLite
                  │
                  ▼
              any new ?  ──no──> log "nothing new" and exit
                  │
                  ▼
              Patchright warmup (sticky-US, homepage + 1 release page)
                  │
                  ▼
              collect cookies (_abck, ak_bmsc, bm_sz, bm_sv)
                  │
                  ▼
              curl_cffi(impersonate=chrome131) + cookies + sticky proxy
                  │
                  ▼
              parallel download of every new release page
                  │
                  ▼
              save HTML + JSON metadata, insert into SQLite

Typical pass: 30–60 s for 0–10 new releases.

Layout

businesswire-scraper/
├── config.yaml         queries + proxy + concurrency + feed URLs
├── scraper.py          single-pass entry point
├── requirements.txt
├── articles/           output: articles/YYYY/MM/DD/{release_id}.{html,json}
├── logs/scraper.log
├── db.sqlite           dedup + index
├── venv/               Python virtualenv
└── probe_*.py          reconnaissance scripts kept for reference

SQLite schema

articles(release_id PK, url, title, published_at, first_seen_at, html_path, size_bytes)
query_hits(query, release_id, seen_at, PRIMARY KEY(query, release_id))

query_hits records every (query, release_id) pair — useful when a release matches multiple queries.

Install

git clone https://github.com/geekproxy/businesswire-scraper.git
cd businesswire-scraper
python3 -m venv venv
./venv/bin/pip install -r requirements.txt
./venv/bin/patchright install chromium

cp config.yaml.example config.yaml
cp .env.example .env
$EDITOR .env           # put your residential proxy credentials here

Patchright bundles its own Chromium and matching stealth patches.

Credentials can live either in config.yaml under proxy.username / proxy.password, or in .env as BW_PROXY_USERNAME / BW_PROXY_PASSWORD. .env wins if both are set. .env is gitignored.

Configure

Edit config.yaml:

queries: free-form substrings (tickers like NASDAQ: ADBE, topics like oil, cybersecurity, ...). Case-insensitive.
proxy.username, proxy.password: Geekproxy credentials.
proxy.country: ISO-2 country to pin (default us). Must match the Accept-Language claimed by the client (en-US).
proxy.sticky_port_range: pool of sticky-session ports. On Geekproxy each port in the range is a distinct sticky session (different IP). The scraper probes them and picks the first one that returns businesswire homepage quickly. Raise the upper bound if your pool often serves slow IPs.
proxy.health_check_timeout: seconds to wait per port during probing.
source.mrss_url: the MRSS feed URL from businesswire's robots.txt. Stable as far as we know; update if businesswire reissues the feed key.
source.english_only: filter to English releases (/en/ in URL).
runtime.article_concurrency: parallel downloads in stage 2.
runtime.challenge_wait_ms: how long to let Akamai sensor run on the homepage during warmup. Raise if you see "warmup did not yield Akamai cookies".

Run

Live pass:

./venv/bin/python3 scraper.py

Backfill from S3 daily sitemap (URL slug used for query matching, since sitemap has no title/description):

./venv/bin/python3 scraper.py --backfill 2026-05-04

Log goes to logs/scraper.log and stdout. Live pass typically takes 30–60 s. Backfill scales with how many sitemap URLs match the query slugs.

Schedule every 15 min via cron

*/15 * * * * cd /root/businesswire-scraper && ./venv/bin/python3 scraper.py >> logs/cron.log 2>&1

Multiple passes do not collide: dedup is in SQLite, file paths are deterministic, and an in-flight Patchright run is its own process so overlap is safe.

What goes into `articles/{date}/{release_id}.json`

{
  "release_id": "20260514366956",
  "url": "https://www.businesswire.com/news/home/20260514366956/en/...",
  "title": "BearingPoint publishes its Annual Report ...",
  "description": "BearingPoint, the €1+ billion management and ...",
  "published": "Fri, 15 May 2026 06:00:00 UT",
  "matched_queries": ["artificial intelligence"],
  "first_seen_at": "2026-05-15T14:18:54Z"
}

matched_queries shows every configured query that matched this release. For backfill items, an extra "source": "backfill YYYY-MM-DD" field is added.

Tuning notes

MRSS window. The feed holds ~50 latest releases (~5–10 hours of output). 15-minute polling never misses anything. If you want a bigger safety margin, set up a nightly cron with --backfill $(date -u -d yesterday +%F).
Substring vs. ranked match. This is a naive substring filter. A query like "oil" will match anything mentioning oil; a query like "NASDAQ: ADBE" is very precise because tickers usually appear verbatim in title or description. Choose specificity accordingly.
Residential rotation. Live MRSS fetch uses the rotating port (823) because feed.businesswire.com is tolerant of fresh IPs and residential pools occasionally hand out a "cold" IP that times out. Warmup and bulk download use a sticky port from the pool: ports 10000..10010 are distinct sticky sessions on Geekproxy; the scraper probes them, picks one that responds quickly, and uses that single IP for the whole flow. If many downloads start failing mid-pass it re-probes the pool and picks a fresh sticky.
_abck lifetime. When many downloads start failing, the script performs a single re-warmup and retries failed items. If problems persist, raise runtime.max_retries or shorten the cron interval.

Known limits

Naive substring matching. Ranked relevance — only via integrating Algolia (used internally for site content, not for press releases) or an external search.
Backfill uses the sitemap URL slug for matching, since the sitemap has no title/description. The slug usually contains the title in hyphenated form, so most queries match correctly, but a query that only appears in the body (e.g. a ticker hidden inside the release text) will not be caught at the sitemap stage.
We do not currently track query coverage history. If a release matched a new query on a later pass it won't be re-tagged, only the first match is kept in the json. If you need the full history, query the query_hits table.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
probes		probes
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
PROMPT.md		PROMPT.md
README.md		README.md
config.yaml.example		config.yaml.example
requirements.txt		requirements.txt
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Businesswire scraper

Architecture

Pipeline of one pass

Layout

SQLite schema

Install

Configure

Run

Schedule every 15 min via cron

What goes into `articles/{date}/{release_id}.json`

Tuning notes

Known limits

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Businesswire scraper

Architecture

Pipeline of one pass

Layout

SQLite schema

Install

Configure

Run

Schedule every 15 min via cron

What goes into articles/{date}/{release_id}.json

Tuning notes

Known limits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

What goes into `articles/{date}/{release_id}.json`

Packages