Periodic monitor of businesswire.com press releases for a configured set of search queries, with full HTML and metadata saved locally.
/newsroom on businesswire is Akamai Bot Manager with strict sensor_data
validation — direct headless scraping of the search UI does not work, and
the search modal itself is a thin client-side filter over the ~10 most
recent releases the page happens to embed. Instead we use two officially
published feeds (both listed in their own robots.txt):
-
MRSS feed at
feed.businesswire.com/mrss/home/?rss=...— the most recent ~50 releases with title, description, link and pubDate. This is the live source, polled every 15 minutes. The subdomain is also Akamai- protected, butcurl_cffiwithimpersonate=chrome131over a US residential proxy passes it cleanly. -
S3-hosted daily sitemap at
bw-prod-sitemap.s3.us-east-1.amazonaws.com/...— every release of a given day. Public bucket, no Akamai. Used in backfill mode if the live MRSS window missed something.
Query matching is done locally: a release is kept when at least one configured query is a case-insensitive substring of its title or description. The MRSS window is ~5–10 hours, so a 15-minute pass never loses anything.
Release pages themselves are heavily protected. They're downloaded with the two-stage Akamai bypass from our 2026 guide:
- Stage 1. Patchright (Chromium with binary stealth patches) opens
the homepage over a sticky-US residential proxy, simulates real
mouse and scroll motion, then visits one of the new release pages.
Akamai's sensor fires under a real browser and we collect a valid
_abckcookie set. - Stage 2. Cookies are transferred into a
curl_cffisession withimpersonate=chrome131over the same sticky proxy, and every new release page is downloaded in parallel. Browser engine and TLS impersonation target match (Chromium ↔ chrome131), which is required to avoid Akamai's "browser version mismatch" anomaly check.
config.yaml ──> fetch_mrss() — curl_cffi+chrome131 over rotating residential
│
▼
feedparser ─> [title, description, link, pubDate] x ~50
│
▼
for each item: match query in title+description, extract
release_id from /news/home/{id}/...; drop dupes via SQLite
│
▼
any new ? ──no──> log "nothing new" and exit
│
▼
Patchright warmup (sticky-US, homepage + 1 release page)
│
▼
collect cookies (_abck, ak_bmsc, bm_sz, bm_sv)
│
▼
curl_cffi(impersonate=chrome131) + cookies + sticky proxy
│
▼
parallel download of every new release page
│
▼
save HTML + JSON metadata, insert into SQLite
Typical pass: 30–60 s for 0–10 new releases.
businesswire-scraper/
├── config.yaml queries + proxy + concurrency + feed URLs
├── scraper.py single-pass entry point
├── requirements.txt
├── articles/ output: articles/YYYY/MM/DD/{release_id}.{html,json}
├── logs/scraper.log
├── db.sqlite dedup + index
├── venv/ Python virtualenv
└── probe_*.py reconnaissance scripts kept for reference
articles(release_id PK, url, title, published_at, first_seen_at, html_path, size_bytes)
query_hits(query, release_id, seen_at, PRIMARY KEY(query, release_id))
query_hits records every (query, release_id) pair — useful when a
release matches multiple queries.
git clone https://github.com/geekproxy/businesswire-scraper.git
cd businesswire-scraper
python3 -m venv venv
./venv/bin/pip install -r requirements.txt
./venv/bin/patchright install chromium
cp config.yaml.example config.yaml
cp .env.example .env
$EDITOR .env # put your residential proxy credentials here
Patchright bundles its own Chromium and matching stealth patches.
Credentials can live either in config.yaml under proxy.username /
proxy.password, or in .env as BW_PROXY_USERNAME / BW_PROXY_PASSWORD.
.env wins if both are set. .env is gitignored.
Edit config.yaml:
queries: free-form substrings (tickers likeNASDAQ: ADBE, topics likeoil,cybersecurity, ...). Case-insensitive.proxy.username,proxy.password: Geekproxy credentials.proxy.country: ISO-2 country to pin (defaultus). Must match theAccept-Languageclaimed by the client (en-US).proxy.sticky_port_range: pool of sticky-session ports. On Geekproxy each port in the range is a distinct sticky session (different IP). The scraper probes them and picks the first one that returns businesswire homepage quickly. Raise the upper bound if your pool often serves slow IPs.proxy.health_check_timeout: seconds to wait per port during probing.source.mrss_url: the MRSS feed URL from businesswire's robots.txt. Stable as far as we know; update if businesswire reissues the feed key.source.english_only: filter to English releases (/en/in URL).runtime.article_concurrency: parallel downloads in stage 2.runtime.challenge_wait_ms: how long to let Akamai sensor run on the homepage during warmup. Raise if you see "warmup did not yield Akamai cookies".
Live pass:
./venv/bin/python3 scraper.py
Backfill from S3 daily sitemap (URL slug used for query matching, since sitemap has no title/description):
./venv/bin/python3 scraper.py --backfill 2026-05-04
Log goes to logs/scraper.log and stdout. Live pass typically takes
30–60 s. Backfill scales with how many sitemap URLs match the query
slugs.
*/15 * * * * cd /root/businesswire-scraper && ./venv/bin/python3 scraper.py >> logs/cron.log 2>&1
Multiple passes do not collide: dedup is in SQLite, file paths are deterministic, and an in-flight Patchright run is its own process so overlap is safe.
{
"release_id": "20260514366956",
"url": "https://www.businesswire.com/news/home/20260514366956/en/...",
"title": "BearingPoint publishes its Annual Report ...",
"description": "BearingPoint, the €1+ billion management and ...",
"published": "Fri, 15 May 2026 06:00:00 UT",
"matched_queries": ["artificial intelligence"],
"first_seen_at": "2026-05-15T14:18:54Z"
}
matched_queries shows every configured query that matched this release.
For backfill items, an extra "source": "backfill YYYY-MM-DD" field is
added.
- MRSS window. The feed holds ~50 latest releases (~5–10 hours of
output). 15-minute polling never misses anything. If you want a
bigger safety margin, set up a nightly cron with
--backfill $(date -u -d yesterday +%F). - Substring vs. ranked match. This is a naive substring filter. A
query like
"oil"will match anything mentioning oil; a query like"NASDAQ: ADBE"is very precise because tickers usually appear verbatim in title or description. Choose specificity accordingly. - Residential rotation. Live MRSS fetch uses the rotating port (823) because feed.businesswire.com is tolerant of fresh IPs and residential pools occasionally hand out a "cold" IP that times out. Warmup and bulk download use a sticky port from the pool: ports 10000..10010 are distinct sticky sessions on Geekproxy; the scraper probes them, picks one that responds quickly, and uses that single IP for the whole flow. If many downloads start failing mid-pass it re-probes the pool and picks a fresh sticky.
_abcklifetime. When many downloads start failing, the script performs a single re-warmup and retries failed items. If problems persist, raiseruntime.max_retriesor shorten the cron interval.
- Naive substring matching. Ranked relevance — only via integrating Algolia (used internally for site content, not for press releases) or an external search.
- Backfill uses the sitemap URL slug for matching, since the sitemap has no title/description. The slug usually contains the title in hyphenated form, so most queries match correctly, but a query that only appears in the body (e.g. a ticker hidden inside the release text) will not be caught at the sitemap stage.
- We do not currently track query coverage history. If a release
matched a new query on a later pass it won't be re-tagged, only the
first match is kept in the json. If you need the full history, query
the
query_hitstable.