Find the best price for any product across 7 retailers, in real time.
Live demo: https://dealcast-realtime-scraper.vercel.app
Type any product name. Dealcast simultaneously searches Amazon, Best Buy, Walmart, Newegg, KSP, Bug, and Ivory. For each retailer it tries up to four progressively smarter scraping strategies, streaming every attempt to your browser as it happens. When a tier succeeds you get the product title, price, rating, and review count — plus direct links to all matching candidates at that retailer.
Hebrew queries work natively. English queries are auto-translated for Israeli retailers. If a Hebrew search fails, it falls back to an English retry automatically.
- Real-time waterfall UI — a 4-slot grid per retailer shows each tier firing, its status (live / ok / fail), and elapsed time as Server-Sent Events stream in
- Global best picks — after all sites finish, scans every cached candidate across all retailers to surface the globally cheapest item (with rating ≥ 1.0) and the highest weighted-score item (
rating × log₁₀(reviews + 10)) - Top-10 drill-down — click any successful retailer row to expand a ranked list of up to 10 matching candidates with prices and direct links
- Search history — all past queries are persisted in
localStorage; clicking a chip instantly replays cached results without re-scraping - 7 retailers — 4 US (Amazon, Best Buy, Walmart, Newegg) + 3 Israeli (KSP, Bug, Ivory), each with its own currency (USD / ₪)
- RTL support — all title text uses
dir="auto"so Hebrew and Arabic product names render correctly - CLI mode — run the pipeline headlessly without the frontend
flowchart TB
User([👤 User])
subgraph Edge["☁️ Vercel · Frontend"]
direction TB
FE["Next.js 16 App<br/>page.tsx state machine"]
FEApi["lib/api.ts<br/>EventSource + fetch"]
FE --> FEApi
end
subgraph ACA["☁️ Azure Container Apps · Backend (northeurope, scale-to-zero)"]
direction TB
API["FastAPI<br/>/api/scrape · /api/retailer-top<br/>/api/best-picks · /api/health"]
Pipe["pipeline.run_all<br/>asyncio.gather × 7 retailers"]
Tiers{{"_run_tiers<br/>Basic → Browser → LLM → Firecrawl"}}
Cache[("In-memory<br/>candidate cache<br/>15-min TTL")]
Trans["translation.py<br/>HE ↔ EN auto-detect<br/>+ English fallback"]
Browser["BrowserHelper<br/>persistent Chromium profile"]
API --> Pipe
Pipe --> Tiers
Pipe <--> Cache
Pipe --> Trans
Tiers --> Browser
end
subgraph Sites["🌐 7 Retailers (parallel scraping)"]
direction LR
AM["Amazon.com<br/>USD"]
BB["BestBuy.com<br/>USD"]
WM["Walmart.com<br/>USD"]
NE["Newegg.com<br/>USD"]
KS["KSP.co.il<br/>₪ · JSON API"]
BG["Bug.co.il<br/>₪"]
IV["Ivory.co.il<br/>₪"]
end
subgraph Ext["🤖 External AI services"]
direction TB
Gem["Gemini 2.5 Flash<br/>LLM extraction + translation"]
FC["Firecrawl Cloud<br/>US-region rendering"]
end
subgraph CI["⚙️ CI/CD pipeline"]
direction LR
GH["GitHub Actions<br/>deploy-backend.yml"]
ACR[("Azure Container Registry<br/>dealcast.azurecr.io")]
end
User -- HTTPS --> FE
FEApi == "SSE stream<br/>tier events" ==> API
Trans --> Gem
Tiers -. tier 3 .-> Gem
Tiers -. tier 4 .-> FC
Tiers --> AM & BB & WM & NE & KS & BG & IV
GH == "build &<br/>push" ==> ACR
ACR -. "pull image<br/>on revision" .-> API
classDef azure fill:#0078d4,stroke:#005a9e,color:#fff
classDef vercel fill:#000,stroke:#fff,color:#fff
classDef ext fill:#fbbc04,stroke:#b07a00,color:#000
classDef ci fill:#1b1f23,stroke:#586069,color:#fff
classDef sites fill:#34d399,stroke:#059669,color:#000
class ACA azure
class Edge vercel
class Ext ext
class CI ci
class Sites sites
Concurrency model: all 7 retailers scrape in parallel via asyncio.gather. Inside each retailer, tiers run sequentially — each tier only fires if every previous tier failed. The first successful tier short-circuits and emits a done event for that retailer.
Data plane: the backend streams Server-Sent Events as each tier starts, succeeds, or fails — the frontend's waterfall UI animates live. After every site reports done, the frontend issues one extra REST call to /api/best-picks which scans the in-memory candidate cache across all 7 retailers to compute the globally cheapest item and the highest weighted-score item.
| Tier | Library | Approach | Typical use |
|---|---|---|---|
| 1 — Basic | Scrapling | Plain HTTP with realistic headers | Israeli retailers (server-rendered PHP/HTML) |
| 2 — Browser | CloakBrowser | Stealth Playwright with persistent profile + humanised input | US retailers behind anti-bot |
| 3 — LLM | Scrapegraph-ai + Gemini 2.5 Flash | AI-driven extraction from browser-rendered HTML | Unusual/changing layouts |
| 4 — Firecrawl | firecrawl-py | US-based cloud rendering via Firecrawl API | Geo-blocked content (Amazon from non-US IPs) |
KSP exception: KSP.co.il exposes a JSON search API. The Basic tier hits it directly; Browser/LLM/Firecrawl tiers are skipped via NotImplementedError.
Israeli retailer fallback: if a Hebrew-translated search returns no results, the pipeline retries with the English query before emitting a final done event. The waterfall accumulates segments from both attempts.
| Retailer | Region | Currency | Scraping method |
|---|---|---|---|
| Amazon.com | US | USD | Browser → Firecrawl (geo-locked pricing) |
| BestBuy.com | US | USD | Browser → LLM → Firecrawl |
| Walmart.com | US | USD | Browser (with warmup + persistent cookies) |
| Newegg.com | US | USD | Basic → Browser → LLM → Firecrawl |
| KSP.co.il | IL | ILS | Basic (JSON API only) |
| Bug.co.il | IL | ILS | Basic → Browser → LLM → Firecrawl |
| Ivory.co.il | IL | ILS | Basic → Browser (card shortcut — SERP has price) |
- Python 3.12+
- Node.js 20+ with
pnpm - API keys:
GEMINI_API_KEY(for tier 3 + translation),FIRECRAWL_API_KEY(for tier 4)
python -m venv .venv && source .venv/bin/activate
pip install -r backend/requirements.txt
playwright install chromium # downloads the browser binary
cp .env.example .env # fill in your API keys
uvicorn backend.api:app --reload --port 8000cd frontend
pnpm install
pnpm dev # http://localhost:3000The frontend connects EventSource directly to http://localhost:8000. Do not proxy SSE through Next.js rewrites — the dev proxy buffers the entire stream and the waterfall stays blank until the backend closes the connection.
source .venv/bin/activate
python -m backend "Sony WH-1000XM5"| Variable | Required | Description |
|---|---|---|
GEMINI_API_KEY |
Yes | Gemini 2.5 Flash — powers tier 3 (LLM extraction) and Hebrew ↔ English translation |
FIRECRAWL_API_KEY |
Yes | Firecrawl — powers tier 4 cloud rendering |
SCRAPER_LLM_PROVIDER |
No | Default gemini. Passed to Scrapegraph-ai as google_genai/<model> |
ALLOWED_ORIGINS |
No | Comma-separated CORS origins. Default http://localhost:3000 |
NEXT_PUBLIC_BACKEND_URL |
No | Frontend only. Default http://localhost:8000 |
Real-Time-Scraper/
├── backend/
│ ├── api.py # FastAPI: /api/scrape (SSE), /api/retailer-top, /api/best-picks, /api/health
│ ├── pipeline.py # run_all / run_site / _run_tiers — orchestration + candidate cache
│ ├── models.py # Pydantic: ProductResult, ScrapeOutcome, CardCandidate
│ ├── matching.py # Token-weighted similarity with Hebrew tokenisation support
│ ├── translation.py # detect_language + translate_query via Gemini (cached in-process)
│ ├── scrapers/ # HOW — one file per tier (basic, browser, llm, firecrawl, _browser)
│ └── sites/ # WHAT — one file per retailer (URL builders + HTML/JSON parsers)
│
├── frontend/
│ ├── app/ # Next.js App Router (page.tsx state machine, layout, globals.css)
│ ├── components/ # HeroIdle, ResultsHeader, ResultsView, Waterfall, ResultRow, …
│ └── lib/ # api.ts (EventSource + REST), history.ts, picks.ts, types.ts
│
├── Dockerfile # python:3.12-slim + playwright install --with-deps chromium
├── .github/workflows/ # deploy-backend.yml — builds + pushes to ACR on push to main
└── DEPLOY.md # Full Azure Container Apps + Vercel deployment guide
Scrapers vs Sites separation: sites/X.py is the what (CSS selectors, URL templates, parse logic — pure HTML-string-in, structured-fields-out). scrapers/X.py is the how (which HTTP client, anti-bot strategy, headers). Adding a new retailer = add one file in sites/. Adding a new scraping tier = add one file in scrapers/.
| Layer | Technology |
|---|---|
| Frontend framework | Next.js 16 (App Router), React 19, TypeScript |
| Styling | Tailwind v4, inline styles (design-spec fidelity) |
| Backend framework | FastAPI, Python 3.12, Uvicorn |
| Streaming | Server-Sent Events (sse-starlette) |
| Data validation | Pydantic v2 |
| Tier 1 scraping | Scrapling |
| Tier 2 browser | CloakBrowser (stealth Playwright) |
| Tier 3 LLM | Scrapegraph-ai + Gemini 2.5 Flash |
| Tier 4 cloud | Firecrawl |
| Translation | Gemini 2.5 Flash (via Google AI SDK) |
| Frontend hosting | Vercel (free tier) |
| Backend hosting | Azure Container Apps (North Europe, scale-to-zero) |
| Container registry | Azure Container Registry (Basic) |
| CI/CD | GitHub Actions |
The app is live. Backend image is built on GitHub Actions (no local Docker push needed) and deployed to Azure Container Apps. Full step-by-step guide with CLI commands is in DEPLOY.md.
To redeploy the backend after a code change:
# Push to main — GitHub Actions builds and pushes the new image to ACR automatically.
# Then roll the Container App to the new image:
az containerapp update --name dealcast-backend --resource-group dealcast-rg \
--image dealcast.azurecr.io/dealcast-backend:latestFrontend changes deploy automatically on push to main via Vercel's GitHub integration.