Skip to content

Gallind/Dealcast-realtime-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dealcast — Real-Time Price Scraper

Find the best price for any product across 7 retailers, in real time.

Live demo: https://dealcast-realtime-scraper.vercel.app


What it does

Type any product name. Dealcast simultaneously searches Amazon, Best Buy, Walmart, Newegg, KSP, Bug, and Ivory. For each retailer it tries up to four progressively smarter scraping strategies, streaming every attempt to your browser as it happens. When a tier succeeds you get the product title, price, rating, and review count — plus direct links to all matching candidates at that retailer.

Hebrew queries work natively. English queries are auto-translated for Israeli retailers. If a Hebrew search fails, it falls back to an English retry automatically.


Features

  • Real-time waterfall UI — a 4-slot grid per retailer shows each tier firing, its status (live / ok / fail), and elapsed time as Server-Sent Events stream in
  • Global best picks — after all sites finish, scans every cached candidate across all retailers to surface the globally cheapest item (with rating ≥ 1.0) and the highest weighted-score item (rating × log₁₀(reviews + 10))
  • Top-10 drill-down — click any successful retailer row to expand a ranked list of up to 10 matching candidates with prices and direct links
  • Search history — all past queries are persisted in localStorage; clicking a chip instantly replays cached results without re-scraping
  • 7 retailers — 4 US (Amazon, Best Buy, Walmart, Newegg) + 3 Israeli (KSP, Bug, Ivory), each with its own currency (USD / ₪)
  • RTL support — all title text uses dir="auto" so Hebrew and Arabic product names render correctly
  • CLI mode — run the pipeline headlessly without the frontend

Architecture

flowchart TB
    User([👤 User])

    subgraph Edge["☁️ Vercel · Frontend"]
        direction TB
        FE["Next.js 16 App<br/>page.tsx state machine"]
        FEApi["lib/api.ts<br/>EventSource + fetch"]
        FE --> FEApi
    end

    subgraph ACA["☁️ Azure Container Apps · Backend (northeurope, scale-to-zero)"]
        direction TB
        API["FastAPI<br/>/api/scrape · /api/retailer-top<br/>/api/best-picks · /api/health"]
        Pipe["pipeline.run_all<br/>asyncio.gather × 7 retailers"]
        Tiers{{"_run_tiers<br/>Basic → Browser → LLM → Firecrawl"}}
        Cache[("In-memory<br/>candidate cache<br/>15-min TTL")]
        Trans["translation.py<br/>HE ↔ EN auto-detect<br/>+ English fallback"]
        Browser["BrowserHelper<br/>persistent Chromium profile"]
        API --> Pipe
        Pipe --> Tiers
        Pipe <--> Cache
        Pipe --> Trans
        Tiers --> Browser
    end

    subgraph Sites["🌐 7 Retailers (parallel scraping)"]
        direction LR
        AM["Amazon.com<br/>USD"]
        BB["BestBuy.com<br/>USD"]
        WM["Walmart.com<br/>USD"]
        NE["Newegg.com<br/>USD"]
        KS["KSP.co.il<br/>₪ · JSON API"]
        BG["Bug.co.il<br/>₪"]
        IV["Ivory.co.il<br/>₪"]
    end

    subgraph Ext["🤖 External AI services"]
        direction TB
        Gem["Gemini 2.5 Flash<br/>LLM extraction + translation"]
        FC["Firecrawl Cloud<br/>US-region rendering"]
    end

    subgraph CI["⚙️ CI/CD pipeline"]
        direction LR
        GH["GitHub Actions<br/>deploy-backend.yml"]
        ACR[("Azure Container Registry<br/>dealcast.azurecr.io")]
    end

    User -- HTTPS --> FE
    FEApi == "SSE stream<br/>tier events" ==> API
    Trans --> Gem
    Tiers -. tier 3 .-> Gem
    Tiers -. tier 4 .-> FC
    Tiers --> AM & BB & WM & NE & KS & BG & IV

    GH == "build &<br/>push" ==> ACR
    ACR -. "pull image<br/>on revision" .-> API

    classDef azure fill:#0078d4,stroke:#005a9e,color:#fff
    classDef vercel fill:#000,stroke:#fff,color:#fff
    classDef ext fill:#fbbc04,stroke:#b07a00,color:#000
    classDef ci fill:#1b1f23,stroke:#586069,color:#fff
    classDef sites fill:#34d399,stroke:#059669,color:#000
    class ACA azure
    class Edge vercel
    class Ext ext
    class CI ci
    class Sites sites
Loading

Concurrency model: all 7 retailers scrape in parallel via asyncio.gather. Inside each retailer, tiers run sequentially — each tier only fires if every previous tier failed. The first successful tier short-circuits and emits a done event for that retailer.

Data plane: the backend streams Server-Sent Events as each tier starts, succeeds, or fails — the frontend's waterfall UI animates live. After every site reports done, the frontend issues one extra REST call to /api/best-picks which scans the in-memory candidate cache across all 7 retailers to compute the globally cheapest item and the highest weighted-score item.


Scraping pipeline

Tier Library Approach Typical use
1 — Basic Scrapling Plain HTTP with realistic headers Israeli retailers (server-rendered PHP/HTML)
2 — Browser CloakBrowser Stealth Playwright with persistent profile + humanised input US retailers behind anti-bot
3 — LLM Scrapegraph-ai + Gemini 2.5 Flash AI-driven extraction from browser-rendered HTML Unusual/changing layouts
4 — Firecrawl firecrawl-py US-based cloud rendering via Firecrawl API Geo-blocked content (Amazon from non-US IPs)

KSP exception: KSP.co.il exposes a JSON search API. The Basic tier hits it directly; Browser/LLM/Firecrawl tiers are skipped via NotImplementedError.

Israeli retailer fallback: if a Hebrew-translated search returns no results, the pipeline retries with the English query before emitting a final done event. The waterfall accumulates segments from both attempts.


Supported retailers

Retailer Region Currency Scraping method
Amazon.com US USD Browser → Firecrawl (geo-locked pricing)
BestBuy.com US USD Browser → LLM → Firecrawl
Walmart.com US USD Browser (with warmup + persistent cookies)
Newegg.com US USD Basic → Browser → LLM → Firecrawl
KSP.co.il IL ILS Basic (JSON API only)
Bug.co.il IL ILS Basic → Browser → LLM → Firecrawl
Ivory.co.il IL ILS Basic → Browser (card shortcut — SERP has price)

Local setup

Prerequisites

  • Python 3.12+
  • Node.js 20+ with pnpm
  • API keys: GEMINI_API_KEY (for tier 3 + translation), FIRECRAWL_API_KEY (for tier 4)

1. Backend

python -m venv .venv && source .venv/bin/activate
pip install -r backend/requirements.txt
playwright install chromium   # downloads the browser binary
cp .env.example .env          # fill in your API keys
uvicorn backend.api:app --reload --port 8000

2. Frontend

cd frontend
pnpm install
pnpm dev   # http://localhost:3000

The frontend connects EventSource directly to http://localhost:8000. Do not proxy SSE through Next.js rewrites — the dev proxy buffers the entire stream and the waterfall stays blank until the backend closes the connection.

3. CLI (no frontend needed)

source .venv/bin/activate
python -m backend "Sony WH-1000XM5"

Environment variables

Variable Required Description
GEMINI_API_KEY Yes Gemini 2.5 Flash — powers tier 3 (LLM extraction) and Hebrew ↔ English translation
FIRECRAWL_API_KEY Yes Firecrawl — powers tier 4 cloud rendering
SCRAPER_LLM_PROVIDER No Default gemini. Passed to Scrapegraph-ai as google_genai/<model>
ALLOWED_ORIGINS No Comma-separated CORS origins. Default http://localhost:3000
NEXT_PUBLIC_BACKEND_URL No Frontend only. Default http://localhost:8000

Project structure

Real-Time-Scraper/
├── backend/
│   ├── api.py          # FastAPI: /api/scrape (SSE), /api/retailer-top, /api/best-picks, /api/health
│   ├── pipeline.py     # run_all / run_site / _run_tiers — orchestration + candidate cache
│   ├── models.py       # Pydantic: ProductResult, ScrapeOutcome, CardCandidate
│   ├── matching.py     # Token-weighted similarity with Hebrew tokenisation support
│   ├── translation.py  # detect_language + translate_query via Gemini (cached in-process)
│   ├── scrapers/       # HOW — one file per tier (basic, browser, llm, firecrawl, _browser)
│   └── sites/          # WHAT — one file per retailer (URL builders + HTML/JSON parsers)
│
├── frontend/
│   ├── app/            # Next.js App Router (page.tsx state machine, layout, globals.css)
│   ├── components/     # HeroIdle, ResultsHeader, ResultsView, Waterfall, ResultRow, …
│   └── lib/            # api.ts (EventSource + REST), history.ts, picks.ts, types.ts
│
├── Dockerfile          # python:3.12-slim + playwright install --with-deps chromium
├── .github/workflows/  # deploy-backend.yml — builds + pushes to ACR on push to main
└── DEPLOY.md           # Full Azure Container Apps + Vercel deployment guide

Scrapers vs Sites separation: sites/X.py is the what (CSS selectors, URL templates, parse logic — pure HTML-string-in, structured-fields-out). scrapers/X.py is the how (which HTTP client, anti-bot strategy, headers). Adding a new retailer = add one file in sites/. Adding a new scraping tier = add one file in scrapers/.


Tech stack

Layer Technology
Frontend framework Next.js 16 (App Router), React 19, TypeScript
Styling Tailwind v4, inline styles (design-spec fidelity)
Backend framework FastAPI, Python 3.12, Uvicorn
Streaming Server-Sent Events (sse-starlette)
Data validation Pydantic v2
Tier 1 scraping Scrapling
Tier 2 browser CloakBrowser (stealth Playwright)
Tier 3 LLM Scrapegraph-ai + Gemini 2.5 Flash
Tier 4 cloud Firecrawl
Translation Gemini 2.5 Flash (via Google AI SDK)
Frontend hosting Vercel (free tier)
Backend hosting Azure Container Apps (North Europe, scale-to-zero)
Container registry Azure Container Registry (Basic)
CI/CD GitHub Actions

Production deployment

The app is live. Backend image is built on GitHub Actions (no local Docker push needed) and deployed to Azure Container Apps. Full step-by-step guide with CLI commands is in DEPLOY.md.

To redeploy the backend after a code change:

# Push to main — GitHub Actions builds and pushes the new image to ACR automatically.
# Then roll the Container App to the new image:
az containerapp update --name dealcast-backend --resource-group dealcast-rg \
  --image dealcast.azurecr.io/dealcast-backend:latest

Frontend changes deploy automatically on push to main via Vercel's GitHub integration.

Releases

No releases published

Packages

 
 
 

Contributors