A self-hosted MCP server that exposes web fetch and scrape tools to AI agents, routing every request through your own HTTP or SOCKS5 proxy.
Built for a specific use case: you already have a proxy subscription, you want Claude Desktop or Cursor to use it, and you do not want to pay a managed scraping API on top of your existing proxy bill.
- Exposes four MCP tools to any compatible client:
fetch_url,scrape_page,check_proxy,list_fingerprints. - Routes all HTTP traffic through a single upstream proxy configured via environment variables.
- Supports HTTP, HTTPS, and SOCKS5 proxies out of the box.
- Applies geo-matched fingerprint headers (User-Agent, Accept-Language, timezone hints) so a request through a German IP does not arrive with en-US locale.
- Enforces a per-request timeout and a token-bucket rate limiter so a runaway agent cannot burn your proxy quota in a retry storm.
- Blocks tool calls that resolve to private, loopback, link-local, or reserved IP ranges. The server is exposed to an LLM, and prompt injection is real.
- It does not manage proxy rotation across a pool. One upstream proxy, one sticky session, by design.
- It does not render JavaScript. Use the optional Playwright extra or wrap a headless browser yourself if you need that.
- It does not cache responses.
git clone https://github.com/YOUR-ACCOUNT/self-mcp-scraper
cd self-mcp-scraper
python3 -m venv .venv
source .venv/bin/activate
pip install -e .Copy .env.example to .env and fill in your proxy credentials, or pass the same variables via your MCP client's env block. The minimum viable config is just PROXY_HOST and PROXY_PORT.
PROXY_SCHEME=socks5
PROXY_HOST=proxy.example.com
PROXY_PORT=1080
PROXY_USER=alice
PROXY_PASS=secret
DEFAULT_FINGERPRINT_COUNTRY=USBefore wiring up an MCP client, confirm the tools work against your proxy:
python examples/standalone_probe.pyYou should see your proxy's exit IP under check_proxy, a 200 from example.com, and extracted headings from the scrape.
Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or the equivalent on Windows and Linux. See examples/claude_desktop_config.json for a template. Replace the path to the binary with the absolute path to .venv/bin/self-mcp-scraper inside your clone.
Restart Claude Desktop. The four tools should appear in the slash menu under the server name self-mcp-scraper.
Add the mcpServers block from examples/cursor_config.json to your Cursor settings. Restart Cursor.
Raw HTTP fetch. Accepts url, method, headers, body, timeout_seconds, country. Returns status, final_url, headers, body, truncated, bytes.
Fetches a URL and extracts fields with CSS selectors. Accepts url, selectors (map of field name to selector), country, timeout_seconds. If selectors is empty, returns the page title and text body.
Calls api.ipify.org through the proxy and returns the exit IP. Use this to confirm your proxy is reachable and to verify the IP geolocates where you expect.
Returns the ISO country codes for which a matching fingerprint preset exists.
Passing country: "DE" attaches Accept-Language: de-DE,de;q=0.9,en;q=0.7, a Chrome/131 User-Agent, and Europe/Berlin-appropriate hints. Currently supported: US, GB, DE, FR, ES, IT, NL, PL, RU, UA, TR, AZ, CN, JP, KR, IN, BR, MX, CA, AU, AE, SG, ZA. Pull requests to add more are welcome.
A shared token bucket across all tool calls. Default: burst of 10, sustained 2 req/s. Tune via RATE_LIMIT_CAPACITY and RATE_LIMIT_REFILL_PER_SECOND. When the bucket is empty, calls block until a token refills. This is cheaper than discovering a runaway agent through your proxy invoice.
MAX_RESPONSE_BYTES bounds the body returned to the agent (default 5 MB). Larger responses are truncated and flagged with truncated: true. Without a cap, a single page can blow your model context budget.
Stdio by default, which is what Claude Desktop, Cursor, Cline, and most local MCP clients want. For remote clients that need HTTP:
self-mcp-scraper --transport http --port 8765Streamable HTTP support depends on the installed mcp package version. If it is unavailable on your install, the server will tell you on start.
The server blocks URLs that resolve to private, loopback, link-local, multicast, or reserved IP ranges. This is the minimum acceptable SSRF guard for a tool exposed to an LLM. It will not save you from a targeted attacker, but it will stop a prompt-injected agent from fetching http://169.254.169.254/latest/meta-data/.
Do not commit your .env. The .gitignore already excludes it.
MIT. See LICENSE.