| Field | Value |
|---|---|
| Project | NoSQL Injection CLI Scanner |
| Artifact | Implementation Plan v1.0 |
| Runtime | Python 3.10+ (CPython), Windows / Linux / macOS |
| Stack | argparse, requests, beautifulsoup4, rich, urllib3 |
| Target App | Custom Express/MongoDB app at website/ (localhost:3000) |
| Version | 1.0 Planning Baseline |
| Date | 2026-04-22 |
We are building a Python CLI vulnerability scanner that automatically detects NoSQL injection flaws in MongoDB-backed web applications. The tool crawls a target website, discovers forms and API inputs, injects categorized NoSQL payloads (operator injection, regex extraction, logic bypasses) via both URL-encoded and JSON content types, and analyzes server responses to flag authentication bypasses, data leaks, and abnormal behaviors. This matters because MongoDB/Node.js stacks are widely deployed yet under-tested for injection flaws — this tool gives security researchers and developers a repeatable, automated way to audit their applications without manual curl testing.
- Invoke — User runs
python scanner.py --url http://localhost:3000 --max-depth 2 --output report.json --verbosefrom thecli/directory. - Validate — Scanner validates the URL, confirms the target is reachable, and establishes a baseline response fingerprint.
- Crawl — Scanner crawls internal pages within scope (up to
--max-depth), collecting all discoverable HTML forms and their field structures. - Inject — For each discovered form, payloads from
payloads.jsonare injected into each field. Each payload is tested as bothapplication/x-www-form-urlencodedandapplication/jsonwhere applicable. - Analyze — Response analysis engine compares each response against the baseline, checking for status code changes, body content anomalies (e.g.,
"ok": truewhere"ok": falsewas expected), redirect behavior changes, and timing anomalies. - Report — Scanner writes a structured JSON report and prints a colorized console summary with severity-scored findings.
- CLI argument parsing with
argparse(--url,--max-depth,--timeout,--output,--verbose,--content-type,--delay) - URL validation and normalization (scheme enforcement, trailing-slash normalization)
- Recursive page crawler with scope restriction (same-origin only) and configurable depth limit
- HTML form discovery via BeautifulSoup — extracting
action,method, inputname/typeattributes - Curated
payloads.jsonwith categorized vectors:- Operator injection:
$ne,$gt,$regex,$in,$nin,$exists - Logic bypass:
' || 1==1,$whereclauses - Time-based blind:
sleep()payloads for timing side-channel detection - Derived from
nosqli_wordlist.txtand the NoSQL Injection reference guide
- Operator injection:
- Dual content-type testing: each payload submitted as both
application/x-www-form-urlencodedandapplication/json - Response analysis heuristics:
- Status code delta — 401 → 200 indicates bypass
- Body content matching —
"ok": true,"Login successful","count":changes - Redirect detection — 3xx or client-side redirect indicators
- Timing analysis — response time > baseline + threshold indicates blind injection
- Error signature detection — MongoDB/Mongoose error strings in response body
- Severity scoring: Critical / High / Medium / Low / Info
- Structured JSON report matching the contract in README.md
- Colorized console summary via
rich - Session-aware scanning (cookie jar persistence across requests to handle authenticated flows)
- Logging with Python
loggingmodule at configurable verbosity levels - Robust error handling: timeouts, connection errors, malformed HTML, unreachable targets
- Automated data exploitation/dumping — the scanner detects and proves, but does not extract database contents beyond proof-of-concept confirmation
- Server-Side JavaScript Injection (SSJI) evaluation —
$wherepayloads are included for detection, but arbitrary JS execution chains are not evaluated - GUI/Web interface — CLI only
- Non-MongoDB databases — no CouchDB, Cassandra, or Redis-specific payloads
- Proxy/interceptor integration — no Burp Suite or ZAP plugin mode
- Distributed/multi-threaded scanning — single-threaded sequential scanning for V1
- Custom authentication flows — scanner supports cookie-based sessions; OAuth/JWT/SAML are deferred
- Scanner correctly identifies the known auth bypass on
POST /api/vuln/loginwhen injecting{"username": {"$gt": ""}, "password": {"$gt": ""}}— rated Critical/High - Scanner correctly identifies search injection on
GET/POST /api/vuln/searchwhen injecting operator objects — rated High/Medium - Scanner produces zero false positives when run against
POST /api/secure/loginandGET /api/secure/search(secure mode rejects all operator payloads with 400) - Scanner discovers at least the login form on
/login.htmland search form on/search.htmlvia crawling - JSON report output validates against the report contract defined in
README.md
- Scanner never stores or leaks target credentials beyond the scan session
- All payloads are loaded from the external
payloads.jsonfile — no hardcoded attack strings in source - Scanner includes a mandatory
--i-agree-to-legal-termsflag or disclaimer prompt before scanning non-localhost targets - No dependency with known CVEs at time of release
- Full scan of the target app (2 pages, ~30 payloads × 2 content types × 2 forms) completes in under 60 seconds with default timeout (5s per request)
- Individual request timeout default: 5 seconds
- Memory usage under 100MB for any scan
pip install -r requirements.txtfrom a clean venv succeeds on Python 3.10+python scanner.py --helpprints complete usage documentation- Scanner exits with code 0 on success, 1 on scan errors, 2 on argument errors
- Console output uses color-coded severity levels (red=Critical, orange=High, yellow=Medium, cyan=Low)
- Progress indicators show current phase (Crawling → Injecting → Analyzing → Reporting)
- Verbose mode logs every request/response pair for lab analysis and debugging
1. CLI Entry Point (scanner.py)
The main entry point. Parses arguments via argparse, orchestrates the scan pipeline (validate → crawl → inject → analyze → report), and handles top-level error boundaries. Owns the decision of which phases to run and in what order.
2. URL Validator (core/validator.py)
Validates and normalizes the target URL. Ensures scheme is http/https, resolves trailing slashes, and performs a reachability check (HEAD request) before the scan begins. Returns a canonical base URL for the crawler.
3. Web Crawler (core/crawler.py)
Recursively discovers pages within the target's origin up to --max-depth. Maintains a visited-URL set to avoid cycles. For each discovered page, delegates to the form parser. Uses requests.Session for cookie persistence. Respects robots.txt as a courtesy (configurable).
4. Form Parser (core/form_parser.py)
Uses BeautifulSoup to extract all <form> elements from a page. For each form, captures: action URL (resolved to absolute), HTTP method, input field names and types, and any hidden fields. Returns structured FormTarget objects for the injection engine.
5. Payload Manager (core/payloads.py)
Loads and categorizes payloads from payloads.json. Exposes payloads grouped by category (operator_injection, logic_bypass, regex_extraction, time_based, blind_bruteforce). Each payload entry includes: the payload value, a category tag, expected behavior description, and applicable content types.
6. Injection Engine (core/injector.py)
Takes a FormTarget and a list of payloads. For each injectable field, substitutes the payload and submits the form. Tests each payload under both application/x-www-form-urlencoded and application/json content types. Records the full request/response pair for analysis. Supports configurable inter-request delay.
7. Response Analyzer (core/analyzer.py)
Compares each injection response against a baseline response (captured by submitting the form with benign data). Implements heuristic checks: status code delta, body content keyword matching, response length delta, redirect detection, timing anomaly detection (for time-based blind), and MongoDB error string scanning. Assigns a severity and confidence score to each finding.
8. Report Generator (core/reporter.py)
Collects all findings from the analyzer and produces: (a) a structured JSON report file matching the contract in README.md, and (b) a colorized console summary table using rich. Includes scan metadata (target, start/end time, total forms/payloads tested, findings count by severity).
9. Utilities (core/utils.py)
Shared helpers: logging configuration, URL resolution, timing utilities, constant definitions (e.g., known MongoDB error signatures), and the disclaimer/legal prompt.
cli/
├── scanner.py # Entry point — CLI argument parsing and pipeline orchestration
├── requirements.txt # Python dependencies
├── payloads.json # Categorized NoSQL injection payloads
├── core/
│ ├── __init__.py
│ ├── validator.py # URL validation and reachability check
│ ├── crawler.py # Recursive same-origin page crawler
│ ├── form_parser.py # BeautifulSoup-based HTML form extractor
│ ├── payloads.py # Payload loader and category manager
│ ├── injector.py # Form submission engine (URL-encoded + JSON)
│ ├── analyzer.py # Response comparison and heuristic scoring
│ ├── reporter.py # JSON report writer + Rich console summary
│ ├── utils.py # Shared helpers, constants, logging setup
│ └── models.py # Data classes: FormTarget, Finding, ScanResult, Payload
├── references/
│ ├── NoSQL Injection.md # Reference guide (existing)
│ └── nosqli_worlist.txt # Raw wordlist (existing)
├── general plan.md # Plan template (existing)
└── README.md # Usage documentation (existing)
Goals: Confirm all requirements, validate the vulnerable target app is functional, and establish the development environment.
Work Packages:
- Verify Python 3.10+ is installed and create the virtual environment (
python -m venv .venv) - Populate
requirements.txtwith pinned versions:requests==2.32.*,beautifulsoup4==4.12.*,rich==13.*,urllib3==2.* - Start the target website (
cd website && npm install && node src/server.js) and confirm it responds onhttp://localhost:3000 - Manually verify the vulnerable endpoints still behave as documented in MODE_AUTH_FLOW_TEST_RESULTS.md:
POST /api/vuln/loginwithusername[$ne]=&password[$ne]=→ 200 +"ok": truePOST /api/vuln/loginwith legit creds → 200 +"ok": truePOST /api/vuln/loginwith wrong creds → 401 +"ok": falsePOST /api/secure/loginwithpassword[$ne]=→ 400 +"must be non-empty strings"
- Create the
cli/core/package directory with__init__.py - Confirm
payloads.jsoncan be loaded and parsed (currently{}, will be populated in Phase 1)
Deliverables:
- Working venv with all dependencies installed
- Confirmed target app responses matching expected behavior
- Empty
cli/core/package structure created
Exit Criteria:
pip install -r requirements.txtsucceeds- Target app responds correctly to both legit and injection payloads
- All team members have the same environment running
References: R1, R2, R3
Goals: Build the data layer and shared infrastructure that all other components depend on.
Work Packages:
-
Create
core/models.py— Define the following dataclasses:@dataclass class FormTarget: url: str # Absolute action URL method: str # GET or POST fields: list[dict] # [{"name": "username", "type": "text"}, ...] page_url: str # Page where the form was found @dataclass class Payload: value: Any # The payload (string, dict, or list) category: str # operator_injection | regex_extraction | logic_bypass | time_based | blind_bruteforce description: str # Human-readable explanation content_types: list # ["urlencoded", "json"] or subset @dataclass class Finding: severity: str # critical | high | medium | low | info confidence: str # confirmed | tentative | speculative endpoint: str # e.g., /api/vuln/login form_page: str # e.g., /login.html method: str # POST, GET content_type: str # The content type that triggered the finding payload: Any # The exact payload that triggered it payload_category: str # Category from Payload evidence: str # Human-readable evidence description baseline_status: int # Expected status code actual_status: int # Observed status code response_snippet: str # First 500 chars of response body response_time_ms: float @dataclass class ScanResult: target: str scan_start: str # ISO 8601 scan_end: str # ISO 8601 pages_crawled: int forms_discovered: int payloads_tested: int findings: list[Finding]
-
Create
core/utils.py— Implement:setup_logging(verbose: bool) -> logging.Logger— configures root logger with Rich handlerresolve_url(base: str, relative: str) -> str— usesurllib.parse.urljoinMONGODB_ERROR_SIGNATURES: list[str]—["MongoError", "MongoServerError", "CastError", "ValidationError", "BSONTypeError", "$where", "mapReduce"]SUCCESS_INDICATORS: list[str]—['"ok": true', '"ok":true', '"Login successful"', '"message": "Login successful"']FAILURE_INDICATORS: list[str]—['"ok": false', '"ok":false', '"Invalid credentials"']print_banner()— prints ASCII art tool name and version
-
Populate
payloads.json— Curate payloads derived from nosqli_wordlist.txt and NoSQL Injection.md. Structure:{ "operator_injection": [ { "value": {"$ne": ""}, "description": "Not-equal empty string bypass", "content_types": ["urlencoded", "json"] }, { "value": {"$gt": ""}, "description": "Greater-than empty string bypass", "content_types": ["urlencoded", "json"] }, { "value": {"$ne": null}, "description": "Not-equal null bypass", "content_types": ["json"] }, { "value": {"$regex": ".*"}, "description": "Wildcard regex match-all", "content_types": ["json"] }, { "value": {"$regex": "^a"}, "description": "Regex prefix extraction probe (letter a)", "content_types": ["json"] }, { "value": {"$exists": true}, "description": "Field existence check bypass", "content_types": ["json"] }, { "value": {"$in": ["admin", "root", "administrator", "Admin"]}, "description": "In-list common username brute force", "content_types": ["json"] }, { "value": {"$nin": ["impossible_value_12345"]}, "description": "Not-in exclusion bypass", "content_types": ["json"] } ], "logic_bypass": [ { "value": "' || 1==1", "description": "JS logical OR tautology", "content_types": ["urlencoded"] }, { "value": "' || 1==1//", "description": "JS logical OR tautology with comment", "content_types": ["urlencoded"] }, { "value": "' || 1==1%00", "description": "JS logical OR with null byte terminator", "content_types": ["urlencoded"] }, { "value": "true, $where: '1 == 1'", "description": "$where clause injection via string concatenation", "content_types": ["urlencoded"] }, { "value": ", $where: '1 == 1'", "description": "$where clause injection (comma prefix)", "content_types": ["urlencoded"] } ], "regex_extraction": [ { "value": {"$regex": "^admin"}, "description": "Regex probe — starts with admin", "content_types": ["json"] }, { "value": {"$regex": "^.{1,50}$"}, "description": "Regex probe — match strings 1-50 chars", "content_types": ["json"] } ], "time_based": [ { "value": {"$where": "sleep(3000)"}, "description": "Time-based blind — 3 second sleep", "content_types": ["json"], "expected_delay_ms": 3000 }, { "value": "';sleep(3000);", "description": "JS injection sleep via string break", "content_types": ["urlencoded"], "expected_delay_ms": 3000 } ], "auth_bypass_combo": [ { "value": {"username": {"$gt": ""}, "password": {"$gt": ""}}, "description": "Full auth bypass — both fields gt empty", "content_types": ["json"], "is_full_body": true }, { "value": {"username": {"$ne": null}, "password": {"$ne": null}}, "description": "Full auth bypass — both fields ne null", "content_types": ["json"], "is_full_body": true }, { "value": {"username": {"$ne": "foo"}, "password": {"$ne": "bar"}}, "description": "Full auth bypass — both fields ne arbitrary", "content_types": ["json"], "is_full_body": true }, { "value": {"username": {"$in": ["admin", "root", "administrator"]}, "password": {"$gt": ""}}, "description": "Admin brute force via $in + $gt bypass", "content_types": ["json"], "is_full_body": true } ] } -
Create
core/payloads.py— Implement:load_payloads(filepath: str) -> dict[str, list[Payload]]— loads and validatespayloads.json, returns categorizedPayloadobjectsget_all_payloads(categories: dict) -> list[Payload]— flattens all categories into a single listget_payloads_for_content_type(payloads: list[Payload], content_type: str) -> list[Payload]— filters payloads by content type
Deliverables:
core/models.pywith all dataclassescore/utils.pywith logging, URL helpers, and constantscore/payloads.pywith loader and filter functions- Populated
payloads.jsonwith 20+ categorized payloads
Exit Criteria:
from core.models import FormTarget, Payload, Finding, ScanResultworksload_payloads("payloads.json")returns populated payload categories- All constants are centrally defined and importable
References: R1, R2, R3
Goals: Implement the input validation layer and recursive page crawler that discovers all forms within the target scope.
Work Packages:
-
Create
core/validator.py— Implement:validate_url(url: str) -> str— checks scheme is http/https, normalizes trailing slash, validates hostname format. RaisesValueErroron invalid input.check_reachability(url: str, timeout: int) -> tuple[bool, int, float]— sends HEAD request, returns (reachable, status_code, response_time_ms). HandlesConnectionError,Timeout,TooManyRedirects.get_base_origin(url: str) -> str— extractsscheme://host:portfor scope comparison.
-
Create
core/crawler.py— Implement:Crawlerclass with:__init__(self, base_url: str, max_depth: int, timeout: int, session: requests.Session)crawl(self) -> list[CrawledPage]— BFS/DFS traversal starting frombase_url_fetch_page(self, url: str) -> str | None— GET request, return HTML body or None on error_extract_links(self, html: str, page_url: str) -> list[str]— find all<a href>links, resolve to absolute, filter to same origin_is_in_scope(self, url: str) -> bool— checks if URL shares the same origin as base
CrawledPagedataclass:url: str,html: str,depth: int- Visited-URL deduplication via
set - Skip non-HTML resources (check Content-Type header before parsing)
- Log each discovered page at INFO level
Deliverables:
core/validator.py— URL validation and reachabilitycore/crawler.py— recursive crawler with scope enforcement
Exit Criteria:
validate_url("http://localhost:3000")returns normalized URLvalidate_url("not-a-url")raisesValueErrorCrawler("http://localhost:3000", max_depth=2, ...).crawl()discovers at least/,/login.html,/search.html- Crawler does not visit external links or non-HTML resources
References: R1, R4
Goals: Extract all injectable form targets from crawled pages.
Work Packages:
-
Create
core/form_parser.py— Implement:parse_forms(html: str, page_url: str) -> list[FormTarget]— uses BeautifulSoup to find all<form>tags- For each form, extract:
actionattribute → resolve to absolute URL usingpage_urlas basemethodattribute → default toGETif missing, normalize to uppercase- All
<input>,<textarea>,<select>elements → capturename,type,value(default value if present) - Hidden fields are preserved (they may contain CSRF tokens or mode selectors)
- Return list of
FormTargetobjects - Handle edge cases: forms with no action (submit to current page), forms with empty method, nested forms (invalid HTML but handle gracefully)
-
Integration test: Run crawler → form parser pipeline against
http://localhost:3000and verify:- Login form discovered with fields:
username(text),password(password),send-json(checkbox) - Search form discovered with fields:
query(text),send-json(checkbox) - Action URLs resolve correctly (
/api/vuln/login,/api/vuln/searchor relative)
- Login form discovered with fields:
Important
The login and search forms in the target app submit via JavaScript (login.js, search.js), not via standard HTML form actions. The <form> tags may have no action attribute, or the action may not match the actual API endpoint. The form parser should extract what it finds from the HTML, but the injection engine (Phase 4) must allow manual endpoint override via CLI flag (--endpoints) or fall back to discovered API path inference by examining the JS files for fetch/XHR URLs.
Deliverables:
core/form_parser.py— BeautifulSoup-based form extractor- Verified form discovery against the live target
Exit Criteria:
parse_forms(html, "http://localhost:3000/login.html")returns at least oneFormTargetwith username and password fields- All form field names are correctly captured
- Absolute action URLs are correctly resolved
References: R1, R4
Goals: Build the core scanning engine that submits payloads to discovered forms and records responses.
Work Packages:
-
Create
core/injector.py— Implement:Injectorclass with:__init__(self, session: requests.Session, timeout: int, delay: float, verbose: bool)establish_baseline(self, form: FormTarget) -> BaselineResponse— submit the form with benign data (empty strings or placeholder values) to capture the "normal" response (status code, body length, key content markers, response time)inject_form(self, form: FormTarget, payloads: list[Payload]) -> list[InjectionResult]— for each payload:- For each injectable field (skip checkboxes, hidden fields with fixed values)
- For each applicable content type (urlencoded, json):
- Build the request body with the payload substituted into the current field, benign values in other fields
- For
is_full_bodypayloads (auth_bypass_combo), use the payload as the entire request body - Submit and record: status code, response body, response headers, response time, any cookies set
- Apply inter-request delay if configured
_build_urlencoded_body(self, form: FormTarget, field_name: str, payload_value) -> dict— for operator payloads like{"$ne": ""}, encode asfield_name[$ne]=_build_json_body(self, form: FormTarget, field_name: str, payload_value) -> dict— substitute payload object directly into JSON body
BaselineResponsedataclass:status_code,body_length,body_hash,response_time_ms,key_markers(list of found SUCCESS/FAILURE indicators)InjectionResultdataclass:form,field_name,payload,content_type,status_code,body,headers,response_time_ms,cookies
-
URL-encoded operator encoding logic: When sending operator injection payloads via urlencoded content type, the injector must format them correctly for Express's
qsparser:{"$ne": ""}→field_name[$ne]={"$gt": ""}→field_name[$gt]={"$regex": ".*"}→field_name[$regex]=.*{"$in": ["admin", "root"]}→field_name[$in][]=admin&field_name[$in][]=root
-
Session cookie management: The injector must use a shared
requests.Sessionso that if a payload achieves login (sets a session cookie), subsequent requests to authenticated endpoints (like/api/vuln/search) can be tested in the authenticated context. The injector should also support resetting the session between payload groups to avoid cross-contamination (logout + new session).
Deliverables:
core/injector.py— injection engine with dual content-type support- Baseline capture and session management
Exit Criteria:
- Injector correctly sends
POST /api/vuln/loginwith{"username": {"$gt": ""}, "password": {"$gt": ""}}as JSON and receives 200 - Injector correctly sends
POST /api/vuln/loginwithusername[$ne]=&password[$ne]=as urlencoded and receives 200 - Injector correctly sends the same payloads to
/api/secure/loginand receives 400 - Baseline response for login form correctly captures 401 status with
"ok": false - Inter-request delay is respected
References: R1, R2, R3, R5
Goals: Implement the intelligence layer that determines whether an injection response indicates a vulnerability.
Work Packages:
-
Create
core/analyzer.py— Implement:ResponseAnalyzerclass with:__init__(self, baseline: BaselineResponse, time_threshold_ms: float = 2500)analyze(self, result: InjectionResult) -> Finding | None— runs all heuristic checks and returns a Finding if any trigger, or None if benign_check_status_code_delta(self, result) -> tuple[bool, str]— returns True if status changed from failure (4xx) to success (2xx/3xx). Evidence:"Status changed from {baseline} to {actual}"_check_auth_bypass(self, result) -> tuple[bool, str]— searches response body for SUCCESS_INDICATORS when baseline contained FAILURE_INDICATORS. Evidence:"Auth bypass detected: response contains 'Login successful'"_check_data_leak(self, result) -> tuple[bool, str]— checks if response body contains significantly more data than baseline (body length ratio > 1.5x, or new JSON keys like"results","count"appear). Evidence:"Data leak: response body {X}x larger than baseline"_check_error_signature(self, result) -> tuple[bool, str]— scans response body for MONGODB_ERROR_SIGNATURES. Evidence:"MongoDB error exposed: {matched_signature}"_check_timing_anomaly(self, result) -> tuple[bool, str]— if payload is time_based category and response_time > baseline + expected_delay * 0.8, flag. Evidence:"Time-based blind: response took {X}ms (baseline: {Y}ms)"_check_redirect_change(self, result) -> tuple[bool, str]— if baseline had no redirect but injection response has 3xx or Location header. Evidence:"Unexpected redirect to {location}"
-
Severity scoring logic:
Condition Severity Confidence Auth bypass confirmed (status 401→200 + success indicator in body) critical confirmed Auth bypass partial (status change but no success indicator) high tentative Data leak detected (significant body length increase) high confirmed MongoDB error exposed medium confirmed Time-based blind (timing confirms) high tentative Redirect change observed medium speculative Status code change only (no other indicators) low speculative -
Finding deduplication: If the same endpoint+field combination triggers on multiple similar payloads (e.g.,
$neand$gtboth bypass login), keep the finding with the highest severity/confidence and note the others as corroborating evidence.
Deliverables:
core/analyzer.py— multi-heuristic response analyzer with severity scoring- Finding deduplication logic
Exit Criteria:
- Analyzer correctly flags the
$ne/$gtauth bypass on/api/vuln/loginas critical / confirmed - Analyzer correctly produces no findings for the same payloads against
/api/secure/login - Time-based analysis correctly flags delayed responses (tested with time_based payloads)
- MongoDB error exposure correctly detected when errors leak in response body
References: R1, R2, R5
Goals: Produce the final JSON report and a professional console summary.
Work Packages:
-
Create
core/reporter.py— Implement:generate_json_report(scan_result: ScanResult, output_path: str)— writes JSON matching the contract:{ "target": "http://localhost:3000", "scan_time": "2026-04-22T10:15:00Z", "scan_duration_seconds": 42, "summary": { "pages_crawled": 3, "forms_tested": 2, "payloads_tested": 48, "potential_findings": 4, "by_severity": { "critical": 1, "high": 2, "medium": 1, "low": 0, "info": 0 } }, "findings": [ ... ] }print_console_summary(scan_result: ScanResult)— usesrichto display:- Header banner with scan metadata
- Summary statistics table
- Findings table with columns: Severity, Endpoint, Payload Category, Evidence, Confidence
- Color-coded severity (Critical=red bold, High=red, Medium=yellow, Low=cyan, Info=dim)
- Footer with report file path
-
Verbose output mode: When
--verboseis active, print each request/response pair as it happens (method, URL, content-type, payload, status code, response time, first 200 chars of body).
Deliverables:
core/reporter.py— JSON file writer + Rich console table- Verbose request/response logging
Exit Criteria:
- JSON report validates against the contract schema
- Console output is readable, color-coded, and includes all findings
- Verbose mode logs every request/response pair with timing
References: R1, R6
Goals: Wire everything together in scanner.py and implement the full scan pipeline.
Work Packages:
-
Implement
scanner.py— Full argument parser and orchestration:usage: scanner.py [-h] --url URL [--max-depth N] [--timeout SECS] [--delay SECS] [--output FILE] [--verbose] [--content-type {both,urlencoded,json}] [--endpoints ENDPOINT [ENDPOINT ...]] [--categories CATEGORY [CATEGORY ...]]Arguments:
--url(required): Target base URL--max-depth(default: 2): Crawl depth limit--timeout(default: 5): Per-request timeout in seconds--delay(default: 0.1): Delay between injection requests in seconds--output(default:report.json): Output report file path--verbose/-v: Enable verbose request/response logging--content-type(default:both): Which content types to test (both,urlencoded,json)--endpoints: Manual endpoint override — skip crawling and inject directly into these endpoints (e.g.,POST:/api/vuln/login:username,password)--categories: Filter payload categories (e.g.,operator_injection time_based)
-
Scan pipeline in
main():def main(): args = parse_args() logger = setup_logging(args.verbose) print_banner() # Phase 1: Validate base_url = validate_url(args.url) reachable, status, rtt = check_reachability(base_url, args.timeout) if not reachable: logger.error(f"Target unreachable: {base_url}") sys.exit(1) # Phase 2: Load payloads payloads = load_payloads("payloads.json") if args.categories: payloads = {k: v for k, v in payloads.items() if k in args.categories} # Phase 3: Discover forms session = requests.Session() if args.endpoints: forms = parse_manual_endpoints(args.endpoints) else: crawler = Crawler(base_url, args.max_depth, args.timeout, session) pages = crawler.crawl() forms = [] for page in pages: forms.extend(parse_forms(page.html, page.url)) # Phase 4: Inject and analyze injector = Injector(session, args.timeout, args.delay, args.verbose) analyzer_findings = [] all_payloads = get_all_payloads(payloads) for form in forms: baseline = injector.establish_baseline(form) results = injector.inject_form(form, all_payloads) response_analyzer = ResponseAnalyzer(baseline) for result in results: finding = response_analyzer.analyze(result) if finding: analyzer_findings.append(finding) # Phase 5: Report scan_result = ScanResult( target=base_url, scan_start=start_time, scan_end=datetime.utcnow().isoformat(), pages_crawled=len(pages), forms_discovered=len(forms), payloads_tested=total_injections, findings=deduplicate_findings(analyzer_findings) ) generate_json_report(scan_result, args.output) print_console_summary(scan_result)
-
Error handling wrapper: Wrap the entire pipeline in try/except. Handle
KeyboardInterrupt(print partial results),ConnectionError(retry once then fail), and unexpected errors (log traceback, exit 1). -
Exit codes: 0 = success (scan completed, findings or not), 1 = scan error (unreachable target, etc.), 2 = argument error.
Deliverables:
- Complete
scanner.pywith argument parsing and pipeline orchestration - Manual endpoint mode (
--endpoints) - Error handling and exit codes
Exit Criteria:
python scanner.py --helpprints full usagepython scanner.py --url http://localhost:3000 --output report.jsoncompletes a full scan and produces a valid reportpython scanner.py --url http://localhost:3000 --endpoints "POST:/api/vuln/login:username,password" --verbosescans only the login endpoint- Scanner exits with code 2 on bad arguments, code 1 on connection failure
References: R1, R4
Goals: Validate the scanner produces correct results against both vulnerable and secure flows.
Work Packages:
-
Validate against vulnerable flow:
- Run full scan against
http://localhost:3000(vulnerable mode endpoints) - Confirm the report contains:
- At least one critical finding for auth bypass on
/api/vuln/login - At least one high finding for search injection on
/api/vuln/search
- At least one critical finding for auth bypass on
- Verify payloads that triggered include:
$ne,$gt,$regexoperator injections - Verify evidence descriptions are clear and actionable
- Run full scan against
-
Validate against secure flow:
- Run scan with
--endpoints "POST:/api/secure/login:username,password" "POST:/api/secure/search:query" - Confirm the report contains zero findings (secure endpoints reject all operator payloads with 400)
- Verify no false positives from timing noise or status code coincidences
- Run scan with
-
Edge case testing:
- Test against unreachable URL → clean error message, exit code 1
- Test with
--timeout 0.001→ timeout handling works gracefully - Test with empty
payloads.json→ scanner reports no payloads to test - Test with
--max-depth 0→ scanner tests only the root page - Test with malformed HTML target → parser handles gracefully
-
Performance validation:
- Full scan completes in under 60 seconds
- Memory usage stays under 100MB
Deliverables:
- Validated scan reports for both vulnerable and secure flows
- Edge case test results documented
- Performance benchmarks recorded
Exit Criteria:
- Zero false negatives against the vulnerable flow (all known vulnerabilities detected)
- Zero false positives against the secure flow
- All edge cases handled gracefully without crashes
References: R1, R2, R5
Goals: Polish the tool for release, finalize documentation, and ensure reproducible builds.
Work Packages:
- Finalize
requirements.txtwith exact pinned versions from the working venv (pip freeze) - Update
README.mdwith:- Installation instructions (venv setup, pip install)
- Complete usage examples (basic scan, verbose mode, manual endpoints, category filtering)
- Report output description with annotated example
- Troubleshooting section (common errors and solutions)
- Add inline docstrings to all public functions and classes
- Add
--versionflag to the CLI - Verify clean install: delete venv, recreate, install deps, run scan — everything works from scratch
- Create example reports: Include a sample
example_report.jsonin the repo showing expected output
Deliverables:
- Pinned
requirements.txt - Updated
README.mdwith full documentation - All modules documented with docstrings
- Example report file
- Clean install verified
Exit Criteria:
pip install -r requirements.txt && python scanner.py --helpworks from a fresh venv- README covers all CLI options and includes worked examples
- Example report demonstrates the output format
References: R1, R6
- Accidental scanning of production systems — user mistypes URL or forgets scope restrictions; could trigger WAF alerts or legal issues
- Credential leakage in reports — session cookies or leaked credentials from successful bypasses appear in JSON reports
- Payload self-injection — if scanner processes its own output or error messages contain payloads, could cause confusion
- Dependency supply chain — compromised PyPI packages in the dependency chain
- Excessive resource consumption — unbounded crawling or aggressive request rates could DoS the target
- Legal disclaimer: Scanner prints a warning on startup: "This tool should only be used on systems you own or have explicit authorization to test."
- Localhost default scope: When targeting non-localhost origins, require explicit
--confirm-externalflag - Report sanitization: Truncate response bodies in reports to 500 characters; strip
Set-Cookieheaders from stored responses - Dependency pinning: All dependencies pinned to specific versions in
requirements.txt - Request rate limiting: Default 100ms delay between requests (
--delayflag), with minimum floor of 50ms
- Run
pip auditorsafety checkagainst installed dependencies before release - Manual review of all payloads in
payloads.jsonto ensure they are detection-only (no destructive operations likedb.dropDatabase()) - Verify scanner does not follow redirects to external domains during crawling
- Request timeout — controls how long each injection attempt waits; directly impacts total scan time
- Inter-request delay — prevents overwhelming the target; trades scan speed for target stability
- Crawl depth — limits the number of pages discovered; deeper crawls find more forms but take longer
- Payload count — more payloads = more thorough coverage but longer scans
- Content-type testing — testing both urlencoded and JSON doubles the request count
| Setting | Default | Rationale |
|---|---|---|
--timeout |
5s | Sufficient for localhost; prevents hanging on unresponsive endpoints |
--delay |
0.1s | 100ms is respectful to target while keeping scan time reasonable |
--max-depth |
2 | Captures most forms without deep-crawling large sites |
--content-type |
both |
Maximizes detection coverage; necessary since Express handles both formats |
| Time-based threshold | 2500ms | Sleep payloads use 3000ms; 2500ms threshold accounts for network jitter |
- Per-request timeout: Hard timeout of
--timeoutseconds on every HTTP request;requests.Timeoutcaught and logged - Max pages cap: Crawler stops after 50 pages regardless of depth (prevents infinite crawling on large sites)
- Max payloads per form: Warn if more than 100 payloads are being tested per form (likely misconfiguration)
- Response body cap: Read at most 1MB of response body per request (prevents memory issues on large responses)
- Graceful degradation: If a form fails baseline capture (e.g., requires auth), skip it with a warning instead of aborting the entire scan
| Key | Type | Default | Description |
|---|---|---|---|
--url |
string | (required) | Target base URL |
--max-depth |
int | 2 | Maximum crawl depth |
--timeout |
float | 5.0 | Per-request timeout (seconds) |
--delay |
float | 0.1 | Inter-request delay (seconds) |
--output |
string | report.json |
Output report file path |
--verbose |
bool | false | Enable detailed request/response logging |
--content-type |
enum | both |
Content types to test: both, urlencoded, json |
--endpoints |
list[str] | [] | Manual endpoint specifications (skip crawling) |
--categories |
list[str] | all | Payload categories to use |
No secrets are required by the scanner itself. If the target requires authentication before scanning protected endpoints, the user should:
- Use
--endpointsmode with manually specified endpoints - Or pre-authenticate via browser and export cookies to a file (future V2 feature)
No credential storage mechanism is needed for V1.
- All code changes via pull request to
main - PR checks:
- Python syntax validation (
python -m py_compileon all.pyfiles) pip install -r requirements.txtsucceeds- Basic smoke test:
python scanner.py --helpexits 0
- Python syntax validation (
- Full integration test against the vulnerable target app
- Dependency audit (
pip audit) - Clean install test from fresh venv
- Report output validation against schema
mainbranch is protected- Feature branches named
feature/<description> - At least one reviewer approval required before merge
- Squash merges preferred for clean history
Risk 1 (Technical) — The target app's forms use JavaScript-based submission (fetch API), so standard HTML form parsing may miss the actual API endpoints.
Mitigation — Implement --endpoints manual override mode; additionally, add JS file scanning to extract fetch/XHR URLs as a best-effort enhancement.
Risk 2 (Security) — A user accidentally runs the scanner against a production system they don't own, causing legal liability.
Mitigation — Print mandatory legal disclaimer on startup; require --confirm-external flag for non-localhost targets.
Risk 3 (Operational) — The target app's session management (cookie-based) may cause false positives if a bypass payload sets a session that affects subsequent requests. Mitigation — Reset session (clear cookies) between payload groups; establish fresh baseline before each form's injection round.
Risk 4 (UX/Adoption) — False positives erode user trust and make the tool unreliable for real assessments. Mitigation — Multi-factor heuristic scoring (require both status code change AND body content indicator for "confirmed" confidence); clear confidence labels (confirmed/tentative/speculative) in reports.
Risk 5 (Technical) — Time-based blind detection is inherently noisy — network latency can cause false positives/negatives. Mitigation — Use conservative threshold (payload delay × 0.8); run time-based payloads last; recommend local-network testing for reliable timing results.
| Week | Phases | Description |
|---|---|---|
| Week 1 | Phase 0 + Phase 1 | Environment setup, models, payloads, utilities |
| Week 2 | Phase 2 + Phase 3 | URL validation, crawler, form parser |
| Week 3 | Phase 4 | Injection engine with dual content-type support |
| Week 4 | Phase 5 + Phase 6 | Response analyzer, severity scoring, report generation |
| Week 5 | Phase 7 | CLI entry point, pipeline integration |
| Week 6 | Phase 8 + Phase 9 | Testing, validation, documentation, release |
-
python scanner.py --url http://localhost:3000 --output report.jsoncompletes without errors - JSON report correctly identifies auth bypass on
/api/vuln/loginas critical - JSON report correctly identifies search injection on
/api/vuln/searchas high - Scanner produces zero false positives against
/api/secure/loginand/api/secure/search -
payloads.jsoncontains 20+ categorized payloads derived from wordlist and reference materials - Console output is color-coded with rich formatting and includes progress indicators
-
--verbosemode logs every request/response pair -
--endpointsmanual mode works for targeted scanning without crawling - All errors (timeout, connection, malformed HTML) are handled gracefully
-
README.mdcontains complete usage documentation with examples -
requirements.txtcontains pinned dependencies that install cleanly - Report output matches the JSON contract defined in the README
| ID | Title | Location |
|---|---|---|
| R1 | Project README — CLI Scanner | README.md |
| R2 | NoSQL Injection Reference Guide | NoSQL Injection.md |
| R3 | NoSQL Injection Wordlist | nosqli_worlist.txt |
| R4 | Vulnerable Target — Server Config | server.js |
| R5 | Mode Auth Flow Test Results | MODE_AUTH_FLOW_TEST_RESULTS.md |
| R6 | Project Tasks Checklist | TASKS.md |
| R7 | NoSQL Testing Guide | NOSQLI_TESTING_GUIDE.md |
| R8 | Vulnerable Auth Controller | vulnAuth.js |
| R9 | Vulnerable Search Controller | vulnSearch.js |
| R10 | Vulnerable Routes | vulnRoutes.js |
- Create the
cli/core/package:mkdir cli/core && touch cli/core/__init__.py— establish the module structure. - Write
core/models.py: Define all dataclasses (FormTarget,Payload,Finding,ScanResult,BaselineResponse,InjectionResult). This is the zero-dependency starting point that all other modules import. - Populate
payloads.json: Curate the full payload catalog from the wordlist and reference guide. Validate it loads correctly withjson.load(). - Write
core/utils.py: Implement logging setup, URL helpers, and all constant lists (error signatures, success/failure indicators). Import and test from a REPL. - Write
requirements.txt: Addrequests>=2.32,beautifulsoup4>=4.12,rich>=13.0,urllib3>=2.0and runpip install -r requirements.txtto confirm clean installation.