Skip to content

ec521-null-pointers/dynamic-analysis

Repository files navigation

NPM Dynamic Analysis Tool

Dynamic analysis sandbox for detecting malicious behavior in npm packages. This tool executes npm packages in a hardened Docker container, traces system calls (strace), captures network traffic (tcpdump), and analyzes the results for indicators of compromise (IOCs). It also includes a Machine Learning classifier to assign risk scores to analyzed packages.

Project Structure

.
├── src/                      # Core source code
│   ├── orchestrator.py       # Main analysis engine (Docker + Strace + PCAP)
│   ├── ml_analyzer.py        # Machine Learning model (Random Forest)
│   ├── Dockerfile            # Sandbox environment definition
│   ├── run_analysis.sh       # Script running inside the container
│   └── preload.js            # Node.js hooks for runtime analysis
├── scripts/                  # Helper scripts for batch processing
│   ├── analyze_list.py       # Parallel analysis of a package list (Main Entrypoint)
│   ├── dataset_runner.py     # Recursive dataset analyzer
│   ├── run_local_analysis.py # Analyze a single local zip file
│   └── batch_ml_analyzer.py  # Run ML on existing JSON logs
├── data/                     # Datasets and input lists
├── logs/                     # Generated JSON analysis reports
├── output/                   # Temporary output (PCAP files)
└── models/                   # Trained ML models

Features

  • System Call Tracing: Monitors file access, process execution, and environment variable reads.
  • Network Analysis: Captures and analyzes PCAP files for suspicious connections, DNS queries, and HTTP payloads.
  • Heuristic Detection:
    • Reverse Shells: Detects socket -> dup2 -> execve chains.
    • Persistence: Flags write attempts to .bashrc, /etc/cron, etc.
    • Evasion: Detects sleep calls used to delay execution.
    • Data Exfiltration: Scans HTTP payloads for sensitive keywords (AWS keys, tokens).
  • ML Risk Scoring: Random Forest classifier trained to predict risk levels (LOW, MEDIUM, HIGH) based on behavioral features.

Prerequisites

  • Docker: Must be installed and running.
  • Python 3: With scapy, colorama, scikit-learn, and numpy installed.
  • Linux Environment: Required for strace and tcpdump compatibility within the container.

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd dynamic-analysis
  2. Install Python dependencies:

    pip install -r requirements.txt

Usage

1. Analyze a List of Packages (Recommended)

To analyze a specific list of packages (e.g., from a text file) in parallel and automatically generate ML risk scores:

python3 scripts/analyze_list.py --workers 4
  • Input: Reads package names from data/malicious-recent-analysis.txt.
  • Process: Finds corresponding zip files in data/malicious-software-packages-dataset, runs them in the sandbox, and generates reports.
  • Output:
    • JSON reports in logs/.
    • PCAP file in output/.

2. Analyze a Single Package

To analyze a single package by name (downloads from npm) or local path:

# From npm registry
python3 src/orchestrator.py express

# Local directory
python3 src/orchestrator.py ./path/to/package

3. Analyze a Local Zip File

To analyze a specific zip file (handling extraction automatically):

python3 scripts/run_local_analysis.py path/to/package.zip

Machine Learning Module

The tool uses a Random Forest classifier (src/ml_analyzer.py) to assess risk.

Training the Model

To train the model, you need two directories: one containing JSON reports of benign packages and one for malicious packages.

python3 src/ml_analyzer.py --train --benign-dir ./data/benign_reports --malicious-dir ./data/malicious_reports

This saves the model to models/risk_model.pkl.

Running ML on Existing Logs

If you have already run the dynamic analysis and have a folder full of JSON reports in logs/, you can generate a CSV summary without re-running the sandbox:

python3 scripts/batch_ml_analyzer.py

Output CSV Format:

Filename Package Name Risk Level Malicious Probability High Threshold Medium Threshold
report.json malicious-pkg HIGH 0.9200 0.8500 0.4500

Configuration

Key settings can be modified in src/orchestrator.py:

  • DOCKER_TIMEOUT: Max runtime for analysis (default: 600s).
  • MAX_MEMORY: Memory limit for the sandbox (default: 4g).

About

Dynamic analysis sandbox for detecting malicious behavior in npm packages

Resources

Stars

Watchers

Forks

Contributors