NPM Dynamic Analysis Tool

Dynamic analysis sandbox for detecting malicious behavior in npm packages. This tool executes npm packages in a hardened Docker container, traces system calls (strace), captures network traffic (tcpdump), and analyzes the results for indicators of compromise (IOCs). It also includes a Machine Learning classifier to assign risk scores to analyzed packages.

Project Structure

.
├── src/                      # Core source code
│   ├── orchestrator.py       # Main analysis engine (Docker + Strace + PCAP)
│   ├── ml_analyzer.py        # Machine Learning model (Random Forest)
│   ├── Dockerfile            # Sandbox environment definition
│   ├── run_analysis.sh       # Script running inside the container
│   └── preload.js            # Node.js hooks for runtime analysis
├── scripts/                  # Helper scripts for batch processing
│   ├── analyze_list.py       # Parallel analysis of a package list (Main Entrypoint)
│   ├── dataset_runner.py     # Recursive dataset analyzer
│   ├── run_local_analysis.py # Analyze a single local zip file
│   └── batch_ml_analyzer.py  # Run ML on existing JSON logs
├── data/                     # Datasets and input lists
├── logs/                     # Generated JSON analysis reports
├── output/                   # Temporary output (PCAP files)
└── models/                   # Trained ML models

Features

System Call Tracing: Monitors file access, process execution, and environment variable reads.
Network Analysis: Captures and analyzes PCAP files for suspicious connections, DNS queries, and HTTP payloads.
Heuristic Detection:
- Reverse Shells: Detects socket -> dup2 -> execve chains.
- Persistence: Flags write attempts to .bashrc, /etc/cron, etc.
- Evasion: Detects sleep calls used to delay execution.
- Data Exfiltration: Scans HTTP payloads for sensitive keywords (AWS keys, tokens).
ML Risk Scoring: Random Forest classifier trained to predict risk levels (LOW, MEDIUM, HIGH) based on behavioral features.

Prerequisites

Docker: Must be installed and running.
Python 3: With scapy, colorama, scikit-learn, and numpy installed.
Linux Environment: Required for strace and tcpdump compatibility within the container.

Installation

Clone the repository:

git clone <repository-url>
cd dynamic-analysis

Install Python dependencies:
```
pip install -r requirements.txt
```

Usage

1. Analyze a List of Packages (Recommended)

To analyze a specific list of packages (e.g., from a text file) in parallel and automatically generate ML risk scores:

python3 scripts/analyze_list.py --workers 4

Input: Reads package names from data/malicious-recent-analysis.txt.
Process: Finds corresponding zip files in data/malicious-software-packages-dataset, runs them in the sandbox, and generates reports.
Output:
- JSON reports in logs/.
- PCAP file in output/.

2. Analyze a Single Package

To analyze a single package by name (downloads from npm) or local path:

# From npm registry
python3 src/orchestrator.py express

# Local directory
python3 src/orchestrator.py ./path/to/package

3. Analyze a Local Zip File

To analyze a specific zip file (handling extraction automatically):

python3 scripts/run_local_analysis.py path/to/package.zip

Machine Learning Module

The tool uses a Random Forest classifier (src/ml_analyzer.py) to assess risk.

Training the Model

To train the model, you need two directories: one containing JSON reports of benign packages and one for malicious packages.

python3 src/ml_analyzer.py --train --benign-dir ./data/benign_reports --malicious-dir ./data/malicious_reports

This saves the model to models/risk_model.pkl.

Running ML on Existing Logs

If you have already run the dynamic analysis and have a folder full of JSON reports in logs/, you can generate a CSV summary without re-running the sandbox:

python3 scripts/batch_ml_analyzer.py

Output CSV Format:

Filename	Package Name	Risk Level	Malicious Probability	High Threshold	Medium Threshold
report.json	malicious-pkg	HIGH	0.9200	0.8500	0.4500

Configuration

Key settings can be modified in src/orchestrator.py:

DOCKER_TIMEOUT: Max runtime for analysis (default: 600s).
MAX_MEMORY: Memory limit for the sandbox (default: 4g).

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
data		data
models		models
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
analysis_results_malicious.csv		analysis_results_malicious.csv
analysis_results_random_500.csv		analysis_results_random_500.csv
local_analysis_results.csv		local_analysis_results.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NPM Dynamic Analysis Tool

Project Structure

Features

Prerequisites

Installation

Usage

1. Analyze a List of Packages (Recommended)

2. Analyze a Single Package

3. Analyze a Local Zip File

Machine Learning Module

Training the Model

Running ML on Existing Logs

Configuration

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NPM Dynamic Analysis Tool

Project Structure

Features

Prerequisites

Installation

Usage

1. Analyze a List of Packages (Recommended)

2. Analyze a Single Package

3. Analyze a Local Zip File

Machine Learning Module

Training the Model

Running ML on Existing Logs

Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages