Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
076bd90
Add v0.7.0 adversarial sample build script
malx-labs Apr 21, 2026
b029cb5
Update adversarial sample build script to include v.0.7.1 binaries. T…
malx-labs Apr 21, 2026
bff2839
Add malformed_import_table.full.c
malx-labs Apr 21, 2026
c2879f3
Add invalid_section_alignment_full.c first draft
malx-labs Apr 21, 2026
79e3168
Add corrupted_data_directorie.full.c first draft code
malx-labs Apr 21, 2026
a368451
Add truncated_rich_header.full.c first draft
malx-labs Apr 21, 2026
ba7c7e2
Add franken_malformed_pe: intentionally ridden with irregular structure
malx-labs Apr 21, 2026
be57201
Franken PE now fires all desired heuristics. TODO: refactor output to…
malx-labs Apr 21, 2026
d5ed37c
Reformate section analysis structure to include data_directories: rem…
malx-labs Apr 22, 2026
f89d1f2
New heuristics firing on the heuristic_rich sample so captured 1 addi…
malx-labs Apr 22, 2026
c57ea87
Expand the contract testing layer model with explanations. Add franke…
malx-labs Apr 22, 2026
26c0419
Distill contract testing strategy directory structure to core folders…
malx-labs Apr 22, 2026
c250628
Fix typo in contract testing strategy
malx-labs Apr 22, 2026
46a2e62
Add a generators README and re-structure c generator directory
malx-labs Apr 22, 2026
02d0c30
Improve contract test runner output for clarity
malx-labs Apr 22, 2026
67c881a
Add franken malformed PE integration tests
malx-labs Apr 22, 2026
774b88e
Add a performance test for the Franken malformed PE: result = 0.0028s
malx-labs Apr 22, 2026
df355e5
Add performance guarantee documenation
malx-labs Apr 24, 2026
f9eb377
Fix typo
malx-labs Apr 24, 2026
6ec747c
Add malformed_import_table fixture, contract and supporting documenta…
malx-labs Apr 24, 2026
4e1aa08
Consolidate structs to ensure invalid_section_alignment passes the re…
malx-labs Apr 24, 2026
93ef37c
Add in the complimenting documentation for invalid_section_alignment …
malx-labs Apr 24, 2026
968e5d9
Add truncated_rich_header and corrupted_data_directories fixtures to …
malx-labs Apr 25, 2026
653ac4d
Add packed_lookalike_full and upx_name_only fixtures and supporting d…
malx-labs Apr 25, 2026
a4a95b9
Add broken_rva_addresses and overlapping_sections to the contract tes…
malx-labs Apr 25, 2026
d88d076
Added adversarial string fixtures: crypto (including base58 validity …
malx-labs Apr 25, 2026
3fb580f
Optimise bare-domain homoglyph handling and improve engine throughput…
malx-labs Apr 25, 2026
aad073f
Add invalid optional header PE32/PE32+ binaries, franked PE32 binary …
malx-labs Apr 26, 2026
28306c7
Add adversarial fixtures and snapshots to lock in contract tests
malx-labs Apr 28, 2026
f4d398b
Add corresponding C adversarial source code
malx-labs Apr 28, 2026
347f211
Improved extractor accuracy across bare domain/string url, and hashes…
malx-labs Apr 28, 2026
4715062
Get project test coverage to 100%
malx-labs Apr 29, 2026
d08ad7d
Punycode logic now aligns with idna spec. Domain decodes and exceptio…
malx-labs Apr 29, 2026
40a6fd1
Add pe dense (worst-case PE scanning) and PE typical (39KB) performan…
malx-labs Apr 30, 2026
ce4ff91
Add test corpus for study, pe_dense source code and additional chaos_…
malx-labs Apr 30, 2026
1c131c3
final draft of v0.7.1 changelog and readme re-structure
malx-labs Apr 30, 2026
bc4eefe
Added v0.7.1 version highlights to README and fix formatting issue
malx-labs Apr 30, 2026
218e016
Added contracts folder under README architecture
malx-labs Apr 30, 2026
3e78456
Updated performance guarantees based on latest v0.7.1 statistics
malx-labs Apr 30, 2026
3882a49
Updated crypto strings adversarial appendix copy
malx-labs Apr 30, 2026
da9f043
Consolidate contract safe testing layer3 entries
malx-labs Apr 30, 2026
c812707
Remove hr markdown
malx-labs Apr 30, 2026
d7199e2
Remove bracket
malx-labs Apr 30, 2026
8797963
Consolidate layer 3 fixture summary
malx-labs Apr 30, 2026
9b766fa
Remove trailing term
malx-labs Apr 30, 2026
b643143
Fixture appendices final edit
malx-labs May 1, 2026
cf7caad
Updated performance badge and linked to supporting documentation
malx-labs May 1, 2026
42b92c1
Link performance summary svg
malx-labs May 1, 2026
8db7192
Centre performance summary svg
malx-labs May 1, 2026
25e7864
Change placement of performance graph
malx-labs May 1, 2026
6ac4fba
Tweak performance graph layout
malx-labs May 1, 2026
354b67d
Enhance the domain metadata by adding decoded_unicode, contains confu…
malx-labs May 1, 2026
518213c
Add last remaining fixture appendix. Rename bare_domain punycode help…
malx-labs May 1, 2026
0ea5ea7
Updated pypi readme and performance stat
malx-labs May 1, 2026
0e242e6
Add throughput performance badge
malx-labs May 1, 2026
cf9d493
Add throughput performance badge refactor
malx-labs May 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 154 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# **v0.7.1 — Heuristics Engine Expansion & Structural Analysis Improvements**
**Released: 2026‑05‑??**

v0.7.1 delivers a major upgrade to IOCX’s **PE heuristics engine**, **extractor correctness**, and **adversarial‑input resilience**. This release introduces six new structural heuristics, broad extractor hardening, and a significantly expanded adversarial test suite — including **full adversarial coverage for every IOC category**.

---

# **Extractor Hardening**

This release strengthens multiple IOC extractors with improved correctness, boundary handling, and adversarial‑text resilience. Updates span the **bare domain**, **strict URL**, **crypto**, and **hash** extractors, plus improved **URL normalisation**.

## **Bare Domain Extractor**

### **Improvements**
- Expanded **TLD allow‑list** (e.g., `.ly`, `.gg`, `.sh`, `.app`, `.dev`, `.xyz`, `.online`) for broader real‑world coverage.
- Strengthened **BAD_TLD deny‑list** to prevent file extensions, config keys, and log fields from being misclassified as domains.
- Refined **boundary detection** to reduce false positives in noisy or punctuation‑heavy text.
- Added **punycode + IDN homoglyph analysis**, including Unicode decoding, script classification, and confusable‑character detection.
- Improved regex structure for **stability and predictable linear performance**, eliminating pathological backtracking cases.

### **Impact**
- Higher recall for legitimate domains across modern TLDs.
- Significant reduction in false positives from filepaths, dotted identifiers, and structured logs.
- Richer, homoglyph‑aware metadata for downstream analysis and phishing detection.

---

## **Strict URL Extractor**

### Improvements
- Added support for `ftp`, `ftps`, and `sftp`.
- RFC‑compliant **userinfo parsing** (`user:pass@host`).
- Full **punycode** domain support.
- Improved **IPv6** handling (including zone indices).
- More robust host matching aligned with the updated domain extractor.
- Cleaner separation of path/query/fragment parsing.

### Impact
- More complete URL extraction.
- Fewer truncated or malformed URLs.
- Better handling of obfuscated or credential‑embedded URLs.

---

## **Crypto Extractor**

### Improvements
- Added **full Base58Check validation** for Bitcoin:
- Double‑SHA256 checksum verification.
- Version‑byte validation (`0x00`, `0x05`).
- Rejects malformed Base58 sequences.
- Preserved Bech32/Taproot and ETH detection.

### Impact
- Dramatic reduction in Base58 false positives.
- Only cryptographically valid BTC addresses are extracted.

---

## **Hash Extractor**

### Improvements
- Increased short‑hex minimum length from **8 → 10** characters.
- Strict MD5/SHA1/SHA256/SHA512 detection unchanged.

### Impact
- Fewer false positives from small hex tokens.
- Behaviour remains aligned with adversarial fixtures.

---

## **URL Normalisation**

- `normalise_url()` now wraps `urlparse()` in safe error handling.
- Malformed URLs return `None` instead of raising.

### Impact
- More robust behaviour on adversarial URL input.
- Prevents crashes during bulk extraction.

---

# **Heuristics Engine Expansion (PE Structural Analysis)**

To support the expanded adversarial PE corpus, v0.7.1 introduces **six new deterministic heuristics** for detecting malformed or inconsistent PE structures:

- **Section overlap detection**
`_analyse_section_overlap`
- **Section alignment validation**
`_analyse_section_alignment`
- **Optional‑header consistency checks**
`_analyse_optional_header_consistency`
- **Entrypoint → section mapping validation**
`_analyse_entrypoint_mapping`
- **Data‑directory anomaly detection**
`_analyse_data_directory_anomalies`
- **Import‑directory validity checks**
`_analyse_import_directory_validity`

### Impact
- Clearer, reason‑coded anomaly reporting.
- No false positives on benign binaries.
- Deterministic behaviour across malformed PE structures.

---

# **Added**

### **1. Full adversarial fixtures for *all* IOC categories**
New adversarial string corpora added for:

- **crypto wallets** (BTC/ETH, reversed, embedded, noisy, base58‑adjacent)
- **domains** (Unicode homoglyphs, mixed‑script lookalikes)
- **URLs** (broken schemes, nested encodings, truncated fragments)
- **IPs** (malformed IPv4/IPv6, concatenated segments, invalid scopes)
- **filepaths** (MAX_PATH‑breaking Windows paths, malformed UNC prefixes)
- **hashes** (near‑miss hex sequences, truncated digests)
- **base64** (invalid padding, embedded noise, extremely long runs)
- **emails** (Unicode variants, malformed local parts)

Each fixture includes a deterministic snapshot.

### **2. Expanded adversarial PE corpus**
Fixtures include:

- broken RVAs
- overlapping/misaligned sections
- corrupted data directories
- malformed import tables
- invalid optional headers (PE32 & PE32+)
- truncated Rich headers
- packed‑lookalike binaries
- franken‑PE hybrids

### **3. Heuristics engine upgrades**
- New structural heuristics (see above)
- Unified internal analysis structure (`sections` + `data_directories`)
- Deterministic, JSON‑safe anomaly reporting

---

# **Fixed**

- Improved stability when parsing malformed or adversarial PE files.
- More robust handling of malformed URLs during normalisation.

---

# **Notes**

- Updated snapshot for `heuristic_rich.full.exe` to reflect new heuristics.
- Previous snapshot predated directory‑range and RVA‑validation logic.

---
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ dev: $(STAMP_DEV)
# ===========================
.PHONY: test
test: dev
$(PYTHON) -m pytest -q -m "not integration and not fuzz and not robustness and not performance"
$(PYTHON) -m pytest -q -m "not integration and not fuzz and not robustness and not performance and not contract"

# ----------------------------------------
# Integration tests only
Expand Down Expand Up @@ -132,7 +132,7 @@ test-coverage: dev
.PHONY: test-contract
test-contract: dev
@echo "Running contract tests..."
$(PYTEST) -m contract $(CONTRACT_DIR)
$(PYTEST) -m contract $(CONTRACT_DIR) -sv

# ----------------------------------------
# Static analysis and SCA
Expand Down
34 changes: 31 additions & 3 deletions README-pypi.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,37 @@ IOCX is a fast, safe, deterministic engine for extracting Indicators of Compromi

It performs **pure static analysis** — no execution, no sandboxing, no risk.

## What's new in v0.7.0
## What's new in v0.7.1

### **Bare Domain Extractor Overhaul**
- Expanded **TLD allow‑list** and strengthened **BAD_TLD deny‑list**
- Refined boundary rules to reduce false positives in noisy text
- Added **punycode decoding**, Unicode script classification, and homoglyph/confusable detection
- Hardened regex for **predictable linear performance** under adversarial input
- New metadata fields:
- `punycode`, `punycode_decodes_to_unicode`
- `decoded_unicode`
- `contains_confusables`
- `script`

### **Performance guarantees**
- **~150-300 MB/s** for individual detectors (domains, crypto, filepaths, IPs)
- **Strict linear scaling** across all detectors
- Pathological punycode, IPv6, and filepath inputs complete in **< 15 ms**
- End‑to‑end engine throughput: **20-30 MB/s**

### **Heuristic engine and adversarial fixture expansion**
- Deterministic section overlap and alignment, optional header consistency, entrypoint mapping, data directory anomalies, and import directory validity heuristics
- Adversarial fixtures covering all new heuristics and IOC subsystems.

### **Documentation updates**
- New adversarial appendices
- New Performance guarantees
- Expanded schema‑contract guidance

## Recent changes

### v0.7.0

- **Deterministic heuristic engine**

Expand All @@ -46,8 +76,6 @@ Deep hex‑encoding of nested byte structures prevents JSON serialization failur

New appendices and deterministic‑output guidance.

## Recent changes

### v0.6.0

- Stable JSON schema across all analysis levels
Expand Down
Loading
Loading