Comprehensive documentation for PatternForge - a fast, deterministic glob-pattern discovery engine for hierarchical data.
- Introduction
- Quick Start
- Core Concepts
- Single-Field Pattern Matching
- Structured Multi-Field Matching
- Configuration Reference
- Performance and Scaling
- Best Practices
- Troubleshooting
- Advanced Topics
PatternForge automatically discovers minimal wildcard patterns that match your data. Given a set of "include" items (what you want) and "exclude" items (what you don't want), it generates precise glob patterns using wildcards (*).
Key Features:
- Zero false positives in EXACT mode (guaranteed not to match excludes)
- Automatic pattern discovery - no manual regex writing
- Multi-field support - patterns across structured data (module/instance/pin)
- Scalable - handles 10,000+ rows efficiently
- Interpretable - generates human-readable wildcard patterns
Use Cases:
- Regression triage (find patterns in failing tests)
- Hardware signal selection (netlists, pin lists)
- Log file filtering
- Database queries on hierarchical data
- Test case categorization
# Clone the repository
git clone https://github.com/your-org/patternforge
cd patternforge
# Install in development mode
pip install -e .
# Or add to PYTHONPATH
export PYTHONPATH=/path/to/patternforge/src:$PYTHONPATHfrom patternforge.engine.solver import propose_solution
# Your data
include = [
"chip/cpu/l1_cache/bank0",
"chip/cpu/l1_cache/bank1",
"chip/cpu/l2_cache/bank0",
]
exclude = [
"chip/gpu/l1_cache/bank0",
"chip/cpu/registers/file0",
]
# Generate patterns - defaults to EXACT mode (zero false positives)
solution = propose_solution(include, exclude)
# Inspect results
print(f"Pattern: {solution.raw_expr}") # *cpu*cache*
print(f"Coverage: {solution.metrics['covered']}/{solution.metrics['total_positive']}")
print(f"False positives: {solution.metrics['fp']}") # 0 in EXACT mode
# Customize with direct parameters - no classes needed!
solution = propose_solution(include, exclude,
mode="EXACT", # Zero false positives (string, not enum!)
max_patterns=5, # At most 5 patterns
w_fp=2.0 # Penalize false positives heavily
)See: examples/01_quick_start.py for complete runnable examples
# Create input files
echo -e "chip/cpu/l1_cache/bank0\nchip/cpu/l1_cache/bank1" > include.txt
echo -e "chip/gpu/l1_cache/bank0" > exclude.txt
# Generate patterns
PYTHONPATH=src python -m patternforge.cli propose \
--include include.txt \
--exclude exclude.txt \
--format json \
--out solution.json
# View results
jq '.expr, .raw_expr, .metrics' solution.jsonPatternForge discovers five types of wildcard patterns:
- Anchored at start
- 1 wildcard
- Example:
video/*matchesvideo/display/pixel0 - Use when: All targets share a common prefix
- Anchored at end
- 1 wildcard
- Example:
*/failmatches all paths ending infail - Use when: All targets share a common suffix
- Matches anywhere
- 2 wildcards
- Example:
*cache*matcheschip/cpu/l1_cache/bank0 - Use when: Target keyword appears anywhere
- Multiple ordered keywords
- Variable wildcards
- Example:
*cpu*execute*alu*requires all three keywords in order - Use when: Need multiple keywords in specific order (more precise)
- No wildcards
- Full string match
- Example:
chip/cpu/core0matches only that exact path - Use when: Patterns are identical
See: examples/02_single_field_patterns.py for detailed examples of each type
Wildcard Rules:
*matches any substring (including/and empty string)- Everything else is literal
- Patterns without leading
*are anchored at start - Patterns without trailing
*are anchored at end
Examples:
Pattern String Match?
------- ------ ------
video* video/display/pixel0 ✓ (prefix)
video* other/video/display ✗ (not anchored at start)
*video* other/video/display ✓ (substring)
*/pixel0 video/display/pixel0 ✓ (suffix)
*/pixel0 video/display/pixel0/x ✗ (not anchored at end)
*cpu*cache* chip/cpu/l1_cache/bank0 ✓ (multi-segment, ordered)
*cache*cpu* chip/cpu/l1_cache/bank0 ✗ (wrong order)
- Guarantee: Zero false positives
- Trade-off: May not achieve 100% coverage to avoid FP
- Use when: Precision matters more than recall
- Speed: Slower (explores more combinations)
solution = propose_solution(include, exclude, mode="EXACT")- Behavior: Allows some false positives for simpler patterns
- Trade-off: Faster, but may match exclude items
- Use when: Speed matters and some FP acceptable
- Speed: 2-10x faster
solution = propose_solution(include, exclude, mode="APPROX")See: examples/01_quick_start.py Example 3
For hierarchical paths or flat strings where you want to match entire strings.
from patternforge.engine.solver import propose_solution
include = [
"regress/nightly/ipA/test_fifo/rand_smoke/fail",
"regress/nightly/ipA/test_dma/burst_test/fail",
"regress/nightly/ipB/test_cache/assoc16/fail",
]
exclude = [
"regress/nightly/ipA/test_fifo/rand_smoke/pass",
"regress/nightly/ipB/test_cache/assoc16/pass",
]
# Generate patterns (defaults to EXACT mode)
solution = propose_solution(include, exclude)
# Result: Pattern that matches all "fail" paths but no "pass" paths
print(solution.raw_expr) # likely: *fail*Sometimes it's easier to specify what you DON'T want:
# Find everything EXCEPT debug paths
solution = propose_solution(
include, # All your paths
exclude, # Debug paths
invert="always" # Force inverted mode
)
# global_inverted=True means: "everything EXCEPT <pattern>"
print(solution.global_inverted) # True
print(solution.raw_expr) # *debug*
# Interpretation: Match everything EXCEPT paths with "debug"Inversion Strategies:
"never": Normal mode (find what matches)"always": Complement mode (everything except pattern)"auto": Solver chooses based on cost (default)
See: examples/02_single_field_patterns.py Example 6
Tokenization controls how patterns are generated from your data. It doesn't affect matching (patterns still use wildcard rules), only candidate generation.
Splits on character class changes (alpha, digit, other):
SRAMController_512x64→[sram, controller, 512, x, 64]chip/cpu/core0→[chip, cpu, core, 0]
Splits into individual characters:
chip/cpu/core0→[c, h, i, p, /, c, p, u, /, c, o, r, e, 0]- Useful for finding character-level patterns
- Automatically uses
min_token_len=1(ignores globalmin_token_lensetting)
# Use classchange for CamelCase, numbers (RECOMMENDED)
solution = propose_solution(include, exclude, splitmethod='classchange')
# Use char for character-level patterns (advanced use)
solution = propose_solution(include, exclude, splitmethod='char')See: examples/02_single_field_patterns.py Example 8
For data with multiple fields (like database rows, hardware signals with module/instance/pin, etc.).
from patternforge.engine.solver import propose_solution_structured
include_rows = [
{"module": "SRAM_512x64", "instance": "chip/cpu/l1_cache/bank0", "pin": "DIN[0]"},
{"module": "SRAM_512x64", "instance": "chip/cpu/l1_cache/bank1", "pin": "DOUT[0]"},
]
exclude_rows = [
{"module": "SRAM_512x64", "instance": "chip/cpu/l1_cache/bank0", "pin": "CLK"},
{"module": "SRAM_512x64", "instance": "chip/gpu/shader0/cache", "pin": "DIN[0]"},
]
solution = propose_solution_structured(include_rows, exclude_rows)
# Atoms have 'field' attribute showing which field they match
for pattern in solution['patterns']:
print(f"{pattern['field']}: {pattern['text']}")
# Output might be:
# pin: *DIN* | *DOUT*
# instance: *cpu*Key Differences from Single-Field:
- Patterns apply to individual fields, not concatenated string
- Each pattern has a
fieldattribute - More precise (can filter on field combinations)
See: examples/03_structured_patterns.py for 7 comprehensive examples
Prefer patterns on certain fields over others:
solution = propose_solution_structured(
include_rows,
exclude_rows,
w_field={
"pin": 3.0, # Strongly prefer pin patterns
"module": 1.5, # Moderately prefer module patterns
"instance": 0.3 # Discourage instance patterns (too broad)
}
)How it works:
w_fieldmultiplies pattern scores during candidate generation- Higher weight = more likely to select patterns on that field
- Unspecified fields default to 1.0
- Typical range: 0.1 (discourage) to 5.0 (strongly prefer)
See: examples/04_performance_scaling.py Test Suite 5
Different fields may need different tokenization:
solution = propose_solution_structured(
include_rows,
exclude_rows,
splitmethod={
"instance": "char", # Split paths on /
"module": "classchange", # Split CamelCase
"pin": "char" # Split bus notation DIN[0]
},
min_token_len={ # Can also be per-field!
"instance": 2,
"module": 3,
"pin": 2
}
)See: examples/03_structured_patterns.py uses per-field tokenization
Use None (or NaN in DataFrames) to create "don't care" fields in excludes:
exclude_rows = [
{"module": None, "instance": "debug/*", "pin": None},
# Excludes ANY module/pin on debug instances
]
# With pandas DataFrame
import pandas as pd
import numpy as np
df_exclude = pd.DataFrame({
'module': [np.nan, '*SRAM*'],
'instance': ['debug/*', np.nan],
'pin': [np.nan, 'CLK']
})
# Row 1: Exclude debug instances regardless of module/pin
# Row 2: Exclude CLK pin on any SRAM module, any instanceSee: STRUCTURED_SOLVER_GUIDE.md Section 4
All configuration is done via direct keyword arguments to propose_solution() and propose_solution_structured().
solution = propose_solution(include, exclude,
# Quality & Mode
mode="EXACT", # or "APPROX" (case-insensitive)
effort="medium", # "low", "medium", "high", "exhaustive"
invert="auto", # "never", "auto", "always"
# Tokenization (can be string or dict for per-field)
splitmethod="classchange", # or "char", or {"field": "method"}
min_token_len=3, # or {"field": int}
# Candidate generation
per_word_substrings=16, # Substrings per token
max_multi_segments=3, # Multi-segment pattern limit
allowed_patterns=None, # or ["prefix", "suffix", "substring"]
# Hard constraints (budgets)
max_candidates=4000, # Max candidates to consider
max_patterns=None, # Max patterns (int or 0<float<1 for %)
max_fp=0, # Max FP (int or percentage)
max_fn=None, # Max FN (int or percentage)
# Soft penalties (weights) - scalar or dict
w_field={"pin": 2.0}, # Field preference (structured only, always dict)
w_fp=1.0, # FP cost (scalar or dict)
w_fn=1.0, # FN cost (scalar or dict)
w_pattern=0.05, # Pattern complexity cost
w_op=0.02, # Operator complexity cost
w_wc=0.01, # Wildcard cost
w_len=0.001, # Length cost
# Advanced
allow_complex_expressions=False, # Allow conjunctive expressions
)mode: "EXACT" or "APPROX" (case-insensitive)
- EXACT: Zero false positives (guaranteed)
- APPROX: Allows some FP for simpler patterns (2-10x faster)
effort: "low", "medium", "high", "exhaustive"
- Controls quality vs speed trade-off
- low: Fastest, single-field only
- medium: Balanced (default)
- high: Best quality, slower
- exhaustive: Try everything (small datasets only)
invert: "never", "auto", "always"
- never: Normal mode (find what matches)
- always: Complement mode (everything except pattern)
- auto: Solver chooses based on cost (default)
These are hard constraints that the solver MUST satisfy:
max_candidates: int (default: 4000)
- Maximum candidate patterns to consider
- Higher = more thorough search, slower
max_patterns: int or float (default: None)
- Maximum patterns in solution
8= at most 8 patterns0.10= at most 10% of include rowsNone= no limit
max_fp: int or float (default: 0 in EXACT mode)
- Maximum false positives allowed
0= zero FP (EXACT mode automatic)5= at most 5 false positives0.01= at most 1% FPNone= no limit
max_fn: int or float (default: None)
- Maximum false negatives allowed
0= full coverage required0.05= at most 5% FNNone= no limit
Percentage Interpretation:
0 < value < 1: Percentage of include rowsvalue >= 1: Absolute count0: Zero (exact)None: No limit
These affect the cost function used during greedy selection:
w_field: dict[str, float] (structured only, default: {})
- Field preference during candidate generation
- Higher weight = prefer patterns on this field
- Example:
{"pin": 3.0, "module": 1.5, "instance": 0.3} - Range: 0.1 (discourage) to 5.0 (strongly prefer)
w_fp: float or dict (default: 1.0)
- Cost per false positive
- Can be per-field:
{"module": 2.0, "pin": 1.0} - Higher = penalize FP more heavily
w_fn: float or dict (default: 1.0)
- Cost per false negative
- Can be per-field for structured data
w_pattern: float or dict (default: 0.05)
- Cost per pattern in solution
- Higher = prefer fewer patterns (simpler solutions)
w_op: float or dict (default: 0.02)
- Cost per boolean operation (OR between patterns)
w_wc: float or dict (default: 0.01)
- Cost per wildcard character
w_len: float or dict (default: 0.001)
- Cost per character length
Weight Usage:
- All weights can be scalar (apply globally) or dict (per-field for structured data)
w_fieldis always dict (structured mode only)- Higher weight = higher penalty = less likely to use
allowed_patterns: list[str] or None (default: None)
- Restrict pattern types
- Values:
"exact","substring","prefix","suffix","multi" - Example:
["prefix", "suffix"](only prefix/suffix patterns) None= allow all types
min_token_len: int or dict (default: 3)
- Minimum token length to consider
- Can be per-field:
{"module": 3, "pin": 2}
splitmethod: str or dict (default: "classchange")
- Tokenization method
"classchange": Split on character class changes (alpha/digit/other) - RECOMMENDED"char": Split into individual characters (advanced - for character-level patterns)- Can be per-field:
{"instance": "char", "module": "classchange"} - Note:
splitmethod='char'automatically usesmin_token_len=1
per_word_substrings: int (default: 16)
- Top N substrings per token to consider
max_multi_segments: int (default: 3)
- Maximum segments in multi-segment patterns like
*a*b*c*
allow_complex_expressions: bool (default: False)
- Allow conjunctive expressions (A & B) and subtraction (A - B)
- More expressive but potentially more complex solutions
PatternForge automatically selects the best algorithm based on dataset size:
| N (rows) | F (fields) | Effort | Algorithm | Complexity |
|---|---|---|---|---|
| < 100 | < 5 | exhaustive | EXHAUSTIVE | O(N²) |
| < 1k | < 8 | medium/high | BOUNDED | O(N) |
| ≥ 1k or ≥ 8 | any | any | SCALABLE | O(N) |
| any | any | low | SCALABLE | O(N) |
BOUNDED Algorithm:
- Row-centric expression generation
- Suitable for small-medium datasets
- Caps: 1000-2000 total expressions, 50-100 per row
SCALABLE Algorithm:
- Pattern-centric with lazy multi-field construction
- Suitable for large datasets
- Caps: 20-200 patterns per field
From examples/04_performance_scaling.py:
Single-Field:
N Time Atoms
10 ~0.01s 1-2
1,000 ~0.2s 2-5
10,000 ~2s 5-15
Structured Multi-Field:
N Time Atoms
10 ~0.01s 1-3
1,000 ~0.5s 3-8
10,000 ~5s 5-20
Mode Comparison (N=1000):
Mode Time FP Atoms
EXACT 0.3s 0 5
APPROX 0.1s 2 3
Speedup: 3x faster
See: examples/04_performance_scaling.py for comprehensive benchmarks
Use Single-Field When:
- Data is flat hierarchical paths (no natural field structure)
- You want patterns on full string concatenation
- Simpler problem, don't need field-level precision
Use Structured When:
- Data has natural field structure (module/instance/pin, database columns)
- You need per-field patterns
- Want to prefer/discourage specific fields
- More precise filtering requirements
Use EXACT When:
- Zero false positives required
- Precision > recall
- Dataset size manageable (< 10k rows)
Use APPROX When:
- Speed matters
- Some false positives acceptable
- Large datasets (> 10k rows)
- Exploratory analysis
High weights (2.0-5.0):
- Fields with discriminating patterns
- Fields that should dominate selection
- Example:
pinfield in hardware signals
Low weights (0.1-0.5):
- Fields with many similar values
- Fields that create too-broad patterns
- Example:
instancepaths that match too much
Default weight (1.0):
- Neutral preference
max_patterns:
- Start with
None(no limit) - If too many patterns, set to 5-10 for simplicity
- Or use percentage:
0.05for 5% of rows max
max_fp / max_fn:
- EXACT mode:
max_fp=0(automatic) - APPROX mode:
max_fp=0.01for 1% tolerance - Full coverage:
max_fn=0
Symptoms: Matches too many items, high FP
Solutions:
- Use EXACT mode instead of APPROX
- Add more exclude examples
- Increase field weights (structured mode)
- Decrease
max_fpbudget - Increase
w_fpweight
Symptoms: Too many patterns, low coverage
Solutions:
- Use APPROX mode
- Decrease
max_patternsbudget - Increase
w_patterncost - Add
max_fnbudget to allow some FN - Use lower effort level
Symptoms: Takes too long to complete
Solutions:
- Use APPROX mode (2-10x faster)
- Use
effort="low" - Reduce
max_candidates(default 4000) - For large datasets (>10k), ensure using APPROX + low effort
- Check dataset size - structured mode slower for >5k rows
Symptoms: (Structured mode) Patterns on wrong fields
Solutions:
- Adjust
w_fieldweightssolution = propose_solution_structured(include, exclude, w_field={"desired_field": 3.0, "wrong_field": 0.3} )
- Increase
w_fpfor wrong fieldsolution = propose_solution_structured(include, exclude, w_fp={"wrong_field": 5.0, "desired_field": 1.0} )
- Add more exclude examples that use wrong field
Symptoms: Empty solution or no patterns
Solutions:
- Check if include/exclude have overlap (unsolvable)
- Relax budgets (
max_patterns=None,max_fntolerance) - Lower
min_token_len(default=3) - Check tokenization method (try both
classchangeandchar)
For complete control over tokenization, provide custom token_iter:
from patternforge.engine.tokens import Token, make_split_tokenizer, iter_structured_tokens_with_fields
# Per-field custom tokenizers
tk_module = make_split_tokenizer("classchange", min_token_len=3)
tk_instance = make_split_tokenizer("char", min_token_len=2)
tk_pin = make_split_tokenizer("char", min_token_len=2)
field_tokenizers = {"module": tk_module, "instance": tk_instance, "pin": tk_pin}
token_iter = list(iter_structured_tokens_with_fields(
include_rows,
field_tokenizers,
field_order=["module", "instance", "pin"]
))
solution = propose_solution_structured(
include_rows,
exclude_rows,
token_iter=token_iter # Pass custom tokens
)See: README.md "Advanced: custom tokenizers (per-field)" section
The evaluator supports complex boolean logic:
from patternforge.engine.solver import evaluate_expr
patterns = {
"P1": "*cpu*",
"P2": "*cache*",
"P3": "*debug*"
}
# AND
metrics = evaluate_expr("P1 & P2", patterns, include, exclude)
# OR
metrics = evaluate_expr("P1 | P2", patterns, include, exclude)
# NOT
metrics = evaluate_expr("!P3", patterns, include, exclude)
# Complex
metrics = evaluate_expr("(P1 & P2) & !P3", patterns, include, exclude)Save and load solutions:
from patternforge import io
# Convert solution to dict for saving
solution_dict = solution.to_json()
io.save_solution(solution_dict, "solution.json")
# Load
loaded = io.load_solution("solution.json")
# Extract just patterns for evaluation
patterns_dict = {pattern["id"]: pattern["text"] for pattern in loaded["patterns"]}Direct pandas/polars support:
import pandas as pd
df_include = pd.DataFrame({
'module': ['SRAM_512x64', 'SRAM_1024x32'],
'instance': ['cpu/l1_cache/bank0', 'cpu/l2_cache/bank0'],
'pin': ['DIN[0]', 'DOUT[0]']
})
solution = propose_solution_structured(df_include, df_exclude)
# Auto-detects columns as fields!Generate human-readable explanations:
from patternforge.engine.explain import explain_text, explain_dict
# Text format
explanation = explain_text(solution, include, exclude)
print(explanation)
# Dict format (programmatic)
summary = explain_dict(solution, include, exclude)
print(summary["expr"], summary["metrics"])- Examples: See
examples/directory for runnable code - Structured Guide: See
STRUCTURED_SOLVER_GUIDE.mdfor multi-field deep dive - README: See
README.mdfor CLI usage and quick reference - Tests: See
tests/for extensive usage patterns - API Reference: See source code docstrings in
src/patternforge/engine/
- Check this guide and examples first
- Review the troubleshooting section
- Run the relevant example file to see expected behavior
- Check the issue tracker: https://github.com/anthropics/patternforge/issues
Contributions welcome! Please:
- Add tests for new features
- Update documentation
- Add example if demonstrating new capability
- Follow existing code style