[WEB-4211] feat: skip specific folders to not parse#120
[WEB-4211] feat: skip specific folders to not parse#120MohamedAklamaash wants to merge 4 commits into
Conversation
Greptile SummaryThis PR replaces exact set-membership directory filtering with case-insensitive regex substring matching (via a module-level
Confidence Score: 4/5The core filtering logic is functional, but the open-ended substring matching (no word boundaries or anchors) will silently skip legitimate project directories like The switch from exact membership to unanchored substring matching is intentional per the PR description, but it causes macos_extraction_helpers.py and constants.py — the combination of new broadly-named skip entries (Public, Templates, Videos) with unanchored case-insensitive matching is most likely to swallow legitimate project paths. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Directory Entry] --> B{should_skip_path}
B --> C[Iterate path.parts]
C --> D{SKIP_PATTERN match on part}
D -- match --> E[Skip - return True]
D -- no match --> F{system_dirs check}
F -- in system_dirs --> E
F -- not in system_dirs --> G[Process - return False]
subgraph Pattern Construction
H[SKIP_DIRS frozenset] --> I[re.escape each entry]
I --> J[join with pipe]
J --> K[re.compile with re.IGNORECASE]
K --> L[SKIP_PATTERN]
end
L --> D
Reviews (4): Last reviewed commit: "fix: fixed type error" | Re-trigger Greptile |
| """ | ||
| # Skip common project directories (check all path parts for nested matches) | ||
| if any(part in SKIP_DIRS for part in path.parts): | ||
| if SKIP_PATTERN.search(str(path)) is not None: |
There was a problem hiding this comment.
Searching the entire stringified path instead of individual parts can produce false positives where a skip keyword straddles a path separator. The macOS sibling function uses
any(SKIP_PATTERN.search(part) for part in path.parts), which checks each component independently. Applying the same approach here keeps the two implementations consistent.
| if SKIP_PATTERN.search(str(path)) is not None: | |
| if any(SKIP_PATTERN.search(part) for part in path.parts): |
| from typing import List, Dict, Optional, Tuple, Callable | ||
|
|
||
| from .constants import MAX_CONFIG_FILE_SIZE, SKIP_DIRS | ||
| from scripts.coding_discovery_tools.macos_extraction_helpers import SKIP_PATTERN |
There was a problem hiding this comment.
Windows helper importing macOS module
windows_extraction_helpers.py now imports SKIP_PATTERN from macos_extraction_helpers via an absolute path. This creates a cross-platform coupling: if macos_extraction_helpers ever gains a macOS-specific import (e.g. AppKit, subprocess calling dscl), this import will fail on Windows. SKIP_PATTERN is derived solely from SKIP_DIRS (a plain constant), so it belongs in constants.py alongside SKIP_DIRS and should be imported from there by both platforms.
| @@ -26,7 +26,9 @@ | |||
| from pathlib import Path | |||
| from typing import Optional, List, Dict | |||
There was a problem hiding this comment.
There was a problem hiding this comment.
Skip dirs obj is being shared b/w windows and mac os as of now, we should move out of this idea and create a different obj for windows all together.
There was a problem hiding this comment.
That makes sense — platform-specific skip patterns give you more flexibility to diverge the behavior later (e.g., Windows-only paths like AppData, $Recycle.Bin that don't exist on macOS). In that case, defining a separate SKIP_DIRS / pattern in windows_extraction_helpers.py (or a windows_constants.py) and removing the import from macos_extraction_helpers entirely is the right call. The cross-platform coupling comment is no longer applicable here.
Tip: You can give me more context with .greptile/rules.md so I get this right next time.
| """ | ||
| # Skip common project directories (check all path parts for nested matches) | ||
| if any(part in SKIP_DIRS for part in path.parts): | ||
| if SKIP_PATTERN.search(path.name) is not None: |
There was a problem hiding this comment.
Only last path component checked — nested skip dirs not caught
should_skip_path was changed from iterating all path.parts to checking only path.name. A path like C:\Users\alice\node_modules\project\src\rules.mdc has rules.mdc as its .name, so the check passes and the file is processed despite being inside node_modules. The macOS counterpart at line 91 in macos_extraction_helpers.py correctly iterates every component: any(SKIP_PATTERN.search(part) for part in path.parts). The Windows version should do the same.
| SKIP_DIRS = frozenset[str]({ | ||
| '.git', 'node_modules', 'venv', '__pycache__', '.venv', 'vendor', '.idea', '.vscode', 'Library', '.Trash', '.cache', | ||
| 'Photos', 'Music', 'Movies', 'Pictures', 'Public', 'Templates', 'Videos' | ||
| }) |
There was a problem hiding this comment.
frozenset[str](...) uses the generic-alias call syntax introduced in Python 3.9. On Python 3.8, frozenset[str] raises TypeError: 'type' object is not subscriptable at import time, breaking all platforms. The type annotation adds no runtime value here; the idiomatic form is just frozenset({...}).
| SKIP_DIRS = frozenset[str]({ | |
| '.git', 'node_modules', 'venv', '__pycache__', '.venv', 'vendor', '.idea', '.vscode', 'Library', '.Trash', '.cache', | |
| 'Photos', 'Music', 'Movies', 'Pictures', 'Public', 'Templates', 'Videos' | |
| }) | |
| SKIP_DIRS = frozenset({ | |
| '.git', 'node_modules', 'venv', '__pycache__', '.venv', 'vendor', '.idea', '.vscode', 'Library', '.Trash', '.cache', | |
| 'Photos', 'Music', 'Movies', 'Pictures', 'Public', 'Templates', 'Videos' | |
| }) |
8ed8e9f to
032217b
Compare
| # Skip common project directories (check all path parts for nested matches) | ||
| if any(part in SKIP_DIRS for part in path.parts): | ||
| if SKIP_PATTERN.search(path) is not None: |
There was a problem hiding this comment.
SKIP_PATTERN.search(path) passes a Path object to re.search(), which only accepts a str or bytes-like argument. On every call to should_skip_path this raises TypeError: expected string or bytes-like object, completely breaking directory filtering on Windows. The macOS sibling already does this correctly by iterating path.parts (a tuple of strings).
| # Skip common project directories (check all path parts for nested matches) | |
| if any(part in SKIP_DIRS for part in path.parts): | |
| if SKIP_PATTERN.search(path) is not None: | |
| # Skip common project directories (check all path parts for nested matches) | |
| if any(SKIP_PATTERN.search(part) for part in path.parts): |
# Conflicts: # scripts/coding_discovery_tools/constants.py
| @@ -37,8 +37,9 @@ | |||
| MAX_SEARCH_DEPTH = 10 # Maximum directory depth to search recursively | |||
| SKIP_DIRS = frozenset[str]({ | |||
There was a problem hiding this comment.
frozenset[str](...) still raises TypeError on Python 3.8 because built-in collection generics are not subscriptable there. This file is imported by the extraction helpers before traversal starts, so a Python 3.8 runtime can fail at module import instead of running discovery. The runtime type parameter is not needed here, and the other frozenset constants in this file already use the compatible form.
| SKIP_DIRS = frozenset[str]({ | |
| SKIP_DIRS = frozenset({ |
vigneshsubbiah16
left a comment
There was a problem hiding this comment.
🛡️ Automated Security Review (consensus)
0 findings — 0 high-confidence, 0 to triage. Reviewers: Cursor, Claude, Semgrep, Gitleaks.
No exploitable security issues identified. Changes affect directory-traversal filtering only (case-insensitive substring skip matching, new skip entries); no auth, secrets, injection, or unsafe deserialization paths introduced. Semgrep and Gitleaks reported clean; Claude found no security issues.
Previously acknowledged (not re-flagged)
- Unanchored substring skip matching (
public/,templates/,.git_backup/, etc.) — Maintainer (2026-05-06): accepted as expected; directories whose names contain skip keywords are unlikely to hold tools/skills files; will monitor and adjustSKIP_DIRSif coverage gaps appear.
🤖 consensus review · reviewers: Cursor,Claude,Semgrep,Gitleaks · head 7cff01d3 · 2026-07-01T13:56Z
How parse works now
pHotos, name_photos kinda folders are also skipped from being parsed (even if they are present as subdirs with dirs that we have access to parse)
Note
Low Risk
Changes only affect which directories are traversed during config/rules discovery; no auth or data-handling logic, with a small chance of skipping paths whose names merely contain a skip token as a substring.
Overview
Discovery walks now treat
SKIP_DIRSas case-insensitive substring matches on each path component (via a sharedSKIP_PATTERNregex built from the skip list), so folders likepHotosorname_photosare skipped the same way as exact names likePhotos.PublicandTemplateswere added toSKIP_DIRS. macOSshould_skip_path, Windowsshould_skip_path, and the Windows Cursor CLI settings walker all useSKIP_PATTERNinstead of exactin SKIP_DIRSchecks.Reviewed by Cursor Bugbot for commit 7cff01d. Bugbot is set up for automated code reviews on this repo. Configure here.
Greptile Summary
This PR broadens directory skipping during coding discovery. The main changes are:
Confidence Score: 4/5
This is close, but the import-time compatibility break should be fixed before merging.
SKIP_DIRSstill uses syntax that can fail before discovery starts on Python 3.8.Pathobject into regex search.scripts/coding_discovery_tools/constants.py
Important Files Changed
Reviews (5): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile