Skip to content

[WEB-4211] feat: skip specific folders to not parse#120

Open
MohamedAklamaash wants to merge 4 commits into
mainfrom
ak/skip-dirs
Open

[WEB-4211] feat: skip specific folders to not parse#120
MohamedAklamaash wants to merge 4 commits into
mainfrom
ak/skip-dirs

Conversation

@MohamedAklamaash

@MohamedAklamaash MohamedAklamaash commented May 5, 2026

Copy link
Copy Markdown
Contributor

How parse works now

  • instead of checking exact string match, now we search by building a parse tree using aho-corasik algo and do substring search along with ignoring case so that cases like

pHotos, name_photos kinda folders are also skipped from being parsed (even if they are present as subdirs with dirs that we have access to parse)


Note

Low Risk
Changes only affect which directories are traversed during config/rules discovery; no auth or data-handling logic, with a small chance of skipping paths whose names merely contain a skip token as a substring.

Overview
Discovery walks now treat SKIP_DIRS as case-insensitive substring matches on each path component (via a shared SKIP_PATTERN regex built from the skip list), so folders like pHotos or name_photos are skipped the same way as exact names like Photos.

Public and Templates were added to SKIP_DIRS. macOS should_skip_path, Windows should_skip_path, and the Windows Cursor CLI settings walker all use SKIP_PATTERN instead of exact in SKIP_DIRS checks.

Reviewed by Cursor Bugbot for commit 7cff01d. Bugbot is set up for automated code reviews on this repo. Configure here.

Greptile Summary

This PR broadens directory skipping during coding discovery. The main changes are:

  • Adds more directory names to the shared skip list.
  • Builds a case-insensitive regex from the skip list.
  • Applies regex-based skip checks in macOS and Windows traversal paths.

Confidence Score: 4/5

This is close, but the import-time compatibility break should be fixed before merging.

  • SKIP_DIRS still uses syntax that can fail before discovery starts on Python 3.8.
  • The Windows path checks now operate on path components rather than passing a Path object into regex search.

scripts/coding_discovery_tools/constants.py

Important Files Changed

Filename Overview
scripts/coding_discovery_tools/constants.py Adds skip-list entries, but the skip-list construction can still fail during import on Python 3.8.
scripts/coding_discovery_tools/macos_extraction_helpers.py Adds the shared case-insensitive skip pattern and applies it to each path component.
scripts/coding_discovery_tools/windows_extraction_helpers.py Updates Windows skip checks to search each path component with the shared pattern.
scripts/coding_discovery_tools/windows/cursor_cli/settings_extractor.py Uses the shared skip pattern when walking Cursor CLI settings directories.

Reviews (5): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

@MohamedAklamaash MohamedAklamaash requested a review from anonpran May 5, 2026 11:06
@MohamedAklamaash MohamedAklamaash self-assigned this May 5, 2026
@greptile-apps

greptile-apps Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR replaces exact set-membership directory filtering with case-insensitive regex substring matching (via a module-level re.compile pattern) and extends SKIP_DIRS with personal-media folders (Photos, Music, Movies, Pictures, Public, Templates, Videos). Both the macOS and Windows helpers are updated to use the new compiled pattern; Windows files import the pattern directly from macos_extraction_helpers.

  • constants.py adds seven new skip-dir entries and wraps SKIP_DIRS in frozenset[str](...), a Python 3.9+ generic-alias form.
  • macos_extraction_helpers.py builds SKIP_PATTERN at import time from SKIP_DIRS and switches should_skip_path to SKIP_PATTERN.search(part) per path component.
  • Both Windows files (windows_extraction_helpers.py and windows/cursor_cli/settings_extractor.py) import SKIP_PATTERN from the macOS module; the dev has acknowledged this coupling and plans to introduce a separate Windows-side pattern.

Confidence Score: 4/5

The core filtering logic is functional, but the open-ended substring matching (no word boundaries or anchors) will silently skip legitimate project directories like public/ or templates/ that are standard in many web frameworks.

The switch from exact membership to unanchored substring matching is intentional per the PR description, but it causes should_skip_path to drop real project directories whose names merely contain a skip-keyword (e.g. public/ in Rails/Next.js, my_library/). This is a present behavioural change — the tool will miss files in those folders — rather than a theoretical risk. The remaining issues (Python 3.9 syntax, cross-platform import) were flagged in earlier review rounds and are on the dev's radar.

macos_extraction_helpers.py and constants.py — the combination of new broadly-named skip entries (Public, Templates, Videos) with unanchored case-insensitive matching is most likely to swallow legitimate project paths.

Important Files Changed

Filename Overview
scripts/coding_discovery_tools/constants.py Adds personal-media directories to SKIP_DIRS and wraps it in frozensetstr — the generic-alias subscript syntax is a Python 3.9+ feature that raises TypeError on 3.8.
scripts/coding_discovery_tools/macos_extraction_helpers.py Introduces a module-level SKIP_PATTERN compiled from SKIP_DIRS with re.IGNORECASE and switches should_skip_path to substring search; no anchors means common project folders (public/, templates/) are silently skipped.
scripts/coding_discovery_tools/windows_extraction_helpers.py Imports SKIP_PATTERN from the macOS module (cross-platform coupling flagged in prior threads); the generator variable path shadows the function parameter — benign in Python 3 but confusing.
scripts/coding_discovery_tools/windows/cursor_cli/settings_extractor.py Switches from SKIP_DIRS set-membership to SKIPDIR_PATTERN.search on entry.name; cross-platform import from macos_extraction_helpers flagged in prior threads, dev acknowledged and plans to split.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Directory Entry] --> B{should_skip_path}
    B --> C[Iterate path.parts]
    C --> D{SKIP_PATTERN match on part}
    D -- match --> E[Skip - return True]
    D -- no match --> F{system_dirs check}
    F -- in system_dirs --> E
    F -- not in system_dirs --> G[Process - return False]

    subgraph Pattern Construction
        H[SKIP_DIRS frozenset] --> I[re.escape each entry]
        I --> J[join with pipe]
        J --> K[re.compile with re.IGNORECASE]
        K --> L[SKIP_PATTERN]
    end

    L --> D
Loading

Reviews (4): Last reviewed commit: "fix: fixed type error" | Re-trigger Greptile

Comment thread scripts/coding_discovery_tools/constants.py
"""
# Skip common project directories (check all path parts for nested matches)
if any(part in SKIP_DIRS for part in path.parts):
if SKIP_PATTERN.search(str(path)) is not None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Searching the entire stringified path instead of individual parts can produce false positives where a skip keyword straddles a path separator. The macOS sibling function uses any(SKIP_PATTERN.search(part) for part in path.parts), which checks each component independently. Applying the same approach here keeps the two implementations consistent.

Suggested change
if SKIP_PATTERN.search(str(path)) is not None:
if any(SKIP_PATTERN.search(part) for part in path.parts):

from typing import List, Dict, Optional, Tuple, Callable

from .constants import MAX_CONFIG_FILE_SIZE, SKIP_DIRS
from scripts.coding_discovery_tools.macos_extraction_helpers import SKIP_PATTERN

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Windows helper importing macOS module

windows_extraction_helpers.py now imports SKIP_PATTERN from macos_extraction_helpers via an absolute path. This creates a cross-platform coupling: if macos_extraction_helpers ever gains a macOS-specific import (e.g. AppKit, subprocess calling dscl), this import will fail on Windows. SKIP_PATTERN is derived solely from SKIP_DIRS (a plain constant), so it belongs in constants.py alongside SKIP_DIRS and should be imported from there by both platforms.

@@ -26,7 +26,9 @@
from pathlib import Path
from typing import Optional, List, Dict

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Same cross-platform coupling as windows_extraction_helpers.py

This file also imports SKIP_PATTERN directly from macos_extraction_helpers. Moving the compiled pattern (or its construction) to constants.py would remove this dependency for both Windows files at once.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip dirs obj is being shared b/w windows and mac os as of now, we should move out of this idea and create a different obj for windows all together.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense — platform-specific skip patterns give you more flexibility to diverge the behavior later (e.g., Windows-only paths like AppData, $Recycle.Bin that don't exist on macOS). In that case, defining a separate SKIP_DIRS / pattern in windows_extraction_helpers.py (or a windows_constants.py) and removing the import from macos_extraction_helpers entirely is the right call. The cross-platform coupling comment is no longer applicable here.

Tip: You can give me more context with .greptile/rules.md so I get this right next time.

"""
# Skip common project directories (check all path parts for nested matches)
if any(part in SKIP_DIRS for part in path.parts):
if SKIP_PATTERN.search(path.name) is not None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Only last path component checked — nested skip dirs not caught

should_skip_path was changed from iterating all path.parts to checking only path.name. A path like C:\Users\alice\node_modules\project\src\rules.mdc has rules.mdc as its .name, so the check passes and the file is processed despite being inside node_modules. The macOS counterpart at line 91 in macos_extraction_helpers.py correctly iterates every component: any(SKIP_PATTERN.search(part) for part in path.parts). The Windows version should do the same.

Comment on lines +38 to +41
SKIP_DIRS = frozenset[str]({
'.git', 'node_modules', 'venv', '__pycache__', '.venv', 'vendor', '.idea', '.vscode', 'Library', '.Trash', '.cache',
'Photos', 'Music', 'Movies', 'Pictures', 'Public', 'Templates', 'Videos'
})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 frozenset[str](...) uses the generic-alias call syntax introduced in Python 3.9. On Python 3.8, frozenset[str] raises TypeError: 'type' object is not subscriptable at import time, breaking all platforms. The type annotation adds no runtime value here; the idiomatic form is just frozenset({...}).

Suggested change
SKIP_DIRS = frozenset[str]({
'.git', 'node_modules', 'venv', '__pycache__', '.venv', 'vendor', '.idea', '.vscode', 'Library', '.Trash', '.cache',
'Photos', 'Music', 'Movies', 'Pictures', 'Public', 'Templates', 'Videos'
})
SKIP_DIRS = frozenset({
'.git', 'node_modules', 'venv', '__pycache__', '.venv', 'vendor', '.idea', '.vscode', 'Library', '.Trash', '.cache',
'Photos', 'Music', 'Movies', 'Pictures', 'Public', 'Templates', 'Videos'
})

Comment on lines +74 to +75
# Skip common project directories (check all path parts for nested matches)
if any(part in SKIP_DIRS for part in path.parts):
if SKIP_PATTERN.search(path) is not None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 SKIP_PATTERN.search(path) passes a Path object to re.search(), which only accepts a str or bytes-like argument. On every call to should_skip_path this raises TypeError: expected string or bytes-like object, completely breaking directory filtering on Windows. The macOS sibling already does this correctly by iterating path.parts (a tuple of strings).

Suggested change
# Skip common project directories (check all path parts for nested matches)
if any(part in SKIP_DIRS for part in path.parts):
if SKIP_PATTERN.search(path) is not None:
# Skip common project directories (check all path parts for nested matches)
if any(SKIP_PATTERN.search(part) for part in path.parts):

Comment thread scripts/coding_discovery_tools/macos_extraction_helpers.py
@MohamedAklamaash MohamedAklamaash changed the title feat: skip specific folders to not parse [WEB-4211] feat: skip specific folders to not parse May 5, 2026
# Conflicts:
#	scripts/coding_discovery_tools/constants.py
@MohamedAklamaash MohamedAklamaash requested a review from a team July 1, 2026 13:43
@@ -37,8 +37,9 @@
MAX_SEARCH_DEPTH = 10 # Maximum directory depth to search recursively
SKIP_DIRS = frozenset[str]({

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Python 3.8 import break

frozenset[str](...) still raises TypeError on Python 3.8 because built-in collection generics are not subscriptable there. This file is imported by the extraction helpers before traversal starts, so a Python 3.8 runtime can fail at module import instead of running discovery. The runtime type parameter is not needed here, and the other frozenset constants in this file already use the compatible form.

Suggested change
SKIP_DIRS = frozenset[str]({
SKIP_DIRS = frozenset({

@vigneshsubbiah16 vigneshsubbiah16 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛡️ Automated Security Review (consensus)

0 findings — 0 high-confidence, 0 to triage. Reviewers: Cursor, Claude, Semgrep, Gitleaks.

No exploitable security issues identified. Changes affect directory-traversal filtering only (case-insensitive substring skip matching, new skip entries); no auth, secrets, injection, or unsafe deserialization paths introduced. Semgrep and Gitleaks reported clean; Claude found no security issues.

Previously acknowledged (not re-flagged)

  • Unanchored substring skip matching (public/, templates/, .git_backup/, etc.) — Maintainer (2026-05-06): accepted as expected; directories whose names contain skip keywords are unlikely to hold tools/skills files; will monitor and adjust SKIP_DIRS if coverage gaps appear.

🤖 consensus review · reviewers: Cursor,Claude,Semgrep,Gitleaks · head 7cff01d3 · 2026-07-01T13:56Z

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants