Skip to content

Implement: .cgrignore File Support (pathspec integration) #38

@lexasub

Description

@lexasub

Implementation Guide: .cgrignore File Support

🎯 Goal

Add support for .cgrignore file (similar to .gitignore) to allow users to declaratively specify which files and directories should be excluded from indexing.

📋 Current State

What exists:

  • exclude_patterns list in config (hardcoded or in JSON)
  • pathspec library in requirements.txt (NOT USED)
  • walk_source_files() function with hardcoded exclusions

What's missing:

  • .cgrignore file support
  • ❌ Glob pattern matching (only exact directory names)
  • ❌ Per-project ignore files
  • ❌ Negation patterns (!important.py)

🔧 Implementation Plan

Step 1: Create .cgrignore Parser

File: ast_rag/utils/ignore_parser.py (new)

"""
.cgrignore file parser for AST-RAG.

Format: Same as .gitignore
- One pattern per line
- # for comments
- ! for negation
- ** for matching across directories
- * for wildcard matching
"""

import os
from pathlib import Path
from typing import Optional
import pathspec


class CgrIgnoreParser:
    """Parser for .cgrignore files."""
    
    def __init__(self, root_path: str):
        self.root_path = Path(root_path)
        self.spec: Optional[pathspec.PathSpec] = None
        self.patterns: list[str] = []
    
    def load(self, ignore_file: Optional[str] = None) -> None:
        """
        Load .cgrignore file.
        
        Args:
            ignore_file: Path to ignore file. If None, looks for .cgrignore in root.
        """
        if ignore_file is None:
            ignore_file = os.path.join(self.root_path, ".cgrignore")
        
        if not os.path.exists(ignore_file):
            # No ignore file, use defaults
            self._load_defaults()
            return
        
        with open(ignore_file, "r") as f:
            lines = f.readlines()
        
        # Filter out comments and empty lines
        patterns = []
        for line in lines:
            line = line.strip()
            if line and not line.startswith("#"):
                patterns.append(line)
        
        self.patterns = patterns
        self.spec = pathspec.PathSpec.from_lines("gitwildmatch", patterns)
    
    def _load_defaults(self) -> None:
        """Load default ignore patterns."""
        self.patterns = [
            ".git/",
            "__pycache__/",
            "node_modules/",
            "target/",
            "build/",
            "dist/",
            ".gradle/",
            ".idea/",
            ".vscode/",
            "venv/",
            ".venv/",
            "*.pyc",
            "*.pyo",
            "*.class",
            "*.o",
            "*.so",
            "*.dll",
        ]
        self.spec = pathspec.PathSpec.from_lines("gitwildmatch", self.patterns)
    
    def should_ignore(self, file_path: str) -> bool:
        """
        Check if a file should be ignored.
        
        Args:
            file_path: Absolute or relative path to check
        
        Returns:
            True if file should be ignored
        """
        if self.spec is None:
            self._load_defaults()
        
        # Make path relative to root
        path = Path(file_path)
        try:
            rel_path = path.relative_to(self.root_path)
        except ValueError:
            # Path is not under root, don't ignore
            return False
        
        return self.spec.match_file(str(rel_path))
    
    def get_patterns(self) -> list[str]:
        """Return loaded patterns."""
        return self.patterns.copy()

Step 2: Update walk_source_files()

File: ast_rag/services/parsing/parser_manager.py

Modify walk_source_files() to use the ignore parser:

def walk_source_files(
    root: str,
    exclude_dirs: Optional[list[str]] = None,
    ignore_file: Optional[str] = None,
) -> list[tuple[str, str]]:
    """
    Recursively enumerate all source files under root.
    
    Args:
        root: Root directory to walk
        exclude_dirs: Additional directories to exclude (legacy, kept for compatibility)
        ignore_file: Path to .cgrignore file (default: root/.cgrignore)
    
    Returns:
        List of (absolute_file_path, language) tuples
    """
    # Initialize ignore parser
    ignore_parser = CgrIgnoreParser(root)
    ignore_parser.load(ignore_file)
    
    result: list[tuple[str, str]] = []
    
    for dirpath, dirnames, filenames in os.walk(root):
        # Filter directories in-place (skip ignored dirs)
        dirnames[:] = [
            d for d in dirnames 
            if not ignore_parser.should_ignore(os.path.join(dirpath, d))
            and not d.startswith(".")
        ]
        
        # Add exclude_dirs patterns (for backward compatibility)
        if exclude_dirs:
            dirnames[:] = [d for d in dirnames if d not in exclude_dirs]
        
        for fname in filenames:
            file_path = os.path.join(dirpath, fname)
            
            # Check if file should be ignored
            if ignore_parser.should_ignore(file_path):
                continue
            
            ext = Path(fname).suffix.lower()
            lang = EXT_TO_LANG.get(ext)
            if lang:
                result.append((file_path, lang))
    
    return result

Step 3: Update CLI to Accept Ignore File

File: ast_rag/cli.py

Add --ignore-file option:

@app.command("index")
def index_project(
    root: str = typer.Argument(".", help="Root directory to index"),
    # ... existing options
    ignore_file: Optional[str] = typer.Option(
        None, 
        "--ignore-file", 
        "-i",
        help="Path to .cgrignore file (default: .cgrignore in root)",
    ),
) -> None:
    """Index a codebase."""
    cfg = _load_config()
    
    # Merge exclude_patterns from config with .cgrignore
    files = walk_source_files(
        root, 
        exclude_dirs=cfg.exclude_patterns,
        ignore_file=ignore_file,
    )
    
    # ... rest of indexing

Step 4: Update Config Schema

File: ast_rag/dto/config.py

Add ignore_file to config:

class ProjectConfig(BaseModel):
    # ... existing fields
    ignore_file: Optional[str] = None  # Path to .cgrignore

Step 5: Create Sample .cgrignore

File: .cgrignore.example (new in repo root)

# .cgrignore - Files and directories to exclude from AST-RAG indexing
# Format: Same as .gitignore

# Version control
.git/
.svn/
.hg/

# Build artifacts
build/
dist/
target/
*.o
*.so
*.dll
*.pyc
*.pyo
*.class

# Dependencies
node_modules/
vendor/
__pycache__/
.venv/
venv/
env/

# IDE
.idea/
.vscode/
*.swp
*.swo
*~

# Test fixtures (optional - uncomment if needed)
# test/fixtures/
# tests/data/

# Documentation (optional)
# docs/
# *.md

# Large data files
*.csv
*.json
*.parquet

Step 6: Documentation

File: docs/IGNORE_FILES.md (new)

# .cgrignore File Format

AST-RAG supports `.cgrignore` files to exclude files and directories from indexing.

## Location

Place a `.cgrignore` file in your project root. AST-RAG will automatically load it.

## Format

The format is identical to `.gitignore`:

Comment

*.pyc # Ignore all .pyc files
build/ # Ignore build directory
!important.py # But don't ignore important.py
/test/ # Ignore test directories anywhere


## Patterns

| Pattern | Meaning |
|---------|---------|
| `*.ext` | Ignore all files with extension .ext |
| `dir/` | Ignore directory dir |
| `!file` | Negation: don't ignore file |
| `**/dir` | Match dir in any directory |
| `dir/**` | Match everything under dir |

## Example

Ignore all test files

test/
tests/
*_test.py

But keep critical test config

!test/config.py

Ignore documentation

docs/
*.md

But keep README

!README.md


## Fallback

If no `.cgrignore` file exists, AST-RAG uses sensible defaults:
- `.git/`, `__pycache__/`, `node_modules/`, etc.
- Build artifacts: `build/`, `dist/`, `target/`
- IDE files: `.idea/`, `.vscode/`

🧪 Testing

def test_cgrignore():
    # Create temp directory with .cgrignore
    with tempfile.TemporaryDirectory() as tmpdir:
        # Write .cgrignore
        with open(os.path.join(tmpdir, ".cgrignore"), "w") as f:
            f.write("*.pyc\nbuild/\n")
        
        # Create test files
        os.makedirs(os.path.join(tmpdir, "src"))
        os.makedirs(os.path.join(tmpdir, "build"))
        
        with open(os.path.join(tmpdir, "src", "main.py"), "w") as f:
            f.write("print('hello')")
        
        with open(os.path.join(tmpdir, "build", "out.pyc"), "w") as f:
            f.write("binary")
        
        # Test walk_source_files
        files = walk_source_files(tmpdir)
        file_paths = [f[0] for f in files]
        
        assert any("main.py" in f for f in file_paths)
        assert not any("out.pyc" in f for f in file_paths)
        assert not any("build" in f for f in file_paths)

📁 Files to Create/Modify

Create:

  1. ast_rag/utils/ignore_parser.py - New ignore parser class
  2. .cgrignore.example - Example ignore file
  3. docs/IGNORE_FILES.md - Documentation

Modify:

  1. ast_rag/services/parsing/parser_manager.py - Update walk_source_files()
  2. ast_rag/cli.py - Add --ignore-file option
  3. ast_rag/dto/config.py - Add ignore_file field
  4. requirements.txt - Ensure pathspec>=0.12 is present

⏱️ Estimated Time

  • 2-3 hours for implementation
  • 1 hour for testing
  • 30 minutes for documentation

Labels: enhancement, cli, configuration
Priority: Low
Implementation Time: 3-4 hours

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    Status

    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions