Skip to content

Add regex-based fallback parser for non-tree-sitter languages#2

Open
2dmaster wants to merge 5 commits into
graph-memory:mainfrom
2dmaster:feat/regex-language-fallback
Open

Add regex-based fallback parser for non-tree-sitter languages#2
2dmaster wants to merge 5 commits into
graph-memory:mainfrom
2dmaster:feat/regex-language-fallback

Conversation

@2dmaster
Copy link
Copy Markdown

@2dmaster 2dmaster commented May 7, 2026

Summary

Tree-sitter grammars are currently only wired up for TypeScript/JavaScript,
so code_search returns no symbols for the ~40 other extensions already
mapped in EXT_TO_LANGUAGE. This PR adds a regex-based fallback parser
that activates when no tree-sitter grammar is registered for the detected
language.

What's new

  • RegexLanguageMapper interface (parallel to the existing tree-sitter
    LanguageMapper) that operates directly on source text.
  • createRegexMapper(opts) factory in
    src/lib/parsers/languages/regex-mapper.ts that builds mappers from
    { symbols, imports, docCommentLine } patterns.
  • registerRegexLanguage() / getRegexMapper() /
    isRegexLanguageSupported() in the registry, parallel to the tree-sitter
    API.
  • Built-in patterns for ~16 languages: Python, Go, Rust, Ruby, Java, Kotlin,
    C, C++, C#, Swift, PHP, Lua, GDScript, GLSL, Dart, Shell, SQL, Scala,
    Elixir, Haskell.
  • New extension mappings: .gdgdscript, .gdshader /
    .gdshaderincglsl, plus .glsl / .vert / .frag / .geom /
    .tesc / .tese / .comp / .hlslglsl.
  • parseCodeFile() falls back to the regex path when tree-sitter is
    unavailable for the language. Tree-sitter remains primary when registered.

Trade-offs

The fallback is best-effort: line-anchored regex catches function/class/
struct/enum/trait definitions and import statements. It does not extract
inheritance edges (extends/implements) or class-member containment —
those return empty arrays. Tree-sitter mappers are still preferred when
available, so this is purely additive (no behaviour change for TS/JS).

Test plan

  • Unit tests for createRegexMapper (line numbers, dedup, doc-comment
    attachment, empty source, missing-name groups, isExported flag).
  • Per-language assertions for Python, Go, Rust, GDScript, GLSL.
  • CI run requested: I authored this PR via the GitHub API without a
    local checkout, so npm test and npm run build haven't been run
    locally. Please run CI to verify.

Background

Started from a real-world hit: indexing a Godot project (.gd /
.gdshader) — the files showed up in files_list but code_search
returned nothing because no parser was registered for those languages.

2dmaster and others added 3 commits May 8, 2026 01:12
Tree-sitter grammars are currently only registered for TypeScript/JavaScript,
so code_search returns nothing for the ~40 other extensions already mapped in
EXT_TO_LANGUAGE. This change adds a regex-based fallback that runs when no
tree-sitter grammar is registered for the detected language, plus built-in
patterns for ~16 common languages including Python, Go, Rust, Ruby, Java,
Kotlin, C/C++, C#, Swift, PHP, Lua, GDScript, GLSL, Dart, Shell, SQL, Scala,
Elixir, Haskell.

The fallback is best-effort: line-anchored regex extracts function/class/
struct/enum definitions and import statements. Less accurate than true AST
parsing, but enough to make code_search surface results.

Also adds extension mappings for Godot (.gd, .gdshader, .gdshaderinc) and
common shader files (.glsl, .vert, .frag, .geom, .tesc, .tese, .comp, .hlsl).
- Tree-sitter AST mappers: Python, Go, Rust, Java, PHP, Ruby, C#, C/C++, Bash, GDScript
- Regex parsers for Godot scene/resource/project/extension file types
- GDScript WASM committed to wasm/ with postinstall.js copying it into
  node_modules/@vscode/tree-sitter-wasm/wasm after npm install
- Added .glsl to default codeInclude for Godot 4 external shader support
- Ignored .mcp.json and .notes/ in .gitignore

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…o/rust/gdscript

- multi-config: update expected default codeInclude to include Godot + glsl extensions
- regex-parser: python/go/rust/gdscript moved to tree-sitter — remove their regex mapper
  tests and flip isRegexLanguageSupported assertions to false
- code-parser-advanced: use .xyz extension for unsupported-language test (.py now parsed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

2dmaster and others added 2 commits May 12, 2026 15:21
Covers Python, Go, Rust, Java, PHP, Ruby, C#, C/C++, and Bash via
parseCodeFile integration tests — fixes codecov/patch failure on PR.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- GDScript: func, class_name, enum, signal, var, const, inner class, extends edge (7% → 80%)
- Godot scene/resource/project/extension regex mappers (regex-patterns.ts 40% → 100%)
- C# extended: namespace, enum, property, method, field (68% → 88%)
- C++ extended: template class, enum, class methods, extracted via class body (69% → 80%)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant