Add regex-based fallback parser for non-tree-sitter languages#2
Open
2dmaster wants to merge 5 commits into
Open
Add regex-based fallback parser for non-tree-sitter languages#22dmaster wants to merge 5 commits into
2dmaster wants to merge 5 commits into
Conversation
Tree-sitter grammars are currently only registered for TypeScript/JavaScript, so code_search returns nothing for the ~40 other extensions already mapped in EXT_TO_LANGUAGE. This change adds a regex-based fallback that runs when no tree-sitter grammar is registered for the detected language, plus built-in patterns for ~16 common languages including Python, Go, Rust, Ruby, Java, Kotlin, C/C++, C#, Swift, PHP, Lua, GDScript, GLSL, Dart, Shell, SQL, Scala, Elixir, Haskell. The fallback is best-effort: line-anchored regex extracts function/class/ struct/enum definitions and import statements. Less accurate than true AST parsing, but enough to make code_search surface results. Also adds extension mappings for Godot (.gd, .gdshader, .gdshaderinc) and common shader files (.glsl, .vert, .frag, .geom, .tesc, .tese, .comp, .hlsl).
- Tree-sitter AST mappers: Python, Go, Rust, Java, PHP, Ruby, C#, C/C++, Bash, GDScript - Regex parsers for Godot scene/resource/project/extension file types - GDScript WASM committed to wasm/ with postinstall.js copying it into node_modules/@vscode/tree-sitter-wasm/wasm after npm install - Added .glsl to default codeInclude for Godot 4 external shader support - Ignored .mcp.json and .notes/ in .gitignore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…o/rust/gdscript - multi-config: update expected default codeInclude to include Godot + glsl extensions - regex-parser: python/go/rust/gdscript moved to tree-sitter — remove their regex mapper tests and flip isRegexLanguageSupported assertions to false - code-parser-advanced: use .xyz extension for unsupported-language test (.py now parsed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment Thanks for integrating Codecov - We've got you covered ☂️ |
Covers Python, Go, Rust, Java, PHP, Ruby, C#, C/C++, and Bash via parseCodeFile integration tests — fixes codecov/patch failure on PR. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- GDScript: func, class_name, enum, signal, var, const, inner class, extends edge (7% → 80%) - Godot scene/resource/project/extension regex mappers (regex-patterns.ts 40% → 100%) - C# extended: namespace, enum, property, method, field (68% → 88%) - C++ extended: template class, enum, class methods, extracted via class body (69% → 80%) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Tree-sitter grammars are currently only wired up for TypeScript/JavaScript,
so
code_searchreturns no symbols for the ~40 other extensions alreadymapped in
EXT_TO_LANGUAGE. This PR adds a regex-based fallback parserthat activates when no tree-sitter grammar is registered for the detected
language.
What's new
RegexLanguageMapperinterface (parallel to the existing tree-sitterLanguageMapper) that operates directly on source text.createRegexMapper(opts)factory insrc/lib/parsers/languages/regex-mapper.tsthat builds mappers from{ symbols, imports, docCommentLine }patterns.registerRegexLanguage()/getRegexMapper()/isRegexLanguageSupported()in the registry, parallel to the tree-sitterAPI.
C, C++, C#, Swift, PHP, Lua, GDScript, GLSL, Dart, Shell, SQL, Scala,
Elixir, Haskell.
.gd→gdscript,.gdshader/.gdshaderinc→glsl, plus.glsl/.vert/.frag/.geom/.tesc/.tese/.comp/.hlsl→glsl.parseCodeFile()falls back to the regex path when tree-sitter isunavailable for the language. Tree-sitter remains primary when registered.
Trade-offs
The fallback is best-effort: line-anchored regex catches function/class/
struct/enum/trait definitions and import statements. It does not extract
inheritance edges (
extends/implements) or class-member containment —those return empty arrays. Tree-sitter mappers are still preferred when
available, so this is purely additive (no behaviour change for TS/JS).
Test plan
createRegexMapper(line numbers, dedup, doc-commentattachment, empty source, missing-name groups, isExported flag).
local checkout, so
npm testandnpm run buildhaven't been runlocally. Please run CI to verify.
Background
Started from a real-world hit: indexing a Godot project (
.gd/.gdshader) — the files showed up infiles_listbutcode_searchreturned nothing because no parser was registered for those languages.