In this plan (foundation, v0.1.0-rc1) the engine is unmodified from the
pinned upstream commit recorded in UPSTREAM.md. The only deltas vs upstream
are:
- Upstream's
.github/workflows have been removed (we run our own in the monorepo root's.github/workflows/). - Upstream's
.git/directory is absent (flat copy, seeUPSTREAM.md).
Plan 2 (bio-languages) will add per-language patches for R, Julia, MATLAB, Perl — each documented as a numbered entry below when it lands.
- WASM grammar:
engine/src/extraction/wasm/tree-sitter-r.wasmfrom r-lib/tree-sitter-r v1.2.0 - Extension map:
.R,.r→r - Extractor:
engine/src/extraction/languages/r.ts - Tests: added
R Extractiondescribe block inengine/__tests__/extraction.test.ts - Notes: R functions are anonymous expressions assigned to identifiers via
binary_operator(<-,=,->>). The extractor uses avisitNodehook to interceptbinary_operatornodes: rhs=function_definition→ function node; plain rhs → variable node. No class support (S3/S4/R5/R6 are runtime constructs).library()/require()are extracted as plain function calls.
- WASM grammar:
engine/src/extraction/wasm/tree-sitter-julia.wasmfrom tree-sitter/tree-sitter-julia v0.25.0 - Extension map:
.jl→julia - Extractor:
engine/src/extraction/languages/julia.ts - Tests: added
Julia Extractiondescribe block inengine/__tests__/extraction.test.ts - Notes: Julia's tree-sitter grammar uses no named fields (all
childForFieldNamecalls return null). The extractor usesvisitNodethroughout. Key AST quirks:function_definition: children by index —[0]=function,[1]=signature(call_expression(identifier, argument_list)),[2]=block,[3]=end- Short-form functions
f() = exprare parsed asassignmentwith acall_expressionon the LHS (not a dedicatedshort_function_definitionnode) const_statementwraps an innerassignmentnodemacro_definitionhas the same structure asfunction_definition- Call edges work via the standard
call_expressionincallTypes;visitFunctionBodyrecursively finds them
- WASM grammar:
engine/src/extraction/wasm/tree-sitter-perl.wasmfrom tree-sitter-perl/tree-sitter-perl v1.0.2 - Extension map:
.pl,.pm,.t→perl - Extractor:
engine/src/extraction/languages/perl.ts - Tests: added
Perl Extractiondescribe block inengine/__tests__/extraction.test.ts(10 tests) - Notes: Perl grammar (tree-sitter-perl v1.0.2) key AST facts confirmed empirically:
subroutine_declaration_statement: namedChild[0]=bareword(name), namedChild[1]=block(body); no named fieldsfunction_call_expression: child[0] has typefunction(callee); no named fieldsambiguous_function_call_expression: like function_call_expression but wraps e.g.print foo(); child[0] type=functionmethod_call_expression:$obj->method()orClass->method(); child[2] type=method(method name)use_statement: namedChild[0] is apackagenode holding the module namerequire_expression: namedChild[0] is abarewordholding the module namepackage_statement: namedChild[0] is apackagenode holding the package name (re-usespackagetype for both the keyword and the identifier)- The grammar does NOT use named fields (childForFieldName returns null); all extraction uses
namedChild(i)orchild(i)by index resolveNamehook is used to extract callee names from all three call node types
- WASM grammar:
engine/src/extraction/wasm/tree-sitter-matlab.wasmbuilt locally from acristoffers/tree-sitter-matlab (upstream ships only Python wheels) - Extension map:
.mis shared with Objective-C; disambiguated bydetectLanguage(filePath, content)content heuristic that checks for ObjC markers (@interface,@implementation,#import,#include) in the first 4 KB.EXTENSION_MAPstill maps.m→objcas the default; the heuristic overrides tomatlabonly when no ObjC markers are found. - Extractor:
engine/src/extraction/languages/matlab.ts - Tests: extraction + disambiguation tests in
engine/__tests__/extraction.test.ts(MATLAB Extractiondescribe block; 14 tests total) - Notes: MATLAB grammar (acristoffers/tree-sitter-matlab) key AST facts confirmed empirically:
function_definition: field'name'→ function identifier;function_outputoptional named child for return values;function_argumentsnamed child for params;blocknamed child for body- Three function forms:
function greet(),function result = hello(name),function [a,b] = swap(x,y)— all producefunction_definitionnodes with the same structure function_call: field'name'→ callee identifier; used for call edges viacallTypes: ['function_call']assignment: fields'left'and'right'; top-level identifier-lhs assignments → variable nodes- Grammar does NOT use
call_expression(unlike most other languages); usesfunction_callinstead