diff --git a/CLAUDE.md b/CLAUDE.md index f637dd6a..5d01bd54 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -81,6 +81,8 @@ Defined in `src/types.ts`. Both extractors and resolvers must use these exact st - `instructions-template.ts` no longer holds an instructions body — it exports only the ``/`` markers. The installer **stopped writing** a `## CodeGraph` block into each agent's instructions file (`CLAUDE.md` / `~/.codex/AGENTS.md` / `~/.config/opencode/AGENTS.md` / `~/.gemini/GEMINI.md` / `.cursor/rules/codegraph.mdc` / Kiro steering doc) because it duplicated the MCP `initialize` instructions verbatim (issue #529). Each target's `install` (self-heal on upgrade) and `uninstall` use the markers to **strip** a block a previous install left behind. `server-instructions.ts` is the single source of truth for agent-facing guidance. - All installer changes need matching coverage in `__tests__/installer-targets.test.ts` — there are ~47 parameterized contract tests covering install idempotency, sibling preservation, uninstall reverses install, byte-equal re-runs returning `unchanged`, and partial-state recovery for Codex. +To add a new language, follow the cookbook at [`docs/ADDING-A-LANGUAGE.md`](docs/ADDING-A-LANGUAGE.md). + ### Cursor MCP working-directory quirk Cursor launches MCP subprocesses with the wrong cwd and doesn't pass `rootUri` in `initialize`. The installer injects `--path` into Cursor's MCP args — absolute path for local installs, `${workspaceFolder}` for global installs. If you touch Cursor wiring, preserve this. @@ -263,3 +265,4 @@ publish actions on shared state. Write the files, hand the user the commands. - The **last main commit** — `git log --first-parent main -1 --format='%ai %h %s'`. A comment after the last release but before a fix on main may already be addressed there but unreleased. - The **current branch's tip** — your own unmerged work obviously can't be what the comment is reacting to. Always disambiguate "released," "merged-but-unreleased," and "in-progress" before agreeing that a user-reported problem is unfixed (or that a fix is incomplete). A user saying "your fix only covers X" about a recent PR is usually pointing at the *released* shortcomings — your in-flight branch may already address them but they have no way to know that. +- For contributor-facing guidance (PR workflow, commit conventions), see `CONTRIBUTING.md`. diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 00000000..07f86a71 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,248 @@ +# Contributing to CodeGraph + +Thanks for your interest in contributing! This guide covers everything you need to +get started. + +## Table of Contents + +- [Code of Conduct](#code-of-conduct) +- [Getting Started](#getting-started) +- [Development Setup](#development-setup) +- [Project Architecture](#project-architecture) +- [Making Changes](#making-changes) +- [Adding a New Language](#adding-a-new-language) +- [Testing](#testing) +- [Commit Messages](#commit-messages) +- [Pull Requests](#pull-requests) +- [Reporting Issues](#reporting-issues) + +--- + +## Code of Conduct + +Be respectful and constructive. We're all here to build something useful. + +## Getting Started + +### Prerequisites + +- **Node.js** >= 20.0.0, < 25.0.0 (Node 25.x has a V8 WASM JIT bug — see + [#81](https://github.com/colbymchenry/codegraph/issues/81)) +- **npm** (ships with Node) +- **Git** + +### Fork and Clone + +```bash +# Fork via GitHub UI, then: +git clone https://github.com//codegraph.git +cd codegraph +git remote add upstream https://github.com/colbymchenry/codegraph.git +``` + +### Install and Verify + +```bash +npm install +npm run build +npm test +``` + +If all tests pass, you're ready to go. + +## Development Setup + +### Useful Commands + +| Command | What it does | +|---|---| +| `npm run build` | Compile TypeScript + copy assets into `dist/` | +| `npm run dev` | Watch mode — rebuilds on file change | +| `npm run clean` | Remove `dist/` | +| `npm test` | Run the full test suite (vitest) | +| `npm run test:watch` | Run tests in watch mode | +| `npm run cli` | Build then run the local CLI binary | + +### Running a Single Test + +```bash +npx vitest run __tests__/extraction.test.ts -t "Groovy" +``` + +### Project Structure + +``` +src/ +├── index.ts # Public API (CodeGraph class) +├── types.ts # Core type definitions (NodeKind, EdgeKind, Language) +├── db/ # SQLite database layer (schema, queries) +├── extraction/ # Tree-sitter parsing and per-language extractors +│ ├── languages/ # One file per language (python.ts, go.ts, ...) +│ ├── tree-sitter.ts # Core extraction engine +│ └── wasm/ # Vendored grammar .wasm files +├── resolution/ # Cross-file reference resolution +│ └── frameworks/ # Framework-specific resolvers (Express, Django, ...) +├── graph/ # Graph traversal (BFS/DFS, impact radius) +├── context/ # Context building for AI consumption +├── mcp/ # MCP server implementation +├── installer/ # Multi-agent installer (Claude, Cursor, Codex, opencode) +├── search/ # FTS5 query parsing +├── sync/ # File watcher and git hooks +└── bin/ # CLI entry point +__tests__/ # Tests mirror the module they cover +docs/ # Design docs, benchmarks, cookbooks +``` + +## Making Changes + +### Branching + +Always branch off `upstream/main`: + +```bash +git checkout -b feat/my-feature upstream/main +``` + +Use descriptive branch names: `feat/`, `fix/`, `docs/`, `refactor/`. + +### What to Edit + +- **Types** (`src/types.ts`): `NodeKind`, `EdgeKind`, and `Language` are + runtime-iterable `const` arrays. Changing them affects the entire pipeline. +- **Extractors** (`src/extraction/languages/`): Each language has its own file + exporting a `LanguageExtractor` config. See + [Adding a New Language](#adding-a-new-language). +- **MCP tools** (`src/mcp/`): Changes to tool behavior require updating all + three of `server-instructions.ts`, `instructions-template.ts`, and + `.cursor/rules/codegraph.mdc` — they're the same guidance written to different + places. +- **Installer** (`src/installer/`): Adding a new agent target is one new file in + `targets/` + one entry in `registry.ts`. All changes need test coverage in + `__tests__/installer-targets.test.ts`. + +### Build Verification + +Before committing, always run: + +```bash +npm run build && npm test +``` + +TypeScript strict mode is fully enabled — the compiler catches a lot. The build +also copies `.wasm` files and `schema.sql` into `dist/`; if you add new assets, +make sure `copy-assets` picks them up. + +## Adding a New Language + +There's a dedicated cookbook for this — see +[`docs/ADDING-A-LANGUAGE.md`](docs/ADDING-A-LANGUAGE.md). It walks through: + +1. Sourcing a tree-sitter `.wasm` grammar +2. Probing the AST before writing code +3. Registering the language (one new file + two registry lines) +4. Writing the extractor (two patterns: `LanguageExtractor` config vs custom class) +5. Testing and PR checklist + +## Testing + +### Philosophy + +Tests use **real files** and **real SQLite** — there is no DB mocking. Each test +creates a temp directory with `fs.mkdtempSync` and cleans up in `afterAll`/`afterEach`. + +### Running Tests + +```bash +npm test # full suite +npx vitest run __tests__/extraction.test.ts # single file +npx vitest run -t "TypeScript" # filter by test name +``` + +### Writing Tests + +- Place tests in `__tests__/` mirroring the module they cover +- Use `extractFromSource(filename, code)` for extraction unit tests +- Use `it.runIf(process.platform === 'win32')(...)` for platform-gated tests +- Clean up temp dirs in `afterEach`/`afterAll` +- Don't skip or mock the database — integration tests must hit real SQLite + +### Evaluation Tests + +The `__tests__/evaluation/` directory has retrieval quality benchmarks. Run +with: + +```bash +npm run eval # builds first, then runs the evaluation runner +``` + +These are not part of the standard test suite and are run separately. + +## Commit Messages + +We follow [Conventional Commits](https://www.conventionalcommits.org/): + +``` +type(scope): description +``` + +**Types:** `feat`, `fix`, `docs`, `chore`, `refactor`, `test`, `perf` + +**Common scopes:** `extraction`, `resolution`, `mcp`, `cli`, `installer`, +`watcher`, `npm`, `release` + +**Examples:** + +``` +feat(extraction): add Groovy language support +fix(resolution): stream node-kind scans in synthesis to fix OOM +docs(readme): link to the website & docs site +chore: update vitest to v2.1.9 +``` + +- Use the imperative mood ("add" not "added") +- Keep the subject line under 72 characters +- Reference issues with `Closes #123` or `Relates to #123` in the body + +## Pull Requests + +### Before Opening + +1. Rebase on the latest `upstream/main` +2. Run `npm run build && npm test` — everything must pass +3. Run `npx tsc --noEmit` — no type errors +4. Keep changes focused — one concern per PR + +### PR Description + +Include: + +- **What** changed and **why** +- Test plan (what you ran, what passed) +- Any known limitations or follow-up work +- For language additions: grammar source, version, license, and sha256 if + vendored (see [docs/ADDING-A-LANGUAGE.md](docs/ADDING-A-LANGUAGE.md) §8) + +### Review Process + +- Maintainers may request changes — please respond to feedback +- Keep PRs up to date with `upstream/main` via rebase (not merge commits) +- Squash commits if the history is messy + +## Reporting Issues + +When filing a bug report, please include: + +- CodeGraph version (`codegraph --version` or `npx @colbymchenry/codegraph --version`) +- Node.js version (`node --version`) +- Operating system and version +- Steps to reproduce +- Expected vs actual behavior +- Relevant log output or error messages + +For feature requests, describe the use case and why existing functionality +doesn't cover it. + +## License + +By contributing, you agree that your contributions will be licensed under the +[MIT License](LICENSE). diff --git a/README.md b/README.md index 1a9800ee..405b7bc0 100644 --- a/README.md +++ b/README.md @@ -628,6 +628,8 @@ is written): | Lua | `.lua` | Full support (functions, methods with receivers, local variables, `require` imports, call edges) | | Luau | `.luau` | Full support (everything in Lua, plus `type`/`export type` aliases, typed signatures, and Roblox instance-path `require`) | +Want to add another language? See [`docs/ADDING-A-LANGUAGE.md`](docs/ADDING-A-LANGUAGE.md) — it walks through sourcing a tree-sitter grammar, probing the AST, choosing between the OO and self-contained extractor patterns, and the worked examples in the existing extractors. + ## Troubleshooting **"CodeGraph not initialized"** — Run `codegraph init` in your project directory first. @@ -653,6 +655,12 @@ is written): +## Contributing + +Contributions are welcome! See [`CONTRIBUTING.md`](CONTRIBUTING.md) for development setup, testing conventions, and PR guidelines. + +Want to add a new language? See [`docs/ADDING-A-LANGUAGE.md`](docs/ADDING-A-LANGUAGE.md) — it walks through sourcing a tree-sitter grammar, probing the AST, choosing an extractor pattern, and the worked examples in the existing extractors. + ## License MIT diff --git a/docs/ADDING-A-LANGUAGE.md b/docs/ADDING-A-LANGUAGE.md new file mode 100644 index 00000000..89119a95 --- /dev/null +++ b/docs/ADDING-A-LANGUAGE.md @@ -0,0 +1,503 @@ +# Adding a Language + +This is a cookbook for adding a new language to CodeGraph. It assumes you have a +working dev setup (`npm install` and `npm test` pass). + +There are two patterns. **Pick the one that matches the language you're adding.** + +| Language shape | Pattern | Examples | +|---|---|---| +| Procedural / OO with named functions, classes, methods | **`LanguageExtractor` config** | `python.ts`, `ruby.ts`, `r.ts` | +| Declarative / template / configuration / no named functions | **Custom extractor class** | `hcl-extractor.ts`, `liquid-extractor.ts`, `sql-extractor.ts` | + +The two patterns share the same setup steps (1–4) and only diverge at the extractor +itself (step 5). + +--- + +## 1. Source a tree-sitter wasm grammar + +CodeGraph parses everything via [`web-tree-sitter`](https://www.npmjs.com/package/web-tree-sitter), +so the grammar has to be available as a `.wasm` file. Three options, in order of +preference: + +### 1a. Already in `tree-sitter-wasms` + +The [`tree-sitter-wasms`](https://www.npmjs.com/package/tree-sitter-wasms) npm package +ships pre-built wasms for 30+ common languages. Check `node_modules/tree-sitter-wasms/out/` +after a fresh install: + +```bash +ls node_modules/tree-sitter-wasms/out/ | grep +``` + +If your grammar is there, you're done with this step — just reference the filename. + +### 1b. A pre-built `.wasm` released somewhere else + +Many grammars publish wasms in their GitHub releases (e.g. r-lib/tree-sitter-r) or +in a separate npm package (e.g. `@tree-sitter-grammars/tree-sitter-hcl` ships +`tree-sitter-hcl.wasm` directly in the tarball). + +```bash +# GitHub release +curl -sL -o src/extraction/wasm/tree-sitter-foo.wasm \ + https://github.com/.../releases/download/vX.Y.Z/tree-sitter-foo.wasm + +# Inside an npm tarball +mkdir -p /tmp/foo && cd /tmp/foo +curl -sL https://registry.npmjs.org/tree-sitter-foo/-/tree-sitter-foo-X.Y.Z.tgz | tar xz +cp package/tree-sitter-foo.wasm /src/extraction/wasm/ +``` + +Verify the sha256 against the upstream release manifest before committing. + +### 1c. Build from source + +If only the C source is published (e.g. DerekStride/tree-sitter-sql), build the wasm +locally with `tree-sitter-cli`. Recent versions ship their own wasi-sdk and don't need +Docker or local emcc: + +```bash +mkdir /tmp/foo && cd /tmp/foo +curl -sL https://github.com/.../releases/download/vX.Y.Z/tree-sitter-foo.tar.gz | tar xz +npx --yes tree-sitter-cli@latest build --wasm +cp tree-sitter-foo.wasm /src/extraction/wasm/ +``` + +### Where the wasm lives + +- Grammars from the `tree-sitter-wasms` package are loaded directly from there at runtime. +- Other grammars must be **vendored** under `src/extraction/wasm/` so they ship in the + npm package. The build's `copy-assets` script copies every `.wasm` from that + directory into `dist/extraction/wasm/`. + +**License check.** Tree-sitter grammars are usually MIT or Apache-2.0 — confirm before +committing the wasm and note the source/version in the file's header comment so the +provenance is recoverable later. + +--- + +## 2. Probe the AST + +Don't guess at node types. Parse a representative sample and dump the tree: + +```js +// scratch/probe.mjs +import { Parser, Language } from 'web-tree-sitter'; +await Parser.init(); +const lang = await Language.load('./src/extraction/wasm/tree-sitter-foo.wasm'); +const parser = new Parser(); +parser.setLanguage(lang); + +const sample = ` +// realistic code here — cover every construct you plan to extract +`; + +const tree = parser.parse(sample); +function dump(n, d = 0, max = 4) { + if (d > max) return; + const text = n.text.length > 60 ? n.text.slice(0, 60).replace(/\n/g, '\\n') + '...' : n.text.replace(/\n/g, '\\n'); + console.log(`${' '.repeat(d)}${n.type} "${text}"`); + for (let i = 0; i < n.namedChildCount; i++) dump(n.namedChild(i), d + 1, max); +} +dump(tree.rootNode); +``` + +```bash +node scratch/probe.mjs +``` + +Cover every construct you plan to extract: function definitions, classes, methods, +imports, assignments, calls, references. Watch for surprises: + +- Some grammars wrap names in extra layers (`identifier > simple_identifier`) +- Field names (`childForFieldName`) often differ from what the docs imply +- Operator nodes can be named, unnamed, or both — call `child(i)` vs `namedChild(i)` + and inspect + +Save the probe output before you start coding — you'll refer to it constantly. + +--- + +## 3. Register the language + +Adding a language is **one new file plus two registry lines**. The per-language +registry (`src/extraction/languages/`) is the single source of truth — extension +maps, include globs, grammar config, and the EXTRACTORS lookup are all derived +from it. + +**Step 3a — Create `src/extraction/languages/foo.ts`** with a `LanguageDef`: + +```ts +import type { LanguageDef } from './types'; +import type { LanguageExtractor } from '../tree-sitter-types'; + +// Path A languages (procedural / OO — Python, Ruby, R) define a +// LanguageExtractor here and reference it from the def below. +export const fooExtractor: LanguageExtractor = { + functionTypes: ['function_definition'], + classTypes: ['class_definition'], + // ... see Section 5a for the full shape +}; + +export const FOO_DEF: LanguageDef = { + name: 'foo', + displayName: 'Foo', + extensions: ['.foo'], + includeGlobs: ['**/*.foo'], + grammar: { + wasmFile: 'tree-sitter-foo.wasm', + vendored: true, // omit if the wasm lives in `tree-sitter-wasms` + extractor: fooExtractor, + }, + // For Path B languages (HCL / SQL / Liquid — non-OO), set + // customExtractor instead of (or in addition to) `extractor`: + // customExtractor: (filePath, source) => new FooExtractor(filePath, source).extract(), +}; +``` + +**Step 3b — Register in `src/extraction/languages/registry.ts`** (2 lines): + +```ts +import { FOO_DEF } from './foo'; // alphabetical +// ... +const ALL_DEFS: readonly LanguageDef[] = [ + // ... existing definitions, alphabetical + FOO_DEF, + // ... +]; +``` + +**Step 3c — Add `'foo'` to the `Language` union in `src/types.ts`** (1 line): + +```ts +export type Language = + | 'typescript' + | ... + | 'foo' // ← add here + | 'unknown'; +``` + +That's it. `DEFAULT_CONFIG.include`, `EXTENSION_MAP`, the `EXTRACTORS` lookup, +and `getLanguageDisplayName()` are all derived from the registry — no parallel +lists to keep in sync. + +The `Language` union update is the only spot that touches a shared file. New +languages registered only via the registry (without a `Language` union entry) +also work at runtime — the union is mostly for TypeScript narrowing in +language-specific resolution code. + +> **Why per-file?** Two PRs adding two different languages used to collide on +> the same `EXTRACTORS` map, the same `EXTENSION_MAP`, the same `Language` +> union, and the same `WASM_GRAMMAR_FILES` table. With per-file `LanguageDef`s, +> two language PRs only conflict if their alphabetical positions in `registry.ts` +> happen to land on the same line — almost never. See `src/extraction/languages/` +> for ~20 worked examples. + +**`CLAUDE.md`** — append the language to the "Supported Languages" line so the +LLM-readable architecture doc stays in sync. + +--- + +## 4. Type-check before writing the extractor + +Run `npx tsc --noEmit` now. If it's not clean, the wiring is wrong — fix that +before adding extraction logic, otherwise type errors will pile up. + +--- + +## 5a. Path A — Plug into `LanguageExtractor` + +Use this when the language has named function/class/method declarations (Python, Ruby, +Java, R, etc.). Create `src/extraction/languages/.ts`: + +```ts +import type { LanguageExtractor } from '../tree-sitter-types'; + +export const fooExtractor: LanguageExtractor = { + // Map AST node types → graph kinds. Empty array = "this kind doesn't + // exist in this language." + functionTypes: ['function_definition'], + classTypes: ['class_definition'], + methodTypes: ['function_definition'], // often the same node, dispatched by context + interfaceTypes: [], + structTypes: [], + enumTypes: [], + typeAliasTypes: [], + importTypes: ['import_statement'], + callTypes: ['call'], + variableTypes: ['assignment'], + + // Field names tree-sitter exposes for extractors to read. + nameField: 'name', + bodyField: 'body', + paramsField: 'parameters', + returnField: 'return_type', + + // Optional hooks — implement what you need: + getSignature: (node, source) => { ... }, + isExported: (node, source) => { ... }, + isAsync: (node) => { ... }, + + // Escape hatch: take over a specific node type entirely. Return true to + // tell the core "I handled this, skip default dispatch." + visitNode: (node, ctx) => { + // R uses this to handle `name <- function() {}` because tree-sitter's + // function_definition has no name field — the name is on the LHS of + // the enclosing assignment. + return false; + }, +}; +``` + +Reference it from your `LanguageDef` (Section 3a): + +```ts +// in src/extraction/languages/foo.ts +export const FOO_DEF: LanguageDef = { + name: 'foo', + // ... + grammar: { wasmFile: 'tree-sitter-foo.wasm', vendored: true, extractor: fooExtractor }, +}; +``` + +The core (`TreeSitterExtractor` in `src/extraction/tree-sitter.ts`) does the rest: +walks the AST, dispatches based on your `*Types` arrays, calls your hooks, manages +the scope stack, and emits nodes/edges. + +**Worked example: R** (`src/extraction/languages/r.ts`). R's `function_definition` +has no name (it's anonymous), so `functionTypes` is empty and the `visitNode` hook +intercepts `binary_operator` assignments and emits the function manually via +`ctx.createNode('function', name, ...)`. + +## 5b. Path B — Custom extractor class + +Use this when the language is declarative (HCL, SQL, dbt) or has a fundamentally +different shape than functions/classes/methods (Liquid templates, Pascal `.dfm` form +files). Create `src/extraction/-extractor.ts`: + +```ts +import { Node, Edge, ExtractionResult, ExtractionError, UnresolvedReference } from '../types'; +import { generateNodeId, getNodeText } from './tree-sitter-helpers'; +import { getParser } from './grammars'; + +export class FooExtractor { + private filePath: string; + private source: string; + private nodes: Node[] = []; + private edges: Edge[] = []; + private unresolvedReferences: UnresolvedReference[] = []; + private errors: ExtractionError[] = []; + + constructor(filePath: string, source: string) { + this.filePath = filePath; + this.source = source; + } + + extract(): ExtractionResult { + const startTime = Date.now(); + const parser = getParser('foo'); + if (!parser) { + this.errors.push({ message: 'foo grammar not loaded', severity: 'error', code: 'grammar_unavailable' }); + return this.result(startTime); + } + const tree = parser.parse(this.source); + if (!tree) { ... return this.result(startTime); } + + try { + const fileNodeId = this.createFileNode(); + // Walk the AST, emit nodes via this.nodes.push and this.edges.push + // Emit references via this.unresolvedReferences.push so the resolver + // pass can match them across files. + ... + return this.result(startTime); + } finally { + tree.delete(); // ← important: tree-sitter trees back onto WASM memory + } + } + + private result(startTime: number): ExtractionResult { + return { + nodes: this.nodes, + edges: this.edges, + unresolvedReferences: this.unresolvedReferences, + errors: this.errors, + durationMs: Date.now() - startTime, + }; + } +} +``` + +Wire the dispatch via `customExtractor` in your `LanguageDef` (Section 3a): + +```ts +// in src/extraction/languages/foo.ts +import { FooExtractor } from '../foo-extractor'; +import type { LanguageDef } from './types'; + +export const FOO_DEF: LanguageDef = { + name: 'foo', + displayName: 'Foo', + extensions: ['.foo'], + includeGlobs: ['**/*.foo'], + // For languages that need a tree-sitter parser AND a custom extractor + // (HCL, SQL): set both `grammar` and `customExtractor`. The grammar + // entry only registers the wasm so the parser is available; the + // customExtractor takes the dispatch. + grammar: { wasmFile: 'tree-sitter-foo.wasm', vendored: true, extractor: { /* skeleton */ } }, + customExtractor: (filePath, source) => new FooExtractor(filePath, source).extract(), +}; +``` + +The dispatch in `src/extraction/tree-sitter.ts` reads `customExtractor` off +the language def — no per-language `if` branches to maintain. + +**Worked examples:** + +- `src/extraction/hcl-extractor.ts` — Terraform / HCL. Block-based DDL. Each + top-level block becomes a node whose qualified name matches the Terraform + reference form (`var.X`, `local.X`, `module.X`, `aws_s3_bucket.foo`) so the + resolver can match references across files automatically. +- `src/extraction/sql-extractor.ts` — SQL DDL. CREATE TABLE / VIEW / FUNCTION / + TRIGGER / TYPE / SCHEMA → graph nodes; foreign keys, view source tables, + trigger target tables and executed functions → edges. +- `src/extraction/liquid-extractor.ts` — Shopify Liquid templates. Regex-based + (no tree-sitter) since the template grammar isn't useful for code intelligence. + +--- + +## 6. Pick `NodeKind` and `EdgeKind` values + +`NodeKind` and `EdgeKind` are fixed unions in `src/types.ts`. Map your language's +constructs onto the closest existing kind rather than introducing new ones — +adding a new kind is a cross-cutting change that touches search, resolution, and +context-building code. + +Common mappings used by recent extractors: + +| Language construct | NodeKind | +|---|---| +| Function / procedure / standalone routine | `function` | +| Method on a class | `method` | +| Class / type / table / declarative resource | `class` | +| Trait / mixin | `trait` | +| Interface / protocol | `interface` | +| Module / package / file-level scope / Terraform module | `module` | +| Namespace / schema / SQL schema / Terraform provider | `namespace` | +| Variable / Terraform variable | `variable` | +| Constant / Terraform local / R top-level binding | `constant` | +| Type alias / SQL composite type | `type_alias` | +| Enum (any) | `enum` | +| Import / library / source / require | `import` | +| Output / re-export / Terraform output | `export` | + +Edges are usually one of: + +| Edge | When | +|---|---| +| `contains` | Parent contains child (file → block, class → method) | +| `calls` | Function/method invokes another | +| `imports` | File pulls in another module/file | +| `references` | Generic mention of another symbol (FK, lookup, attribute access) | +| `extends` / `implements` | Inheritance relationships | + +Emit references through `unresolvedReferences` (with `referenceName` set to a +qualified name that matches what you put on the target node's `qualifiedName`) — +the resolver pass matches them across files using the `name-matcher` and +`import-resolver` modules. + +--- + +## 7. Tests + +Tests live in `__tests__/extraction.test.ts`, grouped by language with a +`describe(' Extraction', ...)` block. Use `extractFromSource` directly +for unit-style tests: + +```ts +import { extractFromSource } from '../src/extraction'; + +describe('Foo Extraction', () => { + describe('Language detection', () => { + it('should detect Foo files', () => { + expect(detectLanguage('main.foo')).toBe('foo'); + }); + }); + + describe('Function extraction', () => { + it('should extract a top-level function', () => { + const code = `function add(a, b) a + b`; + const result = extractFromSource('main.foo', code); + const fn = result.nodes.find((n) => n.kind === 'function' && n.name === 'add'); + expect(fn).toBeDefined(); + }); + }); +}); +``` + +Cover the AST shapes you saw in the probe, especially the surprising ones. Pay +particular attention to: + +- The smallest possible valid program (`expect(...).toBeDefined()` for the file node) +- Each node-kind mapping (one test per emitted kind) +- Reference forms (call edges, FK / cross-file references, imports) +- Anything you intentionally skipped (anonymous lambdas, dynamic imports, etc.) + with a negative assertion so the omission is documented + +Run the suite serialized to avoid the file-watcher tests' parallel flakiness: + +```bash +npx vitest run --no-file-parallelism +``` + +End-to-end smoke test from a fresh fixture before opening the PR: + +```bash +SMOKE=$(mktemp -d) && cat > "$SMOKE/main.foo" <<'EOF' +... realistic input ... +EOF +cd "$SMOKE" && git init -q +node /dist/bin/codegraph.js init "$SMOKE" +node /dist/bin/codegraph.js index "$SMOKE" +node /dist/bin/codegraph.js status "$SMOKE" +cd "$SMOKE" && node /dist/bin/codegraph.js query "" +``` + +The `status` call should report your file under "Files by Language", and `query` +should turn up the symbols you expect at the right line numbers. + +--- + +## 8. Open the PR + +Include in the PR description: + +- The grammar source + version + license + sha256 (if vendored) +- A small worked example showing what gets extracted +- The full test plan (`npm test`, `tsc`, `npm run build`, CLI smoke) +- Any known limitations (constructs not supported, AST quirks, things the grammar + itself can't parse) + +Don't claim support for constructs the grammar can't actually parse — this happens +more often than you'd expect (e.g. `tree-sitter-sql` errors out on `CREATE +PROCEDURE` because procedure-body syntax varies sharply across dialects). Say what +works, say what doesn't, and let reviewers decide. + +--- + +## Reference: existing extractors as templates + +Read these in source order if your language is similar to one of them: + +- **Procedural / OO:** `src/extraction/languages/python.ts` (small, easy to read), + `ruby.ts` (with bare-call detection), `kotlin.ts` (extension functions), + `r.ts` (no `def` keyword — uses `visitNode` hook for assignments) +- **Declarative / config:** `src/extraction/hcl-extractor.ts` (Terraform reference + graph), `sql-extractor.ts` (DDL with FK / view source extraction) +- **Embedded / template:** `src/extraction/svelte-extractor.ts` (delegates to JS + for `