Skip to content

feat(objectscript): InterSystems IRIS ObjectScript language support#467

Open
isc-tdyar wants to merge 1 commit into
DeusData:mainfrom
isc-tdyar:pr/objectscript-language-support
Open

feat(objectscript): InterSystems IRIS ObjectScript language support#467
isc-tdyar wants to merge 1 commit into
DeusData:mainfrom
isc-tdyar:pr/objectscript-language-support

Conversation

@isc-tdyar

Copy link
Copy Markdown

What does this PR do?

Adds ObjectScript (InterSystems IRIS / Caché) as a supported language, per the discussion in #462.

ObjectScript powers large healthcare, finance, and enterprise systems and has no support in CBM (or most code-graph tools). This PR makes those codebases indexable and resolves the call-dispatch patterns that are structurally invisible to text search.

Refs #462

⚠️ Read first — grammar is a separate dependency

Per your note on #462, this PR contains only the CBM source changes — not the vendored grammar. The two tree-sitter grammars come from intersystems/tree-sitter-objectscript (MIT licensed, maintained by the language vendor): objectscript_udl (.cls) and objectscript_routine (.mac/.inc/.rtn/.int).

Consequence: the build will not link until the grammar is vendored at internal/cbm/vendored/grammars/objectscript_udl/ and …/objectscript_routine/ — it fails on the missing tree_sitter_objectscript_udl() / _routine() symbols. CI will be red until then. That is intentional, matching your plan to audit and vendor the grammar independently. The grammar shims (grammar_objectscript_*.c) that declare those factories are included — only the generated parser.c/scanner.c are omitted.

What's in the PR

Definition extraction — Class, Method, ClassMethod, Property, Parameter, Index, Trigger (with body text), XData, Storage, Query members → graph nodes; base classes from the Extends clause.

Four call-dispatch patterns (all resolved statically at index time):

Pattern Example Notes
Explicit cross-class ##class(Pkg.Class).Method() from AST
Relative-dot self-call ..Method() the dominant intra-class form; large impact on CALLS completeness
Macro expansion $$$Macro resolved via a per-project table built from .inc files
Type inference Set x = ##class(P).%New()x.Save() from %New/%OpenId + declared return types

Ensemble production topology (pass_ensemble_routing.c) — EnsembleItem nodes per production component and ROUTES_TO edges resolved from ProductionDefinition XData; plus WorkMgr .Queue("##class(X).method") dispatch. All static — no live IRIS instance required.

Two design points for your review

  1. Public API unchanged. ObjectScript needs two per-project tables (a $$$macro table and a method-return-type table) that single-file extraction can't build alone. Rather than widen the public cbm_extract_file() signature (which would ripple NULL, NULL through every call site), I added an internal cbm_extract_file_ex() that carries the tables; cbm_extract_file() is a thin wrapper that delegates with NULL, NULL. Only the pipeline passes that build the tables call _ex.

  2. Extension collisions with Apex and BitBake (both added since I started this work). .mac/.int/.rtn map to ObjectScript routine directly. The two collisions are resolved by content sniffing, following the existing cbm_disambiguate_m() (.m MATLAB-vs-ObjC) pattern, and default to the existing language on any doubt so neither Apex nor BitBake regresses:

    • .cls (vs Apex): a Class <Uppercase…> header line → ObjectScript UDL, else Apex. Edge case: a .cls whose Class line sits beyond the first 4 KB (e.g. a very large license banner) would fall through to Apex.
    • .inc (vs BitBake): a ROUTINE <Uppercase> header or an ObjectScript preprocessor directive (#define / #def1arg / #;) → ObjectScript routine, else BitBake. (#define/#def1arg never collide with BitBake, which uses # only for # comment.)

    These are the spots most likely to need your input — happy to adjust the heuristics or the generalization however you prefer.

EnsembleItem label (your Q2 on #462)

I used a domain-specific EnsembleItem node label and ROUTES_TO edge type. If you'd prefer a generic label (ServiceComponent / WorkflowNode) with ensemble_item as a property, I'm glad to rename — just let me know before merge.

Tests

tests/test_extraction.c gains the ObjectScript suite: UDL class/method extraction, all four dispatch patterns, Ensemble topology parsing, macro expansion, trigger body text, and Export-XML transcoding. The Export-XML transcoder tests are grammar-independent and pass today; the grammar-dependent tests pass once the grammar is vendored. No other test files are touched.

Scope / roadmap

This PR is the foundation. If it's well received, two separate follow-up PRs would complete the story (each with its own issue): (a) cross-version version_tag + diff_versions, and (b) ObjectScript-tuned semantic embeddings. They're deliberately excluded here to keep this reviewable.

Checklist

  • Every commit is signed off (git commit -s) — DCO
  • Tests pass locally (make -f Makefile.cbm test) — passes once the grammar is vendored; see note above
  • Lint passes (make -f Makefile.cbm lint-ci)
  • New behavior is covered by tests

Add ObjectScript (InterSystems IRIS / Caché) as a supported language,
covering the UDL class format (.cls), MAC/INT routines (.mac/.int/.rtn),
include/macro files (.inc), and IRIS Studio Export XML.

Definition extraction (extract_defs.c): Class, Method, ClassMethod,
Property, Parameter, Index, Trigger (with body text), XData, Storage,
and Query members as graph nodes; base classes from the Extends clause.

Call dispatch resolution (extract_calls.c) — four ObjectScript patterns
that are structurally invisible to text search:
  1. ##class(Pkg.Class).Method()    explicit cross-class call
  2. ..Method()                     relative-dot self-call (the dominant
                                    intra-class form; large impact on
                                    CALLS completeness)
  3. $$$Macro                       macro expansion via a per-project
                                    table built from .inc files
  4. type inference from %New/%OpenId + declared return types

Ensemble production topology (pass_ensemble_routing.c): EnsembleItem
nodes per production component and ROUTES_TO edges resolved from
ProductionDefinition XData, plus WorkMgr .Queue("##class(X).method")
dispatch — all parsed statically at index time, no live IRIS required.

Language detection (language.c): .mac/.int/.rtn map to ObjectScript
routine directly; .cls (shared with Apex) and .inc (shared with BitBake)
are disambiguated by content, defaulting to the existing language on any
doubt so neither Apex nor BitBake detection regresses.

The two new per-project tables (macros, return types) are threaded
through a new internal cbm_extract_file_ex() so the public
cbm_extract_file() signature is unchanged.

The tree-sitter grammars are NOT vendored in this PR; they are a
dependency to be vendored separately from
https://github.com/intersystems/tree-sitter-objectscript (MIT).
The build will not link until the grammar is present.

Refs DeusData#462

Signed-off-by: Thomas Dyar <tdyar@intersystems.com>
@isc-tdyar isc-tdyar force-pushed the pr/objectscript-language-support branch from 593b161 to 578d36f Compare June 14, 2026 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant