Skip to content

Add generic sqlite provider for snomed, rxnorm, and loinc#112

Open
jmandel wants to merge 9 commits intoHealthIntersections:mainfrom
jmandel:generic-sqlite-provider
Open

Add generic sqlite provider for snomed, rxnorm, and loinc#112
jmandel wants to merge 9 commits intoHealthIntersections:mainfrom
jmandel:generic-sqlite-provider

Conversation

@jmandel
Copy link
Contributor

@jmandel jmandel commented Feb 13, 2026

PR Report: Generic SQLite Provider for SNOMED, LOINC, and RxNorm

Date: 2026-02-13
Branch: generic-sqlite-provider
Base: upstream/main

1. Purpose

This branch builds a single SQLite-backed terminology path for SNOMED, LOINC, and RxNorm, while keeping worker compatibility with existing server abstractions.

The intent is not to introduce a parallel server architecture. The intent is to make the existing worker pipeline operate against one generic, metadata-driven provider/runtime for major vocabularies.

2. Scope and Test Configuration

All current validation in this report uses:

  • Endpoint: /r4
  • Config: tx/fixtures/sample-all-sqlite-v0.yml
  • Official mini-run setup: tx/fixtures/test-cases-setup-all-sqlite-v0.json

Configured sources:

  • sqlite-v0!:sct_intl_20250201.v0.db
  • sqlite-v0:sct_us_20250301.v0.db
  • sqlite-v0:loinc_281_full.v0.db
  • sqlite-v0:rxnorm_02022026.v0.db

! marks the default source for that code system in multi-version setups.

3. Architecture Implemented

3.1 Unified loader/runtime

  • One source type: sqlite-v0.
  • Loader no longer hard-branches by snomed/loinc/rxnorm.
  • Runtime behavior is read from cs_config metadata (runtime.* keys).

3.2 Metadata-driven behavior

Behavior is driven by metadata for:

  • version handling
  • hierarchy traversal and closure usage
  • search mode and FTS tables
  • property filters
  • designation handling and display policy
  • implicit ValueSet handling rules

Specialization is still possible, but only via metadata tags and a registry, not loader hardcoding.

Current state:

  • SNOMED and RxNorm run on generic runtime behavior.
  • LOINC keeps one minimal specialization for implicit /vs/* URL semantics.

3.3 SQLite schema and indexing strategy

Unified schema includes (high level):

  • concept
  • designation
  • property_def
  • concept_literal
  • concept_link
  • closure
  • cs_config

Indexing/model choices:

  • full closure table precomputed during import
  • broad FTS over display/designation/literal text
  • targeted relational indexes for lookup/expand/validate joins

3.4 Worker compatibility and optional capabilities

Existing worker contracts remain usable. We added optional capabilities that fall back safely:

  • filterPage(...) (batched filter path)
  • locateMany(...) (batched concept locate)

If unsupported by a provider, workers continue using legacy per-item paths.

4. Codebase Changes

4.1 Legacy removal

Removed legacy terminology classes and legacy importer modules for SNOMED/LOINC/RxNorm so SQLite runtime is the primary path for this branch.

4.2 Importer corrections and runtime hardening

Key fixes applied while validating against official and sampled traffic:

  • fixed searchFilter(...) argument order in expand worker path
  • fixed shared ValueSet/$validate-code crash path (messages propagation)
  • fixed RxNorm importer TTY collapse bug by preserving all RXCUI+TTY pairs
  • aligned SNOMED display derivation with mainline behavior expectations (first active designation order)
  • tightened hierarchy fallback defaults (fallbackRecursive=false)

5. Data Artifacts and Size

Current SQLite outputs in this branch:

Vocabulary File Size (bytes)
SNOMED INT 20250201 sct_intl_20250201.v0.db 929,325,056
SNOMED US 20250301 sct_us_20250301.v0.db 941,453,312
LOINC 2.81 loinc_281_full.v0.db 855,674,880
RxNorm 02022026 rxnorm_02022026.v0.db 314,200,064

Available baseline artifacts in mainline cache path (FHIRsmith/data/terminology-cache) for comparable versions:

Vocabulary Mainline artifact Size (bytes) Branch artifact Size (bytes)
SNOMED INT 20250201 sct_intl_20250201.cache 861,602,379 sct_intl_20250201.v0.db 929,325,056
LOINC 2.81 loinc-2.81.db 542,756,864 loinc_281_full.v0.db 855,674,880
RxNorm 02022026 rxnorm_02022026.db 214,675,456 rxnorm_02022026.v0.db 314,200,064

Notes:

  • US SNOMED 20250301 does not have a direct same-version baseline artifact in the mainline cache directory.
  • Branch artifacts include full closure + broad FTS surfaces by default, which increases size.

6. Correctness Results

6.1 Official terminology mini-runner (R4)

Artifact: captured/official-term-mini-results-r4.all-sqlitev0-20260213-prrefresh.json

  • total: 54
  • raw: 42 pass / 12 fail
  • xfail: 10
  • effective: 52 pass / 2 fail

The 2 non-xfail failures are both SNOMED tests pinned to xsct 20250814, which is outside the configured loaded versions.

6.2 Sampled replay (R4-focused sampled NDJSON)

Artifacts:

  • captured/snomed-replay-allsqlite-v0-20260213-prrefresh.json
  • captured/loinc-replay-allsqlite-v0-20260213-prrefresh.json
  • captured/rxnorm-replay-allsqlite-v0-20260213-prrefresh.json

Results:

Vocabulary Total Intended Match Intended Fail
SNOMED 180 165 15
LOINC 180 170 10
RxNorm 180 163 17

Mismatch classification (latest classified artifacts):

  • SNOMED: external CTS ValueSets not loaded (6), invalid displayLanguage=english (3), prod/dev disagreement (2), captured-body defects (2), other-needs-triage (2)
  • LOINC: invalid displayLanguage=english (5), prod/dev disagreement (4), captured-body defects (1)
  • RxNorm: captured-body defects (9), external CTS ValueSets not loaded (5), 422 VALUESET_TOO_COSTLY replacing sampled 500 (3)

Classified artifacts:

  • captured/snomed-replay-allsqlite-v0-20260213-prrefresh.classified.json
  • captured/loinc-replay-allsqlite-v0-20260213-prrefresh.classified.json
  • captured/rxnorm-replay-allsqlite-v0-20260213-prrefresh.classified.json

6.3 Concrete behavior improvement example

RxNorm query:

  • operation: ValueSet/$expand
  • filter: property=TTY, op==, value=SBD
  • text filter: tylenol

Observed:

  • mainline legacy path previously produced 500 Invalid search filter
  • this branch returns 200 with expansion.total=13 (active Tylenol SBD concepts)

This comes from:

  • worker filter argument-order fix
  • importer fix preserving all TTY literals per RXCUI (no TTY collapse)

7. Performance Results

Performance artifacts used:

  • captured/perf-snomed-main-vs-generic-20260213h.json
  • captured/perf-loinc-main-vs-generic-20260213h.json
  • captured/perf-rxnorm-main-vs-generic-20260213h.json

Run shape:

  • sampled NDJSON per vocabulary (180 requests each)
  • repeats 6, warmup 1
  • expansion cache on and off

7.1 Overall timings

Vocabulary Cache Main p50 Main p95 Main mean Main max Branch p50 Branch p95 Branch mean Branch max Branch faster queries
SNOMED on 1.504 5.424 6.687 770.464 2.079 8.383 3.294 47.523 41/180
SNOMED off 1.289 4.810 7.853 1071.855 1.777 7.381 2.882 48.263 38/180
LOINC on 3.139 45.741 7.733 143.219 1.292 4.415 2.885 64.279 177/180
LOINC off 1.960 28.817 5.087 125.875 1.451 7.029 3.322 103.865 135/180
RxNorm on 0.985 2.265 1.174 7.061 1.189 6.597 2.101 53.907 40/180
RxNorm off 0.733 1.403 0.847 5.656 0.860 3.592 1.383 13.948 32/180

7.2 Operation-level uncached p50 median delta (branch - main)

Vocabulary Operation N Delta ms
SNOMED ValueSet/$validate-code 84 +0.338
SNOMED CodeSystem/$validate-code 73 +0.348
SNOMED ValueSet/$expand 18 +0.189
SNOMED ValueSet/$batch-validate-code 4 +3.596
LOINC ValueSet/$validate-code 100 -0.426
LOINC CodeSystem/$validate-code 66 -0.478
LOINC ValueSet/$expand 12 +0.679
RxNorm CodeSystem/$validate-code 154 +0.134
RxNorm ValueSet/$validate-code 14 -0.035
RxNorm ValueSet/$expand 12 +9.237

Interpretation:

  • LOINC is broadly faster in sampled p50/p95, with remaining hot spots concentrated in _incomplete/large expand paths.
  • SNOMED is near parity but still slower on p50.
  • RxNorm validate paths are near parity; _incomplete expand remains the main lag pattern.

8. Non-DB Pipeline Findings and Fixes

This effort also surfaced and fixed issues outside pure DB schema/import concerns:

  • worker filter call-order bug affecting valid filter queries
  • shared validate worker crash path
  • optional batching interfaces (filterPage, locateMany) with strict fallback behavior
  • request-scope provider memoization to reduce repeated provider setup work

These are generic pipeline improvements and not tied to one terminology.

9. Trade-offs and Remaining Gaps

  • Database artifacts are larger than available mainline cache artifacts for equivalent SNOMED INT/LOINC/RxNorm files.
  • Two official SNOMED mini-runner failures remain because the focused config intentionally omits xsct 20250814.
  • Some sampled mismatches remain attributable to captured-request defects and external ValueSets not present in local scope.
  • Tail latency remains for a small set of large _incomplete expands.

10. Repro Commands

Official mini-runner:

bun scripts/official-terminology-mini-runner.ts \
  --path /r4 \
  --setup tx/fixtures/test-cases-setup-all-sqlite-v0.json \
  --out captured/official-term-mini-results-r4.all-sqlitev0-latest.json

Sample replay:

node scripts/replay-sampled-terminology.js \
  --input /home/jmandel/hobby/FHIRsmith/captured/snomed.ndjson \
  --out captured/snomed-replay-allsqlite-v0-latest.json \
  --path /r4 \
  --library tx/fixtures/sample-all-sqlite-v0.yml \
  --intended-source prod

Perf comparison:

node scripts/perf-sampled-main-vs-convergence.js \
  --input /home/jmandel/hobby/FHIRsmith/captured/snomed.ndjson \
  --out captured/perf-snomed-main-vs-generic-latest.json \
  --repeats 6 \
  --warmup 1 \
  --port-base 9720 \
  --main-root /home/jmandel/hobby/FHIRsmith-main \
  --conv-root /home/jmandel/hobby/FHIRsmith-tx-mainline-convergence \
  --main-library tx/fixtures/test-cases.yml \
  --conv-library tx/fixtures/sample-all-sqlite-v0.yml \
  --endpoint-path /r4 \
  --fhir-version 4.0 \
  --expansion-cache both

11. Summary

The branch demonstrates that SNOMED, LOINC, and RxNorm can run through one generic SQLite provider/runtime with metadata-driven behavior and minimal specialization, while staying compatible with existing worker abstractions.

Results are mixed but clear:

  • strong LOINC performance gains on sampled traffic
  • near-parity SNOMED/RxNorm in many paths with specific lag clusters still present
  • concrete correctness improvements on previously failing RxNorm filter behavior
  • explicit remaining gaps documented and reproducible

@jmandel jmandel changed the title Add generic sqlite provider for snomed, rxnorm, and loin cAdd generic sqlite provider for snomed, rxnorm, and loin Feb 13, 2026
@jmandel jmandel changed the title cAdd generic sqlite provider for snomed, rxnorm, and loin Add generic sqlite provider for snomed, rxnorm, and loinc Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant