Automated cross-publisher standards index built and maintained by Steve LLamb
MSRBot.io is a live, automated (and hand curated) Media Standards Registry (MSR) of media technology documents — extracting, validating, and linking documents across SMPTE, ISO, ITU, AES and other many other publishers, SDOs, and industry groups.
MSRBot.io began in 2020 as a response to a long-standing gap in how the media and entertainment industry tracks its own standards, best practices, specifications, and other important documents and publications - and the references contained within. Understanding the tangled tree branches and roots of documents' dependencies due to the nature of nested references (sometimes circular, and often cross-org), was required for regular maintanance of these critically important documents.
Documents from SMPTE, ISO, ITU, AES and others have always been interconnected — yet their references lived scattered across the internet as generated or scanned PDFs, HTML pages, TXT files, sometimes hidden behind paywalls, or trapped in inconsistent formats. MSRBot.io was built to solve that: an open, automated registry that maps those relationships, extracts structured metadata, and preserves a living history of the standards ecosystem.
What started as a personal tool to make sense of reference trees has grown into a self-maintaining system that reveals the lineage, dependencies, and context of the world’s media technology documents
See docs/buildlog.md for details of v1.0.0 released on Nov 26, 2025.
All badges are generated from live JSON at api/stats.json. Explore the full API at msrbot.io/api/.
- Published historical range: 1896 → present
- Automation uptime: 100% since August 2025 (SMPTE)
- Publishers covered: SMPTE, NIST, ISO, ITU, AES, and more
- Core data stored as JSON:
src/main/data - Schema for data:
src/main/schemas - Main document Dataset:
documents.json - Document lineages: Master Suite Index (MSI)
- Document reference maps: Master Reference Index (MRI)
- API Explorer: msrbot.io/api/
- Live API Stats: api/stats.json
- JSON Schema: api/schemas/documents.schema.json
- Public Site generated from
mainat https://msrbot.io - Change Log: msrbot.io/changelog/ (source)
Portals are curated, topic-oriented landing pages that aggregate related documents, resources, and explanatory context across publishers, suites, collections, and document types.
Unlike Suites or Collections, which are derived directly from publication structure and numbering, Portals are intentionally editorial. They are designed to provide practical entry points into complex subject areas (such as Digital Cinema, IMF, or Accessibility) without requiring prior knowledge of specific standards bodies or document identifiers.
Each Portal may include:
- A narrative overview and background context.
- A curated, non-exhaustive list of relevant documents resolved to their latest applicable versions.
- Cross-publisher coverage (e.g., SMPTE, ISO, ITU, AES).
- Structured resource links (organizations, tools, references).
- Search, filtering, and sorting consistent with Suites and Collections.
Portals are rendered as first-class pages with stable URLs (e.g. /dcinema/) and are intended to complement — not replace — authoritative publisher documentation.
MSRBot.io organizes documents using three complementary concepts, each serving a distinct purpose:
| Concept | Primary Basis | Scope | Purpose |
|---|---|---|---|
| Suites | Formal multipart standards (shared lineage / numbering) | Single publisher | Represent authoritative multipart standards and their evolution over time |
| Collections | Related documents grouped by title or theme | Single publisher | Group related documents that are not formally multipart |
| Portals | Curated topic areas | Cross-publisher | Provide navigable, contextual entry points across standards ecosystems |
Suites and Collections are derived directly from publisher-defined structures and identifiers. Portals, by contrast, are curated to support discovery, orientation, and cross-domain understanding, particularly in areas where relevant documents span multiple organizations and formats.
MSRBot.io updates itself through a chain of automated GitHub Actions. When appropriate, PRs generate MSR Build Preview review links.
See
docs/samples.mdfor full workflow details and live run sample links.
| Stage | Purpose | Trigger | Key Output |
|---|---|---|---|
| Extract | Pulls and parses provider metadata (SMPTE/IETF) | Scheduled + Manual | documents.json |
| MSI | Builds document lineages | PR merge to main / Manual |
masterSuiteIndex.json |
| MRI | Maps references across all docs | After MSI | masterReferenceIndex.json |
| MSR | Builds and publishes the site | Push to main / Manual |
https://msrbot.io/ |
| URL Validate | Checks and normalizes links | After MRI / Weekly (Sat) | url_validate_audit.json |
| PR Build Preview | Builds MSR preview prior to publication | PR updates + upstream workflow runs | https://msrbot.io/pr/###/ |
%%{init: {'flowchart': {'curve': 'linear'}}}%%
graph LR
subgraph Pipeline
direction LR
A[Extract] --> B[MSI] --> C[MRI] --> E[URL Validate]
end
M[Push to main] --> D[MSR]
A -.-> P[PR Build Preview]
B -.-> P
C -.-> P
S[Site/Template PR] -.-> P
Dotted lines indicate PR-triggered preview builds. Extract, MSI, MRI, and site/template PRs all generate a preview.
| Day | Time (UTC) | Pacific (PST) | Workflow |
|---|---|---|---|
| Monday | 04:15 | Sunday 20:15 | Extract Documents - SMPTE |
| Tuesday | 04:45 | Monday 20:45 | Extract Documents - IETF |
| Saturday | 04:15 | Friday 20:15 | Validate Document URLs |
| Sunday | 09:00 | Sunday 01:00 | PR Preview Sweeper |
| Sunday | 09:30 | Sunday 01:30 | Branch Sweeper |
PST shown above (UTC-8). During daylight saving (PDT, UTC-7), add 1 hour.
Event-driven workflows run on upstream completion or repository events:
Build MSRBot.io Site and Test(pushtomain)Build MasterSuite Index(PR merge tomain)Build MasterReference Index(after MSI)Validate Document URLs(after MRI)PR Build Preview (MSRBot.io site)(pull_requestand extract/MSI/MRI/URL Validate workflow runs)
URL validation throttle behavior:
- Daily throttle only considers prior runs where
Run URL validationexecuted successfully. - Skip-only successful runs (for example, upstream open-PR marker skips) do not trigger throttle.
Requires Node 20 + npm.
Run scripts with:
npm run extract
npm run extract-smpte
npm run extract-ietf
npm run build-msi
npm run build-mri
npm run seed-backfill-ietf
npm run validate-url
npm run normalize-url
npm run canonicalize
npm run validate
npm run validate -- --warn
npm run docs-sort
npm run docs-validate
npm run docs-fix
npm run review-refs -- list
npm run review-refs -- resolve {docId}
npm run keywords-sync
npm run keywords-sync -- --write
npm run build
npm run local-serverQuick reference:
extract/extract-smpte: run SMPTE document extraction.extract-ietf: run IETF document extraction.build-msi: build Master Suite Index (lineages/suites metadata).build-mri: build Master Reference Index (cross-doc reference map).seed-backfill-ietf: backfill missing IETF seeds (RFC +IETF.draft-*) from MRI presence-audit (--writeto apply + canonicalize).validate: schema + registry validation (--warnfor keyword warn-only mode).docs-sort: sortdocuments.jsonbydocId(validator-compatible order).docs-validate: run document validation flow.docs-fix: rundocs-sortthendocs-validate.review-refs: list/resolve reference review flags (reviewRequired) indocuments.json.validate-url: run URL reachability/audit checks.normalize-url: apply URL normalization/backfill from URL audit.canonicalize: normalize/sort registry JSON output format.keywords-sync: detect (or--writeappend) controlled keyword updates.build-index: build search index artifacts.build-stats: build API/site stats artifact.build: build full static site output.local-server: start local HTTP server to preview the built site.audit: generate document audit report.
For the full command and flag reference (including build-mri, build-msi, audit, validate-url, and runtime env vars), see docs/commands.md.
npm run extract: convenience alias for SMPTE extraction (currently equivalent toextract-smpte).npm run extract-smpte: explicit SMPTE extraction.npm run extract-ietf: explicit IETF extraction.- Under the hood, extraction now requires an explicit provider flag:
node src/main/scripts/extractDocs.js --provider smptenode src/main/scripts/extractDocs.js --provider ietf
- If additional providers are added, use explicit scripts per provider (recommended naming: hyphen style, e.g.
extract-iso,extract-itu) and keep workflow calls aligned to those script names.
- Shared reference parsing/resolution lives in
src/main/lib/referencing.jsand is reused across providers. badRefsreports only citations that cannot be parsed into a canonicaldocId.- Mixed reference layouts (anchor + prose risk) are currently flagged on
references.bibliographic$metavia:reviewRequired: trueflag: "MIXED_REF_LAYOUT_RISK ..."
npm run review-refs -- listreports review flags across all docs/providers and both reference types (references.normative$metaandreferences.bibliographic$meta), plusbadRefs.latestcorrelation.npm run review-refs -- resolve <DOCID...>clears review flags on both reference types for the provideddocIdvalues after manual review.- Parseable refs that are not yet present as source documents are tracked in MRI with unresolved presence state (
sourcePresent: false) and should be backfilled via data updates or targetedrefMaprules. - Use
npm run seed-backfill-ietfto identify missing IETF seed URLs (RFC + drafts) from MRI presence-audit; usenpm run seed-backfill-ietf -- --writeto append, dedupe, and canonicalizesrc/main/input/seedUrls.ietf.json. - Prefer href-based normalization rules in
parseRefIdfor stable web patterns (for example, Unicode versions, Bugzilla issue links) and usesrc/main/input/refMap.jsonfor curated/manual edge mappings.
- Source of truth for allowed keywords is
src/main/config/site.jsonundercontrolledKeywords. src/main/schemas/documents.schema.jsonintentionally does not enforce a hard keyword enum.- Keyword conformance is validated in
src/main/scripts/documents.validate.jsduringnpm run validate. - Ingested IETF keywords are normalized to project style (Title Case with preserved acronyms/common forms such as
JSON,URN,B-Chain,DCinema,DCP*,SHA-1). - Validation mode can be selected at runtime:
- Strict (default):
npm run validateornpm run validate -- --error - Warn-only for unknown keywords:
npm run validate -- --warn
- Strict (default):
- Extract workflows (
extract-docs-smpte.yml,extract-docs-ietf.yml) run validation in warn mode for unknown keywords (KEYWORD_VALIDATION_MODE=warn), while build/local defaults remain strict unless overridden. - Use keyword sync to review and optionally add new observed keywords:
- Dry run:
npm run keywords-sync - Write updates to
site.json:npm run keywords-sync -- --write
- Dry run:
Issues and pull requests are welcome.
For questions or collaboration inquiries, contact Steve LLamb.
MSRBot.io aggregates factual metadata and references via https://github.com/PrZ3r/MSRBot.io/ about publicly released standards, best practices, and other documents (e.g., SMPTE, ISO, ITU, AES, and many others).
All metadata is derived from publicly available information and is provided for research and interoperability purposes only. Original standards and other documents remain the intellectual property and copyright of their respective publishers, as applicable.