Skip to content

Latest commit

 

History

History
677 lines (481 loc) · 37.5 KB

File metadata and controls

677 lines (481 loc) · 37.5 KB

OpenMontage - Agent Guide

Start here. This is the complete operating guide and agent contract for OpenMontage.

For architecture, key files, and conventions see PROJECT_CONTEXT.md.

First Interaction — Onboarding

When the user's first message is vague, exploratory, or asks what you can do ("make me a video", "what can you do?", "help me create something", "I want to make content"), read the onboarding skill before doing anything else:

Read: skills/meta/onboarding.md

This skill teaches you to run discovery, classify the user's setup, present capabilities in plain language, and offer starter prompts tailored to their available tools. The goal: get the user from "curious" to "making a video" in under 60 seconds.

Skip onboarding when the user arrives with a specific, actionable request (e.g., "Make a 60-second explainer about black holes"). Go directly to Rule Zero.

Reference Video Entry Point

When the user provides a video URL or local video file as inspiration — for example:

  • "Can you make a video like this?"
  • "I love this YouTube Short. Make me something similar."
  • "Use this Reel as a reference."

— do not treat this as a generic web-search or prompt-writing request.

This is a first-class workflow in OpenMontage.

Required behavior

  1. Read: skills/meta/video-reference-analyst.md
  2. Run the reference analysis workflow using the local analysis tools (video_analyzer, transcript extraction, scene detection, frame sampling)
  3. Produce a grounded summary of what the reference is doing:
    • content
    • pacing
    • structure
    • style
    • what makes it work
  4. Then run normal capability audit and pipeline selection
  5. Present 2-3 differentiated concepts for the user's version — not a carbon copy

Important distinction

  • Reference-driven request: "make me something like this" -> use video-reference-analyst.md
  • Source-footage request: "edit this footage" / "cut this into clips" -> use source_media_review and the appropriate footage-led pipeline

If a model misses this distinction, it will often fall back to plain search + guesswork. That is incorrect for OpenMontage.

Rule Zero — All Production Goes Through a Pipeline

Every video production request MUST go through the pipeline system. No exceptions.

When the user asks to make, create, produce, or generate any video content — a trailer, explainer, clip, animation, or any other video — the agent must:

  1. Identify the pipeline. Match the request to one of the pipelines in pipeline_defs/. If unclear, ask the user.
  2. Read the pipeline manifest. pipeline_defs/<pipeline>.yaml — know the stages, tools, and quality gates.
  3. Run preflight. Discover available tools via the registry. Present the capability menu.
  4. Execute stage by stage. For EACH stage, read the stage director skill (skills/pipelines/<pipeline>/<stage>-director.md) BEFORE doing any work in that stage.
  5. Read Layer 3 skills before calling tools. Before using any tool with an agent_skills field, read the referenced skill in .agents/skills/. These contain provider-specific prompting guidance, parameter optimization, and quality techniques that dramatically improve output.

Do NOT:

  • Write ad-hoc Python scripts to call tools directly
  • Skip the pipeline and go straight to API calls
  • Generate assets without reading the stage director skill first
  • Use a tool without checking its Layer 3 skill for prompting guidance
  • Bypass preflight, checkpoints, or review

The intelligence is in the skills, not in improvised code. An agent that reads the director skills and Layer 3 knowledge will produce significantly better output than one that calls tools directly with generic prompts.

What OpenMontage Is

OpenMontage is an instruction-driven video production system. The AI agent IS the intelligence — it reads instructions (pipeline manifests + stage director skills + meta skills) and drives the pipeline using tools.

Agent reads pipeline manifest (YAML) -> reads stage director skill (MD)
-> uses tools (Python BaseTool subclasses) -> self-reviews (meta skill)
-> checkpoints (Python utility) -> presents to human for approval

Python = tools + persistence. No orchestration logic, creative decisions, review logic, or checkpoint policy in Python code. The agent makes those decisions guided by instructions.

Core loop:

  1. Select a pipeline.
  2. Run preflight.
  3. Discover real tools from the registry.
  4. Present the user with concepts, tool plan, production plan, and cost.
  5. Execute stage by stage with checkpoints.

Decision Communication Contract

For any meaningful production decision, the agent must communicate the decision before acting. The user should never have to infer which provider, model, or render path was chosen after the fact.

Announce Before Execution

Before any paid or consequential generation call, state:

  • the exact tool name,
  • the provider,
  • the model or provider variant,
  • the reason it was chosen,
  • whether it is a sample or a batch run.

Ask Before Major Changes

The agent must ask the user before changing any major production choice, including:

  • switching provider,
  • switching model family or provider variant,
  • switching from video-led to still-led treatment,
  • switching composition engine when that changes the output character,
  • dropping narration, music, or other approved creative elements,
  • changing from sample mode to batch mode.

Minor prompt refinements inside an already approved provider/model path do not require separate approval unless they materially change the creative direction.

Present Both Composition Runtimes (HARD RULE)

When both Remotion and HyperFrames are available on the machine (check video_compose.get_info()["render_engines"]), the agent MUST present both options to the user before locking render_runtime at the proposal stage. The agent MAY recommend one with rationale — but silently picking a "default" is forbidden even when the pipeline manifest or a director skill suggests one.

The presentation MUST include, for each runtime:

  1. A one-sentence plain-language description of what it is best at for this specific brief.
  2. A one-sentence honest tradeoff (why it might not be the right pick here).
  3. The agent's recommendation and the reason, tied to the brief's delivery_promise and visual approach.

Then wait for explicit user approval before advancing. Record the full shortlist — BOTH runtimes plus any "ffmpeg" option that applies — as options_considered in the render_runtime_selection decision logged in decision_log. A decision log entry with only one runtime considered when both were available is a CRITICAL reviewer finding.

Exception: if only one runtime is available on the machine, the agent proceeds with it but MUST say so explicitly ("HyperFrames isn't installed on this machine; I'm proceeding with Remotion. Install HyperFrames if you want the alternative."). The render_runtime_selection decision still records the unavailable option as rejected_because: "runtime not available on this machine".

This rule applies to every pipeline that invokes video_compose — not just Wave 1. A pipeline's director skill may recommend a runtime, but that recommendation is input to the conversation with the user, not a decision.

Escalate Blockers Explicitly

When a blocker occurs, the agent must surface it immediately using this structure:

  1. What was attempted
  2. What failed
  3. Whether the issue is auth, provider access, tool bug, or prompt/design quality
  4. What options exist next
  5. Which option the agent recommends, with reasoning

Do not continue with a substitute path until the user approves.

Recommendation Style

When asking the user to choose, do not just list options. The agent should:

  • provide the shortlist,
  • explain the tradeoffs briefly,
  • recommend one option,
  • wait for approval before proceeding.

No Unilateral Substitutions

If the approved path is blocked, the agent may investigate and prepare alternatives, but may not execute those alternatives without user approval.

This applies especially to:

  • provider swaps,
  • model swaps,
  • fallback tools,
  • prompt-only substitutes for reference-driven generation,
  • still-image animatics in place of true motion.

Orchestrator

The agent itself orchestrates the production state machine:

research -> proposal -> script -> scene_plan -> assets -> edit -> compose

The agent:

  1. Reads the pipeline manifest (pipeline_defs/*.yaml) to know the process
  2. Calls checkpoint.get_next_stage() to find where to resume
  3. Reads the stage's director skill (skills/pipelines/<pipeline>/<stage>-director.md) to know HOW
  4. Uses tools (tools/) for concrete capabilities
  5. Self-reviews using the reviewer meta skill (skills/meta/reviewer.md)
  6. Checkpoints via the checkpoint protocol (skills/meta/checkpoint-protocol.md)
  7. Presents to human for approval when human_approval_default: true

Infrastructure files:

  • lib/checkpoint.py — read/write checkpoints, stage validation
  • tools/cost_tracker.py — budget governance
  • lib/pipeline_loader.py — manifest loading and helpers

Project Directory Convention

Every production run creates a project workspace under projects/. This directory is gitignored — all generated assets are regenerable.

projects/<project-name>/
├── artifacts/          # JSON artifacts from each stage (research_brief, script, scene_plan, etc.)
├── assets/
│   ├── images/         # Generated images (PNG)
│   ├── video/          # Generated video clips (MP4)
│   ├── audio/          # Narration segments + final mix (MP3/WAV)
│   ├── music/          # Background music track (MP3)
│   └── subtitles.srt   # Generated subtitles
└── renders/
    └── final.mp4       # Final rendered video (the deliverable)

Naming convention: Use kebab-case derived from the video title (e.g., hidden-math-of-nature, how-music-rewires-brain).

Create the project directory at pipeline initialization, before any stage runs. All tools and agents should write outputs to these paths — never to the repo root or ad-hoc locations.

Music Library

Users can place royalty-free music tracks in music_library/ (gitignored). The asset director will check this folder before falling back to API-based music generation.

music_library/
├── ambient_track.mp3
├── cinematic_epic.mp3
└── ...

If the folder has tracks, the proposal and asset stages should present them as options alongside generated music. See the proposal-director and asset-director skills for details.

Available Pipelines

Pipeline Best For Stability
animated-explainer Topic to fully generated explainer production
talking-head Footage-led speaker videos beta
screen-demo Screen recordings and walkthroughs production
clip-factory Many clips from one long source beta
podcast-repurpose Podcast highlights and derivatives beta
cinematic Trailer, teaser, and mood-led edits production
animation Motion-graphics and animation-first videos production
character-animation Local rigged cartoon characters and reusable character acting beta
hybrid Source footage plus support visuals production
avatar-spokesperson Presenter-led avatar or lip-sync videos production
localization-dub Subtitle, dub, and translated variants beta
framework-smoke Test: minimal 2-stage smoke test test

Beta pipelines have not been fully audited. They work, but expect rough edges. Mention this when the user selects one.

Mandatory Preflight

Do this before any creative work. Use provider_menu_summary() first — it's the human-ready rollup. The raw support_envelope() dump is a firehose (megabytes of JSON on a well-configured machine); pasting it into chat will bury the user.

python -c "
from tools.tool_registry import registry
import json
registry.discover()
print(json.dumps(registry.provider_menu_summary(), indent=2))
"

The summary returns four fields the agent should translate into plain language:

  • composition_runtimes — booleans for ffmpeg, remotion, hyperframes. This is the source of truth for the "Present Both Composition Runtimes (HARD RULE)" check.
  • capabilities[] — one entry per capability family with configured / total counts and provider lists. Ready-made for the "N of M configured" menu.
  • setup_offers[] — unavailable tools whose install is a 1-minute env-var fix. Lead with these when offering upgrades.
  • runtime_warnings[] — specific signals like "hyperframes: npm package not resolvable". Surface these to the user verbatim — they're the kind of silent-failure bugs that break the governance contract.

Then, for deeper inspection (only when the summary isn't enough):

# Full menu — grouped available/unavailable per capability.
python -c "from tools.tool_registry import registry; import json; registry.discover(); print(json.dumps(registry.provider_menu(), indent=2))"

# Raw envelope — every tool's full contract. Slow/firehose; use for debugging only.
python -c "from tools.tool_registry import registry; import json; registry.discover(); print(json.dumps(registry.support_envelope(), indent=2))"

Then:

  1. Read the selected manifest in pipeline_defs/.
  2. Check every required_tools entry against the registry.
  3. Check fallback_tools for unavailable tools.
  4. Report one of: passed, degraded, or blocked.
  5. Do not start production until the user understands the real capability envelope.

Provider Menu (Mandatory at Preflight)

Already fetched via provider_menu_summary() above. Read that output and present it to the user as a capability menu, not as a flat tool list. Use provider_menu() directly only when you need the per-tool detail the summary collapses.

How to present:

YOUR CAPABILITIES

  Video Generation:  0/13 configured
  Image Generation:  1/7 configured
  Text-to-Speech:    1/3 configured
  Music Generation:  1/1 configured
  Composition:       3/3 configured (FFmpeg, video_stitch, video_trimmer)

  You can produce videos now with images + TTS + FFmpeg.
  Quick upgrades available — see below.

For EACH capability with unavailable providers, read the install_instructions field from the menu output and present setup options grouped by effort:

QUICK SETUP OPTIONS (1-minute each — set an env var in .env)

  Video Generation (0/13 -> unlock the biggest upgrade):
    Each unavailable provider lists its own install_instructions.
    Read them from the provider_menu output and present grouped by env var.
    Example: if 3 tools need FAL_KEY, group them: "FAL_KEY unlocks 3 providers"

  Image Generation (1/7 -> more style options):
    Same pattern — read install_instructions from each unavailable tool.

  Text-to-Speech (1/3):
    Same pattern.

LOCAL OPTIONS (free, needs hardware):
  Tools with runtime=LOCAL or runtime=LOCAL_GPU — read from the menu.

Already Available:
  List what's working. The user should feel good about what they have.

Rules:

  • Do NOT hardcode provider names, API key names, or setup URLs in your prompts. Read them from the registry's install_instructions field on each tool.
  • Always show the ratio: "X of Y configured" — this makes breadth visible.
  • Group by capability, not by individual tool.
  • Show what they CAN do now, then what they COULD unlock.
  • If the user declines setup, proceed with the best available path — no nagging.
  • If a tool shares an env var with others, group them (read from dependencies field).

Setup Offer Protocol

When tools are UNAVAILABLE but can be fixed with simple configuration, offer the user setup help instead of silently working around the limitation. Many tools are one env var away from working.

Fix Complexity Action
1-minute fix (env var) Offer to help configure now — read install_instructions from the tool
5-minute fix (install) Explain what to install and why — read install_instructions from the tool
Complex fix (GPU, model download) Note the limitation, explain what it would unlock, move on

Rules:

  • Always tell the user what they're missing AND what they'd gain
  • Show the cost difference (free local vs. paid API)
  • If the user declines setup, proceed with the best available path — no nagging
  • Group related fixes (tools sharing the same env var dependency)

Composition Runtimes (Inside video_compose)

video_compose has three render engines / runtimes. They are parallel, not ranked — the choice is made at proposal and locked in edit_decisions.render_runtime. Check which are available:

python -c "
from tools.tool_registry import registry
registry.discover()
info = registry._tools['video_compose'].get_info()
print('Render engines:', info.get('render_engines'))
print('Remotion note:', info.get('remotion_note'))
print('HyperFrames note:', info.get('hyperframes_note'))
"
Engine Used For Requires
FFmpeg Video-only cuts, concat, trim, subtitle burn ffmpeg binary (always available)
Remotion React-based composition: still images → animated video, text cards, stat cards, charts, callouts, comparisons, transitions with spring physics, word-level caption burn, TalkingHead avatar Node.js (npx) + remotion-composer/ + node_modules
HyperFrames HTML/CSS/GSAP composition: kinetic typography, product promos, launch reels, website-to-video, registry-block-driven scenes, SVG character rigs Node.js ≥ 22 + FFmpeg + npx (consumed via npx hyperframes)

render_runtime is locked at proposal (proposal_packet.production_plan.render_runtime) and carried through edit_decisions unchanged. video_compose routes based on this field; silent runtime swaps are forbidden. If the chosen runtime becomes unavailable at compose time, surface a structured blocker per "Escalate Blockers Explicitly" above. See skills/core/hyperframes.md for the Remotion-vs-HyperFrames decision matrix.

Critical Rule: Motion-Required Requests

For any request where the deliverable inherently depends on motion rather than static coverage, treat motion as a hard requirement. Examples:

  • sci-fi trailers,
  • cinematic teasers built from generated clips,
  • hype edits,
  • avatar or agent videos,
  • any brief whose promise depends on moving shots rather than still frames.

For these requests:

  • The render_runtime chosen at proposal (Remotion, HyperFrames, or FFmpeg) must be confirmed available up front if the planned visual treatment depends on it.
  • Still-image fallback is forbidden. Do not quietly convert the job into a Ken Burns teaser, animatic, or slide-based video.
  • FFmpeg-only fallback is forbidden when it changes the approved deliverable from motion-led video to still-led video.
  • Silent runtime swap is forbidden. If render_runtime="hyperframes" was locked and HyperFrames is unavailable, do NOT route to Remotion instead. Surface the blocker, propose options, get user approval, log a render_runtime_selection decision — then proceed.
  • Bubble critical issues immediately. If the chosen runtime is unavailable, fails to render, or provider clip generation fails in a way that blocks the approved treatment, stop and tell the user before proceeding.
  • Do not spend more tokens or time on downgraded output unless the user explicitly approves the downgrade as an animatic or proof-of-concept.

When Remotion is available, the agent should design production plans around it:

  • Explainer videos with flat-motion-graphics playbook -> Remotion animated scenes, not Ken Burns
  • Data-driven videos -> Remotion stat cards and charts, not static image screenshots
  • Any pipeline using still images -> Remotion spring animations, not FFmpeg pan-and-zoom
  • Screen demos of a CLI/terminal/install flow -> TerminalScene (synthetic screen recording), not OS-level capture. See .agents/skills/synthetic-screen-recording/SKILL.md. Faster, deterministic, privacy-safe. Use real capture (screen_recorder, cap_recorder, playwright-recording) only when the demo is a real app UI or requires unpredictable live behavior.

Remotion scene types available in remotion-composer/

See remotion-composer/SCENE_TYPES.md for the authoritative list and their cut schemas. Current scene types usable via cut.type: text_card, stat_card, callout, comparison, hero_title, terminal_scene, anime_scene, bar_chart, line_chart, pie_chart, kpi_grid, progress_bar. Overlay types include section_title, stat_reveal, hero_title, provider_chip.

When Remotion is NOT available and render_runtime="remotion" was NOT locked, video_compose may use FFmpeg Ken Burns motion on still images. This still works but produces less engaging visuals. Mention this tradeoff in the proposal. When render_runtime="remotion" IS locked and Remotion is unavailable, that's a blocker — escalate, don't silently swap.

When render_runtime="hyperframes" is locked and HyperFrames is unavailable (Node < 22, missing ffmpeg/npx, or hyperframes doctor reports issues), that's also a blocker. Do not substitute Remotion or FFmpeg without user approval + a logged render_runtime_selection decision.

Routing is automatic — video_compose reads edit_decisions.render_runtime and dispatches to the matching engine (_render_via_hyperframes, _remotion_render, or _render_via_ffmpeg). But the agent must know both Remotion and HyperFrames exist at proposal time so it can design the visual approach intentionally. Don't default to Remotion for motion-graphics-heavy concepts that HTML/GSAP would express more naturally, and don't default to HyperFrames for briefs that reuse the existing React scene stack.

Capability Discovery

OpenMontage uses two layers for capability choice:

  • selector tools: capability-level routing such as tts_selector and video_selector
  • provider tools: concrete tools discovered via the registry that call a specific backend

Always inspect the registry first:

python -c "from tools.tool_registry import registry; import json; registry.discover(); print(json.dumps(registry.capability_catalog(), indent=2))"
python -c "from tools.tool_registry import registry; import json; registry.discover(); print(json.dumps(registry.provider_catalog(), indent=2))"

For finalist tools inspect:

  • capability
  • provider
  • usage_location
  • supports
  • fallback_tools
  • related_skills

Do not rely on memory or old docs when the registry can answer it.

Tool Families

Do not maintain hardcoded tool lists. Always query the registry at runtime:

# See all tools grouped by capability (TTS, video_generation, image_generation, etc.)
python -c "from tools.tool_registry import registry; import json; registry.discover(); print(json.dumps(registry.capability_catalog(), indent=2))"

# See all tools grouped by provider (elevenlabs, openai, ffmpeg, etc.)
python -c "from tools.tool_registry import registry; import json; registry.discover(); print(json.dumps(registry.provider_catalog(), indent=2))"

Key capability families to look for in the output:

  • tts — Text-to-speech providers. Route via tts_selector.
  • video_generation — Video generation providers (cloud, local GPU, stock). Route via video_selector.
  • image_generation — Image generation providers (cloud, local GPU, stock). Route via image_selector.
  • music_generation — Music and sound effect generation.
  • video_post — Composition, stitching, trimming (FFmpeg-based, always local).
  • audio_processing — Mixing, enhancement (FFmpeg-based, always local).
  • analysis — Transcription, scene detection, frame sampling.
  • avatar — Talking head and lip sync generation.
  • character_animation — Local character specs, SVG rigs, pose libraries, action timelines, previews, and QA.
  • enhancement — Upscale, background removal, face enhance, color grading.

Each tool in the registry declares best_for, install_instructions, runtime (LOCAL, API, LOCAL_GPU, HYBRID), and status. Read these fields — do not assume tool strengths from memory.

Tool Class Naming Convention

All tool classes use PascalCase without a "Tool" suffix. When importing tools in Python:

Module Class Name NOT
tools.audio.music_gen MusicGen MusicGenTool
tools.video.video_compose VideoCompose VideoComposeTool
tools.audio.audio_mixer AudioMixer AudioMixerTool
tools.tts.elevenlabs_tts ElevenLabsTTS ElevenLabsTTSTool
tools.analysis.transcriber Transcriber TranscriberTool
tools.subtitle.subtitle_gen SubtitleGen SubtitleGenTool

When in doubt, check: grep "^class " tools/<path>.py

All tools call via .execute(params_dict) (returns ToolResult with .success, .data, .error), NOT .run().

Selector Pattern

Three selector tools abstract multi-provider capabilities. Selectors auto-discover providers from the registry. Adding a new provider tool automatically makes it available through the selector — no selector code changes needed.

Selector Routes to How it discovers
tts_selector All tools with capability="tts" (ElevenLabs, Google TTS, OpenAI, Piper) registry.get_by_capability("tts")
image_selector All tools with capability="image_generation" (FLUX, Google Imagen, DALL-E, Recraft, etc.) registry.get_by_capability("image_generation")
video_selector All tools with capability="video_generation" registry.get_by_capability("video_generation")

Selectors route based on: user preference > availability > discovery order. They adapt input schemas between providers transparently.

User-Facing Planning Protocol

Before committing to execution, present:

  1. 4-5 concept directions when the brief is still open.
  2. Recommended pipeline.
  3. Recommended tool path.
  4. Alternative tool paths that are actually available.
  5. Cost estimate and quality tradeoffs.
  6. Music plan — mandatory for every pipeline that has audio. See below.
  7. Production plan by stage.
  8. Approval gate before asset generation.

If a user prefers a specific vendor and that tool is available, surface it directly. Do not hide provider choice.

Music Plan (Mandatory)

Music is a critical part of any video. Surface the music situation to the user at proposal/idea time — do not silently defer it to the asset stage where a failure becomes expensive.

Check music availability in this order and present the options:

  1. User music library (music_library/): Check if this folder exists and contains tracks. If so, list available tracks with durations and let the user pick one.
  2. Music generation APIs: Check which music tools are available via the registry (registry.get_by_capability("music_generation")). Report their status honestly — include quota status if known.
  3. Royalty-free sources: Note if the user can provide their own track (e.g., from YouTube Audio Library, Jamendo, or other free sources). Offer the music_library/ drop path.

Always present the user with explicit choices:

  • Use a track from their library (which one?)
  • Provide a different track (drop it in music_library/)
  • Generate one via API (if available — name the provider and cost)
  • Proceed without music

If no music source is available: Tell the user explicitly. Do NOT let this surface as a surprise at the asset stage.

Record the music decision in the proposal/brief artifact so the asset director knows what to do.

Pipeline Asset Expectations

Each pipeline manifest's tools_available field declares what tools a stage can use. Use selectors for multi-provider capabilities — the selector handles routing to whatever is available. Read the pipeline manifest for the authoritative list per stage.

Stage Agents

Each stage produces one canonical artifact that becomes the contract for the next stage. The stage director skill teaches the agent HOW to produce it.

Stage Director Skill Canonical output Core quality bar
idea *-director.md brief Clear hook, target platform, duration, tone, and user intent
script *-director.md script Structured sections, valid timing, coherent narration
scene_plan *-director.md scene_plan Ordered scenes, timings, asset requirements
assets *-director.md asset_manifest Provenance, paths, model/tool metadata, scene linkage
edit *-director.md edit_decisions Concrete cuts, overlays, subtitle/music decisions
compose *-director.md render_report Output paths, encoding profile, verification notes

Stage contract rules:

  • A completed or awaiting-human checkpoint must include the stage's canonical artifact.
  • Canonical artifacts must validate against the JSON schema in schemas/artifacts/.
  • Non-canonical outputs such as media files belong in stage-specific directories.
  • Tools should record seeds/model versions for reproducibility.

Reviewer Protocol

The reviewer is a meta skill (skills/meta/reviewer.md) — advisory, never directly blocks progression.

  • Self-review after every stage execution, before checkpointing.
  • Load review_focus items from the pipeline manifest for the current stage.
  • Maximum two review rounds. After that, pass with warnings and move on.
  • Findings categorized: critical (must fix), suggestion (should fix), nitpick (nice-to-have).
  • Critical findings -> fix and re-review. Suggestions -> note and proceed.
  • Check playbook quality_rules as constraints, not suggestions.

Human Checkpoint Protocol

The checkpoint protocol meta skill (skills/meta/checkpoint-protocol.md) teaches the agent when to pause:

  • Read human_approval_default from the pipeline manifest per stage
  • Creative stages (idea, script, scene_plan) typically require approval
  • Technical stages (assets, edit, compose) typically auto-proceed
  • When approval is required: present artifact summary, review findings, and cost snapshot
  • Wait for human to approve, request revision, or abort

Communication Protocol

Agents coordinate through canonical JSON artifacts, checkpoints, pipeline manifests, and the tool registry.

Primary files:

  • Artifact schemas: schemas/artifacts/
  • Checkpoint schema: schemas/checkpoints/checkpoint.schema.json
  • Pipeline manifest schema: schemas/pipelines/pipeline_manifest.schema.json
  • Pipeline manifests: pipeline_defs/
  • Style playbooks: styles/*.yaml (validated by schemas/styles/playbook.schema.json)
  • Tool contract: tools/base_tool.py
  • Tool registry: tools/tool_registry.py
  • Stage director skills: skills/pipelines/<pipeline>/<stage>-director.md
  • Meta skills: skills/meta/*.md

Checkpoint rules:

  • Checkpoints live at pipelines/<project_id>/checkpoint_<stage>.json.
  • status may be completed, failed, awaiting_human, or in_progress.
  • completed and awaiting_human checkpoints must include the canonical artifact.
  • Invalid checkpoints or invalid canonical artifacts are contract violations and should fail fast.

Pipeline manifest rules:

  • Pipelines are declarative YAML manifests in pipeline_defs/.
  • Stages declare: skill (director skill path), produces, tools_available, review_focus, success_criteria, human_approval_default.
  • Adding a new pipeline requires a manifest + stage director skills.

Tool rules:

  • Every production tool must inherit from BaseTool.
  • Tool discovery flows through the registry, not ad hoc imports.
  • Support-envelope reporting is the source of truth for capability, status, and resource requirements.

Style Playbooks

Playbook Best For
clean-professional Corporate, educational, SaaS
flat-motion-graphics Social media, TikTok, startups
minimalist-diagram Technical deep-dives, architecture

Layer Map

OpenMontage has three instruction layers:

  1. tools/ What exists, what is available, cost, runtime, fallback, related skills.
  2. skills/ How OpenMontage wants those tools used in pipelines.
  3. .agents/skills/ Raw vendor or technology knowledge.

Reading order:

  1. registry / tool contract — discover what's available
  2. relevant pipeline or creative skill (Layer 2) — know HOW to use it in this context
  3. underlying vendor skill (Layer 3) — mandatory before calling any generation tool

Prefer skills over source code for tool usage. Skills exist precisely so you don't need implementation details in the common case. Layer 2 tells you what and when. Layer 3 tells you how. For authoring prompts, choosing parameters, or understanding usage patterns, you should be reading skills — not .py files.

Exception: debugging, audits, and verifying the governance contract. When a skill and a tool disagree, or when something behaves differently than the skill claims, reading the tool source is fair game — that's often the only way to catch a silent-availability bug or a stale doc string. An audit that refuses to look at the implementation will miss exactly the bugs that matter most. If you do read source to debug, consider whether the finding belongs in a skill update afterward so the next agent doesn't need to repeat the dive.

Layer 3 is not optional. Every generation tool (video, image, TTS, music) has an agent_skills field listing its Layer 3 skills. These skills contain provider-specific prompt engineering, parameter tuning, and quality techniques. Read them before writing prompts. The difference between a generic prompt and a skill-informed prompt is the difference between "usable" and "cinematic."

Example: Before calling kling_video, read its agent_skillsai-video-gen → get Kling-specific prompt structure, camera direction syntax, and quality keywords that the model responds to best.

Layer 3 skills, by category

The .agents/skills/ directory is large. When you're not coming in through a tool's agent_skills pointer, use this table to find the right file by what you're trying to do:

Category Skills
Composition runtime remotion, remotion-best-practices, synthetic-screen-recording (fake terminal/UI demos via Remotion TerminalScene)
Animation knowledge (generic) gsap-core, gsap-timeline, gsap-plugins (SplitText / MorphSVG / DrawSVG / MotionPath / Flip / CustomEase), gsap-utils, gsap-react, gsap-performance, gsap-scrolltrigger, gsap-frameworks, framer-motion (Disney 12 principles), lottie-bodymovin (Lottie export)
Character animation character-rigging, svg-character-animation, pose-library-design, canvas-procedural-animation, character-animation-qa
Image generation bfl-api, flux-best-practices
Video generation seedance-2-0 (preferred premium default — cinematic, trailer, multi-shot, synced audio, lip-sync), ai-video-gen, ltx2
Audio elevenlabs, music, sound-effects, acestep, text-to-speech, setup-api-key
Avatar / lip-sync avatar-video, heygen, create-video, faceswap, video-translate, speech-to-text, agents
Capture playwright-recording (browser flows), ffmpeg (post)
Visualization beautiful-mermaid, d3-viz, manim-composer, manimce-best-practices, manimgl-best-practices
Media editing video-edit, video-download, video-understand, video_toolkit, visual-style

When in doubt, read the category's meta routing file first:

  • Picking an animation runtime? → skills/meta/animation-runtime-selector.md routes between Remotion primitives, GSAP plugins, framer-motion, Lottie, Manim, D3.
  • Picking a screen-recording mode (real capture vs synthetic terminal)? → pipeline_defs/screen-demo.yaml + skills/pipelines/screen-demo/idea-director.md.

Quick Lookup

Question Where to look
What tools exist? tools/tool_registry.py and registry.support_envelope()
What providers are available for a capability? registry.capability_catalog()
What tools exist for a vendor? registry.provider_catalog()
How does a tool actually work? the tool's usage_location from the registry
How should this pipeline stage behave? skills/pipelines/<pipeline>/...
What is the checkpoint/review policy? skills/meta/

What Not To Do

  • Do not bypass the pipeline. Never write ad-hoc scripts to call tools directly. All production goes through pipeline stages with director skills. See Rule Zero.
  • Do not call generation tools without reading their Layer 3 skill. Check the tool's agent_skills field, read the referenced skill, then craft your prompts using that guidance.
  • Do not skip stage director skills. Before executing any pipeline stage, read its director skill. The skill contains the quality bar, the workflow, and the review criteria.
  • Do not use deleted legacy names such as tts_cloud, tts_engine, or video_gen.
  • Do not hardcode provider names, API key names, or setup URLs. Read them from the registry's install_instructions and dependencies fields.
  • Do not begin asset generation before user approval on the production plan.
  • Do not hide degraded paths. Record substitutions and blocked options explicitly.
  • Do not present a single unavailable tool in isolation. Always show the full capability picture: "X of Y providers configured for this capability."
  • Do not skip the Provider Menu at preflight. The user must see what they have AND what they could unlock.
  • Do not change provider, model, or render path without telling the user first and getting approval when the change is material.