You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
packages/torque-eval: dataset scoring + pairwise comparison utilities built on top of Torque exports.
scripts/generate-stackblitz-templates.ts and stackblitz-templates/*: browser playgrounds that must stay in sync with published APIs.
Shared data lives in data/ and examples in examples/; keep them lightweight and anonymized.
Ways of Working
Clarify intent – Restate requirements, note assumptions, and point to the relevant files or docs before modifying code.
Design in the open – For behavioral shifts or new public APIs, outline the approach (data flow, error handling, backward compatibility) before implementing.
Use the mocks re-exported from ai/test to stand up deterministic providers; they expose call logs (mock.doGenerateCalls) so you can assert prompts without patching global state.
Example for a language-model dependency:
import{MockLanguageModelV2}from"ai/test";import{scoreDataset}from"@qforge/torque-eval";constmockJudge=newMockLanguageModelV2({doGenerate: async()=>({content: [{type: "text",text: JSON.stringify({quality: 10,coherence: 9,adherence: 9,notes: "passes rubric",}),},],finishReason: "stop",usage: {inputTokens: 0,outputTokens: 0,totalTokens: 0},warnings: [],}),});awaitscoreDataset({
dataset,sampleSize: 1,judgeModel: mockJudge,// behaves like a LanguageModel from ai-sdk});
Prefer feeding helpers/functions (e.g., pass async (prompt) => JSON.stringify({...})) only for ultra-light tests; MockLanguageModelV2 (and its streaming sibling) gives better parity with production code.
Quality Bar
Every change must include automated coverage or a justification for gaps.
Keep prompts, instructions, and fixtures free of secrets or user data.
Large data files (>1MB) should live in data/ and be gitignored if generated.
Public APIs require changelog/README updates and migrate notes if breaking.
When to Escalate to a Human
You need new third-party services, paid APIs, or environment variables.
A change could break template backwards compatibility or published npm contracts.
A deterministic behavior (RNG, sampling, caching) must change.
Security/privacy concerns, or when unsure how to anonymize sample data.
Deliverables Checklist
Code + tests pass via Bun
Docs/templates reflect the change
Repro steps and verification commands included in PR/summary