Tutorial for writing a new converter plugin from scratch. Prereqs:
read architecture.md first so the contract and
output layout are in your head.
Decide what file extension(s) you're handling and which upstream tool
does the heavy lifting. For PDFs in 2026-05 the realistic options are
documented in tool-survey.md. Settle on ONE
upstream tool per plugin — wrapping multiple converters behind a
single plugin makes versioning and dep-conflict debugging painful.
Naming: package = dikw-converter-<format>, module =
dikw_converter_<format>, engine name = the upstream tool's name
(e.g. marker).
cp -r packages/dikw-converter-example packages/dikw-converter-pdfThen rename everything example → pdf (and Example → Pdf):
packages/dikw-converter-pdf/pyproject.toml—name,entry-pointstable, the engine name on the right side.packages/dikw-converter-pdf/src/dikw_converter_example/→src/dikw_converter_pdf/.__init__.py— class nameExampleConverter→MarkerConverter,name = "example"→name = "marker",extensions = (".example",)→extensions = (".pdf",).
The Protocol is:
from pathlib import Path
class MarkerConverter:
name = "marker"
extensions = (".pdf",)
def convert(self, input_path: Path, output_dir: Path) -> None:
...convert() must:
output_dir.mkdir(parents=True, exist_ok=True).- Write
<input_path.stem>.mdintooutput_dir— the converted prose, with image-style asset references to anything else you write. - Create
output_dir / "assets"and write extracted images, the original input file (provenance), etc. - Ensure every asset is image-referenced from the md
(
works for any path, even non-images).
Sketch using marker-pdf as the upstream:
from pathlib import Path
class MarkerConverter:
name = "marker"
extensions = (".pdf",)
def convert(self, input_path: Path, output_dir: Path) -> None:
from marker.converters.pdf import PdfConverter # lazy import
from marker.models import create_model_dict
output_dir.mkdir(parents=True, exist_ok=True)
assets_dir = output_dir / "assets"
assets_dir.mkdir(exist_ok=True)
# Run marker.
models = create_model_dict()
converter = PdfConverter(artifact_dict=models)
rendered = converter(str(input_path))
markdown_text, _, images = rendered.markdown, rendered.metadata, rendered.images
# Write extracted images to assets/ and rewrite refs.
for img_name, img_pil in images.items():
img_pil.save(assets_dir / img_name)
rewritten = _rewrite_image_paths(markdown_text, prefix="assets/")
# Copy original PDF as provenance.
original_dest = assets_dir / input_path.name
original_dest.write_bytes(input_path.read_bytes())
# Write the md with the original ref appended.
body = rewritten + f"\n\n\n"
(output_dir / f"{input_path.stem}.md").write_text(body, encoding="utf-8")Lazy-import upstream tools inside convert() so a dikw client status
or a markdown-only dikw client import never triggers PyTorch /
model-weight loading. dikw-core's discovery instantiates your class
once, but convert() is what does the heavy work.
Use the stub's test layout as a template. The minimum:
from pathlib import Path
from dikw_converter_pdf import MarkerConverter
def test_protocol_attributes() -> None:
c = MarkerConverter()
assert c.name == "marker"
assert c.extensions == (".pdf",)
def test_convert_produces_md_and_keeps_original(tmp_path: Path) -> None:
input_pdf = Path(__file__).parent / "fixtures" / "tiny.pdf"
out = tmp_path / "tiny"
MarkerConverter().convert(input_pdf, out)
assert (out / "tiny.md").exists()
assert (out / "assets" / "tiny.pdf").exists()
md = (out / "tiny.md").read_text(encoding="utf-8")
assert "" in mdAdd a tiny fixture PDF to tests/fixtures/. CI runs uv run pytest;
keep fixtures small enough to live in git (< 100 KB ideally).
# Install dikw-core (editable from a sibling checkout, or from pypi):
pip install -e ../dikw-core
# Install your plugin in editable mode:
pip install -e packages/dikw-converter-pdf
# Start a server in one terminal:
dikw serve
# In another:
dikw client import paper.pdfVerify the import lands under <base>/sources/paper/:
<base>/sources/paper/
├── paper.md
└── assets/
├── paper.pdf # the original
├── figure-1.png
└── …
If md_inspect rejects with asset_missing or orphan asset, check
that the md references every file you wrote (image syntax, not
regular markdown link).
Run conversion twice on the same input:
mkdir /tmp/a /tmp/b
python -c "from dikw_converter_pdf import MarkerConverter; from pathlib import Path; MarkerConverter().convert(Path('paper.pdf'), Path('/tmp/a'))"
python -c "from dikw_converter_pdf import MarkerConverter; from pathlib import Path; MarkerConverter().convert(Path('paper.pdf'), Path('/tmp/b'))"
diff -r /tmp/a /tmp/bEmpty diff is ideal. If diffs show up only in image binaries (PIL metadata timestamps, say), normalise the save call. If they show up in the markdown body, the upstream tool has non-determinism — pin its seed or document the cost.
Each package is independently versioned and released. The full release
mechanics, PyPI Pending Publisher setup, and rollback procedure live in
docs/release-process.md; the author-side
checklist is:
-
Bump the version in
packages/dikw-converter-<format>/pyproject.tomlper SemVer. -
Write a CHANGELOG entry as the new top section of
packages/dikw-converter-<format>/CHANGELOG.md, following the Keep a Changelog format already in place. The release pipeline reads this block as the GitHub Release body and fails if it's missing. -
Run the local gate:
uv run python scripts/check-package.py dikw-converter-<format>
This rebuilds the wheel + sdist, runs
tests/packaging/scoped to your package (artifact validation, entry-point discoverability,twine check --strict), and prints the would-be release notes. -
Tag and push:
git tag dikw-converter-<format>-vX.Y.Z git push origin dikw-converter-<format>-vX.Y.Z
.github/workflows/release.ymlmatches the tag, re-runs the gate, builds, validates withtwine check, and publishes via the PyPI trusted publisher (OIDC) plus a GitHub Release with the artifacts attached.
First-ever release of a brand-new package name requires a one-time
PyPI Pending Publisher setup so PyPI knows to trust the workflow.
See docs/release-process.md for the form to fill in.
- Lots of inline ML imports at module top. Pushes the dep cost to
dikw clientstartup. Keep them insideconvert(). - Hardcoded model paths. Use the upstream tool's
cache_dirconventions or environment variables; users will hit you with bug reports otherwise. - Mutating
input_path. Don't write back into the user's input tree. Read-only. - Symlinks / hard links into the user's tree. dikw-core's importer rejects symlinks at pre-flight; symlinks under assets/ would silently break.
- Multiple Converter classes in one entry-point. One entry-point = one Converter. Ship multiple entry-points if you have multiple engines in the same package.