Metadata Profiling

Metadata Profiling (`profile`)

Tier: Advanced Command covered: profile

Per-command flag reference lives in /docs/help/profile.md. This page is the workflow layer — what profile produces, which metadata vocabulary to target, and how the pieces compose.

qsv profile turns a CSV (local path or URL) into a .metadata.json document that re-expresses the dataset in a standard metadata vocabulary — DCAT-US v3, DCAT-AP v3, Croissant 1.1, or Geoconnex — plus a CKAN-shaped block that DataPusher+ consumes. It is qsv's FAIRification command: point it at data and get publish-ready, harvestable metadata.

Under the hood it runs the same statistical + frequency analysis DataPusher+ (DP+) runs in CKAN, builds a Jinja2 evaluation context from the results, and — when a CKAN scheming YAML spec is supplied — evaluates the spec's formula / suggestion_formula templates against that context. The Jinja2 helpers/filters are a native Rust port of DP+'s jinja2_helpers.py, built on minijinja.

The five output blocks

profile emits a single <input>.metadata.json carrying up to five top-level blocks:

Block	Source	Purpose
`dpp`	inferred signals	lat/lon/date columns, file size, row count, encoding — the legacy DataPusher+ inference block
`stats`	`qsv stats`	per-column summary statistics
`frequency`	`qsv frequency`	per-column value counts
`ckan`	derived	CKAN-shaped package + resources block DP+ uses to prepopulate CKAN
`projection`	active profile	the dataset re-expressed in the chosen metadata vocabulary

Skip blocks you don't need: --no-projection drops the projection block (keeping dpp/stats/frequency/ckan); --no-ckan drops the CKAN block.

Stats are computed in-process. profile does not shell out to qsv schema — it builds its own stats cache internally (mode = ProfileSchema: schema stats + quartiles + mode) so the descriptive-statistics projection always surfaces the full extended-stat set on a fresh run. It does shell out to qsv frequency, qsv count, and (when a spec declares validators) qsv validate.

Choosing a profile

--profile <name|path> selects the projection vocabulary. Bundled names:

`--profile`	Vocabulary	Consumed by	Validator (`--validate`)
`dcat-us-v3` (default)	DCAT-US v3 JSON-LD	data.gov harvesters	vendored GSA JSON Schema
`dcat-ap-v3`	DCAT-AP v3	EU data portals	`pyshacl` over bundled SHACL shapes
`croissant`	Croissant 1.1 JSON-LD	mlcommons / Hugging Face / Kaggle	`mlcroissant`
`geoconnex`	Geoconnex JSON-LD	Internet of Water tooling	`pyshacl` over bundled SHACL shapes

geoconnex is gated behind the geoconnex cargo feature — on by default in qsv (via distrib_features), opt-in for qsvdp (-F datapusher_plus,geoconnex).

--profile also accepts a path to a custom YAML profile. Embedded names always win over same-named files, so give custom profiles a non-clashing name. See resources/profiles/README.md for the schema and authoring guide.

Validation

--validate checks the emitted projection against the active profile's declared validators (see table above). By default violations are appended to projection_warnings. Add --strict to fail the command on JSON Schema violations or non-Info external-validator findings (Required/Recommended severities) instead of just warning.

RFC 4180 structural failures from qsv validate (emitted when a spec declares validators) are always appended as warnings, regardless of --strict.

External validators (pyshacl, mlcroissant) are Python tools you must have installed for --validate to run them. For bundled profiles they always run, because the profile content is vetted at qsv release time. For a profile loaded from an arbitrary YAML file, the external validator declared by validation.external is not spawned unless you pass --allow-external-validator — otherwise the run emits a Recommended-severity warning instead, so an untrusted YAML can't silently execute arbitrary commands.

URL inputs & DCAT discovery

When <input> is a URL whose response carries DCAT markup (HTTP Link: rel=describedBy), profile discovers the publisher's stated metadata and merges it as a base layer beneath the inferred projection. Disable with --no-dcat-discovery; tune the per-probe timeout with --dcat-discovery-timeout <secs> (default 5). Stdin and URL inputs are materialized to a tempfile so the rest of the pipeline sees a normal file path; the output's input field reads stdin for piped input.

Seeding & overriding values

--initial-context <json> provides seed values for the package / resource dicts plus optional JSON-Pointer overrides for the final projection. Top-level keys: package, resource, dataset_info. Wrap any leaf as {"value": ..., "force": true} to mark it as overriding both URL-discovered DCAT markup and qsv's own inference:

dataset_info entries override their target path verbatim.
package / resource entries route through the active profile's field_mappings: table (e.g. package.title force=true lands at /projection/dct:title, beating inference and discovery).
Forced values for slots the profile doesn't surface are silently dropped (no-op).

See tests/resources/profile/dcat-init-context.README.md for a fully-populated example. (This flag replaces the older --package-meta / --resource-meta flags.)

CKAN scheming specs (formulas)

--spec <yaml> supplies a CKAN scheming YAML spec. profile then evaluates the spec's formula / suggestion_formula Jinja2 templates against the analysis context to compute derived fields — spatial/temporal extents, accrual periodicity, etc. Without a spec, only the inferred dpp block is emitted and no formulas are evaluated. For an example spec, see DP+'s dataset-druf.yaml.

Examples

# Quick: dpp/stats/frequency + default DCAT-US v3 projection.
qsv profile data.csv                       # → data.csv.metadata.json

# Pipe stdin; output defaults to stdin.metadata.json.
cat data.csv | qsv profile

# URL input: discover the publisher's DCAT markup and merge it as a base layer.
qsv profile https://data.example.gov/datasets/sample.csv

# Seed publisher/contact info; write to a chosen output path.
qsv profile data.csv --initial-context publisher.json -o data.metadata.json

# data.gov-style harvest: validate against DCAT-US v3 JSON Schema,
# abort on violations, wrap in a Catalog envelope.
qsv profile data.csv --validate --strict --catalog -o data.metadata.json

# DCAT-AP v3 for EU portals (pyshacl validates the bundled SHACL shapes).
qsv profile open-data.csv --profile dcat-ap-v3 --validate --strict

# Croissant JSON-LD for an ML dataset (mlcroissant validates the output).
qsv profile train.csv --profile croissant --validate -o train.croissant.json

# Embed per-column value-frequency RecordSets in the Croissant projection.
qsv profile train.csv --profile croissant --croissant-frequency

# Geoconnex JSON-LD for hydrologic data (needs the `geoconnex` feature).
qsv profile gages.csv --profile geoconnex --validate --strict

# Evaluate a CKAN scheming spec: Jinja2 formulas compute derived fields.
qsv profile data.csv --spec dataset-druf.yaml -o data.metadata.json

# CKAN-only output: drop the projection block, keep dpp/stats/frequency/ckan.
qsv profile data.csv --no-projection --spec dataset-druf.yaml

# Custom YAML profile from disk (use a non-clashing name).
qsv profile data.csv --profile ./my-org-dcat.yaml --validate

Croissant descriptive statistics

The croissant profile renders per-column descriptive statistics as Croissant annotations (Median, FirstQuartile, ThirdQuartile, Mode, ArithmeticMean, StandardDeviation, Variance, Minimum, Maximum, Range, Sum, …). These come from the in-process extended stats, so they appear on a fresh run without needing a pre-built --everything stats cache. Add --croissant-frequency to also embed per-column value-frequency distributions as inline cr:RecordSets (one <col>-frequency RecordSet of {value, count, percentage} rows per column); the raw counts always remain in the top-level frequency block regardless.

Binary variants

profile is feature-gated (profile cargo feature). It is present in qsv (full) and qsvdp (DataPusher+ optimized). It is not in qsvlite. Note qsvdp enables profile but not schema — profile doesn't need schema, since it computes stats in-process.

Metadata Profiling

Metadata Profiling (profile)

The five output blocks

Choosing a profile

Validation

URL inputs & DCAT discovery

Seeding & overriding values

CKAN scheming specs (formulas)

Examples

Croissant descriptive statistics

Binary variants

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Get Started

Command Reference

Cookbook

Tuning & Internals

Ecosystem

Reference

Legacy

Clone this wiki locally

Metadata Profiling (`profile`)