Skip to content
Joel Natividad edited this page May 29, 2026 · 3 revisions

Metadata Profiling (profile)

Tier: Advanced Command covered: profile

Per-command flag reference lives in /docs/help/profile.md. This page is the workflow layer — what profile produces, which metadata vocabulary to target, and how the pieces compose.

qsv profile turns a CSV (local path or URL) into a .metadata.json document that re-expresses the dataset in a standard metadata vocabulary — DCAT-US v3, DCAT-AP v3, Croissant 1.1, or Geoconnex — plus a CKAN-shaped block that DataPusher+ consumes. It is qsv's FAIRification command: point it at data and get publish-ready, harvestable metadata.

Under the hood it runs the same statistical + frequency analysis DataPusher+ (DP+) runs in CKAN, builds a Jinja2 evaluation context from the results, and — when a CKAN scheming YAML spec is supplied — evaluates the spec's formula / suggestion_formula templates against that context. The Jinja2 helpers/filters are a native Rust port of DP+'s jinja2_helpers.py, built on minijinja.

The five output blocks

profile emits a single <input>.metadata.json carrying up to five top-level blocks:

Block Source Purpose
dpp inferred signals lat/lon/date columns, file size, row count, encoding — the legacy DataPusher+ inference block
stats qsv stats per-column summary statistics
frequency qsv frequency per-column value counts
ckan derived CKAN-shaped package + resources block DP+ uses to prepopulate CKAN
projection active profile the dataset re-expressed in the chosen metadata vocabulary

Skip blocks you don't need: --no-projection drops the projection block (keeping dpp/stats/frequency/ckan); --no-ckan drops the CKAN block.

Stats are computed in-process. profile does not shell out to qsv schema — it builds its own stats cache internally (mode = ProfileSchema: schema stats + quartiles + mode) so the descriptive-statistics projection always surfaces the full extended-stat set on a fresh run. It does shell out to qsv frequency, qsv count, and (when a spec declares validators) qsv validate.

Choosing a profile

--profile <name|path> selects the projection vocabulary. Bundled names:

--profile Vocabulary Consumed by Validator (--validate)
dcat-us-v3 (default) DCAT-US v3 JSON-LD data.gov harvesters vendored GSA JSON Schema
dcat-ap-v3 DCAT-AP v3 EU data portals pyshacl over bundled SHACL shapes
croissant Croissant 1.1 JSON-LD mlcommons / Hugging Face / Kaggle mlcroissant
geoconnex Geoconnex JSON-LD Internet of Water tooling pyshacl over bundled SHACL shapes

geoconnex is gated behind the geoconnex cargo feature — on by default in qsv (via distrib_features), opt-in for qsvdp (-F datapusher_plus,geoconnex).

--profile also accepts a path to a custom YAML profile. Embedded names always win over same-named files, so give custom profiles a non-clashing name. See resources/profiles/README.md for the schema and authoring guide.

Validation

--validate checks the emitted projection against the active profile's declared validators (see table above). By default violations are appended to projection_warnings. Add --strict to fail the command on JSON Schema violations or non-Info external-validator findings (Required/Recommended severities) instead of just warning.

RFC 4180 structural failures from qsv validate (emitted when a spec declares validators) are always appended as warnings, regardless of --strict.

External validators (pyshacl, mlcroissant) are Python tools you must have installed for --validate to run them. For bundled profiles they always run, because the profile content is vetted at qsv release time. For a profile loaded from an arbitrary YAML file, the external validator declared by validation.external is not spawned unless you pass --allow-external-validator — otherwise the run emits a Recommended-severity warning instead, so an untrusted YAML can't silently execute arbitrary commands.

URL inputs & DCAT discovery

When <input> is a URL whose response carries DCAT markup (HTTP Link: rel=describedBy), profile discovers the publisher's stated metadata and merges it as a base layer beneath the inferred projection. Disable with --no-dcat-discovery; tune the per-probe timeout with --dcat-discovery-timeout <secs> (default 5). Stdin and URL inputs are materialized to a tempfile so the rest of the pipeline sees a normal file path; the output's input field reads stdin for piped input.

Seeding & overriding values

--initial-context <json> provides seed values for the package / resource dicts plus optional JSON-Pointer overrides for the final projection. Top-level keys: package, resource, dataset_info. Wrap any leaf as {"value": ..., "force": true} to mark it as overriding both URL-discovered DCAT markup and qsv's own inference:

  • dataset_info entries override their target path verbatim.
  • package / resource entries route through the active profile's field_mappings: table (e.g. package.title force=true lands at /projection/dct:title, beating inference and discovery).
  • Forced values for slots the profile doesn't surface are silently dropped (no-op).

See tests/resources/profile/dcat-init-context.README.md for a fully-populated example. (This flag replaces the older --package-meta / --resource-meta flags.)

CKAN scheming specs (formulas)

--spec <yaml> supplies a CKAN scheming YAML spec. profile then evaluates the spec's formula / suggestion_formula Jinja2 templates against the analysis context to compute derived fields — spatial/temporal extents, accrual periodicity, etc. Without a spec, only the inferred dpp block is emitted and no formulas are evaluated. For an example spec, see DP+'s dataset-druf.yaml.

Examples

# Quick: dpp/stats/frequency + default DCAT-US v3 projection.
qsv profile data.csv                       # → data.csv.metadata.json

# Pipe stdin; output defaults to stdin.metadata.json.
cat data.csv | qsv profile

# URL input: discover the publisher's DCAT markup and merge it as a base layer.
qsv profile https://data.example.gov/datasets/sample.csv

# Seed publisher/contact info; write to a chosen output path.
qsv profile data.csv --initial-context publisher.json -o data.metadata.json

# data.gov-style harvest: validate against DCAT-US v3 JSON Schema,
# abort on violations, wrap in a Catalog envelope.
qsv profile data.csv --validate --strict --catalog -o data.metadata.json

# DCAT-AP v3 for EU portals (pyshacl validates the bundled SHACL shapes).
qsv profile open-data.csv --profile dcat-ap-v3 --validate --strict

# Croissant JSON-LD for an ML dataset (mlcroissant validates the output).
qsv profile train.csv --profile croissant --validate -o train.croissant.json

# Embed per-column value-frequency RecordSets in the Croissant projection.
qsv profile train.csv --profile croissant --croissant-frequency

# Geoconnex JSON-LD for hydrologic data (needs the `geoconnex` feature).
qsv profile gages.csv --profile geoconnex --validate --strict

# Evaluate a CKAN scheming spec: Jinja2 formulas compute derived fields.
qsv profile data.csv --spec dataset-druf.yaml -o data.metadata.json

# CKAN-only output: drop the projection block, keep dpp/stats/frequency/ckan.
qsv profile data.csv --no-projection --spec dataset-druf.yaml

# Custom YAML profile from disk (use a non-clashing name).
qsv profile data.csv --profile ./my-org-dcat.yaml --validate

Croissant descriptive statistics

The croissant profile renders per-column descriptive statistics as Croissant annotations (Median, FirstQuartile, ThirdQuartile, Mode, ArithmeticMean, StandardDeviation, Variance, Minimum, Maximum, Range, Sum, …). These come from the in-process extended stats, so they appear on a fresh run without needing a pre-built --everything stats cache. Add --croissant-frequency to also embed per-column value-frequency distributions as inline cr:RecordSets (one <col>-frequency RecordSet of {value, count, percentage} rows per column); the raw counts always remain in the top-level frequency block regardless.

Binary variants

profile is feature-gated (profile cargo feature). It is present in qsv (full) and qsvdp (DataPusher+ optimized). It is not in qsvlite. Note qsvdp enables profile but not schemaprofile doesn't need schema, since it computes stats in-process.

See also

Clone this wiki locally