-
Notifications
You must be signed in to change notification settings - Fork 104
Metadata Profiling
Tier: Advanced
Command covered: profile
Per-command flag reference lives in
/docs/help/profile.md. This page is the workflow layer — whatprofileproduces, which metadata vocabulary to target, and how the pieces compose.
qsv profile turns a CSV (local path or URL) into a .metadata.json document that re-expresses the dataset in a standard metadata vocabulary — DCAT-US v3, DCAT-AP v3, Croissant 1.1, or Geoconnex — plus a CKAN-shaped block that DataPusher+ consumes. It is qsv's FAIRification command: point it at data and get publish-ready, harvestable metadata.
Under the hood it runs the same statistical + frequency analysis DataPusher+ (DP+) runs in CKAN, builds a Jinja2 evaluation context from the results, and — when a CKAN scheming YAML spec is supplied — evaluates the spec's formula / suggestion_formula templates against that context. The Jinja2 helpers/filters are a native Rust port of DP+'s jinja2_helpers.py, built on minijinja.
profile emits a single <input>.metadata.json carrying up to five top-level blocks:
| Block | Source | Purpose |
|---|---|---|
dpp |
inferred signals | lat/lon/date columns, file size, row count, encoding — the legacy DataPusher+ inference block |
stats |
qsv stats |
per-column summary statistics |
frequency |
qsv frequency |
per-column value counts |
ckan |
derived | CKAN-shaped package + resources block DP+ uses to prepopulate CKAN |
projection |
active profile | the dataset re-expressed in the chosen metadata vocabulary |
Skip blocks you don't need: --no-projection drops the projection block (keeping dpp/stats/frequency/ckan); --no-ckan drops the CKAN block.
Stats are computed in-process.
profiledoes not shell out toqsv schema— it builds its own stats cache internally (mode =ProfileSchema: schema stats + quartiles + mode) so the descriptive-statistics projection always surfaces the full extended-stat set on a fresh run. It does shell out toqsv frequency,qsv count, and (when a spec declares validators)qsv validate.
--profile <name|path> selects the projection vocabulary. Bundled names:
--profile |
Vocabulary | Consumed by | Validator (--validate) |
|---|---|---|---|
dcat-us-v3 (default)
|
DCAT-US v3 JSON-LD | data.gov harvesters | vendored GSA JSON Schema |
dcat-ap-v3 |
DCAT-AP v3 | EU data portals |
pyshacl over bundled SHACL shapes |
croissant |
Croissant 1.1 JSON-LD | mlcommons / Hugging Face / Kaggle | mlcroissant |
geoconnex |
Geoconnex JSON-LD | Internet of Water tooling |
pyshacl over bundled SHACL shapes |
geoconnex is gated behind the geoconnex cargo feature — on by default in qsv (via distrib_features), opt-in for qsvdp (-F datapusher_plus,geoconnex).
--profile also accepts a path to a custom YAML profile. Embedded names always win over same-named files, so give custom profiles a non-clashing name. See resources/profiles/README.md for the schema and authoring guide.
--validate checks the emitted projection against the active profile's declared validators (see table above). By default violations are appended to projection_warnings. Add --strict to fail the command on JSON Schema violations or non-Info external-validator findings (Required/Recommended severities) instead of just warning.
RFC 4180 structural failures from
qsv validate(emitted when a spec declaresvalidators) are always appended as warnings, regardless of--strict.
External validators (pyshacl, mlcroissant) are Python tools you must have installed for --validate to run them. For bundled profiles they always run, because the profile content is vetted at qsv release time. For a profile loaded from an arbitrary YAML file, the external validator declared by validation.external is not spawned unless you pass --allow-external-validator — otherwise the run emits a Recommended-severity warning instead, so an untrusted YAML can't silently execute arbitrary commands.
When <input> is a URL whose response carries DCAT markup (HTTP Link: rel=describedBy), profile discovers the publisher's stated metadata and merges it as a base layer beneath the inferred projection. Disable with --no-dcat-discovery; tune the per-probe timeout with --dcat-discovery-timeout <secs> (default 5). Stdin and URL inputs are materialized to a tempfile so the rest of the pipeline sees a normal file path; the output's input field reads stdin for piped input.
--initial-context <json> provides seed values for the package / resource dicts plus optional JSON-Pointer overrides for the final projection. Top-level keys: package, resource, dataset_info. Wrap any leaf as {"value": ..., "force": true} to mark it as overriding both URL-discovered DCAT markup and qsv's own inference:
-
dataset_infoentries override their target path verbatim. -
package/resourceentries route through the active profile'sfield_mappings:table (e.g.package.title force=truelands at/projection/dct:title, beating inference and discovery). - Forced values for slots the profile doesn't surface are silently dropped (no-op).
See tests/resources/profile/dcat-init-context.README.md for a fully-populated example. (This flag replaces the older --package-meta / --resource-meta flags.)
--spec <yaml> supplies a CKAN scheming YAML spec. profile then evaluates the spec's formula / suggestion_formula Jinja2 templates against the analysis context to compute derived fields — spatial/temporal extents, accrual periodicity, etc. Without a spec, only the inferred dpp block is emitted and no formulas are evaluated. For an example spec, see DP+'s dataset-druf.yaml.
# Quick: dpp/stats/frequency + default DCAT-US v3 projection.
qsv profile data.csv # → data.csv.metadata.json
# Pipe stdin; output defaults to stdin.metadata.json.
cat data.csv | qsv profile
# URL input: discover the publisher's DCAT markup and merge it as a base layer.
qsv profile https://data.example.gov/datasets/sample.csv
# Seed publisher/contact info; write to a chosen output path.
qsv profile data.csv --initial-context publisher.json -o data.metadata.json
# data.gov-style harvest: validate against DCAT-US v3 JSON Schema,
# abort on violations, wrap in a Catalog envelope.
qsv profile data.csv --validate --strict --catalog -o data.metadata.json
# DCAT-AP v3 for EU portals (pyshacl validates the bundled SHACL shapes).
qsv profile open-data.csv --profile dcat-ap-v3 --validate --strict
# Croissant JSON-LD for an ML dataset (mlcroissant validates the output).
qsv profile train.csv --profile croissant --validate -o train.croissant.json
# Embed per-column value-frequency RecordSets in the Croissant projection.
qsv profile train.csv --profile croissant --croissant-frequency
# Geoconnex JSON-LD for hydrologic data (needs the `geoconnex` feature).
qsv profile gages.csv --profile geoconnex --validate --strict
# Evaluate a CKAN scheming spec: Jinja2 formulas compute derived fields.
qsv profile data.csv --spec dataset-druf.yaml -o data.metadata.json
# CKAN-only output: drop the projection block, keep dpp/stats/frequency/ckan.
qsv profile data.csv --no-projection --spec dataset-druf.yaml
# Custom YAML profile from disk (use a non-clashing name).
qsv profile data.csv --profile ./my-org-dcat.yaml --validateThe croissant profile renders per-column descriptive statistics as Croissant annotations (Median, FirstQuartile, ThirdQuartile, Mode, ArithmeticMean, StandardDeviation, Variance, Minimum, Maximum, Range, Sum, …). These come from the in-process extended stats, so they appear on a fresh run without needing a pre-built --everything stats cache. Add --croissant-frequency to also embed per-column value-frequency distributions as inline cr:RecordSets (one <col>-frequency RecordSet of {value, count, percentage} rows per column); the raw counts always remain in the top-level frequency block regardless.
profile is feature-gated (profile cargo feature). It is present in qsv (full) and qsvdp (DataPusher+ optimized). It is not in qsvlite. Note qsvdp enables profile but not schema — profile doesn't need schema, since it computes stats in-process.
-
/docs/help/profile.md— canonical flag reference -
resources/profiles/README.md— profile YAML schema & authoring guide -
Validation & Schema —
validate/schema/sniff -
Aggregation & Statistics → stats — the stats
profilecomputes -
Aggregation & Statistics → frequency — the frequency pass
profileruns - Integrations (CKAN, DuckDB, Python, CI/CD) — CKAN / DataPusher+ context
-
Binary Variants — which builds include
profile -
tests/test_profile.rs— extensive worked examples
qsv — GitHub · Releases · Discussions · qsv pro · Try it online · Benchmarks · datHere · DeepWiki · Dual-licensed MIT / Unlicense
Edit this page: Contributing to the Wiki
Home · Why qsv? · Tier legend
- All Commands (index)
- Selection & Inspection
- Transform & Reshape
- Aggregation & Statistics
- Joins & Set Ops
- SQL & Polars
- Validation & Schema
- Metadata Profiling (profile)
- Conversion & I/O
- Geospatial
- HTTP & Web
- Scripting (Luau / Python)
- Indexing, Compression & Diff
- AI & Documentation