feat(schema): metadata field for dataFiles#2683
Conversation
|
This feature was already proposed last year in this PR: #1967. For the same reason as that time, I think we should not allow arbitrary metadata in multiple places to avoid ambiguity between the responsibilities of |
I see, thank you for referencing that issue, it helps bringing back that discussion, as I can get to the same point my predecessor got to at the end. I' am very open to other suggestion, but as of now it's the only option I can see. |
|
The changes look fine to me. The concern around unclear responsibilities between dataset metadata and datafile metadata makes sense, but I think it's mainly an issue if we assume a 1:1 relationship between datasets and datafiles, which I dont think is the case for everyone. To me, datafile metadata feels like a good fit for file specific details. That being said, I'm not ignoring risks of users mixing the responsiblities between
I'd love to discuss more about the potential risks, but my main point is that optional dataFile metadata is a good addition |
|
I understand the worries to have metadata both in datasets and files, but I think that there is value to have metadata associated with a specific file, specifically when your datasets contains multiples files. I see ESS using this feature in the near future, specifically for derived dataset. |
|
If possible, please wait for the merge until the next scicat meeting. There might be different opinions and suggestions |
e23333b to
c95852b
Compare
sbliven
left a comment
There was a problem hiding this comment.
Feedback from today's meeting was supportive of this feature, but with some changes.
- Add a configuration option for a JSON schema that would be checked for new requests. The default schema should not accept any properties:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"properties": {},
"additionalProperties": false
}We're already using ajv for validation in some other parts, so a new environmental variable to override this schema would make sense. For example, to get the unrestricted behavior you could override DATAFILE_METADATA_SCHEMA=datafile_schema.json with a similar schema but setting "additionalProperties": true. (var spelling subject to change)
- It should be clearly documented that
Dataset.scientificMetadatais the expected field for most metadata, and thatDatafile.metadatawill not be searchable by users across datasets.
|
I added a review based on my notes from the meeting today. I also had a couple personal comments I wanted to bring up.
|
| @ApiProperty({ | ||
| type: Object, | ||
| required: false, | ||
| description: "File metadata.", |
There was a problem hiding this comment.
| description: "File metadata.", | |
| description: "File-specific metadata. The Dataset field scientificMetadata should be preferred for aggregate metadata, as it is searchable and displayed more prominently to users.", |
@HayenNico Would a description like this be clearer?
(Also in the schema file)
There was a problem hiding this comment.
Yes, this description is a lot better for the schema.
For the documentation pages, I'd propose we put forward some additional guidance on when to use this field over scientificMetadata. My thinking would be that while this can be used for file-level granularity, there should generally be some related aggregate in the dataset for every entry in dataFile.metadata. For example, the ILL use case seems to need "sample type" and "sample subtype", which should be complemented by aggregate sample metadata in the dataset (or in a sample entry + sampleIds reference in the dataset).
There was a problem hiding this comment.
Thanks for the feedback and the description, to answer some of your concerns:
1 - If not used, the metadata field can be skipped at mongoDB level since it's optional, Mongoose does that by deafault if you try to create an empty metadata field. which would mean no storage overhead is to be expected due to this change.
2 - as you mentioned since it's an optional field it would not require any migration.
|
@sbliven and @HayenNico thank you so much for the contribution. |
c95852b to
81a6281
Compare
|
Thank you all for the feedback ! In the latest commit you can find the proposed changes with the issues above being addressed. I also suggest reading the added documentation datafiles_metadata.md |
HayenNico
left a comment
There was a problem hiding this comment.
Looks great, especially the new documentation section. No further concerns on my end ;)
f18b1b6 to
6a684f6
Compare
Description
a metadata field for dataFile, to include file related variables
[ { "dataFileList": [ { "path": "string", "size": 0, "time": "2026-04-14T12:05:20.052Z", "chk": "string", "uid": "string", "gid": "string", "perm": "string", "type": "string", "metadata": {} } ] } ]Motivation
In case of large number of files in a dataset it's very useful to collect file specific metadata to help seperate between them.

The following is a use case example:
Changes:
additional field
metadatain thedataFileDTO, schema and interfaceTests included
Documentation