Skip to content

feat(schema): metadata field for dataFiles#2683

Open
abdellah257 wants to merge 2 commits into
SciCatProject:masterfrom
abdellah257:datafile-metadata
Open

feat(schema): metadata field for dataFiles#2683
abdellah257 wants to merge 2 commits into
SciCatProject:masterfrom
abdellah257:datafile-metadata

Conversation

@abdellah257
Copy link
Copy Markdown
Contributor

Description

a metadata field for dataFile, to include file related variables

[
  {
    "dataFileList": [
      {
        "path": "string",
        "size": 0,
        "time": "2026-04-14T12:05:20.052Z",
        "chk": "string",
        "uid": "string",
        "gid": "string",
        "perm": "string",
        "type": "string",
        "metadata": {}
      }
    ]
  }
]

Motivation

In case of large number of files in a dataset it's very useful to collect file specific metadata to help seperate between them.
The following is a use case example:
image

Changes:

additional field metadata in the dataFile DTO, schema and interface

Tests included

  • Included for each change/fix?
  • Passing?

Documentation

  • swagger documentation updated (required for API changes)
  • official documentation updated

@abdellah257 abdellah257 requested a review from a team as a code owner April 14, 2026 12:07
@abdellah257 abdellah257 changed the title feat: metadata field for dataFiles feat(schema): metadata field for dataFiles Apr 14, 2026
@HayenNico
Copy link
Copy Markdown
Member

This feature was already proposed last year in this PR: #1967. For the same reason as that time, I think we should not allow arbitrary metadata in multiple places to avoid ambiguity between the responsibilities of Dataset and DataFile.

@abdellah257
Copy link
Copy Markdown
Contributor Author

This feature was already proposed last year in this PR: #1967. For the same reason as that time, I think we should not allow arbitrary metadata in multiple places to avoid ambiguity between the responsibilities of Dataset and DataFile.

I see, thank you for referencing that issue, it helps bringing back that discussion, as I can get to the same point my predecessor got to at the end. I' am very open to other suggestion, but as of now it's the only option I can see.
I can also argue that the point of conflicting a Dataset and a DataFile is not clear especially at ILL, due to how general a Dataset is, it can have up to 64 000 files, with a set of these having a very specific function for later processing ....
Including this information in a Dataset as a map<file, metadata> at scientific_metadata is also not a viable option due the sizes of our datasets, and also conflicts with the purpose of scientific_metadata in the first place.

@Junjiequan
Copy link
Copy Markdown
Member

The changes look fine to me. The concern around unclear responsibilities between dataset metadata and datafile metadata makes sense, but I think it's mainly an issue if we assume a 1:1 relationship between datasets and datafiles, which I dont think is the case for everyone. To me, datafile metadata feels like a good fit for file specific details.

That being said, I'm not ignoring risks of users mixing the responsiblities between dataset and dataFile. I just think there's more practical way to address it without restrcting the change, we could:

  • Add a clear field description that indicates datafile metadata is not a substitute for dataset metadata
  • Document use case boundries for each

I'd love to discuss more about the potential risks, but my main point is that optional dataFile metadata is a good addition

@nitrosx
Copy link
Copy Markdown
Member

nitrosx commented Apr 16, 2026

I understand the worries to have metadata both in datasets and files, but I think that there is value to have metadata associated with a specific file, specifically when your datasets contains multiples files.
A clear documentation with examples should be able to address all the questions from how to use it, to data curation.
Also, we clearly have a use case from a supporting facility and it will not impact existing deployments if the feature is not used or, even better, disabled.

I see ESS using this feature in the near future, specifically for derived dataset.
I support the feature. Of course, I will be more than happy to discuss further the topic, if that help clarifying and dissipate the worries expressed above.

@Junjiequan Junjiequan changed the title feat(schema): metadata field for dataFiles feat(schema): metadata field for dataFiles Apr 17, 2026
@Junjiequan
Copy link
Copy Markdown
Member

Junjiequan commented Apr 17, 2026

If possible, please wait for the merge until the next scicat meeting. There might be different opinions and suggestions

Copy link
Copy Markdown
Member

@sbliven sbliven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback from today's meeting was supportive of this feature, but with some changes.

  1. Add a configuration option for a JSON schema that would be checked for new requests. The default schema should not accept any properties:
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "properties": {},
  "additionalProperties": false
}

We're already using ajv for validation in some other parts, so a new environmental variable to override this schema would make sense. For example, to get the unrestricted behavior you could override DATAFILE_METADATA_SCHEMA=datafile_schema.json with a similar schema but setting "additionalProperties": true. (var spelling subject to change)

  1. It should be clearly documented that Dataset.scientificMetadata is the expected field for most metadata, and that Datafile.metadata will not be searchable by users across datasets.

@sbliven
Copy link
Copy Markdown
Member

sbliven commented May 26, 2026

I added a review based on my notes from the meeting today. I also had a couple personal comments I wanted to bring up.

  1. Even if not used, this does have some storage overhead (about 15 bytes per DataFile). We currently have around 1 million DataFiles at PSI; I think 15MB of additional database is fine.

  2. Does this change need a migration script? My feeling is no (it's an optional field), but I though I should bring it up.

Comment thread src/common/dto/datafile.dto.ts Outdated
@ApiProperty({
type: Object,
required: false,
description: "File metadata.",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: "File metadata.",
description: "File-specific metadata. The Dataset field scientificMetadata should be preferred for aggregate metadata, as it is searchable and displayed more prominently to users.",

@HayenNico Would a description like this be clearer?

(Also in the schema file)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this description is a lot better for the schema.

For the documentation pages, I'd propose we put forward some additional guidance on when to use this field over scientificMetadata. My thinking would be that while this can be used for file-level granularity, there should generally be some related aggregate in the dataset for every entry in dataFile.metadata. For example, the ILL use case seems to need "sample type" and "sample subtype", which should be complemented by aggregate sample metadata in the dataset (or in a sample entry + sampleIds reference in the dataset).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback and the description, to answer some of your concerns:
1 - If not used, the metadata field can be skipped at mongoDB level since it's optional, Mongoose does that by deafault if you try to create an empty metadata field. which would mean no storage overhead is to be expected due to this change.

2 - as you mentioned since it's an optional field it would not require any migration.

@nitrosx
Copy link
Copy Markdown
Member

nitrosx commented May 27, 2026

@sbliven and @HayenNico thank you so much for the contribution.
I think it is acceptable to have documentation with best practices and use cases.
Also as we said yesterday, we should clearly state that at the moment, files metadata cannot be searched through the datasets nor any other way. They are only for consumption.

@abdellah257
Copy link
Copy Markdown
Contributor Author

Thank you all for the feedback ! In the latest commit you can find the proposed changes with the issues above being addressed. I also suggest reading the added documentation datafiles_metadata.md

Copy link
Copy Markdown
Member

@HayenNico HayenNico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, especially the new documentation section. No further concerns on my end ;)

@abdellah257 abdellah257 requested a review from sbliven June 4, 2026 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants