Skip to content

file listings

jvandegriff edited this page May 4, 2026 · 37 revisions

See also dataset schema issue tag.

File listings and events lists are similar types of data which can be represented in HAPI. For a while HAPI has been able to represent files, using the "stringType" metadata to identify strings as URIs. File listings were then colloquially just a time and a URI. But we would like to represent file listings as a specific schema in HAPI. This document explores this.

Events lists are more generally just a time stamp and a message associated with the time stamp. Often events lists will have a time range with a start and end time.

We start with a base class which is just a "Listing of Times":

  • time isotime

"File Listing" required elements:

  • startDate - start date/time of coverage (isotime, required)
  • uri (string, required, stringType used for base URI)

Optional and recommended elements:

  • modificationDate (isotime, recommended)
  • size (recommended; Units must be in B and values formatted as an integer, unless the fileSize is greater than 9,007,199,254,740,991 (~9 PB), in which case scientific notation should be used (if file size is greater than ~9 PB, the exact number of bytes cannot be communicated).

Optional elements (if present, use these keywords):

  • checksum (stringType used to constrain checkSumAlgorithm)
  • creationDate (isotime)
  • accessDate (isotime)
  • stopDate - (isotime) stop date/time of coverage

Comment on column names: we tried to be consistent with the names used in the info response.

Ordering of "fileListing" columns:

  • required columns must be present in the order given (time, then fileURI)
  • optional columns must follow required columns and can be in any order
  • any number of user-added columns can be present (other than the listed optional columns) and these can be interleaved among the optional columns

Need a new stringType for checksum - see ticket #273 There are curated lists of hash algorithm names (for use in HTTP headers, for example):

A long would be helpful here, but that should be a separate discussion (and maybe we also add float too, for HAPI 4.0; also complex numbers?

Question (analysis needed): how many units-processing libraries use the same strings for these file size units?

Examples of standards for prefixes used with file sizes

We will eventually have to specify which standard we use for these prefixes.

Events Lists are also extensions of String Listings:

  • time - of time coverage (required,isotime)
  • stopDate - of time coverage (isotime, required) (Documentation acknowledges that this should be the same for an instant)
  • label (required)

Some example extensions to Event List:

  • latitude
  • longitude

Example proposed output, note x_parameterSchema

{
    "HAPI": "3.2",
    "x_createdAt": "2017-02-21T17:27Z",
    "modificationDate": "2026-01-01T00:00Z",
    "x_parameterSchema": "list>fileList>jpgFileList",
    "parameters": [
        {
            "length": 20,
            "name": "Time",
            "type": "isotime",
            "x_format": "$Y-$m-$dT$H:$M:$SZ",
            "fill": null,
            "units": "UTC",
            "timeStampLocation" : "begin"
        },
        {
            "description": "Picture of the creek, unmodified",
            "fill": null,
            "name": "fileURI",
            "length": 26,
            "type": "string",
            "units": null,
            "stringType": {
                "uri": {
                    "base": "https://cottagesystems.com/data/hapi/pics/",
                    "mediaType": "image/jpeg"
                }
            }
        },
        {
            "description": "File modification time",
            "name": "modificationDate",
            "type": "isotime",
            "fill": null,
            "x_format": "$Y-$m-$dT$H:$MZ",
            "length": 17,
            "units": "UTC"
        },
        {
            "description": "File size in kilobytes",
            "name": "fileSize",
            "fill": null,
            "type": "integer",
            "units": "KiB"
        }
    ],
    "sampleStartDate": "2023-01-01T00:00Z",
    "sampleStopDate": "2023-02-01T00:00Z",
    "startDate": "2022-11-01T00:00Z",
    "stopDate": "2026-03-06T00:00Z",
    "cadence": "PT10M",
    "status": {
        "code": 1200,
        "message": "OK"
    }
}

One issue is how to deal with the units on the file size. We could use IEEE units, which seem to be similar (the same?) as what is used in VO units, and astropy units, and probably also IEEE units: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9714443

See also:

Message sent 2026-04-06 to HAPI dev mailing list with status update:

For a summary of where we are now: We would like there to be a schema to indicate that a HAPI response is a listing of files that are available as URIs. (We did not provide this or encourage it so far because we don’t want providers just offering a file listing and saying they made their data available via HAPI.) If people do list files using HAPI, we would prefer that they all use the same format, so that it becomes possible to interpret file listings interoperably from any HAPI service. Therefore, we will offer a schema, that if followed, will allow clients to: a) know that they are getting a file listing, and b) be able to interpret such a listing from any server with computer precision using a single client.

The most basic file listing will be a HAPI dataset that has only 2 required columns:

  1. a time column as the first column (required by HAIP for any dataset); for a file listing, this represents the start time of the data in the file
  2. filename as a URI; this is a string column that has a special string sub-type of URI (this URI sub-type is part of the existing HAPI spec as of version 3.2) with a link to the file the start time of the data in the file. See here for URI string types: https://github.com/hapi-server/data-specification/blob/master/hapi-3.2.0/HAPI-data-access-spec-3.2.0.md#3616-the-stringtype-object

There can be optional elements after this for: file size, end time of data in the file, file modification time, file creation time, last file access time, checksum If any of these items are included, there are constraints that must be followed for them to be recognized by HAPI. Following any of these optional but constrained items, a dataset may include any number of other, additional columns relevant for these files, such as wavelength, frequency range, observed target, DOI, image type, quality flag, data version, processing level, etc. HAPI does not place any restriction on the number or structure of these additional columns. They just need to be valid HAPI parameters. Any “x_” items in these parameters are of course allowed, as always.

2026-04-20

Discussion about fileSize:

  • JavaScript does not even have integers, so what should size be? Pandering to JSON and JavaScript is hard since it doesn't have integers (or comments!)
  • Current thinking: use double and recommend that it be shown as an integer with as full precision as possible so that you get the exact value; if you are above 2GB (more digits than fits in double)
  • JavaScript: may lose precision for integers larger than 9007199254740991 (2^53 - 1)
  • see this binary presentation converter: https://www.binaryconvert.com/result_double.html
  • If a double is in this range: +/- 9,007,199,254,740,991 then represent it exactly, and this value will be represented exactly s a double
  • Discussed and abandoned: We could suggest that people add their own x_exactFileSize as a clandestine long by actually being a string type JSON; such as "123456789012345" (quotes make it a string to JSON, and then it requires special parsing, like a BigInt)
  • What about making fileSize as a string
  • Will summarize and clean this up tomorrow.
  • This is useful to show that most file sizes (much bigger than 2GB) would be precisely represented: https://www.binaryconvert.com/convert_double.html

See also: https://github.com/hapi-server/data-specification/issues/218

Sample info response for a file listing

{
   "HAPI": "3.3",
   "status": { "code": 1200, "message": "OK"},
   "$schema": "https://hapi-server.org/schemas/HAPI-3.2.json#info-fileListing",
   "startDate": "1998-001Z",
   "stopDate" : "2017-100Z",
   "parameters": [
       { "name": "time",
         "type": "isotime",
         "units": "UTC",
         "fill": null,
         "length": 24 },
       { "name": "fileURI",
         "type": "string",
         "stringType": {"uri": { "base": "https://sample.com/listing", "mediaType": "image/fits" } },
         "fill": null,
         "description": "solar images at 580 nm",
         "label": "filename"},
       { "name": "checksum",
         "type": "string",
         "length": 32,
         "stringType": {"checksum": { "algorithm": "md5" } },
         "fill": null,
         "description": "pre-calculated checksum using MD5 algorithm"},
       { "name": "stopDate",
         "type": "isotime",
         "length": 24,
         "units": "UTC",
         "fill": null,
         "description": "end date and time when the image was taken; integration times range from 10s to 30s",
         "label": "image stop date"}
   ]
}

How to handle duration of files and events

How to handle the fact that event listing and file listings involve content that has an intrinsic time range. Regular HAIP data content has each row associated with a point in time, at least with respect to the query for data.

We decided to keep the query mechanism and rules the same, and will just add a statement about the need to expand a query time range to include potential edge cases, something like: Because event lists and file listings refer to items with an implied durations, a HAPI query for items in this kind of list may need to be expanded, since the query will return only items whose start time falls in the query range. If a server wants to communicate a duration, the stopDate should be used.

How to handle duplicate times in file listings or event lists

Repeated time tags are allowed in fileListing or eventList data schemas. Equivalently, we could say that data must never be decreasing.

We just noticed that the HAPI spec never actually states that HAPI times must only ever increase. So we need to add that to the spec! The definitions for "monotonically increasing: vary, so we will avoid that language. The spec shoudl say that values can only ever increase, with no duplicates.

Comments on case and capitalization

Three places where we have specific capitalization:

  • http query parameters: we use snake case, such as include_parameters
  • camelCase everywhere else
  • AlertCamelCase for the name of the first column, the Time parameter (sort of, since it's only one word)

Defining the schema for what the parameters are

Like the unitsSchema and coordinateSystemSchema, we will use parameterSchema as the keyword.

Other options: datasetSchema - this means keywords outside the parameters have extra requirements

Could datsetSchema be an array? So far, these potential values are envisioned:

Should it just be called "dataType"? Do we need to worry about other usage of "schema"? We have "stringType" already.

AI summary for 2026-04-29

Meeting Summary for HAPI FileListings - working meeting

Quick recap

The team established file size representation standards and discussed document formatting requirements for file sizes, algorithm names, and schema conventions. They explored design considerations for event lists and file listings, including the handling of time ranges and the potential for shared base classes. The group concluded by addressing documentation updates regarding time tags and data schemas, while also discussing server-side data constraints and API query requirements.

Next steps

  • Jon to update documentation to allow repeated time tags for event and file listings
  • Jon to modify data section to state "Data time values must be monotonically increasing unless specified otherwise
  • Bob to update verifier to check for "not decreasing" instead of "increasing" for time tags
  • Jon to add keyword in info response to indicate parameter schema for file and event listings
  • Team to discourage but allow file listings that mix multiple datasets
  • Team to remove "file" from file URI and file size parameter names
  • Team to maintain consistent capitalization across API, metadata, and parameter names

Summary

File Size Representation Standards

The team discussed file size representation standards, focusing on units and formatting. They agreed that units should always be in bytes, with values formatted as integers, unless the file size exceeds 9 petabytes, in which case exponential notation should be used. Bob emphasized that the verifier would enforce these rules, while Jon captured the decisions in the wiki page.

Document Formatting and Schema Standards

The team discussed and refined text in a document, focusing on formatting and content related to file sizes, algorithm names, and schema conventions. They agreed to modify certain phrases, including replacing specific numbers with tilde symbols and adjusting the placement of parenthetical statements. Jeremy suggested listing algorithm names directly in the document to avoid external dependencies, and the team decided to include examples for clarity. They also discussed the handling of user-added columns and emphasized the importance of encouraging users to develop their own schemas for additional columns while maintaining consistency for file listings.

Event and File Listing Design

Jon and Jeremy discussed the design of a base class for event lists and file listings. They debated whether to have a common base class for both, but ultimately decided against it due to the different requirements of events and file listings. They also discussed the minimal requirements for a VO event, noting that the current structure includes RA and DEC information which may not be relevant for all use cases.

Event and File Listing Challenges

The team discussed the challenges of listing events and file listings, particularly regarding the need for descriptions and the handling of time ranges. They explored the possibility of using a base class for file listings as a subclass of events lists, but Jeremy expressed concerns about the implications of this change. The group agreed that additional API flags might be necessary to better communicate the nature of different types of listings, such as file listings, event listings, or availability listings. They also considered relaxing the rule about HAPI servers returning exact start and stop times for certain datasets.

Event Data Schema Overlap Discussion

The team discussed handling event listings and data schemas, focusing on how to manage overlapping events and time ranges. Bob proposed allowing the start and stop parameters in API queries to refer to different columns than the primary time index, which would better accommodate events with multiple stop times. Jeremy suggested that some events might be described by ranges rather than instances, and Jon agreed that the API should be flexible enough to handle both cases. The team also considered the potential server load of implementing complex queries and agreed that widening queries to include overlapping data might be a simpler solution.

Server Data Constraints and Time Handling

The team discussed implementing constraints on server-side data, particularly focusing on handling time-based queries and data with intrinsic duration. They agreed to maintain the current approach of requiring widened time ranges for event lists and file listings, rather than adding server complications. The group also debated whether to allow non-monotonic data, with Jeremy suggesting that allowing repetitions could be safe while Jon raised concerns about potential hash key issues. They concluded that while duplicate times would be allowed, they must be accompanied by stop dates to describe durations.

Documentation Updates for Time Tags

The team discussed changes to the documentation regarding time tags and data schemas. They agreed to allow repeated time tags in event lists and file listings, though discouraged for file listings. They decided to use "dataset type" instead of "dataset schema" to describe the overall constraints for a dataset. The team also aligned on using consistent capitalization for parameter names, with "time" as an exception to maintain consistency with existing usage. They left open the question of how to declare a particular dataset type in the info response.

AI-generated content may be inaccurate or misleading. Always check for accuracy.

2026-05-04

For identifying an info response, we are settling on datasetSchema, since there may be requirements levied on dataset-level options. For example, if we add a schema for ground-based datasets, we could require the location or geoLocation.

List of potential schemas:

  • fileListing
  • FAIR
  • eventList
  • groundMagnetometer (and other possible measurement types)
  • spaceMagnetometer

What about multiple schemas in one info response? A file listing schema could

info response:

{
   "HAPI": "3.3",
   "status": { "code": 1200, "message": "OK"},
   "startDate": "1998-001Z",
   "stopDate" : "2017-100Z",
   "coordinateSystemSchema" : { "schemaName": "SPASE", "schemaURI": "TBD"},
   "datasetSchema": { "fileListing" },
   "datasetSchema": { "fileListing": { "parameters" : { "startDate" ; "startDate", "uri": "uri" } } },

   "datasetSchema": { "fileListing": { "parameters" : { "startDate" ; "Time", "uri": "fileURL" },
                                        "dataset":    { "geoLocation" : "x_placeAsLatAndLon" } },
   # or just give the mappings
   "datasetSchema": { "fileListing": { "startDate" : " Time", "uri": "fileURL" } }
   # for ground magnetometers
   "datasetSchema": { "groundMagnetometer": { "vectorField" : "field" } },
   "datasetSchema": { "groundMagnetometer": { "vectorField" : [ "bx", "by", "bz"] } },
   "datasetSchema": { "groundMagnetometer": { "vectorFieldBaselineSubtracted" : "fieldBGSubtr" } },
   "datasetSchema": { "groundMagnetometer": { "vectorFieldBaselineSubtracted" : [ "bx_b", "by_b", "bz_b"] } },

   "datasetSchema": { "FAIR": { "dataset": {"licenseURL": "x_APL_license_URL" } },
   "datasetSchema": { "FAIR": { "dataset": {"licenseURL": "x_license_URL" } }
   # we concluded this is not appropriate (people should use the right HAPI keyword names!
}

After discussion, we realized that HAPI server creators control the dataset level keyword names, so we should not allow people to use non-HAPI compliant dataset keywords. I.e., we don't want a separate parameters and dataset elements - just parameters one, so we don't need a title for it.

So just this then:

{
   "HAPI": "3.3",
   "status": { "code": 1200, "message": "OK"},
   "startDate": "1998-001Z",
   "stopDate" : "2017-100Z",
   "coordinateSystemSchema" : { "schemaName": "SPASE", "schemaURI": "TBD"},
   "datasetSchema": "fileListing",
   "datasetSchema": { "fileListing": {} },  # allowed?, but same as above
   "datasetSchema": { "fileListing": { "parameterMap" : { "startDate" ; "Time", "uri": "fileURL" } } },
   # or for data:
   "datasetSchema": { "groundMagnetometer": { "vectorField" : "field" } },
   "datasetSchema": { "groundMagnetometer": { "vectorField" : [ "bx", "by", "bz"] } },
   "datasetSchema": { "groundMagnetometer": { "vectorFieldBaselineSubtracted" : "fieldBGSubtr" } },
   "datasetSchema": { "groundMagnetometer": { "vectorFieldBaselineSubtracted" : [ "bx_b", "by_b", "bz_b"] } },
}

Clone this wiki locally