Proposal draft: packed lookup entries

Currently, when multiple names/prefixes/datatypes must be defined in the stream before a single statement, each falls into a separate `RdfStreamRow`. For example (from [Nanopub Registry](https://registry.petapico.org/np/RALP8h_jozbq7ixA4vjMVGBM3XaG9WVQ2cIhpuwlry4Q4.jelly.txt)):

```java
rows {
  name {
    id: 0
    value: "sig"
  }
}
rows {
  name {
    id: 0
    value: "hasAlgorithm"
  }
}
# and here goes the quad
```

If the entries are added sequentially (and they often are), we could perhaps squash it into:

```java
rows {
  name {
    id: 0
    value: "sig"
    value: "hasAlgorithm"
  }
}
# and here goes the quad
```

By changing the type of the `value` field to `repeated`.

This would save 4 bytes per each squashed entry (2 for tag and LEN of `RdfStreamRow`, and 2 for tag and LEN of `Rdf(Name|Prefix|Datatype)Entry`. Further savings could be achieved if we processed triples in minibatches (maybe introduce such API to `ProtoEncoder`?), where we'd have to assume that the dictionaries are large enough to hold all needed entries for the minibatch. This should not be a problem for batches of, let's say, 10 statements.

I'd have to run some scripts on the datasets in RiverBench to see what would be the savings, in concrete terms. TODO: test it with different minibatch sizes, starting from 1 up to, let's say, 16.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal draft: packed lookup entries #41

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Proposal draft: packed lookup entries #41

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions