Skip to content

[Question] Tokenizer is not counted in submission size #43

@DouglasOrr

Description

@DouglasOrr

The submission size calculation doesn't count persisting the tokenizer itself - is this right? Two challenges to this: first, the size includes everything necessary (modulo generic Python requirements) to fully specify and inference the model, so it seems a shame to lose this property by omitting the tokenizer. Second, not counting this allows large-vocabulary models an artificial advantage, since they get the strings "for free".

(However, I appreciate this may be a pragmatic decision, as although the fineweb_1024_bpe.vocab file is small, fineweb_1024_bpe.model is large, e.g. 150 kB compressed & is presumably larger than it needs to be. I presume the model isn't uniquely reconstructable from the vocab(?))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions