Skip to content

Clarification on how OCR annotations are used during training #105

@JerryPW

Description

@JerryPW

Hi, thank you for releasing this excellent work.

While reading the paper, there seems to be one point that is still unclear: how the OCR annotations are actually incorporated into training.

From the paper, the following part is understood:

PaddleOCR is applied to images from OBELICS and Zero250M
the recognized text is tokenized
100 fine-grained tags are constructed for each image
OCR data is introduced in Stage 2 together with video supervision

However, the paper does not seem to explicitly describe how these OCR-derived tags are optimized in the training objective.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions