Hi, thank you for releasing this excellent work.
While reading the paper, there seems to be one point that is still unclear: how the OCR annotations are actually incorporated into training.
From the paper, the following part is understood:
PaddleOCR is applied to images from OBELICS and Zero250M
the recognized text is tokenized
100 fine-grained tags are constructed for each image
OCR data is introduced in Stage 2 together with video supervision
However, the paper does not seem to explicitly describe how these OCR-derived tags are optimized in the training objective.
Hi, thank you for releasing this excellent work.
While reading the paper, there seems to be one point that is still unclear: how the OCR annotations are actually incorporated into training.
From the paper, the following part is understood:
PaddleOCR is applied to images from OBELICS and Zero250M
the recognized text is tokenized
100 fine-grained tags are constructed for each image
OCR data is introduced in Stage 2 together with video supervision
However, the paper does not seem to explicitly describe how these OCR-derived tags are optimized in the training objective.