Implement pipeline and new schema for image datasets#158
Implement pipeline and new schema for image datasets#158MajikalExplosions wants to merge 7 commits intomainfrom
Conversation
|
@MajikalExplosions could you check the failing pre-commit checks and Python unit tests? Thanks! |
neubig
left a comment
There was a problem hiding this comment.
submitting a review comment so I pop it off my review stack, but please re-request review when tests are passing!
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
neubig
left a comment
There was a problem hiding this comment.
Hey, basically looks good! One comment though, I think we should try to fix the schema to support things we need instead of using getattr and hasattr
|
|
||
| # Handle nested image observation | ||
| image_path = None | ||
| if hasattr(event, "image_observation") and event.image_observation: |
There was a problem hiding this comment.
I'm a bit confused, where would this nested image observation come from? I didn't see it in other parts of the code.
In general, "getattr" and "hasattr" are kinda anti-patterns in Python programming. They are indicative of not strictly adhering to type definitions, and can cause all kinds of tricky runtime errors. Let's try to write this without using these.
Summary
Implements multimodal (image + text) fine-tuning support, enabling 9 datasets to preserve visual information for training vision-language models instead of converting images to text descriptions.
Implementation
schema/observation/image.py: Schema enhancements that:
ImageAnnotation:content_description,clickable,editableNone)agents/openhands/std_to_sft.py: Multimodal SFT conversion that:
<image>tokens in conversation text at appropriate positions_image_pathmetadata[clickable],[editable])imagesarrayDataset converters (raw_to_standardized.py): Updated 9 image datasets.
content_descriptionwith resource ID/hint/tooltip, andclickable/editablefrom raw data fieldstexttocontent_description, marks all elements asclickable=Truecontent_descriptionwith XPath informationdatasets/openhands/screenshots/{trajectory_id}/directory structureTesting
In progress. Some verification is done on the output.
Notes
count(<image> tokens) == len(images array))