Thank you for your excellent work.
I have reviewed the open-source dataset. It seems that apart from the cocktail task, most of the other cot texts are simple visual localization tasks (pick something).
I would like to know if you have only trained long-range planning or general planning skills on cocktail task(or other real machine task showcased). (Like "Scene description, Plan, What I have done, Now I need to" format)
Have you provided synthetic long-range planning data in open source datasets, as mentioned in the paper Table 11: Examples of synthetic vision-language data for long-horizon tasks.
If I missed something, I am deeply sorry. Looking forward to your reply, thank you.
Thank you for your excellent work.
I have reviewed the open-source dataset. It seems that apart from the cocktail task, most of the other cot texts are simple visual localization tasks (pick something).
I would like to know if you have only trained long-range planning or general planning skills on cocktail task(or other real machine task showcased). (Like
"Scene description, Plan, What I have done, Now I need to"format)Have you provided synthetic long-range planning data in open source datasets, as mentioned in the paper
Table 11: Examples of synthetic vision-language data for long-horizon tasks.If I missed something, I am deeply sorry. Looking forward to your reply, thank you.