You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MoVA is a really impressive work! I am working on a similar idea of using the text instruction to guide the fusion of image tokens in MLLMs. However, I encountered an issue thesedays: the LLaVA-665K finutuning dataset contains a lot of multi-turn conversations which means one sample can involve multiple instructions . In this case, do we need to split each multi-turn conversation sample into multiple single-turn conversation samples (since we can only encode one text instruction for one sample in a forward computation)?
Thanks!
The text was updated successfully, but these errors were encountered:
During training, we keep the original data format and directly concatenate these multi-round questions into a single question for instruction-aware extraction.
Dear authors:
MoVA is a really impressive work! I am working on a similar idea of using the text instruction to guide the fusion of image tokens in MLLMs. However, I encountered an issue thesedays: the LLaVA-665K finutuning dataset contains a lot of multi-turn conversations which means one sample can involve multiple instructions . In this case, do we need to split each multi-turn conversation sample into multiple single-turn conversation samples (since we can only encode one text instruction for one sample in a forward computation)?
Thanks!
The text was updated successfully, but these errors were encountered: