-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-Modal Support for Enhanced Retrieval #12
Comments
Sounds like a good enhancement - especially useful for indexing blender/photoshop/visual based tasks. Highly encourage you to take a crack at implementing it! |
Sounds great! I’ll try it out and reach out if I have any questions. |
microsoft Florence 2 might be a good option: |
Yeah I've played with moondream and (when I did) it performed quite poorly on screenshots. I had a short interaction with the creator and it sounded like he was considering trying to tackle screenshots, but the project was currently focused on scenes (photographs etc) I've been keeping an eye out... Closed models (OpenAI / anthropic) are able to look at a screenshot and build an html page to some degree, which tells me they have a pretty good understanding of screenshots and would perform well. Maybe a fine tune in screenshots of moondream using a larger model would be possible. |
A few hours ago hf released an article on how to finetune Florence. |
Is there a plan to incorporate image embeddings along with OCR and metadata-based retrieval? Utilizing the CLIP model from Candle to generate image embeddings could provide clearer context and improve the accuracy of xrem’s results. If performance is a concern, downscaling images before embedding could be a viable solution.
The text was updated successfully, but these errors were encountered: