Tiancheng Gu, Kaicheng Yang, Chaoyi Zhang, Yin Xie, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng
- [2025/02/18]:✨The RealSyn Dataset has been released in 🤗Hugging Face.
- [2025/02/18]:✨The paper of RealSyn has submitted to arXiv.
Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a wide variety of benchmarks. However, a substantial volume of non-paired data, such as multimodal interleaved documents, remains underutilized for vision-language representation learning.
To fully leverage these unpaired documents, we initially establish a Real-World Data Extraction pipeline to extract high-quality images and texts.
Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts. To further enhance fine-grained visual information, we propose an image semantic augmented generation module for synthetic text production. Furthermore, we employ a semantic balance sampling strategy to improve dataset diversity, enabling better learning of long-tail concepts. Based on these innovations, we construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales: 15M, 30M, and 100M. Extensive experiments demonstrate that RealSyn effectively advances vision-language representation learning and exhibits strong scalability.
We ran LDA on random sampling 1M image-realistic text pairs with 30 topics. The above figure presents the proportions and examples for six topics: animal, food, airplane, flower, automotive, and landmark.We presents image-text similarity and text token distribution of 15M samples from YFCC15, LAION, RealSyn-R1 (the most relevant retrieved realistic text), and RealSyn-S1 (the semantic augmented synthetic text based on RealSyn-R1).
We randomly select 0.2M samples to calculate the number of unique entities in the caption to assess the data diversity of different datasets.This project would not have been possible without the invaluable contributions of the following individuals, who have been instrumental in data scraping and collection:
Contributor | Emial |
---|---|
Bin Qin | [email protected] |
Lan Wu | [email protected] |
If you find this repository useful, please use the following BibTeX entry for citation.
@misc{gu2025realsyn,
title={RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm},
author={Tiancheng Gu and Kaicheng Yang and Chaoyi Zhang and Yin Xie and Xiang An and Ziyong Feng and Dongnan Liu and Weidong Cai and Jiankang Deng},
year={2025},
eprint={2502.12513},
archivePrefix={arXiv},
primaryClass={cs.CV}
}