Skip to content

Latest commit

 

History

History
107 lines (81 loc) · 4.71 KB

README.md

File metadata and controls

107 lines (81 loc) · 4.71 KB

RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm

Tiancheng Gu, Kaicheng Yang, Chaoyi Zhang, Yin Xie, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng

Static Badge Static Badge

📣 News

  • [2025/02/18]:✨The RealSyn Dataset has been released in 🤗Hugging Face.
  • [2025/02/18]:✨The paper of RealSyn has submitted to arXiv.

💡 Introduction

Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a wide variety of benchmarks. However, a substantial volume of non-paired data, such as multimodal interleaved documents, remains underutilized for vision-language representation learning.

To fully leverage these unpaired documents, we initially establish a Real-World Data Extraction pipeline to extract high-quality images and texts.

Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts. To further enhance fine-grained visual information, we propose an image semantic augmented generation module for synthetic text production. Furthermore, we employ a semantic balance sampling strategy to improve dataset diversity, enabling better learning of long-tail concepts. Based on these innovations, we construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales: 15M, 30M, and 100M. Extensive experiments demonstrate that RealSyn effectively advances vision-language representation learning and exhibits strong scalability.

💻 Dataset Information

Topic Assessment

We ran LDA on random sampling 1M image-realistic text pairs with 30 topics. The above figure presents the proportions and examples for six topics: animal, food, airplane, flower, automotive, and landmark.

Richness Assessment

We presents image-text similarity and text token distribution of 15M samples from YFCC15, LAION, RealSyn-R1 (the most relevant retrieved realistic text), and RealSyn-S1 (the semantic augmented synthetic text based on RealSyn-R1).

Diversity Assessment

We randomly select 0.2M samples to calculate the number of unique entities in the caption to assess the data diversity of different datasets.

📃 Performance Comparison

Linear probe

Zero-shot Transfer

Zero-shot Retrieval

Dataset Contributors

This project would not have been possible without the invaluable contributions of the following individuals, who have been instrumental in data scraping and collection:

Contributor Emial
Bin Qin [email protected]
Lan Wu [email protected]

Citation

If you find this repository useful, please use the following BibTeX entry for citation.

@misc{gu2025realsyn,
      title={RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm}, 
      author={Tiancheng Gu and Kaicheng Yang and Chaoyi Zhang and Yin Xie and Xiang An and Ziyong Feng and Dongnan Liu and Weidong Cai and Jiankang Deng},
      year={2025},
      eprint={2502.12513},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

🌟Star History

Star History Chart