Cross-modal retrieval

Last updated

Cross-modal retrieval is a subfield of information retrieval that enables users to search for and retrieve information across different data modalities, such as text, images, audio, and video. [1] Unlike traditional information retrieval systems that match queries and documents within the same modality (e.g., text-to-text search), cross-modal retrieval bridges different types of media to facilitate more flexible information access. [2] [3] [4]

Contents

Overview

Cross-modal retrieval addresses scenarios where the query and target documents are of different types. Common applications include:

Technical Challenges

Cross-modal retrieval presents several challenges:

Approaches

Modern cross-modal retrieval systems employ various techniques:

Applications

Cross-modal retrieval has numerous practical applications including:

See Also

References

  1. 1 2 3 Hendriksen, Mariya; Vakulenko, Svitlana; Kuiper, Ernst; de Rijke, Maarten (2023). "Scene-centric vs. object-centric image-text cross-modal retrieval: a reproducibility study". In Kamps, Jaap; Goeuriot, Lorraine; Crestani, Fabio; Maistro, Maria; Joho, Hideo; Davis, Brian; Gurrin, Cathal; Kruschwitz, Udo; Caputo, Annalina (eds.). Advances in Information Retrieval. Lecture Notes in Computer Science. Vol. 13982. Cham: Springer Nature Switzerland. pp. 68–85. doi:10.1007/978-3-031-28241-6_5. ISBN   978-3-031-28240-9.
  2. Gu, Jiuxiang; Cai, Jianfei; Joty, Shafiq; Niu, Li; Wang, Gang (2018). "Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models" (PDF). Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, Utah, USA: IEEE. pp. 7181–7189.
  3. Jain, Aashi; Guo, Mandy; Srinivasan, Krishna; Chen, Ting; Kudugunta, Sneha; Jia, Chao; Yang, Yinfei; Baldridge, Jason (2021). "MURAL: Multimodal, Multitask Representations Across Languages". In Moens, Marie-Francine; Huang, Xuanjing; Specia, Lucia; Yih, Scott Wen-tau (eds.). Findings of the Association for Computational Linguistics: EMNLP 2021. Punta Cana, Dominican Republic: Association for Computational Linguistics. pp. 3449–3463. doi:10.18653/v1/2021.findings-emnlp.293. ISBN   978-1-955917-10-0.
  4. Huang, Zhenyu; Niu, Guocheng; Liu, Xiao; Ding, Wenbiao; Xiao, Xinyan; Wu, Hua; Peng, Xi (2021). "Learning with Noisy Correspondence for Cross-Modal Matching". Advances in Neural Information Processing Systems. Vancouver, Canada: Curran Associates, Inc. pp. 29406–29419.
  5. Jin, Qin; Schulam, Peter Franz; Rawat, Shourabh; Burger, Susanne; Ding, Duo; Metze, Florian (2012). "Event-based Video Retrieval Using Audio". Interspeech 2012. Porto Alegre, Brazil: ISCA. pp. 2085–2088. doi:10.21437/Interspeech.2012-556.
  6. Fang, Han; Xiong, Pengfei; Xu, Luhui; Chen, Yu (2021). "CLIP2Video: Mastering Video-Text Retrieval via Image CLIP". arXiv: 2106.11097 [cs.CV].