Automatic image annotation

Last updated
Output of DenseCap "dense captioning" software, analysing a photograph of a man riding an elephant DenseCap (Johnson et al., 2016) (cropped).png
Output of DenseCap "dense captioning" software, analysing a photograph of a man riding an elephant

Automatic image annotation (also known as automatic image tagging or linguistic indexing) is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.

Contents

This method can be regarded as a type of multi-class image classification with a very large number of classes - as large as the vocabulary size. Typically, image analysis in the form of extracted feature vectors and the training annotation words are used by machine learning techniques to attempt to automatically apply annotations to new images. The first methods learned the correlations between image features and training annotations, then techniques were developed using machine translation to try to translate the textual vocabulary with the 'visual vocabulary', or clustered regions known as blobs. Work following these efforts have included classification approaches, relevance models and so on.

The advantages of automatic image annotation versus content-based image retrieval (CBIR) are that queries can be more naturally specified by the user. [1] CBIR generally (at present) requires users to search by image concepts such as color and texture, or finding example queries. Certain image features in example images may override the concept that the user is really focusing on. The traditional methods of image retrieval such as those used by libraries have relied on manually annotated images, which is expensive and time-consuming, especially given the large and constantly growing image databases in existence.

See also

Related Research Articles

An image retrieval system is a computer system used for browsing, searching and retrieving images from a large database of digital images. Most traditional and common methods of image retrieval utilize some method of adding metadata such as captioning, keywords, title or descriptions to the images so that retrieval can be performed over the annotation words. Manual image annotation is time-consuming, laborious and expensive; to address this, there has been a large amount of research done on automatic image annotation. Additionally, the increase in social web applications and the semantic web have inspired the development of several web-based image annotation tools.

<span class="mw-page-title-main">Content-based image retrieval</span> Method of image retrieval

Content-based image retrieval, also known as query by image content and content-based visual information retrieval (CBVIR), is the application of computer vision techniques to the image retrieval problem, that is, the problem of searching for digital images in large databases. Content-based image retrieval is opposed to traditional concept-based approaches.

Michael S. Lew is a scientist in multimedia information search and retrieval at Leiden University, Netherlands. He has published over a dozen books and 150 scientific articles in the areas of content based image retrieval, computer vision, and deep learning. Notably, he had the most cited paper in the ACM Transactions on Multimedia, one of the top 10 most cited articles in the history of the ACM SIGMM, and the most cited article from the ACM International Conference on Multimedia Information Retrieval in 2008 and also in 2010. He was the opening keynote speaker for the 9th International Conference on Visual Information Systems, the Editor-in-Chief of the International Journal of Multimedia Information Retrieval (Springer), the co-founder of influential conferences such as the International Conference on Image and Video Retrieval, and the IEEE Workshop on Human Computer Interaction. He was also a founding member of the international advisory committee for the TRECVID video retrieval evaluation project, chair of the steering committee for the ACM International Conference on Multimedia Retrieval and a member of the ACM SIGMM Executive Committee. In addition, his work on convolutional fusion networks in deep learning won the best paper award at the 23rd International Conference on Multimedia Modeling. His work is frequently cited in both scientific and popular news sources.

ACM Multimedia (ACM-MM) is the Association for Computing Machinery (ACM)'s annual conference on multimedia, sponsored by the SIGMM special interest group on multimedia in the ACM. SIGMM specializes in the field of multimedia computing, from underlying technologies to applications, theory to practice, and servers to networks to devices.

<span class="mw-page-title-main">James Z. Wang</span> Chinese-American computer scientist

James Ze Wang is a Chinese-American computer scientist. He is a distinguished professor of the College of Information Sciences and Technology at Pennsylvania State University. He is also an affiliated professor of the Molecular, Cellular, and Integrative Biosciences Program; the Computational Science Graduate Minor; and the Social Data Analytics Graduate Program. He is co-director of the Intelligent Information Systems Laboratory. He was a visiting professor of the Robotics Institute at Carnegie Mellon University from 2007 to 2008. In 2011 and 2012, he served as a program manager in the Office of International Science and Engineering at the National Science Foundation. He is the second son of Chinese mathematician Wang Yuan.

In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

Caltech 101 is a data set of digital images created in September 2003 and compiled by Fei-Fei Li, Marco Andreetto, Marc 'Aurelio Ranzato and Pietro Perona at the California Institute of Technology. It is intended to facilitate Computer Vision research and techniques and is most applicable to techniques involving image recognition classification and categorization. Caltech 101 contains a total of 9,146 images, split between 101 distinct object categories and a background category. Provided with the images are a set of annotations describing the outlines of each image, along with a Matlab script for viewing.

In computer vision, the problem of object categorization from image search is the problem of training a classifier to recognize categories of objects, using only the images retrieved automatically with an Internet search engine. Ideally, automatic image collection would allow classifiers to be trained with nothing but the category names as input. This problem is closely related to that of content-based image retrieval (CBIR), where the goal is to return better image search results rather than training a classifier for image recognition.

<span class="mw-page-title-main">Object detection</span> Computer technology related to computer vision and image processing

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.

<span class="mw-page-title-main">Global Memory Net</span>

Global Memory Net (GMNet) is a world digital library of cultural, historical, and heritage image collections. It is directed by Ching-chih Chen, Professor Emeritus of Simmons College, Boston, Massachusetts and supported by the National Science Foundation (NSF)'s International Digital Library Program (IDLP). The goal of GMNet is to provide a global collaborative network that provides universal access to educational resources to a worldwide audience. GMNet provides multilingual and multimedia content and retrieval, as well as links directly to major resources, such as OCLC, Internet Archive, Million Book Project, and Google.

Multilinear principal component analysis (MPCA) is a multilinear extension of principal component analysis (PCA) that is used to analyze M-way arrays, also informally referred to as "data tensors". M-way arrays may be modeled by linear tensor models, such as CANDECOMP/Parafac, or by multilinear tensor models, such as multilinear principal component analysis (MPCA) or multilinear independent component analysis (MICA). The origin of MPCA can be traced back to the tensor rank decomposition introduced by Frank Lauren Hitchcock in 1927; to the Tucker decomposition; and to Peter Kroonenberg's "3-mode PCA" work. In 2000, De Lathauwer et al. restated Tucker and Kroonenberg's work in clear and concise numerical computational terms in their SIAM paper entitled "Multilinear Singular Value Decomposition", (HOSVD) and in their paper "On the Best Rank-1 and Rank-(R1, R2, ..., RN ) Approximation of Higher-order Tensors".

<span class="mw-page-title-main">AlexNet</span> Convolutional neural network

AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor at the University of Toronto.

Shih-Fu Chang is a Taiwanese American computer scientist and electrical engineer noted for his research on multimedia information retrieval, computer vision, machine learning, and signal processing.

Jiebo Luo is a Chinese-American computer scientist, the Albert Arendt Hopeman Professor of Engineering and Professor of Computer Science at the University of Rochester. He is interested in artificial intelligence, data science and computer vision.

<span class="mw-page-title-main">Michael J. Black</span> American-born computer scientist

Michael J. Black is an American-born computer scientist working in Tübingen, Germany. He is a founding director at the Max Planck Institute for Intelligent Systems where he leads the Perceiving Systems Department in research focused on computer vision, machine learning, and computer graphics. He is also an Honorary Professor at the University of Tübingen.

Song-Chun Zhu is a Chinese computer scientist and applied mathematician known for his work in computer vision, cognitive artificial intelligence and robotics. Zhu currently works at Peking University and was previously a professor in the Departments of Statistics and Computer Science at the University of California, Los Angeles. Zhu also previously served as Director of the UCLA Center for Vision, Cognition, Learning and Autonomy (VCLA).

<span class="mw-page-title-main">Edward Y. Chang</span> American computer scientist

Edward Y. Chang is a computer scientist, academic, and author. He is an adjunct professor of Computer Science at Stanford University, and Visiting Chair Professor of Bioinformatics and Medical Engineering at Asia University, since 2019.

References

  1. "Archived copy" (PDF). i.yz.yamagata-u.ac.jp. Archived from the original (PDF) on 8 August 2014. Retrieved 13 January 2022.{{cite web}}: CS1 maint: archived copy as title (link)

Further reading

Y Mori; H Takahashi & R Oka (1999). "Image-to-word transformation based on dividing and vector quantizing images with words.". Proceedings of the International Workshop on Multimedia Intelligent Storage and Retrieval Management. CiteSeerX   10.1.1.31.1704 .
P Duygulu; K Barnard; N de Fretias & D Forsyth (2002). "Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary". Proceedings of the European Conference on Computer Vision. pp. 97–112. Archived from the original on 2005-03-05.
J Li & J Z Wang (2006). "Real-time Computerized Annotation of Pictures". Proc. ACM Multimedia. pp. 911–920.
J Z Wang & J Li (2002). "Learning-Based Linguistic Indexing of Pictures with 2-D MHMMs". Proc. ACM Multimedia. pp. 436–445.
J Li & J Z Wang (2008). "Real-time Computerized Annotation of Pictures". IEEE Transactions on Pattern Analysis and Machine Intelligence.
J Li & J Z Wang (2003). "Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach". IEEE Transactions on Pattern Analysis and Machine Intelligence. pp. 1075–1088.
K Barnard; D A Forsyth (2001). "Learning the Semantics of Words and Pictures". Proceedings of International Conference on Computer Vision. pp. 408–415. Archived from the original on 2007-09-28.
D Blei; A Ng & M Jordan (2003). "Latent Dirichlet allocation" (PDF). Journal of Machine Learning Research. pp. 3:993–1022. Archived from the original (PDF) on March 16, 2005.
G Carneiro; A B Chan; P Moreno & N Vasconcelos (2006). "Supervised Learning of Semantic Classes for Image Annotation and Retrieval" (PDF). IEEE Transactions on Pattern Analysis and Machine Intelligence. pp. 394–410.
R W Picard & T P Minka (1995). "Vision Texture for Annotation". Multimedia Systems.
C Cusano; G Ciocca & R Scettini (2004). Santini, Simone & Schettini, Raimondo (eds.). "Image Annotation Using SVM". Internet Imaging V. 5304: 330–338. Bibcode:2003SPIE.5304..330C. doi:10.1117/12.526746. S2CID   16246057.
R Maree; P Geurts; J Piater & L Wehenkel (2005). "Random Subwindows for Robust Image Classification". Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. pp. 1:34–30.
J Jeon; R Manmatha (2004). "Using Maximum Entropy for Automatic Image Annotation" (PDF). Int'l Conf on Image and Video Retrieval (CIVR 2004). pp. 24–32.
J Jeon; V Lavrenko & R Manmatha (2003). "Automatic image annotation and retrieval using cross-media relevance models" (PDF). Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 119–126.
V Lavrenko; R Manmatha & J Jeon (2003). "A model for learning the semantics of pictures" (PDF). Proceedings of the 16th Conference on Advances in Neural Information Processing Systems NIPS.
R Jin; J Y Chai; L Si (2004). "Effective Automatic Image Annotation via A Coherent Language Model and Active Learning" (PDF). Proceedings of MM'04.
D Metzler & R Manmatha (2004). "An inference network approach to image retrieval" (PDF). Proceedings of the International Conference on Image and Video Retrieval. pp. 42–50.
S Feng; R Manmatha & V Lavrenko (2004). "Multiple Bernoulli relevance models for image and video annotation" (PDF). IEEE Conference on Computer Vision and Pattern Recognition. pp. 1002–1009.
J Y Pan; H-J Yang; P Duygulu; C Faloutsos (2004). "Automatic Image Captioning" (PDF). Proceedings of the 2004 IEEE International Conference on Multimedia and Expo (ICME'04). Archived from the original (PDF) on 2004-12-09.
Quan Hoang Lam; Quang Duy Le; Kiet Van Nguyen; Ngan Luu-Thuy Nguyen (2020). "UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image Captioning". Proceedings of the 2020 International Conference on Computational Collective Intelligence (ICCCI 2020). arXiv: 2002.00175 . doi:10.1007/978-3-030-63007-2_57.
J Fan; Y Gao; H Luo; G Xu (2004). "Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation". Proceedings of the 27th annual international conference on Research and development in information retrieval. pp. 361–368.
A Oliva & A Torralba (2001). "Modeling the shape of the scene: a holistic representation of the spatial envelope" (PDF). International Journal of Computer Vision. pp. 42:145–175.
A Yavlinsky, E Schofield & S Rüger (2005). "Automated Image Annotation Using Global Features and Robust Nonparametric Density Estimation" (PDF). Int'l Conf on Image and Video Retrieval (CIVR, Singapore, Jul 2005). Archived from the original (PDF) on 2005-12-20.
N Vasconcelos & A Lippman (2001). "Statistical Models of Video Structure for Content Analysis and Characterization" (PDF). IEEE Transactions on Image Processing. pp. 1–17.
Ilaria Bartolini; Marco Patella & Corrado Romani (2010). "Shiatsu: Semantic-based Hierarchical Automatic Tagging of Videos by Segmentation Using Cuts". 3rd ACM International Multimedia Workshop on Automated Information Extraction in Media Production (AIEMPro10).
Yohan Jin; Latifur Khan; Lei Wang & Mamoun Awad (2005). "Image annotations by combining multiple evidence & wordNet". 13th Annual ACM International Conference on Multimedia (MM 05). pp. 706–715.
Changhu Wang; Feng Jing; Lei Zhang & Hong-Jiang Zhang (2006). "Image annotation refinement using random walk with restarts". 14th Annual ACM International Conference on Multimedia (MM 06).
Changhu Wang; Feng Jing; Lei Zhang & Hong-Jiang Zhang (2007). "content-based image annotation refinement". IEEE Conference on Computer Vision and Pattern Recognition (CVPR 07). doi:10.1109/CVPR.2007.383221.
Ilaria Bartolini & Paolo Ciaccia (2007). "Imagination: Exploiting Link Analysis for Accurate Image Annotation". Springer Adaptive Multimedia Retrieval. doi:10.1007/978-3-540-79860-6_3.
Ilaria Bartolini & Paolo Ciaccia (2010). "Multi-dimensional Keyword-based Image Annotation and Search". 2nd ACM International Workshop on Keyword Search on Structured Data (KEYS 2010).
Emre Akbas & Fatos Y. Vural (2007). "Automatic Image Annotation by Ensemble of Visual Descriptors". Intl. Conf. on Computer Vision (CVPR) 2007, Workshop on Semantic Learning Applications in Multimedia. doi:10.1109/CVPR.2007.383484. hdl: 11511/16027 .
Ameesh Makadia and Vladimir Pavlovic and Sanjiv Kumar (2008). "A New Baseline for Image Annotation" (PDF). European Conference on Computer Vision (ECCV).

Simultaneous Image Classification and Annotation

Chong Wang and David Blei and Li Fei-Fei (2009). "Simultaneous Image Classification and Annotation" (PDF). Conf. on Computer Vision and Pattern Recognition (CVPR).
Matthieu Guillaumin and Thomas Mensink and Jakob Verbeek and Cordelia Schmid (2009). "TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Auto-Annotation" (PDF). Intl. Conf. on Computer Vision (ICCV).
Yashaswi Verma & C. V. Jawahar (2012). "Image Annotation Using Metric Learning in Semantic Neighbourhoods" (PDF). European Conference on Computer Vision (ECCV). Archived from the original (PDF) on 2013-05-14. Retrieved 2014-02-26.
Venkatesh N. Murthy & Subhransu Maji and R. Manmatha (2015). "Automatic Image Annotation Using Deep Learning Representations" (PDF). International Conference on Multimedia (ICMR).
Sarin, Supheakmungkol; Fahrmair, Michael; Wagner, Matthias & Kameyama, Wataru (2012). Leveraging Features from Background and Salient Regions for Automatic Image Annotation. Journal of Information Processing. Vol. 20. pp. 250–266.
N. B. Marvasti & E. Yörük and B. Acar (2018). "Computer-Aided Medical Image Annotation: Preliminary Results With Liver Lesions in CT". IEEE Journal of Biomedical and Health Informatics.