Automatic image annotation

Last updated July 20, 2025

Automatic image annotation (also known as automatic image tagging or linguistic indexing) is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.

This method can be regarded as a type of multi-class image classification with a very large number of classes - as large as the vocabulary size. Typically, image analysis in the form of extracted feature vectors and the training annotation words are used by machine learning techniques to attempt to automatically apply annotations to new images.^[1] The first methods learned the correlations between image features and training annotations. Subsequently, techniques were developed using machine translation to to attempt to translate the textual vocabulary into the 'visual vocabulary,' represented by clustered regions known as blobs. Subsequent work has included classification approaches, relevance models, and other related methods.

The advantages of automatic image annotation versus content-based image retrieval (CBIR) are that queries can be more naturally specified by the user.^[2] At present, Content-Based Image Retrieval (CBIR) generally requires users to search by image concepts such as color and texture or by finding example queries. However, certain image features in example images may override the concept that the user is truly focusing on. Traditional methods of image retrieval, such as those used by libraries, have relied on manually annotated images, which is expensive and time-consuming, especially given the large and constantly growing image databases in existence.

Types of Image Annotation

2D Bounding Boxes : It defines the boundaries of objects in a two-dimensional space using graphic representations.
Object Detection : This deals with detection of instances of semantic objects pertaining to a certain class (humans, buildings, or cars) with respect to digital images and videos.
Key Point Annotation : The Key Point image data annotation recognizes facial gestures, human poses, expressions, emotions, body language, and sentiments through the connection of multiple dots.
Polygon Annotation : This involves marking and drawing shapes on a digital image. It allows marking objects within an image based on their position and orientation.
3D Cuboid Annotation : This is used for detecting and recognizing 3D objects in images.
Semantic Segmentation : A semantic segmentation technique is used in computer vision to segment images. An image dataset is semantically segmented to locate all categories and classes.
Image Classification : Images or objects are classified within images as per custom multi-level taxonomies like land use, crops, etc.
Skeletal Annotation : This is used to highlight body movement and alignment.^[3]

References

↑ Barrat, Sabine; Tabbone, Salvatore (2010-05-01). "Modeling, classifying and annotating weakly annotated images using Bayesian network". Journal of Visual Communication and Image Representation. 21 (4): 355–363. doi:10.1016/j.jvcir.2010.02.010. ISSN 1047-3203.
↑ "Archived copy" (PDF). i.yz.yamagata-u.ac.jp. Archived from the original (PDF) on 8 August 2014. Retrieved 13 January 2022.{{cite web}}: CS1 maint: archived copy as title (link)
↑ "Types of Image Annotation". cogitotech.com/computer-vision/image-annotation/.

Datta, Ritendra; Dhiraj Joshi; Jia Li; James Z. Wang (2008). "Image Retrieval: Ideas, Influences, and Trends of the New Age". ACM Computing Surveys. 40 (2): 1–60. doi:10.1145/1348246.1348248. S2CID 7060187.
Nicolas Hervé; Nozha Boujemaa (2007). "Image annotation : which approach for realistic databases ?" (PDF). ACM International Conference on Image and Video Retrieval. Archived from the original (PDF) on 2011-05-20.
M Inoue (2004). "On the need for annotation-based image retrieval" (PDF). Workshop on Information Retrieval in Context. pp. 44–46. Archived from the original (PDF) on 2014-08-08.

Quan Hoang Lam; Quang Duy Le; Kiet Van Nguyen; Ngan Luu-Thuy Nguyen (2020). "UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image Captioning". Proceedings of the 2020 International Conference on Computational Collective Intelligence (ICCCI 2020). arXiv: 2002.00175 . doi:10.1007/978-3-030-63007-2_57.

Natural scene annotation

J Fan; Y Gao; H Luo; G Xu (2004). "Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation". Proceedings of the 27th annual international conference on Research and development in information retrieval. pp. 361–368.

Relevant low-level global filters

A Oliva & A Torralba (2001). "Modeling the shape of the scene: a holistic representation of the spatial envelope" (PDF). International Journal of Computer Vision. pp. 42:145–175.

Global image features and nonparametric density estimation

A Yavlinsky, E Schofield & S Rüger (2005). "Automated Image Annotation Using Global Features and Robust Nonparametric Density Estimation" (PDF). Int'l Conf on Image and Video Retrieval (CIVR, Singapore, Jul 2005). Archived from the original (PDF) on 2005-12-20.

Video semantics

N Vasconcelos & A Lippman (2001). "Statistical Models of Video Structure for Content Analysis and Characterization" (PDF). IEEE Transactions on Image Processing. pp. 1–17.

Ilaria Bartolini; Marco Patella & Corrado Romani (2010). "Shiatsu: Semantic-based Hierarchical Automatic Tagging of Videos by Segmentation Using Cuts". 3rd ACM International Multimedia Workshop on Automated Information Extraction in Media Production (AIEMPro10).

Image Annotation Refinement

Yohan Jin; Latifur Khan; Lei Wang & Mamoun Awad (2005). "Image annotations by combining multiple evidence & wordNet". 13th Annual ACM International Conference on Multimedia (MM 05). pp. 706–715.

Changhu Wang; Feng Jing; Lei Zhang & Hong-Jiang Zhang (2006). "Image annotation refinement using random walk with restarts". 14th Annual ACM International Conference on Multimedia (MM 06).

Changhu Wang; Feng Jing; Lei Zhang & Hong-Jiang Zhang (2007). "content-based image annotation refinement". IEEE Conference on Computer Vision and Pattern Recognition (CVPR 07). doi:10.1109/CVPR.2007.383221.

Ilaria Bartolini & Paolo Ciaccia (2007). "Imagination: Exploiting Link Analysis for Accurate Image Annotation". Springer Adaptive Multimedia Retrieval. doi:10.1007/978-3-540-79860-6_3.

Ilaria Bartolini & Paolo Ciaccia (2010). "Multi-dimensional Keyword-based Image Annotation and Search". 2nd ACM International Workshop on Keyword Search on Structured Data (KEYS 2010).

Automatic Image Annotation by Ensemble of Visual Descriptors

Emre Akbas & Fatos Y. Vural (2007). "Automatic Image Annotation by Ensemble of Visual Descriptors". Intl. Conf. on Computer Vision (CVPR) 2007, Workshop on Semantic Learning Applications in Multimedia. doi:10.1109/CVPR.2007.383484. hdl: 11511/16027 .

A New Baseline for Image Annotation

Ameesh Makadia and Vladimir Pavlovic and Sanjiv Kumar (2008). "A New Baseline for Image Annotation" (PDF). European Conference on Computer Vision (ECCV).

Simultaneous Image Classification and Annotation

Chong Wang and David Blei and Li Fei-Fei (2009). "Simultaneous Image Classification and Annotation" (PDF). Conf. on Computer Vision and Pattern Recognition (CVPR).

TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Auto-Annotation

Matthieu Guillaumin and Thomas Mensink and Jakob Verbeek and Cordelia Schmid (2009). "TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Auto-Annotation" (PDF). Intl. Conf. on Computer Vision (ICCV).

Image Annotation Using Metric Learning in Semantic Neighbourhoods

Yashaswi Verma & C. V. Jawahar (2012). "Image Annotation Using Metric Learning in Semantic Neighbourhoods" (PDF). European Conference on Computer Vision (ECCV). Archived from the original (PDF) on 2013-05-14. Retrieved 2014-02-26.

Automatic Image Annotation Using Deep Learning Representations

Venkatesh N. Murthy & Subhransu Maji and R. Manmatha (2015). "Automatic Image Annotation Using Deep Learning Representations" (PDF). International Conference on Multimedia (ICMR).

Holistic Image Annotation using Salient Regions and Background Image Information

Sarin, Supheakmungkol; Fahrmair, Michael; Wagner, Matthias & Kameyama, Wataru (2012). Leveraging Features from Background and Salient Regions for Automatic Image Annotation. Journal of Information Processing. Vol. 20. pp. 250–266.

Medical Image Annotation using bayesian networks and active learning

N. B. Marvasti & E. Yörük and B. Acar (2018). "Computer-Aided Medical Image Annotation: Preliminary Results With Liver Lesions in CT". IEEE Journal of Biomedical and Health Informatics.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Barrat, Sabine; Tabbone, Salvatore (2010-05-01). "Modeling, classifying and annotating weakly annotated images using Bayesian network". Journal of Visual Communication and Image Representation. 21 (4): 355–363. doi:10.1016/j.jvcir.2010.02.010. ISSN 1047-3203.

[2] "Archived copy" (PDF). i.yz.yamagata-u.ac.jp. Archived from the original (PDF) on 8 August 2014. Retrieved 13 January 2022.{{cite web}}: CS1 maint: archived copy as title (link)

[3] "Types of Image Annotation". cogitotech.com/computer-vision/image-annotation/.

[1]

[2]

[3]

Automatic image annotation

Contents

Types of Image Annotation

See also

References

Further reading