Monk Skin Tone Scale

Last updated

The Monk Skin Tone Scale is an open-source, 10-shade scale describing human skin color, developed by Ellis Monk in partnership with Google. [1] It is meant to replace the Fitzpatrick scale in fields such as computer vision research, after an IEEE study found the Fitzpatrick scale to be "poorly predictive of skin tone" and advised it "not be used as such in evaluations of computer vision applications." [2] In particular, the Fitzpatrick scale was found to under-represent darker shades of skin relative to the global human population.

Contents

The ten orbs of the Monk Skin Tone Scale Monk Skin Tone Orbs.png
The ten orbs of the Monk Skin Tone Scale

Predecessor

Computer vision researchers initially adopted the Fitzpatrick scale as a metric to evaluate how well a given collection of photos of people sampled the global population. [3] However, the Fitzpatrick scale was developed to predict the risk of skin cancer in lighter-skinned people, and did not initially include darker skin tones at all. Two tones for darker people were later added to the original four tones to make it more inclusive. Despite these improvements, research has found that the Fitzpatrick Skin Tone correlated more with self-reported race than with objective measurements of skin tone, [2] and that computer vision models trained using the Fitzpatrick scale perform poorly on images of people with darker skin. [4]

Use

The Monk scale includes 10 skin tones. Though other scales (such as those used by cosmetics companies) may include many more shades, [5] Monk claims that 10 tones balances diversity with ease of use, and can be used more consistently across different users than a scale with more tones:

Usually, if you got past 10 or 12 points on these types of scales [and] ask the same person to repeatedly pick out the same tones, the more you increase that scale, the less people are able to do that. Cognitively speaking, it just becomes really hard to accurately and reliably differentiate. [4]

The primary intended application of the scale is in evaluating datasets for training computer vision models. Other proposed applications include increasing the diversity of image search results, so that an image search for "doctor" returns images of doctors with a broad range of skin tones. [4]

Google has cautioned against equating the shades in the scale with race, noting that skin tone can vary widely within race. [6]

The Monk scale is licensed under the Creative Commons Attribution 4.0 International license. [7]

Related Research Articles

Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.

<span class="mw-page-title-main">Optical character recognition</span> Computer recognition of visual text

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.

<span class="mw-page-title-main">Image registration</span> Mapping of data into a single system

Image registration is the process of transforming different sets of data into one coordinate system. Data may be multiple photographs, data from different sensors, times, depths, or viewpoints. It is used in computer vision, medical imaging, military automatic target recognition, and compiling and analyzing images and data from satellites. Registration is necessary in order to be able to compare or integrate the data obtained from these different measurements.

Natural language generation (NLG) is a software process that produces natural language output. A widely-cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems than can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information".

<span class="mw-page-title-main">Automatic image annotation</span>

Automatic image annotation is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.

Video quality is a characteristic of a video passed through a video transmission or processing system that describes perceived video degradation. Video processing systems may introduce some amount of distortion or artifacts in the video signal that negatively impacts the user's perception of a system. For many stakeholders in video production and distribution, assurance of video quality is an important task.

The structural similarityindex measure (SSIM) is a method for predicting the perceived quality of digital television and cinematic pictures, as well as other kinds of digital images and videos. SSIM is used for measuring the similarity between two images. The SSIM index is a full reference metric; in other words, the measurement or prediction of image quality is based on an initial uncompressed or distortion-free image as reference.

Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.

In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

Video Multimethod Assessment Fusion (VMAF) is an objective full-reference video quality metric developed by Netflix in cooperation with the University of Southern California, The IPI/LS2N lab Nantes Université, and the Laboratory for Image and Video Engineering (LIVE) at The University of Texas at Austin. It predicts subjective video quality based on a reference and distorted video sequence. The metric can be used to evaluate the quality of different video codecs, encoders, encoding settings, or transmission variants.

Jan P. Allebach is an American engineer, educator and researcher known for contributions to imaging science including halftoning, digital image processing, color management, visual perception, and image quality. He is Hewlett-Packard Distinguished Professor of Electrical and Computer Engineering at Purdue University.

<span class="mw-page-title-main">Michael J. Black</span> American-born computer scientist

Michael J. Black is an American-born computer scientist working in Tübingen, Germany. He is a founding director at the Max Planck Institute for Intelligent Systems where he leads the Perceiving Systems Department in research focused on computer vision, machine learning, and computer graphics. He is also an Honorary Professor at the University of Tübingen.

<span class="mw-page-title-main">Rita Cucchiara</span> Italian electrical and computer engineer (born 1965)

Rita Cucchiara is an Italian electrical and computer engineer, and professor in Computer engineering and Science in the Enzo Ferrari Department of Engineering at the University of Modena and Reggio Emilia (UNIMORE) in Italy. She helds the courses of “Computer Architecture” and “Computer Vision and Cognitive Systems”. Cucchiara's research work focuses on artificial intelligence, specifically deep network technologies and computer vision for human behavior understanding (HBU) and visual, language and multimodal generative AI. She is the scientific coordinator of the AImage Lab at UNIMORE and is director of the Artificial Intelligence Research and Innovation Center (AIRI) as well as the ELLIS Unit at Modena. She was founder and director from 2018 to 2021 of the Italian National Lab of Artificial Intelligence and intelligent systems AIIS of CINI. Cucchiara was also president of the CVPL from 2016 to 2018. Rita Cucchiara is IAPR Fellow since 2006 and ELLIS Fellow since 2020.

Elastix is an image registration toolbox built upon the Insight Segmentation and Registration Toolkit (ITK). It is entirely open-source and provides a wide range of algorithms employed in image registration problems. Its components are designed to be modular to ease a fast and reliable creation of various registration pipelines tailored for case-specific applications. It was first developed by Stefan Klein and Marius Staring under the supervision of Josien P.W. Pluim at Image Sciences Institute (ISI). Its first version was command-line based, allowing the final user to employ scripts to automatically process big data-sets and deploy multiple registration pipelines with few lines of code. Nowadays, to further widen its audience, a version called SimpleElastix is also available, developed by Kasper Marstal, which allows the integration of elastix with high level languages, such as Python, Java, and R.

<span class="mw-page-title-main">Video super-resolution</span> Generating high-resolution video frames from given low-resolution ones

Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.

Video matting is a technique for separating the video into two or more layers, usually foreground and background, and generating alpha mattes which determine blending of the layers. The technique is very popular in video editing because it allows to substitute the background, or process the layers individually.

Medical open network for AI (MONAI) is an open-source, community-supported framework for Deep learning (DL) in healthcare imaging. MONAI provides a collection of domain-optimized implementations of various DL algorithms and utilities specifically designed for medical imaging tasks. MONAI is used in research and industry, aiding the development of various medical imaging applications, including image segmentation, image classification, image registration, and image generation.

References

  1. Monk, Ellis (2023-05-04). "The Monk Skin Tone Scale". SocArXiv . doi:10.31235/osf.io/pdf4c.
  2. 1 2 Howard, John J.; Sirotin, Yevgeniy B.; Tipton, Jerry L.; Vemury, Arun R. (2021-10-27). "Reliability and Validity of Image-Based and Self-Reported Skin Phenotype Metrics". IEEE Transactions on Biometrics, Behavior, and Identity Science. 3 (4): 550–560. arXiv: 2106.11240 . doi:10.1109/TBIOM.2021.3123550. ISSN   2637-6407. S2CID   235490065.
  3. Hazirbas, Caner; Bitton, Joanna; Dolhansky, Brian; Pan, Jacqueline; Gordo, Albert; Ferrer, Cristian Canton (2021-06-19). "Casual Conversations: A dataset for measuring fairness in AI". Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE: 2289–2293. doi:10.1109/CVPRW53098.2021.00258. ISBN   978-1-6654-4899-4. S2CID   235691583.
  4. 1 2 3 Vincent, James (2022-05-11). "Google is using a new way to measure skin tones to make search results more inclusive". The Verge. Retrieved 2023-06-10.
  5. "The Scale". Skin Tone Research @ Google. Retrieved 2023-06-10.
  6. "Recommended Practices". Skin Tone Research @ Google. Retrieved 2023-06-10.
  7. "Get Started". Skin Tone Research @ Google. Retrieved 2023-06-10.