Monk Skin Tone Scale

Last updated
The ten orbs of the Monk Skin Tone Scale Monk Skin Tone Orbs.png
The ten orbs of the Monk Skin Tone Scale

The Monk Skin Tone Scale is an open-source, 10-shade scale describing human skin color, developed by Ellis Monk in partnership with Google. [1] It is meant to replace the Fitzpatrick scale in fields such as computer vision research, after an IEEE study found the Fitzpatrick scale to be "poorly predictive of skin tone" and advised it "not be used as such in evaluations of computer vision applications." [2] In particular, the Fitzpatrick scale was found to under-represent darker shades of skin relative to the global human population.

Contents

The following table shows the 10 categories of the Monk Skin Tone Scale alongside the six categories of the Fitzpatrick scale, grouped into broad skin tone categories: [3]

Skin tone groupMonk skin tone scaleFitzpatrick scale
Light1–3I–II
Medium4–6III–IV
Dark7–10V–VI

Predecessor

Computer vision researchers initially adopted the Fitzpatrick scale as a metric to evaluate how well a given collection of photos of people sampled the global population. [4] However, the Fitzpatrick scale was developed to predict the risk of skin cancer in lighter-skinned people, and did not initially include darker skin tones at all. Two tones for darker people were later added to the original four tones to make it more inclusive. Despite these improvements, research has found that the Fitzpatrick Skin Tone correlated more with self-reported race than with objective measurements of skin tone, [2] and that computer vision models trained using the Fitzpatrick scale perform poorly on images of people with darker skin. [5]

Use

The Monk scale includes 10 skin tones. Though other scales (such as those used by cosmetics companies) may include many more shades, [6] Monk claims that 10 tones balances diversity with ease of use, and can be used more consistently across different users than a scale with more tones:

Usually, if you got past 10 or 12 points on these types of scales [and] ask the same person to repeatedly pick out the same tones, the more you increase that scale, the less people are able to do that. Cognitively speaking, it just becomes really hard to accurately and reliably differentiate. [5]

The primary intended application of the scale is in evaluating datasets for training computer vision models. Other proposed applications include increasing the diversity of image search results, so that an image search for "doctor" returns images of doctors with a broad range of skin tones. [5]

Google has cautioned against equating the shades in the scale with race, noting that skin tone can vary widely within race. [7]

The Monk scale is licensed under the Creative Commons Attribution 4.0 International license. [8]

See also

Related Research Articles

<span class="mw-page-title-main">Fitzpatrick scale</span> Classification of skin color and response to UV light

The Fitzpatrick scale is a numerical classification schema for human skin color. It was developed in 1975 by American dermatologist Thomas B. Fitzpatrick as a way to estimate the response of different types of skin to ultraviolet (UV) light. It was initially developed on the basis of skin color to measure the correct dose of UVA for PUVA therapy, and when the initial testing based only on hair and eye color resulted in too high UVA doses for some, it was altered to be based on the patient's reports of how their skin responds to the sun; it was also extended to a wider range of skin types. The Fitzpatrick scale remains a recognized tool for dermatological research into human skin pigmentation.

Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.

In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

Caltech 101 is a data set of digital images created in September 2003 and compiled by Fei-Fei Li, Marco Andreetto, Marc 'Aurelio Ranzato and Pietro Perona at the California Institute of Technology. It is intended to facilitate computer vision research and techniques and is most applicable to techniques involving image recognition classification and categorization. Caltech 101 contains a total of 9,146 images, split between 101 distinct object categories and a background category. Provided with the images are a set of annotations describing the outlines of each image, along with a Matlab script for viewing.

<span class="mw-page-title-main">Visual odometry</span> Determining the position and orientation of a robot by analyzing associated camera images

In robotics and computer vision, visual odometry is the process of determining the position and orientation of a robot by analyzing the associated camera images. It has been used in a wide variety of robotic applications, such as on the Mars Exploration Rovers.

David G. Lowe is a Canadian computer scientist working for Google as a senior research scientist. He was a former professor in the computer science department at the University of British Columbia and New York University.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is the subset of machine learning methods based on neural networks with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

Medical image computing (MIC) is an interdisciplinary field at the intersection of computer science, information engineering, electrical engineering, physics, mathematics and medicine. This field develops computational and mathematical methods for solving problems pertaining to medical images and their use for biomedical research and clinical care.

Matti Kalevi Pietikäinen is a computer scientist. He is currently Professor (emer.) in the Center for Machine Vision and Signal Analysis, University of Oulu, Finland. His research interests are in texture-based computer vision, face analysis, affective computing, biometrics, and vision-based perceptual interfaces. He was Director of the Center for Machine Vision Research, and Scientific Director of Infotech Oulu.

<span class="mw-page-title-main">MNIST database</span> Database of handwritten digits

The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.

<span class="mw-page-title-main">DeepDream</span> Software program

DeepDream is a computer vision program created by Google engineer Alexander Mordvintsev that uses a convolutional neural network to find and enhance patterns in images via algorithmic pareidolia, thus creating a dream-like appearance reminiscent of a psychedelic experience in the deliberately overprocessed images.

Emotion recognition is the process of identifying human emotion. People vary widely in their accuracy at recognizing the emotions of others. Use of technology to help people with emotion recognition is a relatively nascent research area. Generally, the technology works best if it uses multiple modalities in context. To date, the most work has been conducted on automating the recognition of facial expressions from video, spoken expressions from audio, written expressions from text, and physiology as measured by wearables.

The ImageNet project is a large visual database designed for use in visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictured and in at least one million of the images, bounding boxes are also provided. ImageNet contains more than 20,000 categories, with a typical category, such as "balloon" or "strawberry", consisting of several hundred images. The database of annotations of third-party image URLs is freely available directly from ImageNet, though the actual images are not owned by ImageNet. Since 2010, the ImageNet project runs an annual software contest, the ImageNet Large Scale Visual Recognition Challenge, where software programs compete to correctly classify and detect objects and scenes. The challenge uses a "trimmed" list of one thousand non-overlapping classes.

<span class="mw-page-title-main">Michael J. Black</span> American-born computer scientist

Michael J. Black is an American-born computer scientist working in Tübingen, Germany. He is a founding director at the Max Planck Institute for Intelligent Systems where he leads the Perceiving Systems Department in research focused on computer vision, machine learning, and computer graphics. He is also an Honorary Professor at the University of Tübingen.

<span class="mw-page-title-main">Rita Cucchiara</span> Italian electrical and computer engineer (born 1965)

Rita Cucchiara is an Italian electrical and computer engineer, and professor in Computer engineering and Science in the Enzo Ferrari Department of Engineering at the University of Modena and Reggio Emilia (UNIMORE) in Italy. She helds the courses of “Computer Architecture” and “Computer Vision and Cognitive Systems”. Cucchiara's research work focuses on artificial intelligence, specifically deep network technologies and computer vision for human behavior understanding (HBU) and visual, language and multimodal generative AI. She is the scientific coordinator of the AImage Lab at UNIMORE and is director of the Artificial Intelligence Research and Innovation Center (AIRI) as well as the ELLIS Unit at Modena. She was founder and director from 2018 to 2021 of the Italian National Lab of Artificial Intelligence and intelligent systems AIIS of CINI. Cucchiara was also president of the CVPL from 2016 to 2018. Rita Cucchiara is IAPR Fellow since 2006 and ELLIS Fellow since 2020.

80 Million Tiny Images is a dataset intended for training machine learning systems. It contains 79,302,017 32×32 pixel color images, scaled down from images extracted from the World Wide Web in 2008 using automated web search queries on a set of 75,062 non-abstract nouns derived from WordNet. The words in the search terms were then used as labels for the images. The researchers used seven web search resources for this purpose: Altavista, Ask.com, Flickr, Cydral, Google, Picsearch and Webshots.

<span class="mw-page-title-main">Video super-resolution</span> Generating high-resolution video frames from given low-resolution ones

Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.

A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches, serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

References

  1. Monk, Ellis (2023-05-04). "The Monk Skin Tone Scale". SocArXiv . doi:10.31235/osf.io/pdf4c.
  2. 1 2 Howard, John J.; Sirotin, Yevgeniy B.; Tipton, Jerry L.; Vemury, Arun R. (2021-10-27). "Reliability and Validity of Image-Based and Self-Reported Skin Phenotype Metrics". IEEE Transactions on Biometrics, Behavior, and Identity Science. 3 (4): 550–560. arXiv: 2106.11240 . doi:10.1109/TBIOM.2021.3123550. ISSN   2637-6407. S2CID   235490065.
  3. Heldreth, Courtney M.; Monk, Ellis P.; Clark, Alan T.; Schumann, Candice; Eyee, Xango; Ricco, Susanna (31 March 2024). "Which Skin Tone Measures Are the Most Inclusive? An Investigation of Skin Tone Measures for Artificial Intelligence". ACM Journal on Responsible Computing. 1 (1): 1–21. doi: 10.1145/3632120 . Retrieved 15 April 2024.
  4. Hazirbas, Caner; Bitton, Joanna; Dolhansky, Brian; Pan, Jacqueline; Gordo, Albert; Ferrer, Cristian Canton (2021-06-19). "Casual Conversations: A dataset for measuring fairness in AI". 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE. pp. 2289–2293. doi:10.1109/CVPRW53098.2021.00258. ISBN   978-1-6654-4899-4. S2CID   235691583.
  5. 1 2 3 Vincent, James (2022-05-11). "Google is using a new way to measure skin tones to make search results more inclusive". The Verge. Retrieved 2023-06-10.
  6. "The Scale". Skin Tone Research @ Google. Retrieved 2023-06-10.
  7. "Recommended Practices". Skin Tone Research @ Google. Retrieved 2023-06-10.
  8. "Get Started". Skin Tone Research @ Google. Retrieved 2023-06-10.