The Monk Skin Tone Scale is an open-source, 10-shade scale describing human skin color, developed by Ellis Monk in partnership with Google and released in 2023. [1] It is meant to replace the Fitzpatrick scale in fields such as computer vision research, after an IEEE study found the Fitzpatrick scale to be "poorly predictive of skin tone" and advised it "not be used as such in evaluations of computer vision applications." [2] In particular, the Fitzpatrick scale was found to under-represent darker shades of skin relative to the global human population.
The following table shows the 10 categories of the Monk Skin Tone Scale alongside the six categories of the Fitzpatrick scale, grouped into broad skin tone categories: [3]
Skin tone group | Monk scale 10 levels | Fitzpatrick scale 6 levels | ||
---|---|---|---|---|
levels | allocation | levels | allocation | |
Light | 1–3 | 30% | I–II | 33% |
Medium | 4–6 | 30% | III–IV | 33% |
Dark | 7–10 | 40% | V–VI | 33% |
Computer vision researchers initially adopted the Fitzpatrick scale as a metric to evaluate how well a given collection of photos of people sampled the global population. [4] However, the Fitzpatrick scale was developed to predict the risk of skin cancer in lighter-skinned people, and did not initially include darker skin tones at all. Two tones for darker people were later added to the original four tones to make it more inclusive. Despite these improvements, research has found that the Fitzpatrick Skin Tone correlated more with self-reported race than with objective measurements of skin tone, [2] and that computer vision models trained using the Fitzpatrick scale perform poorly on images of people with darker skin. [5]
The Monk scale includes 10 skin tones. Though other scales (such as those used by cosmetics companies) may include many more shades, [6] Monk claims that 10 tones balances diversity with ease of use, and can be used more consistently across different users than a scale with more tones:
Usually, if you got past 10 or 12 points on these types of scales [and] ask the same person to repeatedly pick out the same tones, the more you increase that scale, the less people are able to do that. Cognitively speaking, it just becomes really hard to accurately and reliably differentiate. [5]
The primary intended application of the scale is in evaluating datasets for training computer vision models. Other proposed applications include increasing the diversity of image search results, so that an image search for "doctor" returns images of doctors with a broad range of skin tones. [5]
Google has cautioned against equating the shades in the scale with race, noting that skin tone can vary widely within race. [7]
The Monk scale is licensed under the Creative Commons Attribution 4.0 International license. [8]
Image registration is the process of transforming different sets of data into one coordinate system. Data may be multiple photographs, data from different sensors, times, depths, or viewpoints. It is used in computer vision, medical imaging, military automatic target recognition, and compiling and analyzing images and data from satellites. Registration is necessary in order to be able to compare or integrate the data obtained from these different measurements.
Video quality is a characteristic of a video passed through a video transmission or processing system that describes perceived video degradation. Video processing systems may introduce some amount of distortion or artifacts in the video signal that negatively impact the user's perception of the system. For many stakeholders in video production and distribution, ensuring video quality is an important task.
The Fitzpatrick scale is a numerical classification schema for human skin color. It was developed in 1975 by American dermatologist Thomas B. Fitzpatrick as a way to estimate the response of different types of skin to ultraviolet (UV) light. It was initially developed on the basis of skin color to measure the correct dose of UVA for PUVA therapy, and when the initial testing based only on hair and eye color resulted in too high UVA doses for some, it was altered to be based on the patient's reports of how their skin responds to the sun; it was also extended to a wider range of skin types. The Fitzpatrick scale remains a recognized tool for dermatological research into human skin pigmentation.
Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.
In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.
Caltech 101 is a data set of digital images created in September 2003 and compiled by Fei-Fei Li, Marco Andreetto, Marc 'Aurelio Ranzato and Pietro Perona at the California Institute of Technology. It is intended to facilitate computer vision research and techniques and is most applicable to techniques involving image recognition classification and categorization. Caltech 101 contains a total of 9,146 images, split between 101 distinct object categories and a background category. Provided with the images are a set of annotations describing the outlines of each image, along with a Matlab script for viewing.
Automated tissue image analysis or histopathology image analysis (HIMA) is a process by which computer-controlled automatic test equipment is used to evaluate tissue samples, using computations to derive quantitative measurements from an image to avoid subjective errors.
The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.
A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently have been replaced -- in some cases -- by newer deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.
DeepDream is a computer vision program created by Google engineer Alexander Mordvintsev that uses a convolutional neural network to find and enhance patterns in images via algorithmic pareidolia, thus creating a dream-like appearance reminiscent of a psychedelic experience in the deliberately overprocessed images.
The ImageNet project is a large visual database designed for use in visual object recognition software research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictured and in at least one million of the images, bounding boxes are also provided. ImageNet contains more than 20,000 categories, with a typical category, such as "balloon" or "strawberry", consisting of several hundred images. The database of annotations of third-party image URLs is freely available directly from ImageNet, though the actual images are not owned by ImageNet. Since 2010, the ImageNet project runs an annual software contest, the ImageNet Large Scale Visual Recognition Challenge, where software programs compete to correctly classify and detect objects and scenes. The challenge uses a "trimmed" list of one thousand non-overlapping classes.
Dorin Comaniciu is a Romanian-American computer scientist. He is the Senior Vice President of Artificial Intelligence and Digital Innovation at Siemens Healthcare.
An event camera, also known as a neuromorphic camera, silicon retina or dynamic vision sensor, is an imaging sensor that responds to local changes in brightness. Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.
Michael J. Black is an American-born computer scientist working in Tübingen, Germany. He is a founding director at the Max Planck Institute for Intelligent Systems where he leads the Perceiving Systems Department in research focused on computer vision, machine learning, and computer graphics. He is also an Honorary Professor at the University of Tübingen.
Rita Cucchiara is an Italian electrical and computer engineer, and professor in Computer engineering and Science in the Enzo Ferrari Department of Engineering at the University of Modena and Reggio Emilia (UNIMORE) in Italy. She helds the courses of “Computer Architecture” and “Computer Vision and Cognitive Systems”. Cucchiara's research work focuses on artificial intelligence, specifically deep network technologies and computer vision for human behavior understanding (HBU) and visual, language and multimodal generative AI. She is the scientific coordinator of the AImage Lab at UNIMORE and is director of the Artificial Intelligence Research and Innovation Center (AIRI) as well as the ELLIS Unit at Modena. She was founder and director from 2018 to 2021 of the Italian National Lab of Artificial Intelligence and intelligent systems AIIS of CINI. Cucchiara was also president of the CVPL from 2016 to 2018. Rita Cucchiara is IAPR Fellow since 2006 and ELLIS Fellow since 2020.
80 Million Tiny Images is a dataset intended for training machine learning systems. It contains 79,302,017 32×32 pixel color images, scaled down from images extracted from the World Wide Web in 2008 using automated web search queries on a set of 75,062 non-abstract nouns derived from WordNet. The words in the search terms were then used as labels for the images. The researchers used seven web search resources for this purpose: Altavista, Ask.com, Flickr, Cydral, Google, Picsearch and Webshots.
Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.
A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches, serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.