Part of a series on |
Machine learning and data mining |
---|
This is a list of datasets for machine learning research. It is part of the list of datasets for machine-learning research. These datasets consist primarily of images or videos for tasks such as object detection, facial recognition, and multi-label classification.
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
80 Million Tiny Images | 80 million 32×32 images labelled with 75,062 non-abstract nouns. | 80,000,000 | image, label | 2008 | [1] | Torralba et al. | ||
JFT-300M | Dataset internal to Google Research. 300M images with 375M labels in 18291 categories | 300,000,000 | image, label | 2017 | [2] | Google Research | ||
Places | 10+ million images in 400+ scene classes, with 5000 to 30,000 images per class. | 10,000,000 | image, label | 2018 | [3] | Zhou et al | ||
Ego 4D | A massive-scale, egocentric dataset and benchmark suite collected across 74 worldwide locations and 9 countries, with over 3,670 hours of daily-life activity video. | Object bounding boxes, transcriptions, labeling. | 3,670 video hours | video, audio, transcriptions | Multimodal first-person task | 2022 | [4] | K. Grauman et al. |
Wikipedia-based Image Text Dataset | 37.5 million image-text examples with 11.5 million unique images across 108 Wikipedia languages. | 11,500,000 | image, caption | Pretraining, image captioning | 2021 | [5] | Srinivasan e al, Google Research | |
Visual Genome | Images and their description | 108,000 | images, text | Image captioning | 2016 | [6] | R. Krishna et al. | |
Berkeley 3-D Object Dataset | 849 images taken in 75 different scenes. About 50 different object classes are labeled. | Object bounding boxes and labeling. | 849 | labeled images, text | Object recognition | 2014 | [7] [8] | A. Janoch et al. |
Berkeley Segmentation Data Set and Benchmarks 500 (BSDS500) | 500 natural images, explicitly separated into disjoint train, validation and test subsets + benchmarking code. Based on BSDS300. | Each image segmented by five different subjects on average. | 500 | Segmented images | Contour detection and hierarchical image segmentation | 2011 | [9] | University of California, Berkeley |
Microsoft Common Objects in Context (COCO) | complex everyday scenes of common objects in their natural context. | Object highlighting, labeling, and classification into 91 object types. | 2,500,000 | Labeled images, text | Object recognition | 2015 | [10] [11] [12] | T. Lin et al. |
SUN Database | Very large scene and object recognition database. | Places and objects are labeled. Objects are segmented. | 131,067 | Images, text | Object recognition, scene recognition | 2014 | [13] [14] | J. Xiao et al. |
ImageNet | Labeled object image database, used in the ImageNet Large Scale Visual Recognition Challenge | Labeled objects, bounding boxes, descriptive words, SIFT features | 14,197,122 | Images, text | Object recognition, scene recognition | 2009 (2014) | [15] [16] [17] | J. Deng et al. |
LSUN | Large-scale Scene UNderstanding. 10 scene categories (bedroom, etc) and 20 object categories (airplane, etc) | Images and labels. | ~60 million | Images, text | Object recognition, scene recognition | 2015 | [18] [19] [20] | Yu et al. |
Open Images | A Large set of images listed as having CC BY 2.0 license with image-level labels and bounding boxes spanning thousands of classes. | Image-level labels, Bounding boxes | 9,178,275 | Images, text | Classification, Object recognition | 2017 (V7 : 2022) | [21] | |
TV News Channel Commercial Detection Dataset | TV commercials and news broadcasts. | Audio and video features extracted from still images. | 129,685 | Text | Clustering, classification | 2015 | [22] [23] | P. Guha et al. |
Statlog (Image Segmentation) Dataset | The instances were drawn randomly from a database of 7 outdoor images and hand-segmented to create a classification for every pixel. | Many features calculated. | 2310 | Text | Classification | 1990 | [24] | University of Massachusetts |
Caltech 101 | Pictures of objects. | Detailed object outlines marked. | 9146 | Images | Classification, object recognition | 2003 | [25] [26] | F. Li et al. |
Caltech-256 | Large dataset of images for object classification. | Images categorized and hand-sorted. | 30,607 | Images, Text | Classification, object detection | 2007 | [27] [28] | G. Griffin et al. |
COYO-700M | Image–text-pair dataset | 10 billion pairs of alt-text and image sources in HTML documents in CommonCrawl | 746,972,269 | Images, Text | Classification, Image-Language | 2022 | [29] | |
SIFT10M Dataset | SIFT features of Caltech-256 dataset. | Extensive SIFT feature extraction. | 11,164,866 | Text | Classification, object detection | 2016 | [30] | X. Fu et al. |
LabelMe | Annotated pictures of scenes. | Objects outlined. | 187,240 | Images, text | Classification, object detection | 2005 | [31] | MIT Computer Science and Artificial Intelligence Laboratory |
PASCAL VOC Dataset | Large number of images for classification tasks. | Labeling, bounding box included | 500,000 | Images, text | Classification, object detection | 2010 | [32] [33] | M. Everingham et al. |
CIFAR-10 Dataset | Many small, low-resolution, images of 10 classes of objects. | Classes labelled, training set splits created. | 60,000 | Images | Classification | 2009 | [16] [34] | A. Krizhevsky et al. |
CIFAR-100 Dataset | Like CIFAR-10, above, but 100 classes of objects are given. | Classes labelled, training set splits created. | 60,000 | Images | Classification | 2009 | [16] [34] | A. Krizhevsky et al. |
CINIC-10 Dataset | A unified contribution of CIFAR-10 and Imagenet with 10 classes, and 3 splits. Larger than CIFAR-10. | Classes labelled, training, validation, test set splits created. | 270,000 | Images | Classification | 2018 | [35] | Luke N. Darlow, Elliot J. Crowley, Antreas Antoniou, Amos J. Storkey |
Fashion-MNIST | A MNIST-like fashion product database | Classes labelled, training set splits created. | 60,000 | Images | Classification | 2017 | [36] | Zalando SE |
notMNIST | Some publicly available fonts and extracted glyphs from them to make a dataset similar to MNIST. There are 10 classes, with letters A–J taken from different fonts. | Classes labelled, training set splits created. | 500,000 | Images | Classification | 2011 | [37] | Yaroslav Bulatov |
Linnaeus 5 dataset | Images of 5 classes of objects. | Classes labelled, training set splits created. | 8000 | Images | Classification | 2017 | [38] | Chaladze & Kalatozishvili |
11K Hands | 11,076 hand images (1600 x 1200 pixels) of 190 subjects, of varying ages between 18 – 75 years old, for gender recognition and biometric identification. | None | 11,076 hand images | Images and (.mat, .txt, and .csv) label files | Gender recognition and biometric identification | 2017 | [39] | M Afifi |
CORe50 | Specifically designed for Continuous/Lifelong Learning and Object Recognition, is a collection of more than 500 videos (30fps) of 50 domestic objects belonging to 10 different categories. | Classes labelled, training set splits created based on a 3-way, multi-runs benchmark. | 164,866 RBG-D images | images (.png or .pkl) and (.pkl, .txt, .tsv) label files | Classification, Object recognition | 2017 | [40] | V. Lomonaco and D. Maltoni |
OpenLORIS-Object | Lifelong/Continual Robotic Vision dataset (OpenLORIS-Object) collected by real robots mounted with multiple high-resolution sensors, includes a collection of 121 object instances (1st version of dataset, 40 categories daily necessities objects under 20 scenes). The dataset has rigorously considered 4 environment factors under different scenes, including illumination, occlusion, object pixel size and clutter, and defines the difficulty levels of each factor explicitly. | Classes labelled, training/validation/testing set splits created by benchmark scripts. | 1,106,424 RBG-D images | images (.png and .pkl) and (.pkl) label files | Classification, Lifelong object recognition, Robotic Vision | 2019 | [41] | Q. She et al. |
THz and thermal video data set | This multispectral data set includes terahertz, thermal, visual, near infrared, and three-dimensional videos of objects hidden under people's clothes. | 3D lookup tables are provided that allow you to project images onto 3D point clouds. | More than 20 videos. The duration of each video is about 85 seconds (about 345 frames). | AP2J | Experiments with hidden object detection | 2019 | [42] [43] | Alexei A. Morozov and Olga S. Sushkova |
Part of a series on |
Self-driving cars & self-driving vehicles |
---|
Enablers |
Topics |
Related topics |
Dataset Name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Cityscapes Dataset | Stereo video sequences recorded in street scenes, with pixel-level annotations. Metadata also included. | Pixel-level segmentation and labeling | 25,000 | Images, text | Classification, object detection | 2016 | [44] | Daimler AG et al. |
German Traffic Sign Detection Benchmark Dataset | Images from vehicles of traffic signs on German roads. These signs comply with UN standards and therefore are the same as in other countries. | Signs manually labeled | 900 | Images | Classification | 2013 | [45] [46] | S. Houben et al. |
KITTI Vision Benchmark Dataset | Autonomous vehicles driving through a mid-size city captured images of various areas using cameras and laser scanners. | Many benchmarks extracted from data. | >100 GB of data | Images, text | Classification, object detection | 2012 | [47] [48] [49] | A. Geiger et al. |
FieldSAFE | Multi-modal dataset for obstacle detection in agriculture including stereo camera, thermal camera, web camera, 360-degree camera, lidar, radar, and precise localization. | Classes labelled geographically. | >400 GB of data | Images and 3D point clouds | Classification, object detection, object localization | 2017 | [50] | M. Kragh et al. |
Daimler Monocular Pedestrian Detection dataset | It is a dataset of pedestrians in urban environments. | Pedestrians are box-wise labeled. | Labeled part contains 15560 samples with pedestrians and 6744 samples without. Test set contains 21790 images without labels. | Images | Object recognition and classification | 2006 | [51] [52] [53] | Daimler AG |
CamVid | The Cambridge-driving Labeled Video Database (CamVid) is a collection of videos. | The dataset is labeled with semantic labels for 32 semantic classes. | over 700 images | Images | Object recognition and classification | 2008 | [54] [55] [56] | Gabriel J. Brostow, Jamie Shotton, Julien Fauqueur, Roberto Cipolla |
RailSem19 | RailSem19 is a dataset for understanding scenes for vision systems on railways. | The dataset is labeled semanticly and box-wise. | 8500 | Images | Object recognition and classification, scene recognition | 2019 | [57] [58] | Oliver Zendel, Markus Murschitz, Marcel Zeilinger, Daniel Steininger, Sara Abbasi, Csaba Beleznai |
BOREAS | BOREAS is a multi-season autonomous driving dataset. It includes data from includes a Velodyne Alpha-Prime (128-beam) lidar, a FLIR Blackfly S camera, a Navtech CIR304-H radar, and an Applanix POS LV GNSS-INS. | The data is annotated by 3D bounding boxes. | 350 km of driving data | Images, Lidar and Radar data | Object recognition and classification, scene recognition | 2023 | [59] [60] | Keenan Burnett, David J. Yoon, Yuchen Wu, Andrew Zou Li, Haowei Zhang, Shichen Lu, Jingxing Qian, Wei-Kang Tseng, Andrew Lambert, Keith Y.K. Leung, Angela P. Schoellig, Timothy D. Barfoot |
Bosch Small Traffic Lights Dataset | It is a dataset of traffic lights. | The labeling include bounding boxes of traffic lights together with their state (active light). | 5000 images for training and a video sequence of 8334 frames for evaluation | Images | Traffic light recognition | 2017 | [61] [62] | Karsten Behrendt, Libor Novak, Rami Botros |
FRSign | It is a dataset of French railway signals. | The labeling include bounding boxes of railway signals together with their state (active light). | more than 100000 | Images | Railway signal recognition | 2020 | [63] [64] | Jeanine Harb, Nicolas Rébéna, Raphaël Chosidow, Grégoire Roblin, Roman Potarusov, Hatem Hajri |
GERALD | It is a dataset of German railway signals. | The labeling include bounding boxes of railway signals together with their state (active light). | 5000 | Images | Railway signal recognition | 2023 | [65] [66] | Philipp Leibner, Fabian Hampel, Christian Schindler |
Multi-cue pedestrian | Multi-cue onboard pedestrian detection dataset is a dataset for detection of pedestrians. | The databaset is labeled box-wise. | 1092 image pairs with 1776 boxes for pedestrians | Images | Object recognition and classification | 2009 | [67] | Christian Wojek, Stefan Walk, Bernt Schiele |
RAWPED | RAWPED is a dataset for detection of pedestrians in the context of railways. | The dataset is labeled box-wise. | 26000 | Images | Object recognition and classification | 2020 | [68] [69] | Tugce Toprak, Burak Belenlioglu, Burak Aydın, Cuneyt Guzelis, M. Alper Selver |
OSDaR23 | OSDaR23 is a multi-sensory dataset for detection of objects in the context of railways. | The databaset is labeled box-wise. | 16874 frames | Images, Lidar, Radar and Infrared | Object recognition and classification | 2023 | [70] [71] | DZSF, Digitale Schiene Deutschland, and FusionSystems |
Agroverse | Argoverse is a multi-sensory dataset for detection of objects in the context of roads. | The dataset is annotated box-wise. | 320 hours of recording | Data from 7 cameras and LiDAR | Object recognition and classification, object tracking | 2022 | [72] [73] | Argo AI, Carnegie Mellon University, Georgia Institute of Technology |
In computer vision, face images have been used extensively to develop facial recognition systems, face detection, and many other projects that use images of faces.
Dataset name | Brief description | Preprocessing | Instances | Format | Default task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Aff-Wild | 298 videos of 200 individuals, ~1,250,000 manually annotated images: annotated in terms of dimensional affect (valence-arousal); in-the-wild setting; color database; various resolutions (average = 640x360) | the detected faces, facial landmarks and valence-arousal annotations | ~1,250,000 manually annotated images | video (visual + audio modalities) | affect recognition (valence-arousal estimation) | 2017 | CVPR [74] IJCV [75] | D. Kollias et al. |
Aff-Wild2 | 558 videos of 458 individuals, ~2,800,000 manually annotated images: annotated in terms of i) categorical affect (7 basic expressions: neutral, happiness, sadness, surprise, fear, disgust, anger); ii) dimensional affect (valence-arousal); iii) action units (AUs 1,2,4,6,12,15,20,25); in-the-wild setting; color database; various resolutions (average = 1030x630) | the detected faces, detected and aligned faces and annotations | ~2,800,000 manually annotated images | video (visual + audio modalities) | affect recognition (valence-arousal estimation, basic expression classification, action unit detection) | 2019 | BMVC [76] FG [77] | D. Kollias et al. |
FERET (facial recognition technology) | 11338 images of 1199 individuals in different positions and at different times. | None. | 11,338 | Images | Classification, face recognition | 2003 | [78] [79] | United States Department of Defense |
Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) | 7,356 video and audio recordings of 24 professional actors. 8 emotions each at two intensities. | Files labelled with expression. Perceptual validation ratings provided by 319 raters. | 7,356 | Video, sound files | Classification, face recognition, voice recognition | 2018 | [80] [81] | S.R. Livingstone and F.A. Russo |
SCFace | Color images of faces at various angles. | Location of facial features extracted. Coordinates of features given. | 4,160 | Images, text | Classification, face recognition | 2011 | [82] [83] | M. Grgic et al. |
Yale Face Database | Faces of 15 individuals in 11 different expressions. | Labels of expressions. | 165 | Images | Face recognition | 1997 | [84] [85] | J. Yang et al. |
Cohn-Kanade AU-Coded Expression Database | Large database of images with labels for expressions. | Tracking of certain facial features. | 500+ sequences | Images, text | Facial expression analysis | 2000 | [86] [87] | T. Kanade et al. |
JAFFE Facial Expression Database | 213 images of 7 facial expressions (6 basic facial expressions + 1 neutral) posed by 10 Japanese female models. | Images are cropped to the facial region. Includes semantic ratings data on emotion labels. | 213 | Images, text | Facial expression cognition | 1998 | [88] [89] | Lyons, Kamachi, Gyoba |
FaceScrub | Images of public figures scrubbed from image searching. | Name and m/f annotation. | 107,818 | Images, text | Face recognition | 2014 | [90] [91] | H. Ng et al. |
BioID Face Database | Images of faces with eye positions marked. | Manually set eye positions. | 1521 | Images, text | Face recognition | 2001 | [92] [93] | BioID |
Skin Segmentation Dataset | Randomly sampled color values from face images. | B, G, R, values extracted. | 245,057 | Text | Segmentation, classification | 2012 | [94] [95] | R. Bhatt. |
Bosphorus | 3D Face image database. | 34 action units and 6 expressions labeled; 24 facial landmarks labeled. | 4652 | Images, text | Face recognition, classification | 2008 | [96] [97] | A Savran et al. |
UOY 3D-Face | neutral face, 5 expressions: anger, happiness, sadness, eyes closed, eyebrows raised. | labeling. | 5250 | Images, text | Face recognition, classification | 2004 | [98] [99] | University of York |
CASIA 3D Face Database | Expressions: Anger, smile, laugh, surprise, closed eyes. | None. | 4624 | Images, text | Face recognition, classification | 2007 | [100] [101] | Institute of Automation, Chinese Academy of Sciences |
CASIA NIR | Expressions: Anger Disgust Fear Happiness Sadness Surprise | None. | 480 | Annotated Visible Spectrum and Near Infrared Video captures at 25 frames per second | Face recognition, classification | 2011 | [102] | Zhao, G. et al. |
BU-3DFE | neutral face, and 6 expressions: anger, happiness, sadness, surprise, disgust, fear (4 levels). 3D images extracted. | None. | 2500 | Images, text | Facial expression recognition, classification | 2006 | [103] | Binghamton University |
Face Recognition Grand Challenge Dataset | Up to 22 samples for each subject. Expressions: anger, happiness, sadness, surprise, disgust, puffy. 3D Data. | None. | 4007 | Images, text | Face recognition, classification | 2004 | [104] [105] | National Institute of Standards and Technology |
Gavabdb | Up to 61 samples for each subject. Expressions neutral face, smile, frontal accentuated laugh, frontal random gesture. 3D images. | None. | 549 | Images, text | Face recognition, classification | 2008 | [106] [107] | King Juan Carlos University |
3D-RMA | Up to 100 subjects, expressions mostly neutral. Several poses as well. | None. | 9971 | Images, text | Face recognition, classification | 2004 | [108] [109] | Royal Military Academy (Belgium) |
SoF | 112 persons (66 males and 46 females) wear glasses under different illumination conditions. | A set of synthetic filters (blur, occlusions, noise, and posterization ) with different level of difficulty. | 42,592 (2,662 original image × 16 synthetic image) | Images, Mat file | Gender classification, face detection, face recognition, age estimation, and glasses detection | 2017 | [110] [111] | Afifi, M. et al. |
IMDb-WIKI | IMDb and Wikipedia face images with gender and age labels. | None | 523,051 | Images | Gender classification, face detection, face recognition, age estimation | 2015 | [112] | R. Rothe, R. Timofte, L. V. Gool |
Dataset name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
AVA-Kinetics Localized Human Actions Video | Annotated 80 action classes from keyframes from videos from Kinetics-700. | 1.6 million annotations. 238,906 video clips, 624,430 keyframes. | Annotations, videos. | Action prediction | 2020 | [113] [114] | Li et al from Perception Team of Google AI. | |
TV Human Interaction Dataset | Videos from 20 different TV shows for prediction social actions: handshake, high five, hug, kiss and none. | None. | 6,766 video clips | video clips | Action prediction | 2013 | [115] | Patron-Perez, A. et al. |
Berkeley Multimodal Human Action Database (MHAD) | Recordings of a single person performing 12 actions | MoCap pre-processing | 660 action samples | 8 PhaseSpace Motion Capture, 2 Stereo Cameras, 4 Quad Cameras, 6 accelerometers, 4 microphones | Action classification | 2013 | [116] | Ofli, F. et al. |
THUMOS Dataset | Large video dataset for action classification. | Actions classified and labeled. | 45M frames of video | Video, images, text | Classification, action detection | 2013 | [117] [118] | Y. Jiang et al. |
MEXAction2 | Video dataset for action localization and spotting | Actions classified and labeled. | 1000 | Video | Action detection | 2014 | [119] | Stoian et al. |
Dataset name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
Artificial Characters Dataset | Artificially generated data describing the structure of 10 capital English letters. | Coordinates of lines drawn given as integers. Various other features. | 6000 | Text | Handwriting recognition, classification | 1992 | [120] | H. Guvenir et al. |
Letter Dataset | Upper-case printed letters. | 17 features are extracted from all images. | 20,000 | Text | OCR, classification | 1991 | [121] [122] | D. Slate et al. |
CASIA-HWDB | Offline handwritten Chinese character database. 3755 classes in the GB 2312 character set. | Gray-scaled images with background pixels labeled as 255. | 1,172,907 | Images, Text | Handwriting recognition, classification | 2009 | [123] | CASIA |
CASIA-OLHWDB | Online handwritten Chinese character database, collected using Anoto pen on paper. 3755 classes in the GB 2312 character set. | Provides the sequences of coordinates of strokes. | 1,174,364 | Images, Text | Handwriting recognition, classification | 2009 | [124] [123] | CASIA |
Character Trajectories Dataset | Labeled samples of pen tip trajectories for people writing simple characters. | 3-dimensional pen tip velocity trajectory matrix for each sample | 2858 | Text | Handwriting recognition, classification | 2008 | [125] [126] | B. Williams |
Chars74K Dataset | Character recognition in natural images of symbols used in both English and Kannada | 74,107 | Character recognition, handwriting recognition, OCR, classification | 2009 | [127] | T. de Campos | ||
EMNIST dataset | Handwritten characters from 3600 contributors | Derived from NIST Special Database 19. Converted to 28x28 pixel images, matching the MNIST dataset. [128] | 800,000 | Images | character recognition, classification, handwriting recognition | 2016 | EMNIST dataset [129] Documentation [130] | Gregory Cohen, et al. |
UJI Pen Characters Dataset | Isolated handwritten characters | Coordinates of pen position as characters were written given. | 11,640 | Text | Handwriting recognition, classification | 2009 | [131] [132] | F. Prat et al. |
Gisette Dataset | Handwriting samples from the often-confused 4 and 9 characters. | Features extracted from images, split into train/test, handwriting images size-normalized. | 13,500 | Images, text | Handwriting recognition, classification | 2003 | [133] | Yann LeCun et al. |
Omniglot dataset | 1623 different handwritten characters from 50 different alphabets. | Hand-labeled. | 38,300 | Images, text, strokes | Classification, one-shot learning | 2015 | [134] [135] | American Association for the Advancement of Science |
MNIST database | Database of handwritten digits. | Hand-labeled. | 60,000 | Images, text | Classification | 1994 | [136] [137] | National Institute of Standards and Technology |
Optical Recognition of Handwritten Digits Dataset | Normalized bitmaps of handwritten data. | Size normalized and mapped to bitmaps. | 5620 | Images, text | Handwriting recognition, classification | 1998 | [138] | E. Alpaydin et al. |
Pen-Based Recognition of Handwritten Digits Dataset | Handwritten digits on electronic pen-tablet. | Feature vectors extracted to be uniformly spaced. | 10,992 | Images, text | Handwriting recognition, classification | 1998 | [139] [140] | E. Alpaydin et al. |
Semeion Handwritten Digit Dataset | Handwritten digits from 80 people. | All handwritten digits have been normalized for size and mapped to the same grid. | 1593 | Images, text | Handwriting recognition, classification | 2008 | [141] | T. Srl |
HASYv2 | Handwritten mathematical symbols | All symbols are centered and of size 32px x 32px. | 168233 | Images, text | Classification | 2017 | [142] | Martin Thoma |
Noisy Handwritten Bangla Dataset | Includes Handwritten Numeral Dataset (10 classes) and Basic Character Dataset (50 classes), each dataset has three types of noise: white gaussian, motion blur, and reduced contrast. | All images are centered and of size 32x32. | Numeral Dataset: 23330, Character Dataset: 76000 | Images, text | Handwriting recognition, classification | 2017 | [143] [144] | M. Karki et al. |
Dataset name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
iSAID: Instance Segmentation in Aerial Images Dataset | Precise instance-level annotatio carried out by professional annotators, cross-checked and validated by expert annotators complying with well-defined guidelines. | 655,451 (15 classes) | Images, jpg, json | Aerial Classification, Object Detection, Instance Segmentation | 2019 | [145] [146] | Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman Khan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, Ling Shao, Gui-Song Xia, Xiang Bai | |
Aerial Image Segmentation Dataset | 80 high-resolution aerial images with spatial resolution ranging from 0.3 to 1.0. | Images manually segmented. | 80 | Images | Aerial Classification, object detection | 2013 | [147] [148] | J. Yuan et al. |
KIT AIS Data Set | Multiple labeled training and evaluation datasets of aerial images of crowds. | Images manually labeled to show paths of individuals through crowds. | ~ 150 | Images with paths | People tracking, aerial tracking | 2012 | [149] [150] | M. Butenuth et al. |
Wilt Dataset | Remote sensing data of diseased trees and other land cover. | Various features extracted. | 4899 | Images | Classification, aerial object detection | 2014 | [151] [152] | B. Johnson |
MASATI dataset | Maritime scenes of optical aerial images from the visible spectrum. It contains color images in dynamic marine environments, each image may contain one or multiple targets in different weather and illumination conditions. | Object bounding boxes and labeling. | 7389 | Images | Classification, aerial object detection | 2018 | [153] [154] | A.-J. Gallego et al. |
Forest Type Mapping Dataset | Satellite imagery of forests in Japan. | Image wavelength bands extracted. | 326 | Text | Classification | 2015 | [155] [156] | B. Johnson |
Overhead Imagery Research Data Set | Annotated overhead imagery. Images with multiple objects. | Over 30 annotations and over 60 statistics that describe the target within the context of the image. | 1000 | Images, text | Classification | 2009 | [157] [158] | F. Tanner et al. |
SpaceNet | SpaceNet is a corpus of commercial satellite imagery and labeled training data. | GeoTiff and GeoJSON files containing building footprints. | >17533 | Images | Classification, Object Identification | 2017 | [159] [160] [161] | DigitalGlobe, Inc. |
UC Merced Land Use Dataset | These images were manually extracted from large images from the USGS National Map Urban Area Imagery collection for various urban areas around the US. | This is a 21 class land use image dataset meant for research purposes. There are 100 images for each class. | 2,100 | Image chips of 256x256, 30 cm (1 foot) GSD | Land cover classification | 2010 | [162] | Yi Yang and Shawn Newsam |
SAT-4 Airborne Dataset | Images were extracted from the National Agriculture Imagery Program (NAIP) dataset. | SAT-4 has four broad land cover classes, includes barren land, trees, grassland and a class that consists of all land cover classes other than the above three. | 500,000 | Images | Classification | 2015 | [163] [164] | S. Basu et al. |
SAT-6 Airborne Dataset | Images were extracted from the National Agriculture Imagery Program (NAIP) dataset. | SAT-6 has six broad land cover classes, includes barren land, trees, grassland, roads, buildings and water bodies. | 405,000 | Images | Classification | 2015 | [163] [164] | S. Basu et al. |
Dataset name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
SUIM Dataset | The images have been rigorously collected during oceanic explorations and human-robot collaborative experiments, and annotated by human participants. | Images with pixel annotations for eight object categories: fish (vertebrates), reefs (invertebrates), aquatic plants, wrecks/ruins, human divers, robots, and sea-floor. | 1,635 | Images | Segmentation | 2020 | [165] | Md Jahidul Islam et al. |
LIACI Dataset | Images have been collected during underwater ship inspections and annotated by human domain experts. | Images with pixel annotations for ten object categories: defects, corrosion, paint peel, marine growth, sea chest gratings, overboard valves, propeller, anodes, bilge keel and ship hull. | 1,893 | Images | Segmentation | 2022 | [166] | Waszak et al. |
Dataset name | Brief description | Preprocessing | Instances | Format | Default Task | Created (updated) | Reference | Creator |
---|---|---|---|---|---|---|---|---|
NRC-GAMMA | A novel benchmark gas meter image dataset | None | 28,883 | Image, Label | Classification | 2021 | [167] [168] | A. Ebadi, P. Paul, S. Auer, & S. Tremblay |
The SUPATLANTIQUE dataset | Images of scanned official and Wikipedia documents | None | 4908 | TIFF/pdf | Source device identification, forgery detection, Classification,.. | 2020 | [169] | C. Ben Rabah et al. |
Density functional theory quantum simulations of graphene | Labelled images of raw input to a simulation of graphene | Raw data (in HDF5 format) and output labels from density functional theory quantum simulation | 60744 test and 501473 training files | Labeled images | Regression | 2019 | [170] | K. Mills & I. Tamblyn |
Quantum simulations of an electron in a two dimensional potential well | Labelled images of raw input to a simulation of 2d Quantum mechanics | Raw data (in HDF5 format) and output labels from quantum simulation | 1.3 million images | Labeled images | Regression | 2017 | [171] | K. Mills, M.A. Spanner, & I. Tamblyn |
MPII Cooking Activities Dataset | Videos and images of various cooking activities. | Activity paths and directions, labels, fine-grained motion labeling, activity class, still image extraction and labeling. | 881,755 frames | Labeled video, images, text | Classification | 2012 | [172] [173] | M. Rohrbach et al. |
FAMOS Dataset | 5,000 unique microstructures, all samples have been acquired 3 times with two different cameras. | Original PNG files, sorted per camera and then per acquisition. MATLAB datafiles with one 16384 times 5000 matrix per camera per acquisition. | 30,000 | Images and .mat files | Authentication | 2012 | [174] | S. Voloshynovskiy, et al. |
PharmaPack Dataset | 1,000 unique classes with 54 images per class. | Class labeling, many local descriptors, like SIFT and aKaZE, and local feature agreators, like Fisher Vector (FV). | 54,000 | Images and .mat files | Fine-grain classification | 2017 | [175] | O. Taran and S. Rezaeifar, et al. |
Stanford Dogs Dataset | Images of 120 breeds of dogs from around the world. | Train/test splits and ImageNet annotations provided. | 20,580 | Images, text | Fine-grain classification | 2011 | [176] [177] | A. Khosla et al. |
StanfordExtra Dataset | 2D keypoints and segmentations for the Stanford Dogs Dataset. | 2D keypoints and segmentations provided. | 12,035 | Labelled images | 3D reconstruction/pose estimation | 2020 | [178] | B. Biggs et al. |
The Oxford-IIIT Pet Dataset | 37 categories of pets with roughly 200 images of each. | Breed labeled, tight bounding box, foreground-background segmentation. | ~ 7,400 | Images, text | Classification, object detection | 2012 | [177] [179] | O. Parkhi et al. |
Corel Image Features Data Set | Database of images with features extracted. | Many features including color histogram, co-occurrence texture, and colormoments, | 68,040 | Text | Classification, object detection | 1999 | [180] [181] | M. Ortega-Bindenberger et al. |
Online Video Characteristics and Transcoding Time Dataset. | Transcoding times for various different videos and video properties. | Video features given. | 168,286 | Text | Regression | 2015 | [182] | T. Deneke et al. |
Microsoft Sequential Image Narrative Dataset (SIND) | Dataset for sequential vision-to-language | Descriptive caption and storytelling given for each photo, and photos are arranged in sequences | 81,743 | Images, text | Visual storytelling | 2016 | [183] | Microsoft Research |
Caltech-UCSD Birds-200-2011 Dataset | Large dataset of images of birds. | Part locations for birds, bounding boxes, 312 binary attributes given | 11,788 | Images, text | Classification | 2011 | [184] [185] | C. Wah et al. |
YouTube-8M | Large and diverse labeled video dataset | YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities | 8 million | Video, text | Video classification | 2016 | [186] [187] | S. Abu-El-Haija et al. |
YFCC100M | Large and diverse labeled image and video dataset | Flickr Videos and Images and associated description, titles, tags, and other metadata (such as EXIF and geotags) | 100 million | Video, Image, Text | Video and Image classification | 2016 | [188] [189] | B. Thomee et al. |
Discrete LIRIS-ACCEDE | Short videos annotated for valence and arousal. | Valence and arousal labels. | 9800 | Video | Video emotion elicitation detection | 2015 | [190] | Y. Baveye et al. |
Continuous LIRIS-ACCEDE | Long videos annotated for valence and arousal while also collecting Galvanic Skin Response. | Valence and arousal labels. | 30 | Video | Video emotion elicitation detection | 2015 | [191] | Y. Baveye et al. |
MediaEval LIRIS-ACCEDE | Extension of Discrete LIRIS-ACCEDE including annotations for violence levels of the films. | Violence, valence and arousal labels. | 10900 | Video | Video emotion elicitation detection | 2015 | [192] | Y. Baveye et al. |
Leeds Sports Pose | Articulated human pose annotations in 2000 natural sports images from Flickr. | Rough crop around single person of interest with 14 joint labels | 2000 | Images plus .mat file labels | Human pose estimation | 2010 | [193] | S. Johnson and M. Everingham |
Leeds Sports Pose Extended Training | Articulated human pose annotations in 10,000 natural sports images from Flickr. | 14 joint labels via crowdsourcing | 10000 | Images plus .mat file labels | Human pose estimation | 2011 | [194] | S. Johnson and M. Everingham |
MCQ Dataset | 6 different real multiple choice-based exams (735 answer sheets and 33,540 answer boxes) to evaluate computer vision techniques and systems developed for multiple choice test assessment systems. | None | 735 answer sheets and 33,540 answer boxes | Images and .mat file labels | Development of multiple choice test assessment systems | 2017 | [195] [196] | Afifi, M. et al. |
Surveillance Videos | Real surveillance videos cover a large surveillance time (7 days with 24 hours each). | None | 19 surveillance videos (7 days with 24 hours each). | Videos | Data compression | 2016 | [197] | Taj-Eddin, I. A. T. F. et al. |
LILA BC | Labeled Information Library of Alexandria: Biology and Conservation. Labeled images that support machine learning research around ecology and environmental science. | None | ~10M images | Images | Classification | 2019 | [198] | LILA working group |
Can We See Photosynthesis? | 32 videos for eight live and eight dead leaves recorded under both DC and AC lighting conditions. | None | 32 videos | Videos | Liveness detection of plants | 2017 | [199] | Taj-Eddin, I. A. T. F. et al. |
Mathematical Mathematics Memes | Collection of 10,000 memes on mathematics. | None | ~10,000 | Images | Visual storytelling, object detection. | 2021 | [200] | Mathematical Mathematics Memes |
Flickr-Faces-HQ Dataset | Collection of images containing a face each, crawled from Flickr | Pruned with "various automatic filters", cropped and aligned to faces, and had images of statues, paintings, or photos of photos removed via crowdsourcing | 70,000 | Images | Face Generation | 2019 | [201] | Karras et al. |
Fruits-360 dataset | Database with images of 131 fruits, vegetables and nuts. | 100x100 pixels, white background. | 90483 | Images (jpg) | Classification | 2017–2024 | [202] | Mihai Oltean |
In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.
Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.
Matti Kalevi Pietikäinen is a Finnish computer scientist. He is currently Professor (emer.) in the Center for Machine Vision and Signal Analysis, University of Oulu. His research interests are in texture-based computer vision, face analysis, affective computing, biometrics, and vision-based perceptual interfaces. He was Director of the Center for Machine Vision Research, and Scientific Director of Infotech Oulu.
The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.
A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently have been replaced -- in some cases -- by newer deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.
Robust Principal Component Analysis (RPCA) is a modification of the widely used statistical procedure of principal component analysis (PCA) which works well with respect to grossly corrupted observations. A number of different approaches exist for Robust PCA, including an idealized version of Robust PCA, which aims to recover a low-rank matrix L0 from highly corrupted measurements M = L0 +S0. This decomposition in low-rank and sparse matrices can be achieved by techniques such as Principal Component Pursuit method (PCP), Stable PCP, Quantized PCP, Block based PCP, and Local PCP. Then, optimization methods are used such as the Augmented Lagrange Multiplier Method (ALM), Alternating Direction Method (ADM), Fast Alternating Minimization (FAM), Iteratively Reweighted Least Squares (IRLS ) or alternating projections (AP).
DeepDream is a computer vision program created by Google engineer Alexander Mordvintsev that uses a convolutional neural network to find and enhance patterns in images via algorithmic pareidolia, thus creating a dream-like appearance reminiscent of a psychedelic experience in the deliberately overprocessed images.
Dorin Comaniciu is a Romanian-American computer scientist. He is the Senior Vice President of Artificial Intelligence and Digital Innovation at Siemens Healthcare.
René Vidal is a Chilean electrical engineer and computer scientist who is known for his research in machine learning, computer vision, medical image computing, robotics, and control theory. He is the Herschel L. Seder Professor of the Johns Hopkins Department of Biomedical Engineering, and the founding director of the Mathematical Institute for Data Science (MINDS).
An event camera, also known as a neuromorphic camera, silicon retina or dynamic vision sensor, is an imaging sensor that responds to local changes in brightness. Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.
Michael J. Black is an American-born computer scientist working in Tübingen, Germany. He is a founding director at the Max Planck Institute for Intelligent Systems where he leads the Perceiving Systems Department in research focused on computer vision, machine learning, and computer graphics. He is also an Honorary Professor at the University of Tübingen.
An energy-based model (EBM) (also called a Canonical Ensemble Learning(CEL) or Learning via Canonical Ensemble (LCE)) is an application of canonical ensemble formulation of statistical physics for learning from data problems. The approach prominently appears in generative models (GMs).
In the domain of physics and probability, the filters, random fields, and maximum entropy (FRAME) model is a Markov random field model of stationary spatial processes, in which the energy function is the sum of translation-invariant potential functions that are one-dimensional non-linear transformations of linear filter responses. The FRAME model was originally developed by Song-Chun Zhu, Ying Nian Wu, and David Mumford for modeling stochastic texture patterns, such as grasses, tree leaves, brick walls, water waves, etc. This model is the maximum entropy distribution that reproduces the observed marginal histograms of responses from a bank of filters, where for each filter tuned to a specific scale and orientation, the marginal histogram is pooled over all the pixels in the image domain. The FRAME model is also proved to be equivalent to the micro-canonical ensemble, which was named the Julesz ensemble. Gibbs sampler is adopted to synthesize texture images by drawing samples from the FRAME model.
Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.
Small object detection is a particular case of object detection where various techniques are employed to detect small objects in digital images and videos. "Small objects" are objects having a small pixel footprint in the input image. In areas such as aerial imagery, state-of-the-art object detection techniques under performed because of small objects.
Xiaoming Liu is a Chinese-American computer scientist and an academic. He is a Professor in the Department of Computer Science and Engineering, MSU Foundation Professor as well as Anil K. and Nandita Jain Endowed Professor of Engineering at Michigan State University.
Curriculum learning is a technique in machine learning in which a model is trained on examples of increasing difficulty, where the definition of "difficulty" may be provided externally or discovered automatically as part of the training process. This is intended to attain good performance more quickly, or to converge to a better local optimum if the global optimum is not found.
In computer vision and computer graphics, the 3D Morphable Model (3DMM) is a generative technique that uses methods of statistical shape analysis to model 3D objects. The model follows an analysis-by-synthesis approach over a dataset of 3D example shapes of a single class of objects. The main prerequisite is that all the 3D shapes are in a dense point-to-point correspondence, namely each point has the same semantical meaning over all the shapes. In this way, we can extract meaningful statistics from the dataset and use it to represent new plausible shapes of the object's class. Given a 2D image, we can represent its 3D shape via a fitting process or generate novel shapes by directly sampling from the statistical shape distribution of that class.
{{cite journal}}
: Cite journal requires |journal=
(help){{cite journal}}
: Cite journal requires |journal=
(help){{cite journal}}
: Cite journal requires |journal=
(help){{cite web}}
: Missing or empty |url=
(help){{cite arXiv}}
: CS1 maint: multiple names: authors list (link)