FaceNet

Last updated

FaceNet is a facial recognition system developed by Florian Schroff, Dmitry Kalenichenko and James Philbina, a group of researchers affiliated with Google. The system was first presented at the 2015 IEEE Conference on Computer Vision and Pattern Recognition. [1] The system uses a deep convolutional neural network to learn a mapping (also called an embedding) from a set of face images to a 128-dimensional Euclidean space, and assesses the similarity between faces based on the square of the Euclidean distance between the images' corresponding normalized vectors in the 128-dimensional Euclidean space. The system uses the triplet loss function as its cost function and introduced a new online triplet mining method. The system achieved an accuracy of 99.63%, which is the highest score to date on the Labeled Faces in the Wild dataset using the unrestricted with labeled outside data protocol. [2]

Contents

Structure

Basic structure

The structure of FaceNet is represented schematically in Figure 1.

Figure 1: Overall structure of the FaceNet face recognition system Structure of FaceNet System.png
Figure 1: Overall structure of the FaceNet face recognition system

For training, researchers used input batches of about 1800 images. For each identity represented in the input batches, there were 40 similar images of that identity and several randomly selected images of other identities. These batches were fed to a deep convolutional neural network, which was trained using stochastic gradient descent with standard backpropagation and the Adaptive Gradient Optimizer (AdaGrad) algorithm. The learning rate was initially set at 0.05, which was later lowered while finalizing the model.

Structure of the CNN

The researchers used two types of architectures, which they called NN1 and NN2, and explored their trade-offs. The practical differences between the models lie in the difference of parameters and FLOPS. The details of the NN1 model are presented in the table below.

Structure of the CNN used in the model NN1 in the FaceNet face recognition system
LayerSize-in
(rows × cols × #filters)
Size-out
(rows × cols × #filters)
Kernel
(rows × cols, stride)
ParametersFLOPS
conv1220×220×3110×110×647×7×3, 29K115M
pool1110×110×6455×55×643×3×64, 20
rnorm155×55×6455×55×640
conv2a55×55×6455×55×641×1×64, 14K13M
conv255×55×6455×55×1923×3×64, 1111K335M
rnorm255×55×19255×55×1920
pool255×55×19228×28×1923×3×192, 20
conv3a28×28×19228×28×1921×1×192, 137K29M
conv328×28×19228×28×3843×3×192, 1664K521M
pool328×28×38414×14×3843×3×384, 20
conv4a14×14×38414×14×3841×1×384, 1148K29M
conv414×14×38414×14×2563×3×384, 1885K173M
conv5a14×14×25614×14×2561×1×256, 166K13M
conv514×14×25614×14×2563×3×256, 1590K116M
conv6a14×14×25614×14×2561×1×256, 166K13M
conv614×14×25614×14×2563×3×256, 1590K116M
pool414×14×2563×3×256, 27×7×2560
concat7×7×2567×7×2560
fc17×7×2561×32×128maxout p=2103M103M
fc21×32×1281×32×128maxout p=234M34M
fc71281×32×1281×1×128524K0.5M
L21×1×1281×1×1280
Total140M1.6B

Triplet loss function

The triplet loss function minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity. Triplet Loss Minimization.png
The triplet loss function minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.

FaceNet introduced a novel loss function called "triplet loss". This function is defined using triplets of training images of the form . In each triplet, (called an "anchor image") denotes a reference image of a particular identity, (called a "positive image") denotes another image of the same identity in image , and (called a "negative image") denotes a randomly selected image of an identity different from the identity in image and .

Let be some image and let be the embedding of in the 128-dimensional Euclidean space. It shall be assumed that the L2-norm of is unity (the L2 norm of a vector in a finite dimensional Euclidean space is denoted by .) We assemble triplets of images from the training dataset. The goal of training here is to ensure that, after learning, the following condition (called the "triplet constraint") is satisfied by all triplets in the training data set:

The variable is a hyperparameter called the margin, and its value must be set manually. Its value has been set as 0.2.

Thus, the full form of the function to be minimized is the following function, which is officially called the triplet loss function:

Selection of triplets

In general, the number of triplets of the form is very large. To make computations faster, the Google researchers considered only those triplets which violate the triplet constraint. For this, for a given anchor image they chose that positive image for which is maximum (such a positive image was called a "hard positive image") and that negative image for which is minimum (such a positive image was called a "hard negative image"). since using the whole training data set to determine the hard positive and hard negative images was computationally expensive and infeasible, the researchers experimented with several methods for selecting the triplets.

Performance

On the widely used Labeled Faces in the Wild (LFW) dataset, the FaceNet system achieved an accuracy of 99.63% which is the highest score on LFW in the unrestricted with labeled outside data protocol. [2] On YouTube Faces DB the system achieved an accuracy of 95.12%. [1]

See also

Further reading

Related Research Articles

<span class="mw-page-title-main">Absolute value</span> Distance from zero to a number

In mathematics, the absolute value or modulus of a real number , denoted , is the non-negative value of without regard to its sign. Namely, if is a positive number, and if is negative, and . For example, the absolute value of 3 is 3, and the absolute value of −3 is also 3. The absolute value of a number may be thought of as its distance from zero.

In mathematics, specifically in functional analysis, a C-algebra is a Banach algebra together with an involution satisfying the properties of the adjoint. A particular case is that of a complex algebra A of continuous linear operators on a complex Hilbert space with two additional properties:

<span class="mw-page-title-main">Convolution</span> Integral expressing the amount of overlap of one function as it is shifted over another

In mathematics, convolution is a mathematical operation on two functions that produces a third function. The term convolution refers to both the result function and to the process of computing it. It is defined as the integral of the product of the two functions after one is reflected about the y-axis and shifted. The integral is evaluated for all values of shift, producing the convolution function. The choice of which function is reflected and shifted before the integral does not change the integral result. Graphically, it expresses how the 'shape' of one function is modified by the other.

<span class="mw-page-title-main">Euclidean space</span> Fundamental space of geometry

Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid's Elements, it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are Euclidean spaces of any positive integer dimension n, which are called Euclidean n-spaces when one wants to specify their dimension. For n equal to one or two, they are commonly called respectively Euclidean lines and Euclidean planes. The qualifier "Euclidean" is used to distinguish Euclidean spaces from other spaces that were later considered in physics and modern mathematics.

<span class="mw-page-title-main">Metric space</span> Mathematical space with a notion of distance

In mathematics, a metric space is a set together with a notion of distance between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general setting for studying many of the concepts of mathematical analysis and geometry.

The Cauchy–Schwarz inequality is an upper bound on the inner product between two vectors in an inner product space in terms of the product of the vector norms. It is considered one of the most important and widely used inequalities in mathematics.

In mathematics, a symmetric matrix with real entries is positive-definite if the real number is positive for every nonzero real column vector where is the row vector transpose of More generally, a Hermitian matrix is positive-definite if the real number is positive for every nonzero complex column vector where denotes the conjugate transpose of

<span class="mw-page-title-main">Riemannian manifold</span> Smooth manifold with an inner product on each tangent space

In differential geometry, a Riemannian manifold is a geometric space on which many geometric notions such as distance, angles, length, volume, and curvature are defined. Euclidean space, the -sphere, hyperbolic space, and smooth surfaces in three-dimensional space, such as ellipsoids and paraboloids, are all examples of Riemannian manifolds. Riemannian manifolds are named after German mathematician Bernhard Riemann, who first conceptualized them.

In mathematics, the Laplace operator or Laplacian is a differential operator given by the divergence of the gradient of a scalar function on Euclidean space. It is usually denoted by the symbols , (where is the nabla operator), or . In a Cartesian coordinate system, the Laplacian is given by the sum of second partial derivatives of the function with respect to each independent variable. In other coordinate systems, such as cylindrical and spherical coordinates, the Laplacian also has a useful form. Informally, the Laplacian Δf (p) of a function f at a point p measures by how much the average value of f over small spheres or balls centered at p deviates from f (p).

<span class="mw-page-title-main">Eigenface</span> Set of eigenvectors used in the computer vision problem of human face recognition

An eigenface is the name given to a set of eigenvectors when used in the computer vision problem of human face recognition. The approach of using eigenfaces for recognition was developed by Sirovich and Kirby and used by Matthew Turk and Alex Pentland in face classification. The eigenvectors are derived from the covariance matrix of the probability distribution over the high-dimensional vector space of face images. The eigenfaces themselves form a basis set of all images used to construct the covariance matrix. This produces dimension reduction by allowing the smaller set of basis images to represent the original training images. Classification can be achieved by comparing how faces are represented by the basis set.

<span class="mw-page-title-main">Square (algebra)</span> Product of a number by itself

In mathematics, a square is the result of multiplying a number by itself. The verb "to square" is used to denote this operation. Squaring is the same as raising to the power 2, and is denoted by a superscript 2; for instance, the square of 3 may be written as 32, which is the number 9. In some cases when superscripts are not available, as for instance in programming languages or plain text files, the notations x^2 (caret) or x**2 may be used in place of x2. The adjective which corresponds to squaring is quadratic.

In mathematics, a norm is a function from a real or complex vector space to the non-negative real numbers that behaves in certain ways like the distance from the origin: it commutes with scaling, obeys a form of the triangle inequality, and is zero only at the origin. In particular, the Euclidean distance in an Euclidean space is defined by a norm on the associated Euclidean vector space, called the Euclidean norm, the 2-norm, or, sometimes, the magnitude of the vector. This norm can be defined as the square root of the inner product of a vector with itself.

In computing, the modulo operation returns the remainder or signed remainder of a division, after one number is divided by another.

In quantum field theory, the Wightman distributions can be analytically continued to analytic functions in Euclidean space with the domain restricted to the ordered set of points in Euclidean space with no coinciding points. These functions are called the Schwinger functions and they are real-analytic, symmetric under the permutation of arguments, Euclidean covariant and satisfy a property known as reflection positivity. Properties of Schwinger functions are known as Osterwalder–Schrader axioms. Schwinger functions are also referred to as Euclidean correlation functions.

Mean shift is a non-parametric feature-space mathematical analysis technique for locating the maxima of a density function, a so-called mode-seeking algorithm. Application domains include cluster analysis in computer vision and image processing.

In mathematics, the symmetric decreasing rearrangement of a function is a function which is symmetric and decreasing, and whose level sets are of the same size as those of the original function.

Similarity learning is an area of supervised machine learning in artificial intelligence. It is closely related to regression and classification, but the goal is to learn a similarity function that measures how similar or related two objects are. It has applications in ranking, in recommendation systems, visual identity tracking, face verification, and speaker verification.

In machine learning and computer vision, M-theory is a learning framework inspired by feed-forward processing in the ventral stream of visual cortex and originally developed for recognition and classification of objects in visual scenes. M-theory was later applied to other areas, such as speech recognition. On certain image recognition tasks, algorithms based on a specific instantiation of M-theory, HMAX, achieved human-level performance.

A Siamese neural network is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. Often one of the output vectors is precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints but can be described more technically as a distance function for locality-sensitive hashing.

<span class="mw-page-title-main">Triplet loss</span> Function for machine learning algorithms

Triplet loss is a loss function for machine learning algorithms where a reference input is compared to a matching input and a non-matching input. The distance from the anchor to the positive is minimized, and the distance from the anchor to the negative input is maximized. An early formulation equivalent to triplet loss was introduced for metric learning from relative comparisons by M. Schultze and T. Joachims in 2003.

References

  1. 1 2 Florian Schroff; Dmitry Kalenichenko; James Philbin. "FaceNet: A Unified Embedding for Face Recognition and Clustering" (PDF). The Computer Vision Foundation. Retrieved 4 October 2023.
  2. 1 2 Erik Learned-Miller; Gary Huang; Aruni RoyChowdhury; Haoxiang Li; Gang Hua (April 2016). "Labeled Faces in the Wild: A Survey". Advances in Face Detection and Facial Image Analysis (PDF). Springer. pp. 189–248. Retrieved 5 October 2023.