Federated learning (aka collaborative learning) is a machine learning technique that trains an algorithm across multiple decentralized edge devices or servers holding local data samples, without exchanging their data samples. This approach stands in contrast to traditional centralized machine learning techniques where all data samples are uploaded to one server, as well as to more classical decentralized approaches which assume that local data samples are identically distributed.
Federated learning enables multiple actors to build a common, robust machine learning model without sharing data, thus addressing critical issues such as data privacy, data security, data access rights and access to heterogeneous data. Its applications are spread over a number of industries including defense, telecommunications, IoT, or pharmaceutics.
Federated learning aims at training a machine learning algorithm, for instance deep neural networks, on multiple local datasets contained in local nodes without exchanging data samples. The general principle consists in training local models on local data samples and exchanging parameters (e.g. the weights of a deep neural network) between these local models at some frequency to generate a global model.
Federated learning algorithms may use a central server that orchestrates the different steps of the algorithm and acts as a reference clock, or they may be peer-to-peer, where no such central server exists. In the non peer-to-peer case, a federated learning process can be broken down in multiple rounds, each consisting of 4 general steps.
The main difference between federated learning and distributed learning lies in the assumptions made on the properties of the local datasets,as distributed learning originally aims at parallelizing computing power where federated learning originally aims at training on heterogeneous datasets. While distributed learning also aims at training a single model on multiple servers, a common underlying assumption is that the local datasets are identically distributed and roughly have the same size. None of these hypotheses are made for federated learning; instead, the datasets are typically heterogeneous and their sizes may span several orders of magnitude.
To ensure good task performance of a final, central machine learning model, federated learning relies on an iterative process broken up into an atomic set of client-server interactions known as a federated learning round. Each round of this process consists in transmitting the current global model state to participating nodes, training local models on these local nodes to produce a set of potential model updates at each node, and then aggregating and processing these local updates into a single global update and applying it to the global model.
In the methodology below, we use a central server for this aggregation, while local nodes perform local training depending on the central server's orders. However, other strategies lead to the same results without central servers, in a peer-to-peer approach, using gossip methodologies.
A statistical model (e.g., linear regression, neural network, boosting) is chosen to be trained on local nodes and initialized. Nodes are activated and wait for the central server to give calculation tasks.
For multiple iterations of so-called federated learning rounds, the following steps are performed:
A fraction of local nodes are selected to start training on local data. They all acquire the same current statistical model from the central server. Other nodes wait for the next federated round.
The central server orders selected nodes to undergo training of the model on their local data in a pre-specified fashion (e.g. for some batch updates of gradient descent).
Each node returns the locally learned incremental model updates to the central server. The central server aggregates all results and stores the new model. It also handles failures (e.g., connection lost with a node while training). The system returns to the selection phase.
When a pre-specified termination criterion (e.g. maximal number of rounds or local accuracies higher than some target) has been met, the central server orders the end of the iterative training process. The central server contains a robust model which was trained on multiple heterogeneous data sources.
The way the statistical local outputs are pooled and the way the nodes communicate with each other can change from the centralized model explained in the previous section. This leads to a variety of federated learning approaches: for instance no central orchestrating server, or stochastic communication.
In particular, orchestrator-less distributed networks are one important variation. In this case, there is no central server dispatching queries to local nodes and aggregating local models. Each local node sends its outputs to a several randomly-selected others,which aggregate their results locally. This restrains the number of transactions, thereby sometimes reducing training time and computing cost.
Once the topology of the node network is chosen, one can control different parameters of the federated learning process (in opposition to the machine learning model's own hyperparameters) to optimize learning :
Other model-dependent parameters can also be tinkered with, such as :
Those parameters have to be optimized depending on the constraints of the machine learning application (e.g., available computing power, available memory, bandwidth). For instance, stochastically choosing a limited fraction C of nodes for each iteration diminishes computing cost and may prevent overfitting, in the same way that stochastic gradient descent can reduce overfitting.
In this section, we follow the exposition of Communication-Efficient Learning of Deep Networks from Decentralized Data, H. Brendan McMahan and al. 2017.
To describe the federated strategies, let us introduce some notations:
Deep learning training mainly relies on variants of stochastic gradient descent, where gradients are computed on a random subset of the total dataset and then used to make one step of the gradient descent.
Federated stochastic gradient descentis the direct transposition of this algorithm to the federated setting, but by using a random fraction C of the nodes and using all the data on this node. The gradients are averaged by the server proportionally to the number of training samples on each node, and used to make a gradient descent step.
Federative averaging (FedAvg)is a generalization of FedSGD, which allows local nodes to perform more than one batch update on local data and exchanges the updated weights rather than the gradients. The rationale behind this generalization is that in FedSGD, if all local nodes start from the same initialization, averaging the gradients is strictly equivalent to averaging the weights themselves. Further, averaging tuned weights coming from the same initialization does not necessarily hurt the resulting averaged model's performance.
Federated learning requires frequent communication between nodes during the learning process. Thus, it requires not only enough local computing power and memory, but also high bandwidth connections to be able to exchange parameters of the machine learning model. However, the technology also avoid data communication, which can require significant resources before starting centralized machine learning.
Federated learning raises several statistical challenges :
The main advantage of using federated approaches to machine learning is to ensure data privacy or data secrecy. Indeed, no local data is uploaded externally, concatenated or exchanged. Since the entire database is segmented into local bits, this makes it more difficult to hack into it.
With federated learning, only machine learning parameters are exchanged. In addition, such parameters can be encrypted before sharing between learning rounds to extend privacy. Despite such protective measures, these parameters mays still leak information about the underlying data samples, for instance, by making multiple specific queries on specific datasets. Querying capability of nodes thus is a major attention point, which can be addressed using differential privacy or secure aggregation.
The generated model delivers insights based on the global patterns of nodes. However, if a participating node wishes to learn from global patterns but also adapt outcomes to its peculiar status, the federated learning methodology can be adapted to generate two models at once in a multi-task learning framework.
In the case of deep neural networks, it is possible to share some layers across the different nodes and keep some of them on each local node. Typically, first layers performing general pattern recognition are shared and trained all datasets. The last layers will remain on each local node and only be trained on the local node's dataset.
Western legal frameworks emphasize more and more on data protection and data traceability. White House 2012 Reportrecommended the application of a data minimization principle, which is mentioned in European GDPR. In some cases, it is impossible to transfer data from a country to another (e.g., genomic data), however international consortia are sometimes necessary for scientific advances. In such cases federated learning brings solutions to train a global model while respecting security constraints.
Federated learning has started to emerge as an important research topic in 2015and 2016, with the first publications on federative averaging in telecommunication settings. Recent publications have emphasized the development of resource allocation strategies, especially to reduce communication requirements between node with gossip algorithms. In addition, recent publications continue to work on the federated algorithms robustness to differential privacy attacks.
Federated learning typically applies when individual actors need to train models on larger datasets than their own, but cannot afford to share the data in itself with other (e.g., for legal, strategic or economic reasons). The technology yet requires good connections between local servers and minimum computational power for each node.
One of the first use cases of federated learning was implemented by Googlefor predictive keyboards. Under high regulatory pressure, it showed impossible to upload every user's text message to train the predictive algorithm for word guessing. Besides, such a process would hijack too much of the user's data. Despite the sometimes limited memory and computing power of smartphones, Google has made a compelling use case out of its G-board, as presented during the Google IO 2019 event.
Pharmaceutical research is pivoting towards a new paradigm : real world data use for generating drug leads and synthetic control arms. Generating knowledge on complex biological problems require to gather a lot of data from diverse medical institutions, which are eager to maintain control of their sensitive patient data. Federated learning, especially assisted by high traceability technologies (distributive ledgers) enable researchers to train predictive models on many sensitive data in a transparent way without uploading them. In 2019, French start-up Owkin is pioneering the development of biomedical machine learning models based on such algorithms to capture heterogeneous data from both pharmaceutical companies and medical institutions.
Self driving cars encapsulate many machine learning technologies to function : computer vision for analyzing obstacles, machine learning for adapting their pace to the environment (e.g., bumpiness of the road). Due to the potential high number of self-driving cars and the need for them to quickly respond to real world situations, traditional cloud approach may generate safety risks. Federated learning can represent a solution for limiting volume of data transfer and accelerating learning processes.
Distributed computing is a field of computer science that studies distributed systems. A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another. The components interact with one another in order to achieve a common goal. Three significant characteristics of distributed systems are: concurrency of components, lack of a global clock, and independent failure of components. Examples of distributed systems vary from SOA-based systems to massively multiplayer online games to peer-to-peer applications.
A self-organizing map (SOM) or self-organizing feature map (SOFM) is a type of artificial neural network (ANN) that is trained using unsupervised learning to produce a low-dimensional, discretized representation of the input space of the training samples, called a map, and is therefore a method to do dimensionality reduction. Self-organizing maps differ from other artificial neural networks as they apply competitive learning as opposed to error-correction learning, and in the sense that they use a neighborhood function to preserve the topological properties of the input space.
In machine learning, early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent. Such methods update the learner so as to make it better fit the training data with each iteration. Up to a point, this improves the learner's performance on data outside of the training set. Past that point, however, improving the learner's fit to the training data comes at the expense of increased generalization error. Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit. Early stopping rules have been employed in many different machine learning methods, with varying amounts of theoretical foundation.
Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop a conventional algorithm for effectively performing the task.
Random sample consensus (RANSAC) is an iterative method to estimate parameters of a mathematical model from a set of observed data that contains outliers, when outliers are to be accorded no influence on the values of the estimates. Therefore, it also can be interpreted as an outlier detection method. It is a non-deterministic algorithm in the sense that it produces a reasonable result only with a certain probability, with this probability increasing as more iterations are allowed. The algorithm was first published by Fischler and Bolles at SRI International in 1981. They used RANSAC to solve the Location Determination Problem (LDP), where the goal is to determine the points in the space that project onto an image into a set of landmarks with known locations.
Stochastic gradient descent is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient by an estimate thereof. Especially in big data applications this reduces the computational burden, achieving faster iterations in trade for a slightly lower convergence rate.
In machine learning, specifically deep learning, backpropagation is a widely used algorithm in training feedforward neural networks for supervised learning. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions generally – a class of algorithms is referred to generically as "backpropagation". In deep learning, backpropagation computes the gradient of the loss function with respect to the weights of the network for a single input–output example, and does so efficiently, unlike a naive direct computation of the gradient with respect to each weight individually. This efficiency makes it feasible to use gradient methods for training multilayer networks, updating weights to minimize loss; gradient descent, or variants such as stochastic gradient descent, are commonly used. The backpropagation algorithm works by computing the gradient of the loss function with respect to each weight by the chain rule, computing the gradient one layer at a time, iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule; this is an example of dynamic programming.
In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms work by making data-driven predictions or decisions, through building a mathematical model from input data.
A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes form a directed graph along a temporal sequence. This allows it to exhibit temporal dynamic behavior. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.
A feedforward neural network is an artificial neural network wherein connections between the nodes do not form a cycle. As such, it is different from recurrent neural networks.
Meta learning is a subfield of machine learning where automatic learning algorithms are applied on metadata about machine learning experiments. As of 2017 the term had not found a standard interpretation, however the main goal is to use such metadata to understand how automatic learning can become flexible in solving learning problems, hence to improve the performance of existing learning algorithms or to learn (induce) the learning algorithm itself, hence the alternative term learning to learn.
In machine learning, multiclass or multinomial classification is the problem of classifying instances into one of three or more classes.
Active learning is a special case of machine learning in which a learning algorithm is able to interactively query the user to obtain the desired outputs at new data points. In statistics literature it is sometimes also called optimal experimental design.
In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.
In deep learning, a convolutional neural network is a class of deep neural networks, most commonly applied to analyzing visual imagery. They are also known as shift invariant or space invariant artificial neural networks (SIANN), based on their shared-weights architecture and translation invariance characteristics. They have applications in image and video recognition, recommender systems, image classification, medical image analysis, and natural language processing.
In machine learning, the vanishing gradient problem is a difficulty found in training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, each of the neural network's weights receives an update proportional to the partial derivative of the error function with respect to the current weight in each iteration of training. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training. As one example of the problem cause, traditional activation functions such as the hyperbolic tangent function have gradients in the range (-1, 1), and backpropagation computes gradients by the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the "front" layers in an n-layer network, meaning that the gradient decreases exponentially with n while the front layers train very slowly.
Apache SINGA is an Apache top-level project for developing an open source machine learning library. It provides a flexible architecture for scalable distributed training, is extensible to run over a wide range of hardware, and has a focus on health-care applications.
In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters are learned.
RevoScaleR is a machine learning package in R created by Microsoft. It is available as part of Machine Learning Server, Microsoft R Client, and Machine Learning Services in Microsoft SQL Server 2016.
In machine learning and statistics, the learning rate is a tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function. Since it influences to what extent newly acquired information overrides old information, it metaphorically represents the speed at which a machine learning model "learns." The learning rate is often denoted by the character η or α.