Differentially private analysis of graphs

Last updated

Differentially private analysis of graphs [1] studies algorithms for computing accurate graph statistics while preserving differential privacy. Such algorithms are used for data represented in the form of a graph where nodes correspond to individuals and edges correspond to relationships between them. For examples, edges could correspond to friendships, sexual relationships, or communication patterns. A party that collected sensitive graph data can process it using a differentially private algorithm and publish the output of the algorithm. The goal of differentially private analysis of graphs is to design algorithms that compute accurate global information about graphs while preserving privacy of individuals whose data is stored in the graph.

Contents

Variants

Differential privacy imposes a restriction on the algorithm. Intuitively, it requires that the algorithm has roughly the same output distribution on neighboring inputs. If the input is a graph, there are two natural notions of neighboring inputs, edge neighbors and node neighbors, which yield two natural variants of differential privacy for graph data.

Let ε be a positive real number and be a randomized algorithm that takes a graph as input and returns an output from a set . The algorithm is -differentially private if, for all neighboring graphs and and all subsets of ,

where the probability is taken over the randomness used by the algorithm.

Edge differential privacy

Two graphs are edge neighbors if they differ in one edge. An algorithm is -edge-differentially private if, in the definition above, the notion of edge neighbors is used. Intuitively, an edge differentially private algorithm has similar output distributions on any pair of graphs that differ in one edge, thus protecting changes to graph edges.

Node differential privacy

Two graphs are node neighbors if one can be obtained from the other by deleting a node and its adjacent edges. An algorithm is -node-differentially private if, in the definition above, the notion of node neighbors is used. Intuitively, a node differentially private algorithm has similar output distributions on any pair of graphs that differ in one one nodes and edges adjacent to it, thus protecting information pertaining to each individual. Node differential privacy give a stronger privacy protection than edge differential privacy.

Research history

The first edge differentially private algorithm was designed by Nissim, Raskhodnikova, and Smith. [2] The distinction between edge and node differential privacy was first discussed by Hay, Miklau, and Jensen. [3] However, it took several years before first node differentially private algorithms were published in Blocki et al., [4] Kasiviswanathan et al., [5] and Chen and Zhou. [6] In all three papers, the algorithms are for releasing a single statistic, like a triangle count or counts of other subgraphs. Raskhodnikova and Smith gave the first node differentially private algorithm for releasing a vector, specifically, the degree count and the degree distribution. [7]

Related Research Articles

A* is a graph traversal and path search algorithm, which is used in many fields of computer science due to its completeness, optimality, and optimal efficiency. One major practical drawback is its space complexity, as it stores all generated nodes in memory. Thus, in practical travel-routing systems, it is generally outperformed by algorithms that can pre-process the graph to attain better performance, as well as memory-bounded approaches; however, A* is still the best solution in many cases.

<span class="mw-page-title-main">Independent set (graph theory)</span> Unrelated vertices in graphs

In graph theory, an independent set, stable set, coclique or anticlique is a set of vertices in a graph, no two of which are adjacent. That is, it is a set of vertices such that for every two vertices in , there is no edge connecting the two. Equivalently, each edge in the graph has at most one endpoint in . A set is independent if and only if it is a clique in the graph's complement. The size of an independent set is the number of vertices it contains. Independent sets have also been called "internally stable sets", of which "stable set" is a shortening.

<span class="mw-page-title-main">Szemerédi regularity lemma</span>

Szemerédi's regularity lemma is one of the most powerful tools in extremal graph theory, particularly in the study of large dense graphs. It states that the vertices of every large enough graph can be partitioned into a bounded number of parts so that the edges between different parts behave almost randomly.

In computational complexity theory, the unique games conjecture is a conjecture made by Subhash Khot in 2002. The conjecture postulates that the problem of determining the approximate value of a certain type of game, known as a unique game, has NP-hard computational complexity. It has broad applications in the theory of hardness of approximation. If the unique games conjecture is true and P ≠ NP, then for many important problems it is not only impossible to get an exact solution in polynomial time, but also impossible to get a good polynomial-time approximation. The problems for which such an inapproximability result would hold include constraint satisfaction problems, which crop up in a wide variety of disciplines.

In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

In mathematics, the Johnson–Lindenstrauss lemma is a result named after William B. Johnson and Joram Lindenstrauss concerning low-distortion embeddings of points from high-dimensional into low-dimensional Euclidean space. The lemma states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. The map used for the embedding is at least Lipschitz, and can even be taken to be an orthogonal projection.

In computer science, a property testing algorithm for a decision problem is an algorithm whose query complexity to its input is much smaller than the instance size of the problem. Typically property testing algorithms are used to distinguish if some combinatorial structure S satisfies some property P, or is "far" from having this property, using only a small number of "local" queries to the object.

The exponential mechanism is a technique for designing differentially private algorithms. It was developed by Frank McSherry and Kunal Talwar in 2007. Their work was recognized as a co-winner of the 2009 PET Award for Outstanding Research in Privacy Enhancing Technologies.

<span class="mw-page-title-main">Cynthia Dwork</span> American computer scientist

Cynthia Dwork is an American computer scientist best known for her contributions to cryptography, distributed computing, and algorithmic fairness. She is one of the inventors of differential privacy and proof-of-work.

Differential privacy (DP) is an approach to providing privacy while sharing information about a group of individuals, by describing the patterns within the group while withholding information about specific individuals. This is done by making arbitrary small changes to individual data that do not change the statistics of interest. Thus the data cannot be used to infer much about any individual.

In theoretical computer science, Baker's technique is a method for designing polynomial-time approximation schemes (PTASs) for problems on planar graphs. It is named after Brenda Baker, who announced it in a 1983 conference and published it in the Journal of the ACM in 1994.

<span class="mw-page-title-main">Occam learning</span> Model of algorithmic learning

In computational learning theory, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set.

In graph theory, the graph removal lemma states that when a graph contains few copies of a given subgraph, then all of the copies can be eliminated by removing a small number of edges. The special case in which the subgraph is a triangle is known as the triangle removal lemma.

Local differential privacy (LDP) is a model of differential privacy with the added requirement that even if an adversary has access to the personal responses of an individual in the database, that adversary will still be unable to learn too much about the user's personal data. This is contrasted with global differential privacy, a model of differential privacy that incorporates a central aggregator with access to the raw data.

Adding controlled noise from predetermined distributions is a way of designing differentially private mechanisms. This technique is useful for designing private mechanisms for real-valued functions on sensitive data. Some commonly used distributions for adding noise include Laplace and Gaussian distributions.

Since the advent of differential privacy, a number of systems supporting differentially private data analyses have been implemented and deployed. This article tracks real-world deployments, production software packages, and research prototypes.

A reconstruction attack is any method for partially reconstructing a private dataset from public aggregate information. Typically, the dataset contains sensitive information about individuals, whose privacy needs to be protected. The attacker has no or only partial access to the dataset, but has access to public aggregate statistics about the datasets, which could be exact or distorted, for example by adding noise. If the public statistics are not sufficiently distorted, the attacker is able to accurately reconstruct a large portion of the original private data. Reconstruction attacks are relevant to the analysis of private data, as they show that, in order to preserve even a very weak notion of individual privacy, any published statistics need to be sufficiently distorted. This phenomenon was called the Fundamental Law of Information Recovery by Dwork and Roth, and formulated as "overly accurate answers to too many questions will destroy privacy in a spectacular way."

Weak supervision, also called semi-supervised learning, is a paradigm in machine learning, the relevance and notability of which increased with the advent of large language models due to large amount of data required to train them. It is characterized by using a combination of a small amount of human-labeled data, followed by a large amount of unlabeled data. In other words, the desired output values are provided only for a subset of the training data. The remaining data is unlabeled or imprecisely labeled. Intuitively, it can be seen as an exam and labeled data as sample problems that the teacher solves for the class as an aid in solving another set of problems. In the transductive setting, these unsolved problems act as exam questions. In the inductive setting, they become practice problems of the sort that will make up the exam. Technically, it could be viewed as performing clustering and then labeling the clusters with the labeled data, pushing the decision boundary away from high-density regions, or learning an underlying one-dimensional manifold where the data reside.

Sofya Raskhodnikova is a Belarusian and American theoretical computer scientist. She is known for her research in sublinear-time algorithms, information privacy, property testing, and approximation algorithms, and was one of the first to study differentially private analysis of graphs. She is a professor of computer science at Boston University.

In the context of quantum computing, the quantum walk search is a quantum algorithm for finding a marked node in a graph.

References

  1. Sofya Raskhodnikova; Adam Smith (2015). "Differentially Private Analysis of Graphs". Kao MY. (Eds) Encyclopedia of Algorithms. Springer, Berlin, Heidelberg. doi:10.1007/978-3-642-27848-8_549-1.
  2. Nissim, Kobbi; Raskhodnikova, Sofya; Smith, Adam (2007). "Smooth sensitivity and sampling in private data analysis". Proceedings of the thirty-ninth annual ACM symposium on Theory of computing. New York, New York, USA: ACM Press. pp. 75–84. doi:10.1145/1250790.1250803. ISBN   9781595936318. S2CID   5642529.
  3. Hay, Michael; Li, Chao; Miklau, Gerome; Jensen, David (2009). "Accurate Estimation of the Degree Distribution of Private Networks". 2009 Ninth IEEE International Conference on Data Mining. IEEE. pp. 169–178. doi:10.1109/icdm.2009.11. ISBN   9781424452422. S2CID   2572996.
  4. Blocki, Jeremiah; Blum, Avrim; Datta, Anupam; Sheffet, Or (2012). "The Johnson-Lindenstrauss Transform Itself Preserves Differential Privacy". 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science. pp. 410–419. arXiv: 1204.2136 . Bibcode:2012arXiv1204.2136B. doi:10.1109/focs.2012.67. ISBN   978-0-7695-4874-6. S2CID   349368.
  5. Kasiviswanathan, Shiva Prasad; Nissim, Kobbi; Raskhodnikova, Sofya; Smith, Adam (2013), "Analyzing Graphs with Node Differential Privacy", Theory of Cryptography, Springer Berlin Heidelberg, pp. 457–476, doi: 10.1007/978-3-642-36594-2_26 , ISBN   9783642365935
  6. Chen, Shixi; Zhou, Shuigeng (2013). "Recursive mechanism". Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. New York, New York, USA: ACM Press. pp. 653–664. doi:10.1145/2463676.2465304. ISBN   9781450320375. S2CID   16257197.
  7. Raskhodnikova, Sofya; Smith, Adam (2016). "Lipschitz Extensions for Node-Private Graph Statistics and the Generalized Exponential Mechanism". 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS). IEEE. pp. 495–504. doi:10.1109/focs.2016.60. ISBN   9781509039333. S2CID   7310416.