Community search

Last updated

Discovering communities in a network, known as community detection/discovery, is a fundamental problem in network science, which attracted much attention in the past several decades. In recent years, with the tremendous studies on big data, another related but different problem, called community search, which aims to find the most likely community that contains the query node, has attracted great attention from both academic and industry areas. It is a query-dependent variant of the community detection problem. A detailed survey of community search can be found at ref., [1] which reviews all the recent studies [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]

Contents

Main advantages

As pointed by the first work on community search [2] published in SIGKDD'2010, many existing community detection/discovery methods consider the static community detection problem, where the graph needs to be partitioned a-priori with no reference to query nodes. While community search often focuses the most-likely communitie containing the query vertex. The main advantages of community search over community detection/discovery are listed as below:

(1) High personalization. [3] [9] [10] Community detection/discovery often uses the same global criterion to decide whether a subgraph qualifies as a community. In other words, the criterion is fixed and predetermined. But in reality, communities for different vertices may have very different characteristics. Moreover, community search allows the query users to specify more personalized query conditions. In addition, the personalized query conditions enable the communities to be interpreted easily.

For example, a recent work, [9] which focuses on attributed graphs, where nodes are often associated with some attributes like keyword, and tries to find the communities, called attributed communities, which exhibit both strong structure and keyword cohesiveness. The query users are allowed to specify a query node and some other query conditions: (1) a value, k, the minimum degree for the expected communities; and (2) a set of keywords, which control the semantic of the expected communities. The communities returned can be easily interpreted by the keywords shared by all the community members. More details can be found from. [11]

(2) High efficiency. With the striking booming of social networks in recent years, there are many real big graphs. For example, the numbers of users in Facebook and Twitter are often billions-scale. As community detection/discovery often finds all the communities from an entire social network, this can be very costly and also time-consuming. In contrast, community search often works on a sub-graph, which is much efficient. Moreover, detecting all the communities from an entire social network is often unnecessary. For real applications like recommendation and social media markets, people often focus on some communities that they are really interested in, rather than all the communities.

Some recent studies [4] [9] have shown that, for million-scale graphs, community search often takes less than 1 second to find a well-defined community, which is generally much faster than many existing community detection/discovery methods. This also implies that, community search is more suitable for finding communities from big graphs.

(3) Support for dynamically evolving graphs. [3] Almost all the graphs in real life are often evolving over time. Since community detection often uses the same global criterion to find communities, they are not sensitive of the updates of nodes and edges in graphs. In other words, the detected communities may loose freshness after a short period of time. On the contrary, community search can handle this easily since it is able to search the communities in an online manner, based on a query request.

Community search often uses some well-defined, fundamental graph metrics to formulate the cohesiveness of communities. The commonly used metrics are k-core (minimum degree), [2] [4] [6] [7] [9] k-truss, [5] [8] k-edge-connected, [12] [13] etc. Among these measures, the k-core metric is the most popular one, and has been used in many recent studies as surveyed in. [1]

Related Research Articles

<span class="mw-page-title-main">R-tree</span> Data structures used in spatial indexing

R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles or polygons. The R-tree was proposed by Antonin Guttman in 1984 and has found significant use in both theoretical and applied contexts. A common real-world usage for an R-tree might be to store spatial objects such as restaurant locations or the polygons that typical maps are made of: streets, buildings, outlines of lakes, coastlines, etc. and then find answers quickly to queries such as "Find all museums within 2 km of my current location", "retrieve all road segments within 2 km of my location" or "find the nearest gas station". The R-tree can also accelerate nearest neighbor search for various distance metrics, including great-circle distance.

Datalog is a declarative logic programming language. While it is syntactically a subset of Prolog, Datalog generally uses a bottom-up rather than top-down evaluation model. This difference yields significantly different behavior and properties from Prolog. It is often used as a query language for deductive databases. Datalog has been applied to problems in data integration, networking, program analysis, and more.

<span class="mw-page-title-main">MonetDB</span> Open source column-oriented relational database management system

MonetDB is an open-source column-oriented relational database management system (RDBMS) originally developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It is designed to provide high performance on complex queries against large databases, such as combining tables with hundreds of columns and millions of rows. MonetDB has been applied in high-performance applications for online analytical processing, data mining, geographic information system (GIS), Resource Description Framework (RDF), text retrieval and sequence alignment processing.

A bitmap index is a special kind of database index that uses bitmaps.

Query optimization is a feature of many relational database management systems and other databases such as NoSQL and graph databases. The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans.

Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest to a given point. Closeness is typically expressed in terms of a dissimilarity function: the less similar the objects, the larger the function values.

In data analysis, anomaly detection is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data.

<span class="mw-page-title-main">Georg Gottlob</span> Austrian computer scientist

Georg Gottlob FRS is an Austrian-Italian computer scientist who works in the areas of database theory, logic, and artificial intelligence and is Professor of Informatics at the University of Calabria. He was Professor at the University of Oxford.

Alon Yitzchack Halevy is an Israeli-American computer scientist and a leading researcher in the area of data integration. He was a research scientist at Google from 2005 to 2015, when he left to become head of Recruit Institute of Technology. He left Recruit in 2018 and joined Facebook AI in 2019. Until 2006, he was a professor of computer science at the University of Washington, where his doctoral students included Xin Luna Dong. He received his PhD from Stanford University in 1993, under the joint supervision of Richard Fikes and Edward Feigenbaum.

<span class="mw-page-title-main">Samuel Madden (computer scientist)</span> American computer scientist

Samuel R. Madden is an American computer scientist specializing in database management systems. He is currently a professor of computer science at the Massachusetts Institute of Technology.

<span class="mw-page-title-main">Reverse image search</span> Content-based image retrieval

Reverse image search is a content-based image retrieval (CBIR) query technique that involves providing the CBIR system with a sample image that it will then base its search upon; in terms of information retrieval, the sample image is very useful. In particular, reverse image search is characterized by a lack of search terms. This effectively removes the need for a user to guess at keywords or terms that may or may not return a correct result. Reverse image search also allows users to discover content that is related to a specific sample image or the popularity of an image, and to discover manipulated versions and derivative works.

<span class="mw-page-title-main">Tomasz Imieliński</span> Polish-American computer scientist (born 1954)

Tomasz Imieliński is a Polish-American computer scientist, most known in the areas of data mining, mobile computing, data extraction, and search engine technology. He is currently a professor of computer science at Rutgers University in New Jersey, United States.

Dataspaces are an abstraction in data management that aims to overcome some of the problems encountered in a data integration system. The aim is to reduce the effort required to set up a data integration system by relying on existing matching and mapping generation techniques, and to improve the system in "pay-as-you-go" fashion as it is used. Labor-intensive aspects of data integration are postponed until they are absolutely needed.

<span class="mw-page-title-main">Log-structured merge-tree</span> Data structure

In computer science, the log-structured merge-tree is a data structure with performance characteristics that make it attractive for providing indexed access to files with high insert volume, such as transactional log data. LSM trees, like other search trees, maintain key-value pairs. LSM trees maintain data in two or more separate structures, each of which is optimized for its respective underlying storage medium; data is synchronized between the two structures efficiently, in batches.

<span class="mw-page-title-main">Martin L. Kersten</span> Dutch computer scientist (born 1953)

Martin L. Kersten was a computer scientist with research focus on database architectures, query optimization and their use in scientific databases. He was an architect of the MonetDB system, an open-source column store for data warehouses, online analytical processing (OLAP) and geographic information systems (GIS). He has been (co-) founder of several successful spin-offs of the Centrum Wiskunde & Informatica (CWI).

<span class="mw-page-title-main">Apache SINGA</span> Open-source machine learning library

Apache SINGA is an Apache top-level project for developing an open source machine learning library. It provides a flexible architecture for scalable distributed training, is extensible to run over a wide range of hardware, and has a focus on health-care applications.

Zehra Meral Özsoyoglu is a Turkish-American computer scientist specializing in databases, including research on query languages, database model, and indexes, and applications of databases in science, bioinformatics, and medical informatics. She is the Andrew R. Jennings Professor Emeritus of Computer Science at Case Western Reserve University.

<span class="mw-page-title-main">Gautam Das (computer scientist)</span> Indian computer scientist

Gautam Das is a computer scientist in the field of databases research. He is an ACM Fellow and IEEE Fellow.

Truth discovery is the process of choosing the actual true value for a data item when different data sources provide conflicting information on it.

Tim Kraska is a German computer scientist specializing in data systems and the intersection of systems and machine learning. He is currently an associate professor of computer science at the Massachusetts Institute of Technology.

References

  1. 1 2 Fang, Yixiang; Huang, Xin; Qin, Lu; Zhang, Ying; Zhang, Wenjie; Cheng, Reynold; Lin, Xuemin (2019). "A Survey of Community Search over Big Graphs". arXiv: 1904.12539 [cs.DB].
  2. 1 2 3 Sozio, Mauro; Gionis, Aristides (2010). "The community-search problem and how to plan a successful cocktail party". Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '10. p. 939. doi:10.1145/1835804.1835923. ISBN   9781450300551. S2CID   11484255.
  3. 1 2 3 Cui, Wanyun; Xiao, Yanghua; Wang, Haixun; Lu, Yiqi; Wang, Wei (2013). "Online search of overlapping communities". Proceedings of the 2013 international conference on Management of data - SIGMOD '13. p. 277. doi:10.1145/2463676.2463722. ISBN   9781450320375. S2CID   953025.
  4. 1 2 3 Cui, Wanyun; Xiao, Yanghua; Wang, Haixun; Wang, Wei (2014). "Local search of communities in large graphs". Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. pp. 991–1002. doi:10.1145/2588555.2612179. ISBN   9781450323765. S2CID   4653380.
  5. 1 2 Huang, Xin; Cheng, Hong; Qin, Lu; Tian, Wentao; Yu, Jeffrey Xu (2014). "Querying k-truss community in large and dynamic graphs". Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. pp. 1311–1322. doi:10.1145/2588555.2610495. ISBN   9781450323765. S2CID   207211829.
  6. 1 2 Li, Rong-Hua; Qin, Lu; Yu, Jeffrey Xu; Mao, Rui (2015). "Influential community search in large networks". Proceedings of the VLDB Endowment. 8 (5): 509–520. CiteSeerX   10.1.1.667.4074 . doi:10.14778/2735479.2735484. S2CID   17672355.
  7. 1 2 Barbieri, Nicola; Bonchi, Francesco; Galimberti, Edoardo; Gullo, Francesco (2015). "Efficient and effective community search". Data Mining and Knowledge Discovery. 29 (5): 1406–1433. doi:10.1007/s10618-015-0422-1. S2CID   13440433.
  8. 1 2 Huang, Xin; Lakshmanan, Laks V. S.; Yu, Jeffrey Xu; Cheng, Hong (2015). "Approximate closest community search in networks". Proceedings of the VLDB Endowment. 9 (4): 276–287. arXiv: 1505.05956 . doi:10.14778/2856318.2856323. S2CID   2905457.
  9. 1 2 3 4 5 Fang, Yixiang; Cheng, Reynold; Luo, Siqiang; Hu, Jiafeng (2016). "Effective community search for large attributed graphs". Proceedings of the VLDB Endowment. 9 (12): 1233–1244. doi:10.14778/2994509.2994538. hdl: 10722/232839 .
  10. 1 2 Fang, Yixiang; Cheng, Reynold; Li, Xiaodong; Luo, Siqiang; Hu, Jiafeng (2017). "Effective community search over large spatial graphs". Proceedings of the VLDB Endowment. 10 (6): 709–720. doi:10.14778/3055330.3055337. hdl: 10722/243528 .
  11. 1 2 "Effective Community Search for Large Attributed Graphs".
  12. Chang, Lijun; Lin, Xuemin; Qin, Lu; Yu, Jeffrey Xu; Zhang, Wenjie (2015). "Index-based Optimal Algorithms for Computing Steiner Components with Maximum Connectivity". Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. pp. 459–474. doi:10.1145/2723372.2746486. ISBN   9781450327589. S2CID   18282516.
  13. Hu, Jiafeng; Wu, Xiaowei; Cheng, Reynold; Luo, Siqiang; Fang, Yixiang (2017). "On Minimal Steiner Maximum-Connected Subgraph Queries". IEEE Transactions on Knowledge and Data Engineering. 29 (11): 2455–2469. doi:10.1109/TKDE.2017.2730873. S2CID   40432915.