Author name disambiguation

Last updated
The author name "Li Li" might refer to a number of people, including the seven listed here. Scholia comparison page for seven people named Li Li as of 2019-12-02 (cropped).png
The author name "Li Li" might refer to a number of people, including the seven listed here.

Author name disambiguation is the process of disambiguation and record linkage applied to the names of individual people. The process could, for example, distinguish individuals with the name "John Smith".

Contents

An editor may apply the process to scholarly documents where the goal is to find all mentions of the same author and cluster them together. Authors of scholarly documents often share names which makes it hard to distinguish each author's work. Hence, author name disambiguation aims to find all publications that belong to a given author and distinguish them from publications of other authors who share the same name.

Methods

Considerable research has been conducted into name disambiguation. [1] [2] [3] [4] [5] Typical approaches for author name disambiguation rely on information to distinguish between authors, including (but not limited to) information about the authors such as: their name representation, affiliations and email addresses, and information about the publication: such as year of publication, co-authors, and the topic of the paper. This information can be used to train a machine learning classifier to decide whether two author mentions refer to the same author or not. [6] Much research regards name disambiguation as a clustering problem, i.e., partitioning documents into clusters, where each represents an author. [2] [7] [8] Other research treats it as a classification problem. [9] Some works constructs a document graph and utilizes the graph topology to learn document similarity. [8] [10] Recently, several pieces of research [10] [11] aim to learn low-dimensional document representations by employing network embedding methods. [12] [13]

Applications

Some of the ways in which authorship has been indicated for the same person Scholia page for missing information related to an author profile as of 2019-12-02 at 20.47.22 (cropped).png
Some of the ways in which authorship has been indicated for the same person

There are multiple reasons that cause author names to be ambiguous, among which: individuals may publish under multiple names for a variety of reasons including different transliteration, misspelling, name change due to marriage, or the use of nicknames or middle names and initials. [14]

Motivations for disambiguating individuals include identifying inventors from patents, and researchers across differing publishers, research institutions and time periods. [15] Name disambiguation is also a cornerstone in author-centric academic search and mining systems, such as AMiner (formerly ArnetMiner). [16]

Similar issues

Author name disambiguation is only one record linkage problem in the scholarly data domain. Closely related, and potentially mutually beneficial problems include: organisation (affiliation) disambiguation, [17] as well as conference or publication venue disambiguation, since data publishers often use different names or aliases for these entities.

Resources

Several well-known benchmarks to evaluate author name disambiguation are listed below, each of which provides publications with some ambiguous names and their ground truths.

Source Codes

Related Research Articles

Biclustering, block clustering, Co-clustering or two-mode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Boris Mirkin to name a technique introduced many years earlier, in 1972, by John A. Hartigan.

SIGKDD, representing the Association for Computing Machinery's (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining, hosts an influential annual conference.

Expertise finding is the use of tools for finding and assessing individual expertise. In the recruitment industry, expertise finding is the problem of searching for employable candidates with certain required skills set. In other words, it is the challenge of linking humans to expertise areas, and as such is a sub-problem of expertise retrieval.

Hans-Peter Kriegel is a German computer scientist and professor at the Ludwig Maximilian University of Munich and leading the Database Systems Group in the Department of Computer Science. He was previously professor at the University of Würzburg and the University of Bremen after habilitation at the Technical University of Dortmund and doctorate from Karlsruhe Institute of Technology.

AMiner is a free online service used to index, search, and mine big scientific data.

Jie Tang is a full-time professor at the Department of Computer Science of Tsinghua University. He received a PhD in computer science from the same university in 2006. He is known for building the academic social network search system AMiner, which was launched in March 2006 and now has attracted 2,766,356 independent IP accesses from 220 countries. His research interests include social networks and data mining.

<span class="mw-page-title-main">Entity linking</span> Concept in Natural Language Processing

In natural language processing, entity linking, also referred to as named-entity linking (NEL), named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD) or named-entity normalization (NEN) is the task of assigning a unique identity to entities mentioned in text. For example, given the sentence "Paris is the capital of France", the idea is to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred to as "Paris". Entity linking is different from named-entity recognition (NER) in that NER identifies the occurrence of a named entity in text but it does not identify which specific entity it is.

Philip S. Yu is an American computer scientist and professor of information technology at the University of Illinois at Chicago. He is a prolific author, holds over 300 patents, and is known for his work in the field of data mining.

Social media mining is the process of obtaining data from user-generated content on social media in order to extract actionable patterns, form conclusions about users, and act upon the information. Mining supports targeting advertising to users or academic research. The term is an analogy to the process of mining for minerals. Mining companies sift through raw ore to find the valuable minerals; likewise, social media mining sifts through social media data in order to discern patterns and trends about matters such as social media usage, online behaviour, content sharing, connections between individuals, buying behaviour. These patterns and trends are of interest to companies, governments and not-for-profit organizations, as such organizations can use the analyses for tasks such as design strategies, introduce programs, products, processes or services.

Discovering communities in a network, known as community detection/discovery, is a fundamental problem in network science, which attracted much attention in the past several decades. In recent years, with the tremendous studies on big data, another related but different problem, called community search, which aims to find the most likely community that contains the query node, has attracted great attention from both academic and industry areas. It is a query-dependent variant of the community detection problem. A detailed survey of community search can be found at ref., which reviews all the recent studies

Arthur Zimek is a professor in data mining, data science and machine learning at the University of Southern Denmark in Odense, Denmark.

An associative classifier (AC) is a kind of supervised learning model that uses association rules to assign a target value. The term associative classification was coined by Bing Liu et al., in which the authors defined a model made of rules "whose right-hand side are restricted to the classification class attribute".

<span class="mw-page-title-main">Gautam Das (computer scientist)</span> Indian computer scientist

Gautam Das is a computer scientist in the field of databases research. He is an ACM Fellow and IEEE Fellow.

Wei Wang is a Chinese-born American computer scientist. She is the Leonard Kleinrock Chair Professor in Computer Science and Computational Medicine at University of California, Los Angeles and the director of the Scalable Analytics Institute (ScAi). Her research specializes in big data analytics and modeling, database systems, natural language processing, bioinformatics and computational biology, and computational medicine.

node2vec is an algorithm to generate vector representations of nodes on a graph. The node2vec framework learns low-dimensional representations for nodes in a graph through the use of random walks through a graph starting at a target node. It is useful for a variety of machine learning applications. node2vec follows the intuition that random walks through a graph can be treated like sentences in a corpus. Each node in a graph is treated like an individual word, and a random walk is treated as a sentence. By feeding these "sentences" into a skip-gram, or by using the continuous bag of words model paths found by random walks can be treated as sentences, and traditional data-mining techniques for documents can be used. The algorithm generalizes prior work which is based on rigid notions of network neighborhoods, and argues that the added flexibility in exploring neighborhoods is the key to learning richer representations of nodes in graphs. The algorithm is considered one of the best graph classifiers.

Jiliang Tang is a Chinese-born computer scientist and associate professor at Michigan State University in the Computer Science and Engineering Department, where he is the director of the Data Science and Engineering (DSE) Lab. His research expertise is in data mining and machine learning.

Spatial embedding is one of feature learning techniques used in spatial analysis where points, lines, polygons or other spatial data types. representing geographic locations are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per geographic object to a continuous vector space with a much lower dimension.

Yixin Chen is a computer scientist, academic, and author. He is a professor of computer science and engineering at Washington University in St. Louis.

<span class="mw-page-title-main">Nitesh Chawla</span> Computer scientist

Nitesh V. Chawla is a computer scientist and data scientist currently serving as the Frank M. Freimann Professor of Computer Science and Engineering at the University of Notre Dame. He is the Founding Director of the Lucy Family Institute for Data & Society. Chawla's research expertise lies in machine learning, data science, and network science. He is also the co-founder of Aunalytics, a data science software and cloud computing company. Chawla is a Fellow of the: American Association for the Advancement of Sciences (AAAS), Association for Computing Machinery (ACM), Association for the Advancement of Artificial Intelligence, Asia Pacific Artificial Intelligence Association, and Institute of Electrical and Electronics Engineers (IEEE). He has received multiple awards, including the 1st Source Bank Commercialization Award in 2017, Outstanding Teaching Award (twice), IEEE CIS Early Career Award, National Academy of Engineering New Faculty Award, and the IBM Big Data Award in 2013. One of Chawla's most recognized publications, with a citation count of over 30,000, is the research paper titled "SMOTE: Synthetic Minority Over-sampling Technique." Chawla's research has garnered a citation count of over 65,000 and an H-index of 81.

<span class="mw-page-title-main">Xing Xie</span> Computer scientist at Microsoft Research Asia

Xing Xie is a partner research manager at Microsoft Research Asia. As a computer scientist, his research has focused on data mining, social computing, and responsible AI. He has published more than 400 papers which have been cited more than 60,000 times. He has been on organizing committees or helped with the programs of over 70 conferences and workshops.

References

  1. De Bonis, Michele; Manghi, Paolo; Falchi, Fabrizio (2023). "Graph-based methods for Author Name Disambiguation: a survey". PeerJ Computer Science. 9: e1536. doi: 10.7717/peerj-cs.1536 . PMC   10557506 . PMID   37810360.
  2. 1 2 Khabsa, Madian; Treeratpituk, Pucktada; Giles, C. Lee (2015). Proceedings of the 15th ACM/IEEE-CE on Joint Conference on Digital Libraries - JCDL '15. pp. 37–46. doi:10.1145/2756406.2756915. ISBN   9781450335942. S2CID   14068285.
  3. Mann, Gideon S.; Yarowsky, David (2003). "Unsupervised personal name disambiguation". Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 -. Vol. 4. pp. 33–40. doi:10.3115/1119176.1119181. S2CID   29759924.
  4. Han, Hui; Giles, Lee; Zha, Hongyuan; Li, Cheng; Tsioutsiouliklis, Kostas (2004). "Two supervised learning approaches for name disambiguation in author citations". Proceedings of the 2004 joint ACM/IEEE conference on Digital libraries - JCDL '04. p. 296. doi:10.1145/996350.996419. ISBN   1581138326. S2CID   1089260.
  5. Huang, Jian; Ertekin, Seyda; Giles, C. Lee (2006). Knowledge Discovery in Databases: PKDD 2006. Lecture Notes in Computer Science. Vol. 4213. pp. 536–544. doi:10.1007/11871637_53. ISBN   978-3-540-45374-1. ISSN   0302-9743. S2CID   14132755.
  6. Treeratpituk, Pucktada; Giles, C. Lee (2009). Disambiguating authors in academic publications using random forests (PDF). Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM. pp. 39–48. CiteSeerX   10.1.1.147.3500 . doi:10.1145/1555400.1555408.
  7. Jie Tang; A.C.M. Fong; Bo Wang; Jing Zhang (2012). "A Unified Probabilistic Framework for Name Disambiguation in Digital Library". IEEE Transactions on Knowledge and Data Engineering. 24 (6). IEEE: 975–987. doi:10.1109/TKDE.2011.13. S2CID   1032074.
  8. 1 2 Xuezhi Wang; Jie Tang; Hong Cheng; Philip S. Yu (2011). ADANA: Active Name Disambiguation. Proceedings of 2011 IEEE International Conference on Data Mining. Vancouver: IEEE. pp. 794–803. doi:10.1109/ICDM.2011.19. ISBN   978-1-4577-2075-8.
  9. Zeyd Boukhers; Nagaraj Bahubali Asundi (2022). "Whois? Deep Author Name Disambiguation Using Bibliographic Data". Linking Theory and Practice of Digital Libraries. Lecture Notes in Computer Science. Vol. 13541. Padua: Springer. pp. 201–215. arXiv: 2207.04772 . doi:10.1007/978-3-031-16802-4_16. ISBN   978-3-031-16801-7.
  10. 1 2 3 Yutao Zhang; Fanjin Zhang; Peiran Yao; Jie Tang (2018). Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. London: ACM. pp. 1002–1011.
  11. Baichuan Zhang; Mohammad Al Hasan (2017). Name disambiguation in anonymized graphs using network embedding. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. Singapore: ACM. pp. 1239–1248.
  12. Bryan Perozzi; Rami Al-Rfou; Steven Skiena (2014). Deepwalk: Online learning of social representations. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM. pp. 701–710.
  13. Jiezhong Qiu; Yuxiao Dong; Hao Ma; Jian Li; Kuansan Wang; Jie Tang (2018). Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec. Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. Marina Del Rey: ACM. pp. 459–467.
  14. Smalheiser, Neil R.; Torvik, Vetle I. (2009). "Author name disambiguation". Annual Review of Information Science and Technology . 43: 1–43. doi:10.1002/aris.2009.1440430113.
  15. Morrison, Greg; Riccaboni, Massimo; Pammolli, Fabio (16 May 2017). "Disambiguation of patent inventors and assignees using high-resolution geolocation data". Scientific Data. 4: 170064. Bibcode:2017NatSD...470064M. doi:10.1038/sdata.2017.64. PMC   5433392 . PMID   28509897.
  16. Jie Tang; Jing Zhang; Limin Yao; Juanzi Li; Li Zhang; Zhong Su (2008). ArnetMiner: extraction and mining of academic social networks. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM. pp. 990–998.
  17. Zhang, Ziqi; Nuzzolese, Andrea; Gentile, Anna Lisa (2017). Entity Deduplication on ScholarlyData. Proceedings of the Extended Semantic Web Conference. Springer-Verlag. pp. 85–100. doi:10.1007/978-3-319-58068-5_6.
  18. Subramanian, Shivashankar; King, Daniel; Downey, Doug; Feldman, Sergey (21 Mar 2021). "S2AND: A Benchmark and Evaluation System for Author Name Disambiguation". arXiv: 2103.07534 [cs.DL].