Entity linking

Last updated

In natural language processing, entity linking, also referred to as named-entity linking (NEL), [1] named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD) or named-entity normalization (NEN) [2] is the task of assigning a unique identity to entities (such as famous individuals, locations, or companies) mentioned in text. For example, given the sentence "Paris is the capital of France", the idea is to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred to as "Paris". Entity linking is different from named-entity recognition (NER) in that NER identifies the occurrence of a named entity in text but it does not identify which specific entity it is (see Differences from other techniques).

Contents

In entity linking, each named entity is linked to a unique identifier. Often, this identifier corresponds to a Wikipedia page. Entity Linking - Short Example.png
In entity linking, each named entity is linked to a unique identifier. Often, this identifier corresponds to a Wikipedia page.

Introduction

In entity linking, words of interest (names of persons, locations and companies) are mapped from an input text to corresponding unique entities in a target knowledge base. Words of interest are called named entities (NEs), mentions, or surface forms. The target knowledge base depends on the intended application, but for entity linking systems intended to work on open-domain text it is common to use knowledge-bases derived from Wikipedia (such as Wikidata or DBpedia). [2] [3] In this case, each individual Wikipedia page is regarded as a separate entity. Entity linking techniques that map named entities to Wikipedia entities are also called wikification. [4]

Considering again the example sentence "Paris is the capital of France", the expected output of an entity linking system will be Paris and France. These uniform resource locators (URLs) can be used as unique uniform resource identifiers (URIs) for the entities in the knowledge base. Using a different knowledge base will return different URIs, but for knowledge bases built starting from Wikipedia there exist one-to-one URI mappings. [5]

In most cases, knowledge bases are manually built, [6] but in applications where large text corpora are available, the knowledge base can be inferred automatically from the available text. [7]

Entity linking is a critical step to bridge web data with knowledge bases, which is beneficial for annotating the huge amount of raw and often noisy data on the Web and contributes to the vision of the Semantic Web. [8] In addition to entity linking, there are other critical steps including but not limited to event extraction, [9] and event linking [10] etc.

Applications

Entity linking is beneficial in fields that need to extract abstract representations from text, as it happens in text analysis, recommender systems, semantic search and chatbots. In all these fields, concepts relevant to the application are separated from text and other non-meaningful data. [11] [12]

For example, a common task performed by search engines is to find documents that are similar to one given as input, or to find additional information about the persons that are mentioned in it. Consider a sentence that contains the expression "the capital of France": without entity linking, the search engine that looks at the content of documents would not be able to directly retrieve documents containing the word "Paris", leading to so-called false negatives (FN). Even worse, the search engine might produce spurious matches (or false positives (FP)), such as retrieving documents referring to "France" as a country.

Many approaches orthogonal to entity linking exist to retrieve documents similar to an input document. For example, latent semantic analysis (LSA) or comparing document embeddings obtained with doc2vec. However, these techniques do not allow the same fine-grained control that is offered by entity linking, as they will return other documents instead of creating high-level representations of the original one. For example, obtaining schematic information about "Paris", as presented by Wikipedia infoboxes would be much less straightforward, or sometimes even unfeasible, depending on the query complexity. [13]

Moreover, entity linking has been used to improve the performance of information retrieval systems [2] and to improve search performance on digital libraries. [14] Entity linking is also a key input for semantic search. [15]

Challenges in entity linking

An entity linking system has to deal with a number of challenges before being performant in real-life applications. Some of these issues are intrinsic to the task of entity linking, [16] such as text ambiguity, while others, such as scalability and execution time, become relevant when considering real-life usage of such systems.

Differences from other techniques

Entity linking is also known as named-entity disambiguation (NED), and is deeply connected to Wikification and record linkage. [20] Definitions are often blurry and vary slightly among different authors: Alhelbawy et al. [21] consider entity linking as a broader version of NED, as NED should assume that the entity that correctly matches a certain textual named entity mention is in the knowledge base. Entity linking systems might deal with cases in which no entry for the named entity is available in the reference knowledge base. Other authors do not make such distinction, and use the two names interchangeably. [22] [23]

Paris is the capital of France.

would be processed by an NER system to obtain the following output:

[Paris]City is the capital of [France]Country.

Named-entity recognition is usually a preprocessing step of an entity linking system, as it can be useful to know in advance which words should be linked to entities of the knowledge base.

Paris is the capital of France. It is also the largest city in France.

In this example, a coreference resolution algorithm would identify that the pronoun It refers to Paris, and not to France or to another entity. A notable distinction compared to entity linking is that Coreference Resolution does not assign any unique identity to the words it matches, but it simply says whether they refer to the same entity or not. In that sense, predictions from a coreference resolution system could be useful to a subsequent entity linking component.

Approaches to entity linking

Entity linking has been a hot topic in industry and academia for the last decade. However, as of today most existing challenges are still unsolved, and many entity linking systems, with widely different strengths and weaknesses, have been proposed. [24]

Broadly speaking, modern entity linking systems can be divided into two categories:

Often entity linking systems cannot be strictly categorized in either category, but they make use of knowledge graphs that have been enriched with additional textual features extracted, for example, from the text corpora that were used to build the knowledge graphs themselves. [22] [23]

Representation of the main steps in an entity linking algorithm. Most entity linking algorithms are composed of an initial named-entity recognition step in which named entities are found in the original text (here, Paris and France), and of a subsequent step in which each named entity is linked to its corresponding unique identifier (here, a Wikipedia page). This last step is often done by creating a small set of candidate identifiers for each named entity, and by picking the most promising candidate with respect to a chosen metric. Entity Linking - Example of pipeline.png
Representation of the main steps in an entity linking algorithm. Most entity linking algorithms are composed of an initial named-entity recognition step in which named entities are found in the original text (here, Paris and France), and of a subsequent step in which each named entity is linked to its corresponding unique identifier (here, a Wikipedia page). This last step is often done by creating a small set of candidate identifiers for each named entity, and by picking the most promising candidate with respect to a chosen metric.

Text-based entity linking

The seminal work by Cucerzan in 2007 proposed one of the first entity linking systems that appeared in the literature, and tackled the task of wikification, linking textual mentions to Wikipedia pages. [25] This system partitions pages as entity, disambiguation, or list pages, used to assign categories to each entity. The set of entities present in each entity page is used to build the entity's context. The final entity linking step is a collective disambiguation performed by comparing binary vectors obtained from hand-crafted features, and from each entity's context. Cucerzan's entity linking system is still used as baseline for many recent works. [27]

The work of Rao et al. is a well-known paper in the field of entity linking. [16] The authors propose a two-step algorithm to link named entities to entities in a target knowledge base. First, a set of candidate entities is chosen using string matching, acronyms, and known aliases. Then the best link among the candidates is chosen with a ranking support vector machine (SVM) that uses linguistic features.

Recent systems, such as the one proposed by Tsai et al., [20] employ word embeddings obtained with a skip-gram model as language features, and can be applied to any language as long as a large corpus to build word embeddings is provided. Similarly to most entity linking systems, the linking is done in two steps, with an initial candidate entities selection and a linear ranking SVM as second step.

Various approaches have been tried to tackle the problem of entity ambiguity. In the seminal approach of Milne and Witten, supervised learning is employed using the anchor texts of Wikipedia entities as training data. [28] Other approaches also collected training data based on unambiguous synonyms. [29]

Graph-based entity linking

Modern entity linking systems do not limit their analysis to textual features generated from input documents or text corpora, but employ large knowledge graphs created from knowledge bases such as Wikipedia. These systems extract complex features which take advantage of the knowledge graph topology, or leverage multi-step connections between entities, which would be hidden by simple text analysis. Moreover, creating multilingual entity linking systems based on natural language processing (NLP) is inherently difficult, as it requires either large text corpora, often absent for many languages, or hand-crafted grammar rules, which are widely different among languages. Han et al. propose the creation of a disambiguation graph (a subgraph of the knowledge base which contains candidate entities). [3] This graph is employed for a purely collective ranking procedure that finds the best candidate link for each textual mention.

Another famous entity linking approach is AIDA, which uses a series of complex graph algorithms, and a greedy algorithm that identifies coherent mentions on a dense subgraph by also considering context similarities and vertex importance features to perform collective disambiguation. [26]

Graph ranking (or vertex ranking) denotes algorithms such as PageRank (PR) and Hyperlink-Induced Topic Search (HITS), with the goal to assign a score to each vertex that represents its relative importance in the overall graph. The entity linking system presented in Alhelbawy et al. employs PageRank to perform collective entity linking on a disambiguation graph, and to understand which entities are more strongly related with each other and would represent a better linking. [21]

Mathematical entity linking

Mathematical expressions (symbols and formulae) can be linked to semantic entities (e.g., Wikipedia articles [30] or Wikidata items [31] ) labeled with their natural language meaning. This is essential for disambiguation, since symbols may have different meanings (e.g., "E" can be "energy" or "expectation value", etc.). [32] [31] The math entity linking process can be facilitated and accelerated through annotation recommendation, e.g., using the "AnnoMathTeX" system that is hosted by Wikimedia. [33] [34] [35]

To facilitate the reproducibility of Mathematical Entity Linking (MathEL) experiments, the benchmark MathMLben was created. [36] [37] It contains formulae from Wikipedia, the arXiV and the NIST Digital Library of Mathematical Functions (DLMF). Formulae entries in the benchmark are labeled and augmented by Wikidata markup. [31] Furthermore, for two large corporae from the arXiv [38] and zbMATH [39] repository distributions of mathematical notation were examined. Mathematical Objects of Interest (MOI) are identified as potential candidates for MathEL. [40]

Besides linking to Wikipedia, Schubotz [37] and Scharpf et al. [31] describe linking mathematical formula content to Wikidata, both in MathML and LaTeX markup. To extend classical citations by mathematical, they call for a Formula Concept Discovery (FCD) and Formula Concept Recognition (FCR) challenge to elaborate automated MathEL. Their FCD approach yields a recall of 68% for retrieving equivalent representations of frequent formulae, and 72% for extracting the formula name from the surrounding text on the NTCIR [41] arXiv dataset. [35]

See also

Related Research Articles

Natural language processing (NLP) is an interdisciplinary subfield of computer science and information retrieval. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. To this end, natural language processing often borrows ideas from theoretical linguistics. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

<span class="mw-page-title-main">Semantic network</span> Knowledge base that represents semantic relations between concepts in a network

A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, which represent concepts, and edges, which represent semantic relations between concepts, mapping or connecting semantic fields. A semantic network may be instantiated as, for example, a graph database or a concept map. Typical standardized semantic networks are expressed as semantic triples.

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process. Text mining usually involves the process of structuring the input text, deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling.

Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP) that is concerned with building systems that automatically answer questions that are posed by humans in a natural language.

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. Typically, this involves processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction.

An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For annotations of different digital media, see web annotation and text annotation.

Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".

Named-entity recognition (NER) (also known as (named)entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

Bibliographic coupling, like co-citation, is a similarity measure that uses citation analysis to establish a similarity relationship between documents. Bibliographic coupling occurs when two works reference a common third work in their bibliographies. It is an indication that a probability exists that the two works treat a related subject matter.

Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.

In linguistics, statistical semantics applies the methods of statistics to the problem of determining the meaning of words or phrases, ideally through unsupervised learning, to a degree of precision at least sufficient for the purpose of information retrieval.

Plagiarism detection or content similarity detection is the process of locating instances of plagiarism or copyright infringement within a work or document. The widespread use of computers and the advent of the Internet have made it easier to plagiarize the work of others.

Ontology learning is the automatic or semi-automatic creation of ontologies, including extracting the corresponding domain's terms and the relationships between the concepts that these terms represent from a corpus of natural language text, and encoding them with an ontology language for easy retrieval. As building ontologies manually is extremely labor-intensive and time-consuming, there is great motivation to automate the process.

Knowledge extraction is the creation of knowledge from structured and unstructured sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL, the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge or the generation of a schema based on the source data.

In natural language processing and information retrieval, explicit semantic analysis (ESA) is a vectoral representation of text that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document is represented as the centroid of the vectors representing its words. Typically, the text corpus is English Wikipedia, though other corpora including the Open Directory Project have been used.

In geographic information systems, toponym resolution is the relationship process between a toponym, i.e. the mention of a place, and an unambiguous spatial footprint of the same place.

<span class="mw-page-title-main">Author name disambiguation</span>

Author name disambiguation is a type of disambiguation and record linkage applied to the names of individual people. The process could, for example, distinguish individuals with the name "John Smith".

<span class="mw-page-title-main">Rada Mihalcea</span> American computer scientist

Rada Mihalcea is Janice M. Jenkins Collegiate Professor of Computer Science and Engineering at the University of Michigan. She made influential contribution to natural language processing, multimodal processing, and computational social science. She is also the inventor of TextRank Algorithm, which is widely used for text summarization.

<span class="mw-page-title-main">Knowledge graph</span> Type of knowledge base

In knowledge representation and reasoning, a knowledge graph is a knowledge base that uses a graph-structured data model or topology to represent and operate on data. Knowledge graphs are often used to store interlinked descriptions of entities – objects, events, situations or abstract concepts – while also encoding the semantics or relationships underlying these entities.

References

  1. Hachey, Ben; Radford, Will; Nothman, Joel; Honnibal, Matthew; Curran, James R. (2013-01-01). "Artificial Intelligence, Wikipedia and Semi-Structured ResourcesEvaluating Entity Linking with Wikipedia". Artificial Intelligence. 194: 130–150. doi: 10.1016/j.artint.2012.04.005 .
  2. 1 2 3 M. A. Khalid, V. Jijkoun and M. de Rijke (2008). The impact of named entity normalization on information retrieval for question answering [ permanent dead link ]. Proc. ECIR.
  3. 1 2 3 Han, Xianpei; Sun, Le; Zhao, Jun (2011). "Collective entity linking in web text". Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM. pp. 765–774. doi:10.1145/2009916.2010019. ISBN   9781450307574. S2CID   14428938.
  4. Rada Mihalcea and Andras Csomai (2007)Wikify! Linking Documents to Encyclopedic Knowledge. Proc. CIKM.
  5. "Wikipedia Links". 4 May 2023.
  6. Wikidata
  7. Aaron M. Cohen (2005). Unsupervised gene/protein named entity normalization using automatically extracted dictionaries. Proc. ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, pp. 17–24.
  8. Shen W, Wang J, Han J. Entity linking with a knowledge base: Issues, techniques, and solutions[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 27(2): 443-460.
  9. Chang Y C, Chu C H, Su Y C, et al. PIPE: a protein–protein interaction passage extraction module for BioCreative challenge[J]. Database, 2016, 2016.
  10. Lou P, Jimeno Yepes A, Zhang Z, et al. BioNorm: deep learning-based event normalization for the curation of reaction databases[J]. Bioinformatics, 2020, 36(2): 611-620.
  11. Slawski, Bill (16 September 2015). "How Google Uses Named Entity Disambiguation for Entities with the Same Names".
  12. Zhou, Ming; Lv, Weifeng; Ren, Pengjie; Wei, Furu; Tan, Chuanqi (2017). "Entity Linking for Queries by Searching Wikipedia Sentences". Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 68–77. arXiv: 1704.02788 . doi:10.18653/v1/D17-1007. S2CID   1125678.
  13. Le, Quoc; Mikolov, Tomas (2014). "Distributed Representations of Sentences and Documents". Proceedings of the 31st International Conference on International Conference on Machine Learning. 32: II–1188–II–1196. arXiv: 1405.4053 .
  14. 1 2 3 Hui Han, Hongyuan Zha, C. Lee Giles, "Name disambiguation in author citations using a K-way spectral clustering method," ACM/IEEE Joint Conference on Digital Libraries 2005 (JCDL 2005): 334-343, 2005
  15. "STICS". Archived from the original on 2021-09-01. Retrieved 2015-11-16.
  16. 1 2 3 4 Rao, Delip; McNamee, Paul; Dredze, Mark (2013). "Entity Linking: Finding Extracted Entities in a Knowledge Base". Multi-source, Multilingual Information Extraction and Summarization. Theory and Applications of Natural Language Processing. Springer Berlin Heidelberg. pp. 93–115. doi:10.1007/978-3-642-28569-1_5. ISBN   978-3-642-28568-4. S2CID   6420241.
  17. Parravicini, Alberto; Patra, Rhicheek; Bartolini, Davide B.; Santambrogio, Marco D. (2019). "Fast and Accurate Entity Linking via Graph Embedding". Proceedings of the 2nd Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA). ACM. pp. 10:1–10:9. doi:10.1145/3327964.3328499. hdl: 11311/1119019 . ISBN   9781450367899. S2CID   195357229.
  18. Hoffart, Johannes; Altun, Yasemin; Weikum, Gerhard (2014). "Discovering emerging entities with ambiguous names". Proceedings of the 23rd international conference on World wide web. ACM. pp. 385–396. doi:10.1145/2566486.2568003. ISBN   9781450327442. S2CID   7562986.
  19. Doermann, David S.; Oard, Douglas W.; Lawrie, Dawn J.; Mayfield, James; McNamee, Paul (2011). "Cross-Language Entity Linking". S2CID   3801685.{{cite journal}}: Cite journal requires |journal= (help)
  20. 1 2 Tsai, Chen-Tse; Roth, Dan (2016). "Cross-lingual Wikification Using Multilingual Embeddings". Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Vol. Proceedings of NAACL-HLT 2016. pp. 589–598. doi:10.18653/v1/N16-1072. S2CID   15156124.
  21. 1 2 Alhelbawy, Ayman; Gaizauskas, Robert (August 2014). "Collective Named Entity Disambiguation using Graph Ranking and Clique Partitioning Approaches". Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (Dublin City University and Association for Computational Linguistics): 1544–1555.{{cite journal}}: Cite journal requires |journal= (help)
  22. 1 2 Zwicklbauer, Stefan; Seifert, Christin; Granitzer, Michael (2016). "Robust and Collective Entity Disambiguation through Semantic Embeddings". Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval (PDF). ACM. pp. 425–434. doi:10.1145/2911451.2911535. ISBN   9781450340694. S2CID   207237647.
  23. 1 2 Hachey, Ben; Radford, Will; Nothman, Joel; Honnibal, Matthew; Curran, James R. (2013). "Evaluating Entity Linking with Wikipedia". Artif. Intell. 194: 130–150. doi: 10.1016/j.artint.2012.04.005 . ISSN   0004-3702.
  24. Ji, Heng; Nothman, Joel; Hachey, Ben; Florian, Radu (2015). "Overview of TAC-KBP2015 Tri-lingual Entity Discovery and Linking". TAC.
  25. 1 2 Cucerzan, Silviu (June 2007). "Large-Scale Named Entity Disambiguation Based on Wikipedia Data". Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL): 708–716.{{cite journal}}: Cite journal requires |journal= (help)
  26. 1 2 Weikum, Gerhard; Thater, Stefan; Taneva, Bilyana; Spaniol, Marc; Pinkal, Manfred; Fürstenau, Hagen; Bordino, Ilaria; Yosef, Mohamed Amir; Hoffart, Johannes (2011). "Robust Disambiguation of Named Entities in Text". Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing: 782–792.
  27. Kulkarni, Sayali; Singh, Amit; Ramakrishnan, Ganesh; Chakrabarti, Soumen (2009). Collective annotation of Wikipedia entities in web text. Proc. 15th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (KDD). CiteSeerX   10.1.1.151.1904 . doi:10.1145/1557019.1557073. ISBN   9781605584959.
  28. David Milne and Ian H. Witten (2008). Learning to link with Wikipedia. Proc. CIKM.
  29. Zhang, Wei; Jian Su; Chew Lim Tan (2010). "Entity Linking Leveraging Automatically Generated Annotation". Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010).
  30. Giovanni Yoko Kristianto; Goran Topic; Akiko Aizawa; et al. (2016). "Entity Linking for Mathematical Expressions in Scientific Documents". Digital Libraries: Knowledge, Information, and Data in an Open Access Society. Lecture Notes in Computer Science. Vol. 10075. Springer. pp. 144–149. doi:10.1007/978-3-319-49304-6_18. ISBN   978-3-319-49303-9.
  31. 1 2 3 4 Philipp Scharpf; Moritz Schubotz; et al. (2018). Representing Mathematical Formulae in Content MathML using Wikidata. ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2018).
  32. Moritz Schubotz; Philipp Scharpf; et al. (2018). "Introducing MathQA: a Math-Aware question answering system". Information Discovery and Delivery. 46 (4). Emerald Publishing Limited: 214–224. arXiv: 1907.01642 . doi:10.1108/IDD-06-2018-0022. S2CID   49484035.
  33. "AnnoMathTeX Formula/Identifier Annotation Recommender System".
  34. Philipp Scharpf; Ian Mackerracher; et al. (17 September 2019). "AnnoMathTeX - a formula identifier annotation recommender system for STEM documents". Proceedings of the 13th ACM Conference on Recommender Systems (PDF). pp. 532–533. doi:10.1145/3298689.3347042. ISBN   9781450362436. S2CID   202639987.
  35. 1 2 Philipp Scharpf; Moritz Schubotz; Bela Gipp (14 April 2021). "Fast Linking of Mathematical Wikidata Entities in Wikipedia Articles Using Annotation Recommendation". Companion Proceedings of the Web Conference 2021 (PDF). pp. 602–609. arXiv: 2104.05111 . doi:10.1145/3442442.3452348. ISBN   9781450383134. S2CID   233210264.
  36. "MathMLben formula benchmark".
  37. 1 2 Moritz Schubotz; André Greiner-Petter; Philipp Scharpf; Norman Meuschke; Howard Cohl; Bela Gipp (2018). "Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context". Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (PDF). Vol. 39. pp. 233–242. arXiv: 1804.04956 . doi:10.1145/3197026.3197058. ISBN   9781450351782. PMC   8474120 . PMID   34584342. S2CID   4872257.{{cite book}}: |journal= ignored (help)
  38. "arXiv preprint repository".
  39. "zbMath mathematical document library".
  40. André Greiner-Petter; Moritz Schubotz; Fabian Mueller; Corinna Breitinger; Howard S. Cohl; Akiko Aizawa; Bela Gipp (2020). "Discovering Mathematical Objects of Interest—A Study of Mathematical Notations". Proceedings of the Web Conference 2020 (PDF). pp. 1445–1456. arXiv: 2002.02712 . doi:10.1145/3366423.3380218. ISBN   9781450370233. S2CID   211066554.
  41. Akiko Aizawa; Michael Kohlhase; Iadh Ounis; Moritz Schubotz. "NTCIR-11 Math-2 Task Overview". Proceedings of the 11th NTCIR Conference on Evaluation of Information Access Technologies.