XML retrieval

Last updated

XML retrieval, or XML information retrieval, is the content-based retrieval of documents structured with XML (eXtensible Markup Language). As such it is used for computing relevance of XML documents. [1]

Contents

Queries

Most XML retrieval approaches do so based on techniques from the information retrieval (IR) area, e.g. by computing the similarity between a query consisting of keywords (query terms) and the document. However, in XML-Retrieval the query can also contain structural hints. So-called "content and structure" (CAS) queries enable users to specify what structure the requested content can or must have.

Exploiting XML structure

Taking advantage of the self-describing structure of XML documents can improve the search for XML documents significantly. This includes the use of CAS queries, the weighting of different XML elements differently and the focused retrieval of subdocuments.

Ranking

Ranking in XML-Retrieval can incorporate both content relevance and structural similarity, which is the resemblance between the structure given in the query and the structure of the document. Also, the retrieval units resulting from an XML query may not always be entire documents, but can be any deeply nested XML elements, i.e. dynamic documents. The aim is to find the smallest retrieval unit that is highly relevant. Relevance can be defined according to the notion of specificity, which is the extent to which a retrieval unit focuses on the topic of request. [2]

Existing XML search engines

An overview of two potential approaches is available. [3] [4] The INitiative for the Evaluation of XML-Retrieval (INEX) was founded in 2002 and provides a platform for evaluating such algorithms. [2] Three different areas influence XML-Retrieval: [5]

Traditional XML query languages

Query languages such as the W3C standard XQuery [6] supply complex queries, but only look for exact matches. Therefore, they need to be extended to allow for vague search with relevance computing. Most XML-centered approaches imply a quite exact knowledge of the documents' schemas. [7]

Databases

Classic database systems have adopted the possibility to store semi-structured data [5] and resulted in the development of XML databases. Often, they are very formal, concentrate more on searching than on ranking, and are used by experienced users able to formulate complex queries.

Information retrieval

Classic information retrieval models such as the vector space model provide relevance ranking, but do not include document structure; only flat queries are supported. Also, they apply a static document concept, so retrieval units usually are entire documents. [7] They can be extended to consider structural information and dynamic document retrieval. Examples for approaches extending the vector space models are available: they use document subtrees (index terms plus structure) as dimensions of the vector space. [8]

Data-centric XML datasets

For data-centric XML datasets, the unique and distinct keyword search method, namely, XDMA [9] for XML databases is designed and developed based on dual indexing and mutual summation.

See also

Related Research Articles

Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.

An image retrieval system is a computer system used for browsing, searching and retrieving images from a large database of digital images. Most traditional and common methods of image retrieval utilize some method of adding metadata such as captioning, keywords, title or descriptions to the images so that retrieval can be performed over the annotation words. Manual image annotation is time-consuming, laborious and expensive; to address this, there has been a large amount of research done on automatic image annotation. Additionally, the increase in social web applications and the semantic web have inspired the development of several web-based image annotation tools.

A query language, also known as data query language or database query language (DQL), is a computer language used to make queries in databases and information systems. A well known example is the Structured Query Language (SQL).

<span class="mw-page-title-main">Metasearch engine</span> Online information retrieval tool

A metasearch engine is an online information retrieval tool that uses the data of a web search engine to produce its own results. Metasearch engines take input from a user and immediately query search engines for results. Sufficient data is gathered, ranked, and presented to the users.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.

<span class="mw-page-title-main">Content-based image retrieval</span> Method of image retrieval

Content-based image retrieval, also known as query by image content (QBIC) and content-based visual information retrieval (CBVIR), is the application of computer vision techniques to the image retrieval problem, that is, the problem of searching for digital images in large databases. Content-based image retrieval is opposed to traditional concept-based approaches.

An XML database is a data persistence software system that allows data to be specified, and sometimes stored, in XML format. This data can be queried, transformed, exported and returned to a calling system. XML databases are a flavor of document-oriented databases which are in turn a category of NoSQL database.

<span class="mw-page-title-main">Automatic image annotation</span>

Automatic image annotation is the process by which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. This application of computer vision techniques is used in image retrieval systems to organize and locate images of interest from a database.

In computer science, an inverted index is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of documents. The purpose of an inverted index is to allow fast full-text searches, at a cost of increased processing when a document is added to the database. The inverted file may be the database file itself, rather than its index. It is the most popular data structure used in document retrieval systems, used on a large scale for example in search engines. Additionally, several significant general-purpose mainframe-based database management systems have used inverted list architectures, including ADABAS, DATACOM/DB, and Model 204.

Search Engine Results Pages (SERP) are the pages displayed by search engines in response to a query by a user. The main component of the SERP is the listing of results that are returned by the search engine in response to a keyword query.

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.

Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. In the context of search engines, query expansion involves evaluating a user's input and expanding the search query to match additional documents. Query expansion involves techniques such as:

A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

Ranking of query is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines. Given a query q and a collection D of documents that match the query, the problem is to rank, that is, sort, the documents in D according to some criterion so that the "best" results appear early in the result list displayed to the user. Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems. A majority of search engines use ranking algorithms to provide users with accurate and relevant results.

Vector space model or term vector model is an algebraic model for representing text documents as vectors of identifiers. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.

XQuery is a query and functional programming language that queries and transforms collections of structured and unstructured data, usually in the form of XML, text and with vendor-specific extensions for other data formats. The language is developed by the XML Query working group of the W3C. The work is closely coordinated with the development of XSLT by the XSL Working Group; the two groups share responsibility for XPath, which is a subset of XQuery.

<span class="mw-page-title-main">Learning to rank</span> Use of machine learning to rank items

Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data consists of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment for each item. The goal of constructing the ranking model is to rank new, unseen lists in a similar way to rankings in the training data.

The Binary Independence Model (BIM) in computing and information science is a probabilistic information retrieval technique. The model makes some simple assumptions to make the estimation of document/query similarity probable and feasible.

Evaluation measures for an information retrieval (IR) system assess how well an index, search engine or database returns results from a collection of resources that satisfy a user's query. They are therefore fundamental to the success of information systems and digital platforms. The success of an IR system may be judged by a range of criteria including relevance, speed, user satisfaction, usability, efficiency and reliability. However, the most important factor in determining a system's effectiveness for users is the overall relevance of results retrieved in response to a query. Evaluation measures may be categorised in various ways including offline or online, user-based or system-based and include methods such as observed user behaviour, test collections, precision and recall, and scores from prepared benchmark test sets.

References

  1. Winter, Judith; Drobnik, Oswald (November 9, 2007). "An Architecture for XML Information Retrieval in a Peer-to-Peer Environment" (PDF). ACM. Retrieved 2009-02-10.
  2. 1 2 Malik, Saadia; Trotman, Andrew; Lalmas, Mounia; Fuhr, Norbert (2007). "Overview of INEX 2006" (PDF). Proceedings of the Fifth Workshop of the INitiative for the Evaluation of XML Retrieval. Archived from the original (PDF) on October 16, 2008. Retrieved 2009-02-10.
  3. Amer-Yahia, Sihem; Lalmas, Mounia (2006). "XML Search: Languages, INEX and Scoring" (PDF). SIGMOD Rec. 35 (4). doi:10.1145/1228268.1228271. S2CID   17300151 . Retrieved 2009-02-10.[ dead link ]
  4. Pal, Sukomal (June 30, 2006). "XML Retrieval: A Survey". CiteSeerX   10.1.1.109.5986 .
  5. 1 2 Fuhr, Norbert; Gövert, N.; Kazai, Gabriella; Lalmas, Mounia (2003). "INEX: Initiative for the Evaluation of XML Retrieval" (PDF). Proceedings of the First INEX Workshop, Dagstuhl, Germany, 2002. ERCIM Workshop Proceedings, France. Archived from the original (PDF) on November 21, 2008. Retrieved 2009-02-10.
  6. Boag, Scott; Chamberlin, Don; Fernández, Mary F.; Florescu, Daniela; Robie, Jonathan; Siméon, Jérôme (23 January 2007). "XQuery 1.0: An XML Query Language". W3C Recommendation. World Wide Web Consortium. Retrieved 2009-02-10.
  7. 1 2 Schlieder, Torsten; Meuss, Holger (2002). "Querying and Ranking XML Documents". Journal of the American Society for Information Science and Technology. 53 (6): 489–503. doi:10.1002/asi.10060. Archived from the original on June 10, 2007. Retrieved 2009-02-10.
  8. Liu, Shaorong; Zou, Qinghua; Chu, Wesley W. (2004). "Configurable Indexing and Ranking for XML Information Retrieval" (PDF). SIGIR'04. ACM. Retrieved 2009-02-10.
  9. Selvaganesan, S.; Haw, Su-Cheng; Soon, Lay-Ki (2014). "XDMA: A Dual Indexing and Mutual Summation Based Keyword Search Algorithm for XML Databases". International Journal of Software Engineering and Knowledge Engineering. 24 (4): 591–615. doi:10.1142/s0218194014500223.