Information filtering system

Last updated

An information filtering system is a system that removes redundant or unwanted information from an information stream using (semi)automated or computerized methods prior to presentation to a human user. Its main goal is the management of the information overload and increment of the semantic signal-to-noise ratio. To do this the user's profile is compared to some reference characteristics. These characteristics may originate from the information item (the content-based approach) or the user's social environment (the collaborative filtering approach).

Contents

Whereas in information transmission signal processing filters are used against syntax-disrupting noise on the bit-level, the methods employed in information filtering act on the semantic level.

The range of machine methods employed builds on the same principles as those for information extraction. A notable application can be found in the field of email spam filters. Thus, it is not only the information explosion that necessitates some form of filters, but also inadvertently or maliciously introduced pseudo-information.

On the presentation level, information filtering takes the form of user-preferences-based newsfeeds, etc.

Recommender systems and content discovery platforms are active information filtering systems that attempt to present to the user information items (film, television, music, books, news, web pages) the user is interested in. These systems add information items to the information flowing towards the user, as opposed to removing information items from the information flow towards the user. Recommender systems typically use collaborative filtering approaches or a combination of the collaborative filtering and content-based filtering approaches, although content-based recommender systems do exist.

History

Before the advent of the Internet, there are already several methods of filtering information; for instance, governments may control and restrict the flow of information in a given country by means of formal or informal censorship.

On the other hand, we are going to talk about information filters if we refer to newspaper editors and journalists when they provide a service that selects the most valuable information for their clients, readers of books, magazines, newspapers, radio listeners and TV viewers. This filtering operation is also present in schools and universities where there is a selection of information to provide assistance based on academic criteria to customers of this service, the students. With the advent of the Internet it is possible that anyone can publish anything he wishes at a low-cost. In this way, it increases considerably the less useful information and consequently the quality information is disseminated. With this problem, it began to devise new filtering with which we can get the information required for each specific topic to easily and efficiently.

Operation

A filtering system of this style consists of several tools that help people find the most valuable information, so the limited time you can dedicate to read / listen / view, is correctly directed to the most interesting and valuable documents. These filters are also used to organize and structure information in a correct and understandable way, in addition to group messages on the mail addressed. These filters are essential in the results obtained of the search engines on the Internet. The functions of filtering improves every day to get downloading Web documents and more efficient messages.

Criterion

One of the criteria used in this step is whether the knowledge is harmful or not, whether knowledge allows a better understanding with or without the concept. In this case the task of information filtering to reduce or eliminate the harmful information with knowledge.

Learning System

A system of learning content consists, in general rules, mainly of three basic stages:

  1. First, a system that provides solutions to a defined set of tasks.
  2. Subsequently, it undergoes assessment criteria which will measure the performance of the previous stage in relation to solutions of problems.
  3. Acquisition module which its output obtained knowledge that are used in the system solver of the first stage.

Future

Currently the problem is not finding the best way to filter information, but the way that these systems require to learn independently the information needs of users. Not only because they automate the process of filtering but also the construction and adaptation of the filter. Some branches based on it, such as statistics, machine learning, pattern recognition and data mining, are the base for developing information filters that appear and adapt in base to experience. To carry out the learning process, part of the information has to be pre-filtered, which means there are positive and negative examples which we named training data, which can be generated by experts, or via feedback from ordinary users.

Error

As data is entered, the system includes new rules; if we consider that this data can generalize the training data information, then we have to evaluate the system development and measure the system's ability to correctly predict the categories of new information. This step is simplified by separating the training data in a new series called "test data" that we will use to measure the error rate. As a general rule it is important to distinguish between types of errors (false positives and false negatives). For example, in the case on an aggregator of content for children, it doesn't have the same gravity to allow the passage of information not suitable for them, that shows violence or pornography, than the mistake to discard some appropriated information. To improve the system to lower error rates and have these systems with learning capabilities similar to humans we require development of systems that simulate human cognitive abilities, such as natural-language understanding, capturing meaning Common and other forms of advanced processing to achieve the semantics of information.

Fields of use

Nowadays, there are numerous techniques to develop information filters, some of these reach error rates lower than 10% in various experiments.[ citation needed ] Among these techniques there are decision trees, support vector machines, neural networks, Bayesian networks, linear discriminants, logistic regression, etc.. At present, these techniques are used in different applications, not only in the web context, but in thematic issues as varied as voice recognition, classification of telescopic astronomy or evaluation of financial risk.

See also

Related Research Articles

A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help minimize the time required to find information and the amount of information which must be consulted, akin to other techniques for managing information overload.

The Semantic Web, sometimes known as Web 3.0, is an extension of the World Wide Web through standards set by the World Wide Web Consortium (W3C). The goal of the Semantic Web is to make Internet data machine-readable.

Music information retrieval (MIR) is the interdisciplinary science of retrieving information from music. Those involved in MIR may have a background in academic musicology, psychoacoustics, psychology, signal processing, informatics, machine learning, optical music recognition, computational intelligence or some combination of these.

In computer science, a software agent or software AI is a computer program that acts for a user or other program in a relationship of agency, which derives from the Latin agere : an agreement to act on one's behalf. Such "action on behalf of" implies the authority to decide which, if any, action is appropriate. Agents are colloquially known as bots, from robot. They may be embodied, as when execution is paired with a robot body, or as software such as a chatbot executing on a phone or other computing device. Software agents may be autonomous or work together with other agents or people. Software agents interacting with people may possess human-like qualities such as natural language understanding and speech, personality or embody humanoid form.

<span class="mw-page-title-main">Collaborative filtering</span> Algorithm

Collaborative filtering (CF) is a technique used by recommender systems. Collaborative filtering has two senses, a narrow one and a more general one.

A recommender system, or a recommendation system, is a subclass of information filtering system that provide suggestions for items that are most pertinent to a particular user. Typically, the suggestions refer to various decision-making processes, such as what product to purchase, what music to listen to, or what online news to read. Recommender systems are particularly useful when an individual needs to choose an item from a potentially overwhelming number of items that a service may offer.

Automatic summarization is the process of shortening a set of data computationally, to create a subset that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data.

<span class="mw-page-title-main">Content-based image retrieval</span> Method of image retrieval

Content-based image retrieval, also known as query by image content (QBIC) and content-based visual information retrieval (CBVIR), is the application of computer vision techniques to the image retrieval problem, that is, the problem of searching for digital images in large databases. Content-based image retrieval is opposed to traditional concept-based approaches.

Corporate taxonomy is the hierarchical classification of entities of interest of an enterprise, organization or administration, used to classify documents, digital assets and other information. Taxonomies can cover virtually any type of physical or conceptual entities at any level of granularity.

Product finders are information systems that help consumers to identify products within a large palette of similar alternative products. Product finders differ in complexity, the more complex among them being a special case of decision support systems. Conventional decision support systems, however, aim at specialized user groups, e.g. marketing managers, whereas product finders focus on consumers.

Sentiment analysis is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine. With the rise of deep language models, such as RoBERTa, also more difficult data domains can be analyzed, e.g., news texts where authors typically express their opinion/sentiment less explicitly.

Adaptive hypermedia (AH) uses hypermedia which is adaptive according to a user model. In contrast to linear media, where all users are offered a standard series of hyperlinks, adaptive hypermedia (AH) tailors what the user is offered based on a model of the user's goals, preferences and knowledge, thus providing links or content most appropriate to the current user.

Cold start is a potential problem in computer-based information systems which involves a degree of automated data modelling. Specifically, it concerns the issue that the system cannot draw any inferences for users or items about which it has not yet gathered sufficient information.

The concept of the Social Semantic Web subsumes developments in which social interactions on the Web lead to the creation of explicit and semantically rich knowledge representations. The Social Semantic Web can be seen as a Web of collective knowledge systems, which are able to provide useful information based on human contributions and which get better as more people participate. The Social Semantic Web combines technologies, strategies and methodologies from the Semantic Web, social software and the Web 2.0.

Data preprocessing can refer to manipulation or dropping of data before it is used in order to ensure or enhance performance, and is an important step in the data mining process. The phrase "garbage in, garbage out" is particularly applicable to data mining and machine learning projects. Data-gathering methods are often loosely controlled, resulting in out-of-range values, impossible data combinations, and missing values, etc.

User modeling is the subdivision of human–computer interaction which describes the process of building up and modifying a conceptual understanding of the user. The main goal of user modeling is customization and adaptation of systems to the user's specific needs. The system needs to "say the 'right' thing at the 'right' time in the 'right' way". To do so it needs an internal representation of the user. Another common purpose is modeling specific kinds of users, including modeling of their skills and declarative knowledge, for use in automatic software-tests. User-models can thus serve as a cheaper alternative to user testing but should not replace user testing.

A concept search is an automated information retrieval method that is used to search electronically stored unstructured text for information that is conceptually similar to the information provided in a search query. In other words, the ideas expressed in the information retrieved in response to a concept search query are relevant to the ideas contained in the text of the query.

Social information processing is "an activity through which collective human actions organize knowledge." It is the creation and processing of information by a group of people. As an academic field Social Information Processing studies the information processing power of networked social systems.

Expertise finding is the use of tools for finding and assessing individual expertise. In the recruitment industry, expertise finding is the problem of searching for employable candidates with certain required skills set. In other words, it is the challenge of linking humans to expertise areas, and as such is a sub-problem of expertise retrieval.

<span class="mw-page-title-main">Content curation</span>

Content curation is the process of gathering information relevant to a particular topic or area of interest, usually with the intention of adding value through the process of selecting, organizing, and looking after the items in a collection or exhibition. Services or people that implement content curation are called curators. Curation services can be used by businesses as well as end users.

References