Robust collaborative filtering

Last updated

Robust collaborative filtering, or attack-resistant collaborative filtering, refers to algorithms or techniques that aim to make collaborative filtering more robust against efforts of manipulation, while hopefully maintaining recommendation quality. In general, these efforts of manipulation usually refer to shilling attacks, also called profile injection attacks. Collaborative filtering predicts a user's rating to items by finding similar users and looking at their ratings, and because it is possible to create nearly indefinite copies of user profiles in an online system, collaborative filtering becomes vulnerable when multiple copies of fake profiles are introduced to the system. There are several different approaches suggested to improve robustness of both model-based and memory-based collaborative filtering. However, robust collaborative filtering techniques are still an active research field, and major applications of them are yet to come.

Collaborative filtering algorithm

Collaborative filtering (CF) is a technique used by recommender systems. Collaborative filtering has two senses, a narrow one and a more general one.

Contents

Introduction

One of the biggest challenges to collaborative filtering is shilling attacks. That is, malicious users or a competitor may deliberately inject certain number of fake profiles to the system (typically 1~5%) in such a way that they can affect the recommendation quality or even bias the predicted ratings on behalf of their advantages. Some of the main shilling attack strategies are random attacks, average attacks, bandwagon attacks, and segment-focused attacks.

Random attacks insert profiles that give random ratings to a subset of items; average attacks give mean rating of each item. [1] Bandwagon and segment-focused attacks are newer and more sophisticated attack model. Bandwagon attack profiles give random rating to a subset of items and maximum rating to very popular items, in an effort to increase the chances that these fake profiles have many neighbors. Segment-focused attack is similar to bandwagon attack model, but it gives maximum rating to items that are expected to be highly rated by target user group, instead of frequently rated. [2]

In general, item-based collaborative filtering is known to be more robust than user-based collaborative filtering. However, item-based collaborative filtering are still not completely immune to bandwagon and segment attacks.

Robust collaborative filtering typically works as follows:

  1. Build spam user detection model
  2. Follow the workflow of regular collaborative filtering system, but only using rating data of non-spam users.

User relationships

Distributions of cosine distance under bandwagon attacks of different sizes Cosine distance distribution.JPG
Distributions of cosine distance under bandwagon attacks of different sizes

This is a detection method suggested by Gao et al. to make memory-based collaborative filtering more robust. [3] Some popular metrics used in collaborative filtering to measure user similarity are Pearson correlation coefficient, interest similarity, and cosine distance. (refer to Memory-based CF for definitions) A recommender system can detect attacks by exploiting the fact that the distributions of these metrics differ when there are spam users in the system. Because shilling attacks inject not just single fake profile but a large number of similar fake profiles, these spam users will have unusually high similarity than normal users do.

The entire system works like this. Given a rating matrix, it runs a density-based clustering algorithm on the user relationship metrics to detect spam users, and gives weight of 0 to spam users and weight of 1 to normal users. That is, the system will only consider ratings from normal users when computing predictions. The rest of the algorithm works exactly same as normal item-based collaborative filtering.

According to experimental results on MovieLens data, this robust CF approach preserves accuracy compared to normal item-based CF, but is more stable. Prediction result for normal CF shifts by 30-40% when spam user profiles are injected, but this robust approach shifts only about 5-10%.

Related Research Articles

Apache SpamAssassin A computer program for e-mail spam filtering

Apache SpamAssassin is a computer program used for e-mail spam filtering. It uses a variety of spam-detection techniques, including DNS-based and fuzzy-checksum-based spam detection, Bayesian filtering, external programs, blacklists and online databases. It is released under the Apache License 2.0 and is a part of the Apache Foundation since 2004.

CRM114 is a program based upon a statistical approach for classifying data, and especially used for filtering email spam.

A recommender system or a recommendation system is a subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item.

A joe job is a spamming technique that sends out unsolicited e-mails using spoofed sender data. Early joe jobs aimed at tarnishing the reputation of the apparent sender or inducing the recipients to take action against them, but they are now typically used by commercial spammers to conceal the true origin of their messages and to trick recipients into opening emails apparently coming from a trusted source.

VoIP spam or SPIT is unsolicited, automatically dialed telephone calls, typically using voice over Internet Protocol (VoIP) technology.

Personalization consists of tailoring a service or a product to accommodate specific individuals, sometimes tied to groups or segments of individuals. A wide variety of organizations use personalization to improve customer satisfaction, digital sales conversion, marketing results, branding, and improved website metrics as well as for advertising. Personalization is a key element in social media and recommender systems.

Slope One is a family of algorithms used for collaborative filtering, introduced in a 2005 paper by Daniel Lemire and Anna Maclachlan. Arguably, it is the simplest form of non-trivial item-based collaborative filtering based on ratings. Their simplicity makes it especially easy to implement them efficiently while their accuracy is often on par with more complicated and computationally expensive algorithms. They have also been used as building blocks to improve other algorithms. They are part of major open-source libraries such as Apache Mahout and Easyrec.

Reputation systems are programs that allow users to rate each other in online communities in order to build trust through reputation. Some common uses of these systems can be found on E-commerce websites such as eBay, Amazon.com, and Etsy as well as online advice communities such as Stack Exchange. These reputation systems represent a significant trend in "decision support for Internet mediated service provisions". With the popularity of online communities for shopping, advice, and exchange of other important information, reputation systems are becoming vitally important to the online experience. The idea of reputations systems is that even if the consumer can't physically try a product or service, or see the person providing information, that they can be confident in the outcome of the exchange through trust built by recommender systems.

Psychographic filtering is located within a branch of collaborative filtering (user-based) which anticipates preferences based upon information received from a statistical survey, a questionnaire, or other forms of social research. The term Psychographic is derived from Psychography which is the study of associating and classifying people according to their psychological characteristics. In marketing or social research, information received from a participant’s response is compared with other participants’ responses and the comparison of that research is designed to predict preferences based upon similarities or differences in perception. The participant should be inclined to share perceptions with people who have similar preferences. Suggestions are then provided to the participant based on their predicted preferences. Psychographic filtering differs from collaborative filtering in that it classifies similar people into a specific psychographic profile where predictions of preferences are based upon that psychographic profile type. Examples of psychological characteristics which determine a psychographic profile are personality, lifestyle, value system, behavior, experience and attitude.

Cold start is a potential problem in computer-based information systems which involve a degree of automated data modelling. Specifically, it concerns the issue that the system cannot draw any inferences for users or items about which it has not yet gathered sufficient information.

GroupLens Research computer science research lab at the University of Minnesota focused on recommender systems and social computing

GroupLens Research is a human–computer interaction research lab in the Department of Computer Science and Engineering at the University of Minnesota, Twin Cities specializing in recommender systems and online communities. GroupLens also works with mobile and ubiquitous technologies, digital libraries, and local geographic information systems.

MovieLens is a web-based recommender system and virtual community that recommends movies for its users to watch, based on their film preferences using collaborative filtering of members' movie ratings and movie reviews. It contains about 11 million ratings for about 8500 movies. MovieLens was created in 1997 by GroupLens Research, a research lab in the Department of Computer Science and Engineering at the University of Minnesota, in order to gather research data on personalized recommendations.

Gary Robinson American software engineer

Gary Robinson is an American software engineer and mathematician and inventor notable for his mathematical algorithms to fight spam. In addition, he patented a method to use web browser cookies to track consumers across different web sites, allowing marketers to better match advertisements with consumers. The patent was bought by DoubleClick, and then DoubleClick was bought by Google. He is credited as being one of the first to use automated collaborative filtering technologies to turn word-of-mouth recommendations into useful data.

Robust random early detection (RRED) is a queueing disclipine for a network scheduler. The existing random early detection (RED) algorithm and its variants are found vulnerable to emerging attacks, especially the Low-rate Denial-of-Service attacks (LDoS). Experiments have confirmed that the existing RED-like algorithms are notably vulnerable under LDoS attacks due to the oscillating TCP queue size caused by the attacks.

Adversarial machine learning is a technique employed in the field of machine learning which attempts to fool models through malicious input. This technique can be applied for a variety of reasons, the most common being to attack or cause a malfunction in standard machine learning models.

Item-item collaborative filtering, or item-based, or item-to-item, is a form of collaborative filtering for recommender systems based on the similarity between items calculated using people's ratings of those items.

Readgeek Online book recommendations engine and social cataloging service

Readgeek is an online book recommendations engine and social cataloging service launched in December 2010. The website allows users to search for books matching their individual taste making use of several algorithms. Taking ratings and metadata of prior read books into account, those algorithms help the site to learn about a users preferences. The service suggests books other users with similar tastes have enjoyed, rather than offering up books similar to the ones a user already ranked.

Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. This family of methods became widely known during the Netflix prize challenge due to its effectiveness as reported by Simon Funk in his 2006 blog post, where he shared his findings with the research community.

References

  1. Bhaskar Mehta, Thomas Hofmann, and Wolfgang Nejdl, Robust Collaborative Filtering, RecSys ‘07 Proceedings of the 2007 ACM Conference on Recommender Systems, 49-56
  2. Bamshad Mobasher, Robin Burke, Chad Williams, and Runa Bhaumik, Analysis and Detection of Segment-Focused Attacks Against Collaborative Recommendation, Advances in Web Mining and Web Usage Analysis, 2005, 96-118
  3. Min Gao, Bin Ling, Quan Yuan, Qingyu Xiong, and Linda Yang, A Robust Collaborative Filtering Approach Based on User Relationships for Recommender Systems, Mathematical Problems in Engineering, vol.2014, Article ID 162521