Wordfilter

Last updated

A wordfilter (sometimes referred to as just "filter" or "censor") is a script typically used on Internet forums or chat rooms that automatically scans users' posts or comments as they are submitted and automatically changes or censors particular words or phrases.

Contents

The most basic wordfilters search only for specific strings of letters, and remove or overwrite them regardless of their context. More advanced wordfilters make some exceptions for context (such as filtering "butt" but not "butter"), and the most advanced wordfilters may use regular expressions.

Functions

Wordfilters can serve any of a number of functions.

Removal of vulgar language

A swear filter, also known as a profanity filter or language filter is a software subsystem which modifies text to remove words deemed offensive by the administrator or community of an online forum. Swear filters are common in custom-programmed chat rooms and online video games, primarily MMORPGs. This is not to be confused with content filtering, which is usually built into internet browsing programs by third-party developers to filter or block specific websites or types of websites. Swear filters are usually created or implemented by the developers of the Internet service.

Most commonly, wordfilters are used to censor language considered inappropriate by the operators of the forum or chat room. Expletives are typically partially replaced, completely replaced, or replaced by nonsense words. [1] This relieves the administrators or moderators of the task of constantly patrolling the board to watch for such language. This may also help the message board avoid content-control software installed on users' computers or networks, since such software often blocks access to Web pages that contain vulgar language.

Filtered phrases may be permanently replaced as it is saved (example: phpBB 1.x), or the original phrase may be saved but displayed as the censored text. In some software users can view the text behind the wordfilter by quoting the post.

Swear filters typically take advantage of string replacement functions built into the programming language used to create the program, to swap out a list of inappropriate words and phrases with a variety of alternatives. Alternatives can include:

Some swear filters do a simple search for a string. Others have measures that ignore whitespace, and still others go as far as ignoring all non-alphanumeric characters and then filtering the plain text. This means that if the word "you" was set to be filtered, "y o u" or "y.o!u" would also be filtered.

Cliché control

Clichés—particular words or phrases constantly reused in posts, also known as "memes"—often develop on forums. Some users find that these clichés add to the fun, but other users find them tedious, especially when overused. Administrators may configure the wordfilter to replace the annoying cliché with a more embarrassing phrase, or remove it altogether.

Vandalism control

Internet forums are sometimes attacked by vandals who try to fill the forum with repeated nonsense messages, or by spammers who try to insert links to their commercial web sites. The site's wordfilter may be configured to remove the nonsense text used by the vandals, or to remove all links to particular websites from posts.

Lameness filter

Lameness filters are text-based wordfilters used by Slash-based websites (such as textboards and imageboards) to stop junk comments from being posted in response to stories. Some of the things they are designed to filter include:

Circumventing filters

Since wordfilters are automated and look only for particular sequences of characters, users aware of the filters will sometimes try to circumvent them by changing their lettering just enough to avoid the filters. A user trying to avoid a vulgarity filter might replace one of the characters in the offending word into an asterisk, dash, or something similar. Some administrators respond by revising the wordfilters to catch common substitutions; others may make filter evasion a punishable offense of its own. [2] A simple example of evading a wordfilter would be entering symbols between letters, deliberately misspelling words, or using leet. More advanced techniques of wordfilter evasion include the use of images, using hidden tags, or Cyrillic characters (i.e. a homograph spoofing attack).

Another method is to use a soft hyphen. A soft hyphen is only used to indicate where a word can be split when breaking text lines and is not displayed. By placing this halfway in a word, the word gets broken up and will in some cases not be recognised by the wordfilter.

Some more advanced filters, such as those in the online game RuneScape , can detect bypassing. However, the downside of sensitive wordfilters is that legitimate phrases get filtered out as well.

Censorship aspects

Wordfilters are coded into the Internet forums or chat rooms, and operate only on material submitted to the forum or chat room in question. This distinguishes wordfilters from content-control software, which is typically installed on an end user's PC or computer network, and which can filter all Internet content sent to or from the PC or network in question. Since wordfilters alter users' words without their consent, some users still consider them to be censorship, while others consider them an acceptable part of a forum operator's right to control the contents of the forum.

False positives

A common quirk with wordfilters, often considered either comical or aggravating by users, is that they often affect words that are not intended to be filtered. This is a typical problem when short words are filtered. For example, with the word "ass" censored, one may see, "Do you need istance for playing clical music?" instead of "Do you need assistance for playing classical music?" Multiple words may be filtered if whitespace is ignored, resulting in "as suspected" becoming " uspected". Prohibiting a phrase such as "hard on" will result in filtering innocuous statements such as "That was a hard one!" and "Sorry I was hard on you," into "That was a e!" and "Sorry I was you."

Some words that have been filtered accidentally can become replacements for profane words. One example of this is found on the Myst forum Mystcommunity. There, the word 'manuscript' was accidentally censored for containing the word 'anus', which resulted in 'm****cript'. The word was adopted as a replacement swear and carried over when the forum moved, and many substitutes, such as " 'scripting ", are used (though mostly by the older community members).

Place names may be filtered out unintentionally due to containing portions of swear words. In the early years of the internet, the British place name Penistone was often filtered out from spam and swear filters. [3]

Implementation

Many games, such as World of Warcraft , and more recently, Habbo Hotel and RuneScape allow users to turn the filters off. Other games, especially free Massively multiplayer online games, such as Knight Online do not have such an option.

Other games such as Medal of Honor and Call of Duty (except Call of Duty: World at War , Call of Duty: Black Ops , Call of Duty: Black Ops 2 , and Call of Duty: Black Ops 3 ) do not give users the option to turn off scripted foul language, while Gears of War does.

In addition to games, profanity filters can be used to moderate user generated content in forums, blogs, social media apps, kid's websites, and product reviews. There are many profanity filter APIs like WebPurify that help in replacing the swear words with other characters (i.e. "@#$!"). These profanity filters APIs work with profanity search and replace method.

See also

Related Research Articles

An Internet filter is software that restricts or controls the content an Internet user is capable to access, especially when utilized to restrict material delivered over the Internet via the Web, Email, or other means. Such restrictions can be applied at various levels: a government can attempt to apply them nationwide, or they can, for example, be applied by an Internet service provider to its clients, by an employer to its personnel, by a school to its students, by a library to its visitors, by a parent to a child's computer, or by an individual user to their own computers. The motive is often to prevent access to content which the computer's owner(s) or other authorities may consider objectionable. When imposed without the consent of the user, content control can be characterised as a form of internet censorship. Some filter software includes time control functions that empowers parents to set the amount of time that child may spend accessing the Internet or playing games or other computer activities.

<span class="mw-page-title-main">Leet</span> Online slang and alternative orthography

Leet, also known as eleet or leetspeak, or simply hacker speech, is a system of modified spellings used primarily on the Internet. It often uses character replacements in ways that play on the similarity of their glyphs via reflection or other resemblance. Additionally, it modifies certain words on the basis of a system of suffixes and alternative meanings. There are many dialects or linguistic varieties in different online communities.

Spamdexing is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building and repeating related and/or unrelated phrases, to manipulate the relevance or prominence of resources indexed in a manner inconsistent with the purpose of the indexing system.

The asterisk, from Late Latin asteriscus, from Ancient Greek ἀστερίσκος, asteriskos, "little star", is a typographical symbol. It is so called because it resembles a conventional image of a heraldic star.

Social software, also known as social apps or social platform includes communications and interactive tools that are often based on the Internet. Communication tools typically handle capturing, storing and presenting communication, usually written but increasingly including audio and video as well. Interactive tools handle mediated interactions between a pair or group of users. They focus on establishing and maintaining a connection among users, facilitating the mechanics of conversation and talk. Social software generally refers to software that makes collaborative behaviour, the organisation and moulding of communities, self-expression, social interaction and feedback possible for individuals. Another element of the existing definition of social software is that it allows for the structured mediation of opinion between people, in a centralized or self-regulating manner. The most improved area for social software is that Web 2.0 applications can all promote co-operation between people and the creation of online communities more than ever before. The opportunities offered by social software are instant connections and opportunities to learn. An additional defining feature of social software is that apart from interaction and collaboration, it aggregates the collective behaviour of its users, allowing not only crowds to learn from an individual but individuals to learn from the crowds as well. Hence, the interactions enabled by social software can be one-to-one, one-to-many, or many-to-many.

<span class="mw-page-title-main">Internet forum</span> Online discussion site

An Internet forum, or message board, is an online discussion site where people can hold conversations in the form of posted messages. They differ from chat rooms in that messages are often longer than one line of text, and are at least temporarily archived. Also, depending on the access level of a user or the forum set-up, a posted message might need to be approved by a moderator before it becomes publicly visible.

This is a Glossary of Internet Terminology; words pertaining to Internet Technology, a subset of Computer Science.

A bleep censor is the replacement of profanity or classified information with a beep sound, used in public television, radio and social media.

Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag-of-words features to identify email spam, an approach commonly used in text classification.

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.

An Internet bot, web robot, robot or simply bot, is a software application that runs automated tasks (scripts) on the Internet, usually with the intent to imitate human activity, such as messaging, on a large scale. An Internet bot plays the client role in a client–server model whereas the server role is usually played by web servers. Internet bots are able to perform simple and repetitive tasks much faster than a person could ever do. The most extensive use of bots is for web crawling, in which an automated script fetches, analyzes and files information from web servers. More than half of all web traffic is generated by bots.

Internet Content Rating Association (ICRA) was an international non-profit organization with offices in the United States and the United Kingdom. In October 2010, the ICRA rating system, and the organization, was discontinued.

<span class="mw-page-title-main">Scunthorpe problem</span> Problem caused by profanity filters on the Internet

The Scunthorpe problem is the unintentional blocking of online content by a spam filter or search engine because their text contains a string of letters that appear to have an obscene or otherwise unacceptable meaning. Names, abbreviations, and technical terms are most often cited as being affected by the issue.

Scieno Sitter is content-control software that, when installed on a computer, blocks certain websites critical of Scientology from being viewed. The software was released by the Church of Scientology in 1998 for Church members using Windows 95. The term "Scieno Sitter" was coined by critics of Scientology who assert that the program is a form of Internet censorship.

<span class="mw-page-title-main">Kaspersky Internet Security</span> Internet security suite developed by Kaspersky Lab

Kaspersky Internet Security is a internet security suite developed by Kaspersky Lab compatible with Microsoft Windows and Mac OS X. Kaspersky Internet Security offers protection from malware, as well as email spam, phishing and hacking attempts, and data leaks. Kaspersky Lab Diagnostics results are distributed to relevant developers through the MIT License.

Forum spam consists of posts on Internet forums that contains related or unrelated advertisements, links to malicious websites, trolling and abusive or otherwise unwanted information. Forum spam is usually posted onto message boards by automated spambots or manually with unscrupulous intentions with intent to get the spam in front of readers who would not otherwise have anything to do with it intentionally.

It is common to find minced oaths in literature and media. Writers often include minced oaths instead of profanity in their writing to avoid offending their audience or incurring censorship.

XRumer is a piece of software made for spamming online forums and comment sections. It is marketed as a program for search engine optimization and was created by BotmasterLabs. It is able to register and post to forums with the aim of boosting search engine rankings. The program is able to bypass security techniques commonly used by many forums and blogs to deter automated spam, such as account registration, client detection, many forms of CAPTCHAs, and e-mail activation before posting. The program utilises SOCKS and HTTP proxies in an attempt to make it more difficult for administrators to block posts by source IP, and features a proxy checking tool to verify the integrity and anonymity of the proxies used.

Shadow banning, also called stealth banning, hell banning, ghost banning, and comment ghosting, is the practice of blocking or partially blocking a user or the user's content from some areas of an online community in such a way that the ban is not readily apparent to the user, regardless of whether the action is taken by an individual or an algorithm. For example, shadow-banned comments posted to a blog or media website would be visible to the sender, but not to other users accessing the site.

Social spam is unwanted spam content appearing on social networking services, social bookmarking sites, and any website with user-generated content. It can be manifested in many ways, including bulk messages, profanity, insults, hate speech, malicious links, fraudulent reviews, fake friends, and personally identifiable information.

References

  1. "When the **** did we get a wordfilter?" . Retrieved 2006-10-01.
  2. "GameFAQs Terms of Use". GameFAQs. Retrieved 2008-08-04.
  3. Sheerin, Jude (29 March 2010). "How spam filters dictated Canadian magazine's fate". BBC Online . Retrieved 5 April 2011.

replaces characters with similar Unicode chars from different character sets (e.g. Cyrillic)