Wordfilter

Last updated

A wordfilter (sometimes referred to as just "filter" or "censor") is a script typically used on Internet forums or chat rooms that automatically scans users' posts or comments as they are submitted and automatically changes or censors particular words or phrases.

Contents

The most basic wordfilters search only for specific strings of letters, and remove or overwrite them regardless of their context. More advanced wordfilters make some exceptions for context (such as filtering "butt" but not "butter"), and the most advanced wordfilters may use regular expressions.

Functions

Wordfilters can serve any of a number of functions.

Removal of vulgar language

A swear filter, also known as a profanity filter or language filter is a software subsystem which modifies text to remove words deemed offensive by the administrator or community of an online forum. Swear filters are common in custom-programmed chat rooms and online video games, primarily MMORPGs. This is not to be confused with content filtering, which is usually built into internet browsing programs by third-party developers to filter or block specific websites or types of websites. Swear filters are usually created or implemented by the developers of the Internet service.

Most commonly, wordfilters are used to censor language considered inappropriate by the operators of the forum or chat room. Expletives are typically partially replaced, completely replaced, or replaced by nonsense words. [1] This relieves the administrators or moderators of the task of constantly patrolling the board to watch for such language. This may also help the message board avoid content-control software installed on users' computers or networks, since such software often blocks access to Web pages that contain vulgar language.

Filtered phrases may be permanently replaced as it is saved (example: phpBB 1.x), or the original phrase may be saved but displayed as the censored text. In some software users can view the text behind the wordfilter by quoting the post.

Swear filters typically take advantage of string replacement functions built into the programming language used to create the program, to swap out a list of inappropriate words and phrases with a variety of alternatives. Alternatives can include:

Some swear filters do a simple search for a string. Others have measures that ignore whitespace, and still others go as far as ignoring all non-alphanumeric characters and then filtering the plain text. This means that if the word "you" was set to be filtered, "y o u" or "y.o!u" would also be filtered.

Cliché control

Clichés—particular words or phrases constantly reused in posts, also known as "memes"—often develop on forums. Some users find that these clichés add to the fun, but other users find them tedious, especially when overused. Administrators may configure the wordfilter to replace the annoying cliché with a more embarrassing phrase, or remove it altogether.

Vandalism control

Internet forums are sometimes attacked by vandals who try to fill the forum with repeated nonsense messages, or by spammers who try to insert links to their commercial web sites. The site's wordfilter may be configured to remove the nonsense text used by the vandals, or to remove all links to particular websites from posts.

Lameness filter

Lameness filters are text-based wordfilters used by Slash-based websites (such as textboards and imageboards) to stop junk comments from being posted in response to stories. Some of the things they are designed to filter include:

Circumventing filters

Since wordfilters are automated and look only for particular sequences of characters, users aware of the filters will sometimes try to circumvent them by changing their lettering just enough to avoid the filters. A user trying to avoid a vulgarity filter might replace one of the characters in the offending word into an asterisk, dash, or something similar. Some administrators respond by revising the wordfilters to catch common substitutions; others may make filter evasion a punishable offense of its own. [2] A simple example of evading a wordfilter would be entering symbols between letters or using leet. More advanced techniques of wordfilter evasion include the use of images, using hidden tags, or Cyrillic characters (i.e. a homograph spoofing attack).

Another method is to use a soft hyphen. A soft hyphen is only used to indicate where a word can be split when breaking text lines and is not displayed. By placing this halfway in a word, the word gets broken up and will in some cases not be recognised by the wordfilter.

Some more advanced filters, such as those in the online game RuneScape , can detect bypassing. However, the downside of sensitive wordfilters is that legitimate phrases get filtered out as well.

Censorship aspects

Wordfilters are coded into the Internet forums or chat rooms, and operate only on material submitted to the forum or chat room in question. This distinguishes wordfilters from content-control software, which is typically installed on an end user's PC or computer network, and which can filter all Internet content sent to or from the PC or network in question. Since wordfilters alter users' words without their consent, some users still consider them to be censorship, while others consider them an acceptable part of a forum operator's right to control the contents of the forum.

False positives

A common quirk with wordfilters, often considered either comical or aggravating by users, is that they often affect words that are not intended to be filtered. This is a typical problem when short words are filtered. For example, with the word "ass" censored, one may see, "Do you need istance for playing clical music?" instead of "Do you need assistance for playing classical music?" Multiple words may be filtered if whitespace is ignored, resulting in "as suspected" becoming " uspected". Prohibiting a phrase such as "hard on" will result in filtering innocuous statements such as "That was a hard one!" and "Sorry I was hard on you," into "That was a e!" and "Sorry I was you."

Some words that have been filtered accidentally can become replacements for profane words. One example of this is found on the Myst forum Mystcommunity. There, the word 'manuscript' was accidentally censored for containing the word 'anus', which resulted in 'm****cript'. The word was adopted as a replacement swear and carried over when the forum moved, and many substitutes, such as " 'scripting ", are used (though mostly by the older community members).

Place names may be filtered out unintentionally due to containing portions of swear words. In the early years of the internet, the British place name Penistone was often filtered out from spam and swear filters. [3]

Implementation

Many games, such as World of Warcraft , and more recently, Habbo Hotel and RuneScape allow users to turn the filters off. Other games, especially free Massively multiplayer online games, such as Knight Online do not have such an option.

Other games such as Medal of Honor and Call of Duty (except Call of Duty: World at War , Call of Duty: Black Ops , Call of Duty: Black Ops 2 , and Call of Duty: Black Ops 3 ) do not give users the option to turn off scripted foul language, while Gears of War does.

In addition to games, profanity filters can be used to moderate user generated content in forums, blogs, social media apps, kid's websites, and product reviews. There are many profanity filter APIs like WebPurify that help in replacing the swear words with other characters (i.e. "@#$!"). These profanity filters APIs work with profanity search and replace method.

See also

Related Research Articles

An Internet filter is software that restricts or controls the content an Internet user is capable to access, especially when utilized to restrict material delivered over the Internet via the Web, Email, or other means. Content-control software determines what content will be available or be blocked.

<span class="mw-page-title-main">Leet</span> Online slang and alternative orthography

Leet, also known as eleet or leetspeak, is a system of modified spellings used primarily on the Internet. It often uses character replacements in ways that play on the similarity of their glyphs via reflection or other resemblance. Additionally, it modifies certain words on the basis of a system of suffixes and alternative meanings. There are many dialects or linguistic varieties in different online communities.

Spamdexing is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building and repeating unrelated phrases, to manipulate the relevance or prominence of resources indexed in a manner inconsistent with the purpose of the indexing system.

<span class="mw-page-title-main">Internet forum</span> Online discussion site

An Internet forum, or message board, is an online discussion site where people can hold conversations in the form of posted messages. They differ from chat rooms in that messages are often longer than one line of text, and are at least temporarily archived. Also, depending on the access level of a user or the forum set-up, a posted message might need to be approved by a moderator before it becomes publicly visible.

<span class="mw-page-title-main">Content moderation</span> System to sort undesirable contributions

On Internet websites that invite users to post comments, content moderation is the process of detecting contributions that are irrelevant, obscene, illegal, harmful, or insulting, in contrast to useful or informative contributions, frequently for censorship or suppression of opposing viewpoints. The purpose of content moderation is to remove or apply a warning label to problematic content or allow users to block and filter content themselves.

This is a Glossary of Internet Terminology; words pertaining to Internet Technology, a subset of Computer Science.

CRM114 is a program based upon a statistical approach for classifying data, and especially used for filtering email spam.

A bleep censor is the replacement of offensive language or classified information with a beep sound, used in television and radio.

Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag-of-words features to identify email spam, an approach commonly used in text classification.

A word salad is a "confused or unintelligible mixture of seemingly random words and phrases", most often used to describe a symptom of a neurological or mental disorder. The name schizophasia is used in particular to describe the confused language that may be evident in schizophrenia. The words may or may not be grammatically correct, but they are semantically confused to the point that the listener cannot extract any meaning from them. The term is often used in psychiatry as well as in theoretical linguistics to describe a type of grammatical acceptability judgement by native speakers, and in computer programming to describe textual randomization.

<span class="mw-page-title-main">Spambot</span> Computer spam program (malware)

A spambot is a computer program designed to assist in the sending of spam. Spambots usually create accounts and send spam messages with them. Web hosts and website operators have responded by banning spammers, leading to an ongoing struggle between them and spammers in which spammers find new ways to evade the bans and anti-spam programs, and hosts counteract these methods.

An Internet bot, web robot, robot or simply bot, is a software application that runs automated tasks (scripts) on the Internet, usually with the intent to imitate human activity, such as messaging, on a large scale. An Internet bot plays the client role in a client–server model whereas the server role is usually played by web servers. Internet bots are able to perform simple and repetitive tasks much faster than a person could ever do. The most extensive use of bots is for web crawling, in which an automated script fetches, analyzes and files information from web servers. More than half of all web traffic is generated by bots.

Disposable email addressing, also known as DEA, dark mail or masked email, refers to an approach that involves a unique email address being used for every contact, entity, or for a limited number of times or uses. The benefit is that if anyone compromises the address or utilizes it in connection with email abuse, the address owner can easily cancel it without affecting any of their other contacts.

Internet Content Rating Association (ICRA) was an international non-profit organization with offices in the United States and the United Kingdom. In October 2010, the ICRA rating system, and the organization, was discontinued.

<span class="mw-page-title-main">Scunthorpe problem</span> Problem caused by profanity filters on the Internet

The Scunthorpe problem is the unintentional blocking of online content by a spam filter or search engine because their text contains a string of letters that appear to have an obscene or otherwise unacceptable meaning. Names, abbreviations, and technical terms are most often cited as being affected by the issue.

Scieno Sitter is content-control software that, when installed on a computer, blocks certain websites critical of Scientology from being viewed. The software was released by the Church of Scientology in 1998 for Church members using Windows 95. The term "Scieno Sitter" was coined by critics of Scientology who assert that the program is a form of Internet censorship.

Forum spam consists of posts on Internet forums that contains related or unrelated advertisements, links to malicious websites, trolling and abusive or otherwise unwanted information. Forum spam is usually posted onto message boards by automated spambots or manually with unscrupulous intentions with intent to get the spam in front of readers who would not otherwise have anything to do with it intentionally.

It is common to find minced oaths in literature and media. Writers often include minced oaths instead of profanity in their writing, to avoid offending their audience or incurring censorship.

Shadow banning, also called stealth banning, hellbanning, ghost banning, and comment ghosting, is the practice of blocking or partially blocking a user or the user's content from some areas of an online community in such a way that the ban is not readily apparent to the user, regardless of whether the action is taken by an individual or an algorithm. For example, shadow-banned comments posted to a blog or media website would be visible to the sender, but not to other users accessing the site.

Social spam is unwanted spam content appearing on social networking services, social bookmarking sites, and any website with user-generated content. It can be manifested in many ways, including bulk messages, profanity, insults, hate speech, malicious links, fraudulent reviews, fake friends, and personally identifiable information.

References

  1. "When the **** did we get a wordfilter?" . Retrieved 2006-10-01.
  2. "GameFAQs Terms of Use". GameFAQs. Retrieved 2008-08-04.
  3. Sheerin, Jude (29 March 2010). "How spam filters dictated Canadian magazine's fate". BBC Online . Retrieved 5 April 2011.

replaces characters with similar Unicode chars from different character sets (e.g. Cyrillic)