Toloka

Last updated
Toloka
Toloka Logo After 2021 Rebranding.jpg
Type of site
Crowdsourcing, Microwork
Available in English, Russian, Spanish, French, Arabic etc. [1]
Founded2014;10 years ago (2014)
Country of originRussia, Switzerland [2] [3]
Owner YandexInc
Founder(s) Olga Megorskaya
URL toloka.ai

Toloka is a crowdsourcing platform and microtasking project launched by Yandex in 2014 [2] to quickly markup large amounts of data, which are then used for machine learning and improving search algorithms. [4] The proposed tasks are usually simple and do not require any special training from the performer. [2] Most of the tasks are designed to improve algorithms that are used by modern technologies spanning self-driving vehicles, smart web searches, advanced voice assistants and e-commerce.[ citation needed ] Upon completion of each task the performer receives a reward based on the volume of images, videos, and unstructured text. [3] The service has two app versions – for Android and iOS.

Contents

About Toloka

Origin of the platform's name

A toloka used to be a form of mutual assistance among villagers of Russia, Ukraine, Belarus, Estonia, Latvia, and Lithuania. It was organized in villages to perform urgent work requiring a large number of workers, such as harvesting, logging, building houses, etc. Sometimes a toloka was used for community works (building churches, schools, roads, etc.). [3]

Types of tasks and scope of results

Data labeling helps to improve search quality and effectively tune result ranking algorithms of search engine. [3]

Machine learning

To train machine learning algorithm requires labeling of large volumes with positive and negative examples of data. Toloka performers receive tasks to determine the presence or absence of objects defined by a computer in a content item. [3] [5] In tasks of another type, a context of the dialogue is given and a scale is proposed by which it is necessary to assess whether a chatbot's answer in this context is appropriate, interesting, and so on. [6] Another group of tasks in Toloka is translation verification performed by collecting examples of translations from different performers. [7]

Audit and marketing research

Checking the quality of the online store, delivery service, writing reviews about products and services. Such audits allow to control the quality of the service and identify weaknesses, over which work will be carried out in the future to improve and eliminate the identified problems. [8] [9]

Users

Toloka users, also known as performers or tolokers, are people who earn money by completing system testing and improvement tasks on the Toloka crowdsourcing platform.[ citation needed ] In 2018, more than a million people participated in Toloka projects. Most performers are young people under 35 (usually engineering students or mothers on maternity leave). Performers mainly see Toloka as an additional source of income, but many of them note that they like to do meaningful work and clean up the internet. As of March 2022, Toloka has 245,000 monthly active performers in 123 countries. Tolokers generates over 15 million labels per day. [1] [10]

Requesters

All tasks in Toloka are placed by requesters. The main uses of Toloka are data collection and processing for machine learning, speech technology, computer vision, smart search algorithms, and other projects, as well as content moderation, field tasks, optimization of internal business processes. [3]

Toloka Research

In May 2019, the service's team started publishing datasets for non-commercial and academic purposes to support the scientific community and attract researchers to Toloka. Such datasets are addressed to researchers in different directions like linguistics, computer vision, testing of result aggregation models, and chatbot training. [11] Toloka research has been showcased at a range of conferences, including the Conference on Neural Information Processing Systems (NeurIPS), [12] the International Conference on Machine Learning (ICML) [13] and the International Conference on Very Large Data Bases (VLDB). [14]

Related Research Articles

Artificial intelligence (AI) is the intelligence of machines or software, as opposed to the intelligence of humans or other animals. It is a field of study in computer science that develops and studies intelligent machines. Such machines may be called AIs.

<span class="mw-page-title-main">Chatbot</span> Program that simulates conversation

A chatbot is a software application or web interface that is designed to mimic human conversation through text or voice interactions. Modern chatbots are typically online and use generative artificial intelligence systems that are capable of maintaining a conversation with a user in natural language and simulating the way a human would behave as a conversational partner. Such chatbots often use deep learning and natural language processing, but simpler chatbots have existed for decades.

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions. Recently, generative artificial neural networks have been able to surpass many previous approaches in performance.

Natural language generation (NLG) is a software process that produces natural language output. A widely-cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems than can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information".

Human-based computation (HBC), human-assisted computation, ubiquitous human computing or distributed thinking is a computer science technique in which a machine performs its function by outsourcing certain steps to humans, usually as microwork. This approach uses differences in abilities and alternative costs between humans and computer agents to achieve symbiotic human–computer interaction. For computationally difficult tasks such as image recognition, human-based computation plays a central role in training Deep Learning-based Artificial Intelligence systems. In this case, human-based computation has been referred to as human-aided artificial intelligence.

Artificial intelligence (AI) has been used in applications throughout industry and academia. AI, like electricity or computers, is a general-purpose technology that has a multitude of applications. It has been used in fields of language translation, image recognition, credit scoring, e-commerce and other domains.

<span class="mw-page-title-main">Figure Eight Inc.</span> American software company

Figure Eight was a human-in-the-loop machine learning and artificial intelligence company based in San Francisco.

Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data may, for example, consist of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment for each item. The goal of constructing the ranking model is to rank new, unseen lists in a similar way to rankings in the training data.

Explainable AI (XAI), often overlapping with Interpretable AI, or Explainable Machine Learning (XML), either refers to an AI system over which it is possible for humans to retain intellectual oversight, or to the methods to achieve this. The main focus is usually on the reasoning behind the decisions or predictions made by the AI which are made more understandable and transparent. XAI counters the "black box" tendency of machine learning, where even the AI's designers cannot explain why it arrived at a specific decision.

Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems.

Crowdsource is a crowdsourcing platform developed by Google intended to improve a host of Google services through the user-facing training of different algorithms.

Deep reinforcement learning is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. Deep RL algorithms are able to take in very large inputs and decide what actions to perform to optimize an objective. Deep reinforcement learning has been used for a diverse set of applications including but not limited to robotics, video games, natural language processing, computer vision, education, transportation, finance and healthcare.

<span class="mw-page-title-main">Federated learning</span> Decentralized machine learning

Federated learning is a machine learning technique that trains an algorithm via multiple independent sessions, each using its own dataset. This approach stands in contrast to traditional centralized machine learning techniques where local datasets are merged into one training session, as well as to approaches that assume that local data samples are identically distributed.

Government by algorithm is an alternative form of government or social ordering where the usage of computer algorithms is applied to regulations, law enforcement, and generally any aspect of everyday life such as transportation or land registration. The term "government by algorithm" has appeared in academic literature as an alternative for "algorithmic governance" in 2013. A related term, algorithmic regulation, is defined as setting the standard, monitoring and modifying behaviour by means of computational algorithms – automation of judiciary is in its scope. In the context of blockchain, it is also known as blockchain governance.

Abeba Birhane is an Ethiopian-born cognitive scientist who works at the intersection of complex adaptive systems, machine learning, algorithmic bias, and critical race studies. Birhane's work with Vinay Prabhu uncovered that large-scale image datasets commonly used to develop AI systems, including ImageNet and 80 Million Tiny Images, carried racist and misogynistic labels and offensive images. She has been recognized by VentureBeat as a top innovator in computer vision and named as one of the 100 most influential persons in AI 2023 by TIME magazine.

The Fashion MNIST dataset is a large freely available database of fashion images that is commonly used for training and testing various machine learning systems. Fashion-MNIST was intended to serve as a replacement for the original MNIST database for benchmarking machine learning algorithms, as it shares the same image size, data format and the structure of training and testing splits.

Hugging Face, Inc. is a French-American company that develops tools for building applications using machine learning, based in New York City. It is most notable for its transformers library built for natural language processing applications and its platform that allows users to share machine learning models and datasets and showcase their work.

<span class="mw-page-title-main">Generative pre-trained transformer</span> Type of large language model

Generative pre-trained transformers (GPT) are a type of large language model (LLM) and a prominent framework for generative artificial intelligence. They are artificial neural networks that are used in natural language processing tasks. GPTs are based on the transformer architecture, pre-trained on large data sets of unlabelled text, and able to generate novel human-like content. As of 2023, most LLMs have these characteristics and are sometimes referred to broadly as GPTs.

<span class="mw-page-title-main">Edward Y. Chang</span> American computer scientist

Edward Y. Chang is a computer scientist, academic, and author. He is an adjunct professor of Computer Science at Stanford University, and Visiting Chair Professor of Bioinformatics and Medical Engineering at Asia University, since 2019.

References

  1. 1 2 "It helps me learn and earn: Toloka reports results of a global survey of Tolokers in 2022". toloka.ai. 2022-03-23. Retrieved 2022-09-16.
  2. 1 2 3 "Toloka rolls out 20000 new jobs opportunities for Ghanaians". Ghana Education News. 2021-06-15. Retrieved 2022-09-17.
  3. 1 2 3 4 5 6 Alex Woodie (2021-04-27). "Toloka Expands Data Labeling Service". Datanami. Retrieved 2022-09-17.
  4. Daria Baidakova (2021-09-29). "Data-Labeling Instructions: Gateway to Success in Crowdsourcing and Enduring Impact on AI". Data Science Central. Retrieved 2022-09-17.
  5. Frederik Bussler (2021-12-07). "Data labeling will fuel the AI revolution". VentureBeat . Retrieved 2022-09-17.
  6. Kumar Gandharv (2021-04-29). "Why Are Data Labelling Firms Eyeing Indian Market?". Analytics India Magazine. Retrieved 2022-09-17.
  7. Magdalena Konkiewicz (2021-12-16). "Human in the loop in Machine Translation systems". Towards Data Science. Retrieved 2022-09-16.
  8. Magdalena Konkiewicz (2022-03-29). "Evaluating search relevance on-demand with crowdsourcing". Towards Data Science. Retrieved 2022-09-17.
  9. "Guest post: Data Labeling and Its Role in E-commerce Today – Recent Use Cases". TheSequence. 2022-01-16. Retrieved 2022-09-16.
  10. "Olga Megorskaya/Toloka: Practical Lessons About Data Labeling". TheSequence. 2021-10-27. Retrieved 2022-09-16.
  11. "Toloka to present new dataset at prestigious Data-Centric AI workshop launched by Andrew Ng". The AI Journal. Retrieved 2022-09-17.
  12. "Toloka to present new dataset at prestigious Data-Centric AI workshop launched by Andrew Ng". FE News. 2021-11-18. Retrieved 2022-02-10.
  13. "Toloka". icml.cc. Retrieved 2022-02-10.
  14. "VLDB 2021 Challenge". crowdscience.ai. Retrieved 2022-02-10.