Toloka

Last updated
Toloka
Toloka Logo After 2021 Rebranding.jpg
Type of site
Crowdsourcing, Microwork
Available in English, Russian, Spanish, French, Arabic etc. [1]
Founded2014;10 years ago (2014)
Country of originRussia, Switzerland [2] [3]
Owner YandexInc
Founder(s) Olga Megorskaya
URL toloka.ai

Toloka is a crowdsourcing platform and microtasking project launched by Yandex in 2014 [2] to quickly markup large amounts of data, which are then used for machine learning and improving search algorithms. [4] The proposed tasks are usually simple and do not require any special training from the performer. [2] Most of the tasks are designed to improve algorithms that are used by modern technologies spanning self-driving vehicles, smart web searches, advanced voice assistants and e-commerce.[ citation needed ] Upon completion of each task the performer receives a reward based on the volume of images, videos, and unstructured text. [3] The service has two app versions – for Android and iOS.

Contents

About Toloka

Origin of the platform's name

A Toloka used to be a form of mutual assistance among villagers of Russia, Ukraine, Belarus, Estonia, Latvia, and Lithuania. It was organized in villages to perform urgent work requiring. A large number of workers, such as harvesting, logging, building houses, etc. Sometimes a Toloka was used for community works (building churches, schools, roads, etc.). [3]

Types of tasks and scope of results

Data labeling helps to improve search quality and effectively tune result ranking algorithms of search engine. [3]

Machine learning

To train machine learning algorithm requires labeling of large volumes with positive and negative examples of data. Toloka performers receive tasks to determine the presence or absence of objects defined by a computer in a content item. [3] [5] In tasks of another type, a context of the dialogue is given and a scale is proposed by which it is necessary to assess whether a chatbot's answer in this context is appropriate, interesting, and so on. [6] Another group of tasks in Toloka is translation verification performed by collecting examples of translations from different performers.[ citation needed ]

Audit and marketing research

Checking the quality of the online store, delivery service, writing reviews about products and services. Such audits allow to control the quality of the service and identify weaknesses, over which work will be carried out in the future to improve and eliminate the identified problems.[ citation needed ]

Users

Toloka users, also known as performers or tolokers, are people who earn money by completing system testing and improvement tasks on the Toloka crowdsourcing platform.[ citation needed ] In 2018, more than a million people participated in Toloka projects. Most performers are young people under 35 (usually engineering students or mothers on maternity leave). Performers mainly see Toloka as an additional source of income, but many of them note that they like to do meaningful work and clean up the internet. As of March 2022, Toloka has 245,000 monthly active performers in 123 countries. Tolokers generates over 15 million labels per day. [1] [7]

Requesters

All tasks in Toloka are placed by requesters. The main uses of Toloka are data collection and processing for machine learning, speech technology, computer vision, smart search algorithms, and other projects, as well as content moderation, field tasks, optimization of internal business processes. [3]

Toloka Research

In May 2019, the service's team started publishing datasets for non-commercial and academic purposes to support the scientific community and attract researchers to Toloka. Such datasets are addressed to researchers in different directions like linguistics, computer vision, testing of result aggregation models, and chatbot training. [8] Toloka research has been showcased at a range of conferences, including the Conference on Neural Information Processing Systems (NeurIPS), [9] the International Conference on Machine Learning (ICML) [10] and the International Conference on Very Large Data Bases (VLDB). [11]

Related Research Articles

<span class="mw-page-title-main">Chatbot</span> Program that simulates conversation

A chatbot is a software application or web interface that is designed to mimic human conversation through text or voice interactions. Modern chatbots are typically online and use generative artificial intelligence systems that are capable of maintaining a conversation with a user in natural language and simulating the way a human would behave as a conversational partner. Such chatbots often use deep learning and natural language processing, but simpler chatbots have existed for decades.

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data and thus perform tasks without explicit instructions. Recently, artificial neural networks have been able to surpass many previous approaches in performance.

Natural language generation (NLG) is a software process that produces natural language output. A widely-cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems that can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information".

Human-based computation (HBC), human-assisted computation, ubiquitous human computing or distributed thinking is a computer science technique in which a machine performs its function by outsourcing certain steps to humans, usually as microwork. This approach uses differences in abilities and alternative costs between humans and computer agents to achieve symbiotic human–computer interaction. For computationally difficult tasks such as image recognition, human-based computation plays a central role in training Deep Learning-based Artificial Intelligence systems. In this case, human-based computation has been referred to as human-aided artificial intelligence.

<span class="mw-page-title-main">Crowdsourcing</span> Sourcing services or funds from a group

Crowdsourcing involves a large group of dispersed participants contributing or producing goods or services—including ideas, votes, micro-tasks, and finances—for payment or as volunteers. Contemporary crowdsourcing often involves digital platforms to attract and divide work between participants to achieve a cumulative result. Crowdsourcing is not limited to online activity, however, and there are various historical examples of crowdsourcing. The word crowdsourcing is a portmanteau of "crowd" and "outsourcing". In contrast to outsourcing, crowdsourcing usually involves less specific and more public groups of participants.

Artificial intelligence (AI) has been used in applications throughout industry and academia. Similar to electricity or computers, AI serves as a general-purpose technology that has numerous applications. Its applications span language translation, image recognition, decision-making, credit scoring, e-commerce and various other domains. AI which accommodates such technologies as machines being equipped perceive, understand, act and learning a scientific discipline.

<span class="mw-page-title-main">Figure Eight Inc.</span> American software company

Figure Eight was a human-in-the-loop machine learning and artificial intelligence company based in San Francisco.

Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data may, for example, consist of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment for each item. The goal of constructing the ranking model is to rank new, unseen lists in a similar way to rankings in the training data.

Synthetic data is information that is artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models.

Microwork is a series of many small tasks which together comprise a large unified project, and it is completed by many people over the Internet. Microwork is considered the smallest unit of work in a virtual assembly line. It is most often used to describe tasks for which no efficient algorithm has been devised, and require human intelligence to complete reliably. The term was developed in 2008 by Leila Chirayath Janah of Samasource.

Automated machine learning (AutoML) is the process of automating the tasks of applying machine learning to real-world problems. It is the combination of automation and ML.

Deep reinforcement learning is a subfield of machine learning that combines reinforcement learning (RL) and deep learning. RL considers the problem of a computational agent learning to make decisions by trial and error. Deep RL incorporates deep learning into the solution, allowing agents to make decisions from unstructured input data without manual engineering of the state space. Deep RL algorithms are able to take in very large inputs and decide what actions to perform to optimize an objective. Deep reinforcement learning has been used for a diverse set of applications including but not limited to robotics, video games, natural language processing, computer vision, education, transportation, finance and healthcare.

Abeba Birhane is an Ethiopian-born cognitive scientist who works at the intersection of complex adaptive systems, machine learning, algorithmic bias, and critical race studies. Birhane's work with Vinay Prabhu uncovered that large-scale image datasets commonly used to develop AI systems, including ImageNet and 80 Million Tiny Images, carried racist and misogynistic labels and offensive images. She has been recognized by VentureBeat as a top innovator in computer vision and named as one of the 100 most influential persons in AI 2023 by TIME magazine.

The Fashion MNIST dataset is a large freely available database of fashion images that is commonly used for training and testing various machine learning systems. Fashion-MNIST was intended to serve as a replacement for the original MNIST database for benchmarking machine learning algorithms, as it shares the same image size, data format and the structure of training and testing splits.

<span class="mw-page-title-main">Meta AI</span> Artificial intelligence division of Meta Platforms

Meta AI is an American company owned by Meta that develops artificial intelligence and augmented and artificial reality technologies. Meta AI deems itself an academic research laboratory, focused on generating knowledge for the AI community, and should not be confused with Meta's Applied Machine Learning (AML) team, which focuses on the practical applications of its products.

Artificial intelligence in mental health is the application of artificial intelligence (AI), computational technologies and algorithms to supplement the understanding, diagnosis, and treatment of mental health disorders. AI is becoming a ubiquitous force in everyday life which can be seen through frequent operation of models like ChatGPT. Utilizing AI in the realm of mental health signifies a form of digital healthcare, in which, the goal is to increase accessibility in a world where mental health is becoming a growing concern. Prospective ideas involving AI in mental health include identification and diagnosis of mental disorders, explication of electronic health records, creation of personalized treatment plans, and predictive analytics for suicide prevention. Learning how to apply AI in healthcare proves to be a difficult task with many challenges, thus it remains rarely used as efforts to bridge gaps are deliberated.

<span class="mw-page-title-main">Generative pre-trained transformer</span> Type of large language model

Generative pre-trained transformers (GPT) are a type of large language model (LLM) and a prominent framework for generative artificial intelligence. They are artificial neural networks that are used in natural language processing tasks. GPTs are based on the transformer architecture, pre-trained on large data sets of unlabelled text, and able to generate novel human-like content. As of 2023, most LLMs have these characteristics and are sometimes referred to broadly as GPTs.

<span class="mw-page-title-main">Edward Y. Chang</span> American computer scientist

Edward Y. Chang is a computer scientist, academic, and author. He is an adjunct professor of Computer Science at Stanford University, and Visiting Chair Professor of Bioinformatics and Medical Engineering at Asia University, since 2019.

Nicholas Carlini is a researcher affiliated with Google DeepMind who has published research in the field of computer security and machine learning. He is known for his work on adversarial machine learning.

References

  1. 1 2 "It helps me learn and earn: Toloka reports results of a global survey of Tolokers in 2022". toloka.ai. 2022-03-23. Retrieved 2022-09-16.
  2. 1 2 3 "Toloka rolls out 20000 new jobs opportunities for Ghanaians". Ghana Education News. 2021-06-15. Retrieved 2022-09-17.
  3. 1 2 3 4 5 6 Alex Woodie (2021-04-27). "Toloka Expands Data Labeling Service". Datanami. Retrieved 2022-09-17.
  4. Daria Baidakova (2021-09-29). "Data-Labeling Instructions: Gateway to Success in Crowdsourcing and Enduring Impact on AI". Data Science Central. Retrieved 2022-09-17.
  5. Frederik Bussler (2021-12-07). "Data labeling will fuel the AI revolution". VentureBeat . Retrieved 2022-09-17.
  6. Kumar Gandharv (2021-04-29). "Why Are Data Labelling Firms Eyeing Indian Market?". Analytics India Magazine. Retrieved 2022-09-17.
  7. "Olga Megorskaya/Toloka: Practical Lessons About Data Labeling". TheSequence. 2021-10-27. Retrieved 2022-09-16.
  8. "Toloka to present new dataset at prestigious Data-Centric AI workshop launched by Andrew Ng". The AI Journal. Retrieved 2022-09-17.
  9. "Toloka to present new dataset at prestigious Data-Centric AI workshop launched by Andrew Ng". FE News. 2021-11-18. Retrieved 2022-02-10.
  10. "Toloka". icml.cc. Retrieved 2022-02-10.
  11. "VLDB 2021 Challenge". crowdscience.ai. Retrieved 2022-02-10.