Toloka

Last updated
Toloka
Toloka Logo After 2021 Rebranding.jpg
Type of site
Platform
Founded2014;10 years ago (2014)
Owner Nebius Group
Founder(s) Olga Megorskaya [1] [2]
Industry Artificial intelligence
Information technology
URL toloka.ai

Toloka, based in Amsterdam, is a crowdsourcing and generative AI services provider. [1]

Contents

The company helps development of artificial intelligence from training to evaluation and provides generative artificial intelligence and large language model-related services. [3] [4]

History

Toloka was founded in 2014 by Olga Megorskaya, a member of the board of directors of Yandex, as a crowdsourcing and microtasking platform. [5] It was founded primarily for data markup to improve machine learning and search algorithms

As generative AI evolved, the platform adapted to provide expert data labeling to generational AI app producers. [6]

In 2024, the company's Russian operations were sold to Russian investors. [4] [7]

Services

Generative AI

In the generative AI domain, Toloka provides services such as model fine tuning, reinforcement learning from human feedback, evaluation, adhoc datasets, which require large volumes of highly skilled experts annotation.

Machine learning

On Toloka, trainers are tasked with identifying the presence or absence of objects in content, as specified by algorithms. [5] [8] They also assess chatbot responses within given dialogues for relevance and engagement. [9] Additionally, translation verification tasks involve evaluating the accuracy of translations from multiple annotators. For the fine-tuning of large language models (LLMs), experts are required to generate and provide context-based prompts that can be single-turn or multi-turn, serving various domains and purposes.

Natural language processing

In the natural language processing (NLP) domain, Toloka facilitates optical character recognition and classification, sentiment analysis, named-entity recognition, and search relevance evaluation. It also provides transcription and classification of audio data. [5]

Annotators

Toloka mainly works with domain experts, such as physicists, scientists, lawyers, and software engineers, to develop specialized data for models targeting niche tasks. [1] Toloka also works with freelancers, referred to as "Tolokers," who annotate and create data for diverse applications. [1] They perform tasks such as labeling personally identifiable information for AI projects, translating content, summarizing information, and transcribing audio to text. [1]

Upon completion of each task the performer receives a reward based on the volume of images, videos, and unstructured text. [5]

Research

In May 2019, Toloka's research team began publishing datasets for non-commercial and academic purposes to support the scientific community and attract researchers to Toloka. Such datasets are addressed to researchers in different directions like linguistics, computer vision, testing of result aggregation models, and chatbot training. [10]

Toloka research has been showcased at a range of conferences, including the Conference on Neural Information Processing Systems (NeurIPS), [10] the International Conference on Machine Learning (ICML) [11] and the International Conference on Very Large Data Bases (VLDB). [12]

In February 2024, Toloka conducted a tutorial at the AAAI Conference on Artificial Intelligence, focusing on aligning Large Language Models to Low-Resource Languages. [13]

The company participated in BigCode, a joint scientific initiative led by HuggingFace and ServiceNow, where it served as the primary data partner. [14]

Controversies

Enabling arrests of protesters via facial recognition software (March 2024)

In March 2024, Toloka's Russian division was criticized for helping develop the facial recognition software used by Russia to track and arrest protesters after the death of Alexei Navalny. [15] The company's Russian operations were sold in July 2024.

Related Research Articles

Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals. Such machines may be called AIs.

<span class="mw-page-title-main">Chatbot</span> Program that simulates conversation

A chatbot is a software application or web interface that is designed to mimic human conversation through text or voice interactions. Modern chatbots are typically online and use generative artificial intelligence systems that are capable of maintaining a conversation with a user in natural language and simulating the way a human would behave as a conversational partner. Such chatbots often use deep learning and natural language processing, but simpler chatbots have existed for decades.

Natural language generation (NLG) is a software process that produces natural language output. A widely-cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems that can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information".

Multimodal learning is a type of deep learning that integrates and processes multiple types of data, referred to as modalities, such as text, audio, images, or video. This integration allows for a more holistic understanding of complex data, improving model performance in tasks like visual question answering, cross-modal retrieval, text-to-image generation, aesthetic ranking, and image captioning.

This glossary of artificial intelligence is a list of definitions of terms and concepts relevant to the study of artificial intelligence (AI), its subdisciplines, and related fields. Related glossaries include Glossary of computer science, Glossary of robotics, and Glossary of machine vision.

Explainable AI (XAI), often overlapping with interpretable AI, or explainable machine learning (XML), either refers to an artificial intelligence (AI) system over which it is possible for humans to retain intellectual oversight, or refers to the methods to achieve this. The main focus is usually on the reasoning behind the decisions or predictions made by the AI which are made more understandable and transparent. XAI counters the "black box" tendency of machine learning, where even the AI's designers cannot explain why it arrived at a specific decision.

Alice is a Russian intelligent personal assistant for Android, iOS and Windows operating systems and Yandex's own devices developed by Yandex. Alice was officially introduced on 10 October 2017. Aside from common tasks, such as internet search or weather forecasts, it can also run applications and chit-chat. Alice is also the virtual assistant used for the Yandex Station smart speaker.

In artificial intelligence, researchers teach AI systems to develop their own ways of communicating by having them work together on tasks and use symbols as parts of a new language. These languages might grow out of human languages or be built completely from scratch. When AI is used for translating between languages, it can even create a new shared language to make the process easier. Natural Language Processing (NLP) helps these systems understand and generate human-like language, making it possible for AI to interact and communicate more naturally with people.

An energy-based model (EBM) is an application of canonical ensemble formulation from statistical physics for learning from data. The approach prominently appears in generative artificial intelligence.

Emotion recognition in conversation (ERC) is a sub-field of emotion recognition, that focuses on mining human emotions from conversations or dialogues having two or more interlocutors. The datasets in this field are usually derived from social platforms that allow free and plenty of samples, often containing multimodal data. Self- and inter-personal influences play critical role in identifying some basic emotions, such as, fear, anger, joy, surprise, etc. The more fine grained the emotion labels are the harder it is to detect the correct emotion. ERC poses a number of challenges, such as, conversational-context modeling, speaker-state modeling, presence of sarcasm in conversation, emotion shift across consecutive utterances of the same interlocutor.

<span class="mw-page-title-main">Meta AI</span> Artificial intelligence division of Meta Platforms

Meta AI is a company owned by Meta that develops artificial intelligence and augmented and artificial reality technologies. Meta AI deems itself an academic research laboratory, focused on generating knowledge for the AI community, and should not be confused with Meta's Applied Machine Learning (AML) team, which focuses on the practical applications of its products.

<span class="mw-page-title-main">Text-to-image model</span> Machine learning model

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

<span class="mw-page-title-main">Hallucination (artificial intelligence)</span> Erroneous material generated by AI

In the field of artificial intelligence (AI), a hallucination or artificial hallucination is a response generated by AI that contains false or misleading information presented as fact. This term draws a loose analogy with human psychology, where hallucination typically involves false percepts. However, there is a key difference: AI hallucination is associated with erroneous responses rather than perceptual experiences.

Artificial intelligence in mental health is the application of artificial intelligence (AI), computational technologies and algorithms to supplement the understanding, diagnosis, and treatment of mental health disorders. AI is becoming a ubiquitous force in everyday life which can be seen through frequent operation of models like ChatGPT. Utilizing AI in the realm of mental health signifies a form of digital healthcare, in which, the goal is to increase accessibility in a world where mental health is becoming a growing concern. Prospective ideas involving AI in mental health include identification and diagnosis of mental disorders, explication of electronic health records, creation of personalized treatment plans, and predictive analytics for suicide prevention. Learning how to apply AI in healthcare proves to be a difficult task with many challenges, thus it remains rarely used as efforts to bridge gaps are deliberated.

<span class="mw-page-title-main">Generative pre-trained transformer</span> Type of large language model

A generative pre-trained transformer (GPT) is a type of large language model (LLM) and a prominent framework for generative artificial intelligence. It is an artificial neural network that is used in natural language processing by machines. It is based on the transformer deep learning architecture, pre-trained on large data sets of unlabeled text, and able to generate novel human-like content. As of 2023, most LLMs had these characteristics and are sometimes referred to broadly as GPTs.

<span class="mw-page-title-main">Generative artificial intelligence</span> AI system capable of generating content in response to prompts

Generative artificial intelligence is a subset of artificial intelligence that uses generative models to produce text, images, videos, or other forms of data. These models often generate output in response to specific prompts. Generative AI systems learn the underlying patterns and structures of their training data, enabling them to create new data.

<span class="mw-page-title-main">Edward Y. Chang</span> American computer scientist

Edward Y. Chang is a computer scientist, academic, and author. He is an adjunct professor of Computer Science at Stanford University, and Visiting Chair Professor of Bioinformatics and Medical Engineering at Asia University, since 2019.

YandexGPT is a neural network of the GPT family developed by the Russian company Yandex LLC. YandexGPT can create and revise texts, generate new ideas and capture the context of the conversation with the user.

<span class="mw-page-title-main">Nicholas Carlini</span> American artificial intelligence researcher

Nicholas Carlini is an American researcher affiliated with Google DeepMind who has published research in the fields of computer security and machine learning. He is known for his work on adversarial machine learning, particularly his work on the Carlini & Wagner attack in 2016. This attack was particularly useful in defeating defensive distillation, a method used to increase model robustness, and has since been effective against other defenses against adversarial input.

Artificial Intelligence engineering is a tech discipline that focuses on the design, development, and deployment of AI systems. AI engineering involves applying engineering principles and methodologies to create scalable, efficient, and reliable AI-based solutions. It merges aspects of data engineering and software engineering to create real-world applications in diverse domains such as healthcare, finance, autonomous systems, and industrial automation.

References

  1. 1 2 3 4 5 Shrivastava, Rashi (July 24, 2024). "The Internet Isn't Big Enough To Train AI. One Fix? Fake Data". Forbes .
  2. Sacolick, Isaac (April 8, 2024). "How to test large language models". InfoWorld .
  3. "AI development from training to evaluation" . Bloomberg News . July 16, 2024.
  4. 1 2 Sawers, Paul (July 21, 2024). "From Yandex's ashes comes Nebius, a 'startup' with plans to be a European AI compute leader". TechCrunch .
  5. 1 2 3 4 Woodie, Alex (April 27, 2021). "Toloka Expands Data Labeling Service". Datanami.
  6. Baidakova, Daria (September 29, 2021). "Data-Labeling Instructions: Gateway to Success in Crowdsourcing and Enduring Impact on AI". Data Science Central.
  7. "Yandex founder to build AI business in Europe after Russia exit". Financial Times . July 16, 2024.
  8. Bussler, Frederik (December 7, 2021). "Data labeling will fuel the AI revolution". VentureBeat .
  9. Gandharv, Kumar (April 29, 2021). "Why Are Data Labelling Firms Eyeing Indian Market?". Analytics India Magazine.
  10. 1 2 "Toloka to present new dataset at prestigious Data-Centric AI workshop launched by Andrew Ng". FE News. November 18, 2021.
  11. "Toloka". icml.cc.
  12. "VLDB 2021 Challenge". crowdscience.ai.
  13. "The 38th Annual AAAI Conference on Artificial Intelligence". AAAI Conference on Artificial Intelligence .
  14. "BigCode Governance Card". arXiv .
  15. "Dutch Yandex subsidiary helping Russia with facial recognition software". NL Times. 27 March 2024.