Vision-language-action model

Last updated October 27, 2025

In robot learning, a vision-language-action model (VLA) is a class of multimodal foundation models that integrates vision, language and actions. Given an input image (or video) of the robot's surroundings and a text instruction, a VLA directly outputs low-level robot actions that can be executed to accomplish the requested task.^[1]

VLAs are generally constructed by fine-tuning a vision-language model (VLM), i.e. a large language model extended with vision capabilities) on a large-scale dataset that pairs visual observation and language instructions with robot trajectories.^[2] These models combine a vision-language encoder (vision transformer), which translates an image observation and a natural language description into a distribution within a latent space, with an action decoder that transforms this representation into continuous output actions, directly executable on the robot.^[3]

The concept was pioneered in July 2023 by Google DeepMind with RT-2, a VLM adapted for end-to-end manipulation tasks, capable of unifying perception, reasoning and control.^[4]

Overview of architecture

VLAs share a common high-level architecture articulated in two stages:

In the first stage, a pre-trained VLM serves as the perception and reasoning core. It encodes one or more camera images together with a language instruction into a sequence of language tokens in a shared latent space. VLMs are specifically trained on large multimodal datasets and can perform a variety of tasks such as image understanding, visual-question answering and reasoning. In order to directly control robots, VLMs must be extended to output robot actions.^[5]

In the second stage, an action decoder maps those tokens to discrete symbols that are then de-tokenised into continuous robot commands. These output actions are represented in the same way as language tokens, but specifically refer to the number of degrees of freedom (DoF) of the robot's end effector. Considering a 6-DoF end-effector, the action space usually includes end-effector displacements (positional and rotational) and gripper positions. For instance, in RT-2, each action vector covers 6-DoF in addition to the gripper state and a termination flag, all quantized into 256 bins.^[2]

VLAs usually rely on off-the-shelf VLMs, giving the robot a prior understanding of images and text. During the training process, the model is then fine-tuned on data in the form of (text instruction, visual observation, action trajectory), and so it learns to map visual observations and text instructions to robot actions. The training dataset consists of robot demonstrations which may be gathered from real robots, human teleoperation, or even synthetically generated in a simulation environment. Due to end-to-end learning, VLAs inherently learn to associate high-level concepts (e.g. object categories and spatial relations) with low-level actions, eliminating the partitioning typical of traditional robotic systems.^[2]^[6]

Action representation

A crucial design choice for the architecture of a VLA is the format in which robot actions are encoded.

'Discrete Token Output' is the most common approach, used by VLAs such as RT-2 and OpenVLA, and it represents each motion primitive as a sequence of discrete tokens. In this way, the model encodes the robot actions as an action string, and the VLA model learns to generate these sequences just as a language model generates text. This token-based approach keeps the same output layer and makes training straightforward. However, converting continuous trajectories into vocabulary symbols can limit spatial accuracy or temporal resolution. RT-2 demonstrates that this can be mitigated using special tokens that, for instance, mark the end of an action segment.^[2]^[7]

'Continuous Output' (Diffusion/Flow) is an alternative approach used by VLAs such as π₀ that, in order to achieve accurate dexterity and high frequency control, forego discrete tokens and directly output continuous actions. This is achieved through the use of diffusion models or flow-matching networks that act as the action decoder. π₀ exploited this strategy to output continuous joint trajectories up to 50Hz. Practically, continuous output tends to scale better to robots with many degrees of freedom, where discretization for every DoF would be impractical.^[8]

Single-model versus dual-system design

VLAs can be organized either as a single end-to-end network or as a dual-system that employs two coupled models.

The single-model design, employed by RT-2, OpenVLA and π₀, simultaneously understands the scene and the language instruction to produce robot actions in a single forward pass, keeping the architecture simple and reducing latency.^[2]^[7]^[8]

The dual-system design, adopted by Helix and Groot N1, decouples the architecture into two components. The first component is usually slower and handles image observation and text instructions received as input. The second component runs at a faster rate and produces the robot's actions. The two components are trained end-to-end to communicate. This split improves dexterity and latency at the cost of increased computational complexity.^[9]^[10]

History

2023

Robotic Transformer 2 (RT-2)

Robotic Transformer 2 (RT-2) was developed by Google DeepMind in mid-2023 and established the vision-language-action model paradigm in robotics. It builds on two state-of-the-art VLMs, respectively PaLI-X^[11] and PaLM-E,^[12] by fine-tuning them on real robot demonstration data. RT-2 takes as input camera images paired with a text description and outputs discretized robot action encoded as discrete tokens. Compared to its predecessor RT-1,^[13] which was trained only on robotic data, RT-2 exhibits stronger generalization for new tasks, being also able to perform multi-step reasoning using chain-of-thought.^[4]

2024

OpenVLA

OpenVLA is a 7b-parameter open-source VLA model introduced in June 2024 by researchers at Stanford. It was trained on the Open X-Embodiment dataset, a collaboration between 21 institutions that collected over one million episodes on 22 different embodiments. The model fuses image features using DINOv2^[14] and CLIP, with a Llama-2 language backbone, and outputs discrete actions tokens. Despite its smaller size with respect to Google DeepMind's RT-2, OpenVLA outperforms RT-2 on a suite of manipulation tasks. It also supports parameter-efficient fine-tuning methods and quantization for resource-constrained deployment.^[7]^[15]^[16]

Octo (Open Generalist Policy)

Octo is a lightweight open-source generalist robot policy from UC Berkeley. Originally trained on Open X-Embodiment, it was released in smaller configurations (27M and 93M parameters). Octo encodes text instructions and image observations respectively with a language model and a lightweight convolutional neural network. Additionally, instead of an autoregressive decoder, Octo uses a diffusion policy that outputs continuous joint trajectories, enabling smoother motion and fast task adaptation. During fine-tuning, the block-wise attention structure of the architecture employed by Octo, allows to add new observations without modifying the parameters.^[17]

TinyVLA

TinyVLA is a compact VLA designed for fast inference and efficient training. TinyVLA addresses the computational requirements and the heavy reliance on large datasets of its predecessors by initializing the policy with a smaller multimodal backbone and then fine-tuning on robotics data. This work demonstrated potential for more efficient VLAs, focusing on architecture and data curation without the computational cost of very large models.^[18]

π₀ (pi-zero)

π₀ (pi-zero) is a large-scale generalist VLA, announced in late 2024 by the startup Physical Intelligence.^[8]^{[ better source needed ]} π₀ incorporates Paligemma^[19] as a pre-trained VLM backbone, built from SigLIP^[20] and Gemma^[21] encoders, with an action expert trained on robot trajectories from Open X-Embodiment. Trained on robot trajectories from 8 different embodiments, it is able to generalize cross-embodiment, control different robotic arms (single-arm, dual-arm) and tackle a wide variety of tasks. π₀ also introduced flow-matching model to generate high-frequency continuous actions, up to 50 Hz, while the action head takes advantage of a diffusion policy.^[22]^[23] π₀-FAST, an extension of π₀, takes advantage of Frequency-space Action Sequence Tokenization (FAST),^[24] a novel time-series compression approach that transform continuous tokens from time domain to frequency domain using discrete cosine transform.

2025

Helix

Helix, unveiled in February 2025 by Figure AI, it is a generalist VLA specifically tailored for humanoid robots. It is the first VLA able to control at a high frequency the entire upper body of a humanoid (i.e. arms, hands, torso, head, fingers). It uses a dual-system architecture, with two complementary systems trained to communicate in an end-to-end manner. System 2 (S2) is an internet-scale VLM specialized in scene understanding and language comprehension, while System 1 (S1) is a visuomotor policy that translates the latent representations produced by S2 into continuous robot actions. This decoupled architecture allows to achieve both broad generalization and fast low-level control. Helix is trained on ~500 hours of robot teleoperation paired with automatically generated text descriptions. The Helix model underscored the ability of VLAs to scale to complex embodiments such as humanoids.^[9]

GR00T N1

GR00T N1, released by NVIDIA in March 2025, is a VLA for humanoid robots that adopts the same dual-system architecture employed by Helix. It is composed of a System 2, a VLM responsible for the perception of the environment, and a System 1, which generates motor action. Different from other VLAs, it includes a heterogeneous mixture of data comprising robots' trajectories, human videos and synthetic datasets.^[10]

Gemini Robotics

Gemini Robotics, introduced in 2025 by Google DeepMind, is a VLA that builds on top of the capabilities of Gemini 2.0. While Gemini is inherently able to process multimodal data such as text, images, videos and audio, Gemini Robotics extends these capabilities to the physical world, allowing robots to take actions. The reasoning capabilities of the Gemini 2.0 VLM backbone, paired with learned low-level robot actions, allow the robot to perform highly dexterous tasks such as folding origami, as well as playing with cards. The model exhibits a high degree of generalization and is able to adapt to entirely new platforms. In June 2025, the authors released Gemini Robotics On-Device, a lightweight version of the previous model, optimized to run locally on a real robot with low-latency and high reliability while preserving dexterity.^[6]^[25]

SmolVLA

SmolVLA is an open-source compact VLA with 450 million parameters released by Hugging Face. It represents an effort to democratize research on VLAs. It was trained entirely on LeRobot, an open-source dataset collected and curated by the community. Despite its compact size, SmolVLA achieved comparable performances with much larger VLAs such as Octo, OpenVLA and π₀. The architecture of SmolVLA employs flow-matching for continuous control, and asynchronous inference to decouple the VLM backbone from the action execution. SmolVLA can be fine-tuned and used on a single consumer GPU.^[26]^[27]^[28]

References

↑ Jeong, Hyeongyo; Lee, Haechan; Kim, Changwon; Shin, Sungta (October 2024). "A Survey of Robot Intelligence with Large Language Models". Applied Sciences. 14 (19): 8868. doi: 10.3390/app14198868 .
1 2 3 4 5 Brohan, Anthony; Brown, Noah; Carbajal, Justice; Chebotar, Yevgen; Chen, Xi; Choromanski, Krzysztof; Ding, Tianli; Driess, Danny; Dubey, Avinava (July 28, 2023), RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, Proceedings of The 7th Conference on Robot Learning: PLMR, pp. 2165–2183, arXiv: 2307.15818 {{citation}}: CS1 maint: location (link)
↑ Fan, L.; Chen, Z.; Xu, M.; Yuan, M.; Huang, P.; Huang, W. (2024). "Language Reasoning in Vision-Language-Action Model for Robotic Grasping". 2024 China Automation Congress (CAC). pp. 6656–6661. doi:10.1109/CAC63892.2024.10865585. ISBN 979-8-3503-6860-4.
1 2 Dotson, Kyt (July 28, 2023). "Google unveils RT-2, an AI language model for telling robots what to do". Silicon Angle. Retrieved March 13, 2025.
↑ Zhang, Jingyi; Huang, Jiaxing; Jin, Sheng; Lu, Shijian (August 2024). "Vision-Language Models for Vision Tasks: A Survey". IEEE Transactions on Pattern Analysis and Machine Intelligence. 46 (8): 5625–5644. Bibcode:2024ITPAM..46.5625Z. doi:10.1109/TPAMI.2024.3369699. ISSN 0162-8828. PMID 38408000.
1 2 Team, Gemini Robotics; Abeyruwan, Saminda; Ainslie, Joshua; Alayrac, Jean-Baptiste; Arenas, Montserrat Gonzalez; Armstrong, Travis; Balakrishna, Ashwin; Baruch, Robert; Bauza, Maria (March 25, 2025), Gemini Robotics: Bringing AI into the Physical World, arXiv: 2503.20020
1 2 3 4 Kim, Moo Jin; Pertsch, Karl; Karamcheti, Siddharth; Xiao, Ted; Balakrishna, Ashwin; Nair, Suraj; Rafailov, Rafael; Foster, Ethan; Lam, Grace (September 5, 2024), OpenVLA: An Open-Source Vision-Language-Action Model, 8th Annual Conference on Robot Learning, arXiv: 2406.09246 {{citation}}: CS1 maint: location (link) CS1 maint: location missing publisher (link)
1 2 3 Black, Kevin; Brown, Noah; Driess, Danny; Esmail, Adnan; Equi, Michael; Finn, Chelsea; Fusai, Niccolo; Groom, Lachy; Hausman, Karol (November 13, 2024), $π_0$: A Vision-Language-Action Flow Model for General Robot Control, arXiv: 2410.24164
1 2 "Helix: A Vision-Language-Action Model for Generalist Humanoid Control". FigureAI. February 20, 2025. Retrieved July 9, 2025.
1 2 NVIDIA; Bjorck, Johan; Castañeda, Fernando; Cherniadev, Nikita; Da, Xingye; Ding, Runyu; Fan, Linxi "Jim"; Fang, Yu; Fox, Dieter (March 27, 2025), GR00T N1: An Open Foundation Model for Generalist Humanoid Robots, arXiv: 2503.14734
↑ Chen, Xi; et al. (2024). "On Scaling up a Multilingual Vision and Language Model". 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14432–14444. arXiv: 2305.18565v1 . doi:10.1109/CVPR52733.2024.01368. ISBN 979-8-3503-5300-6 – via OpenAccess.
↑ Driess, Danny (July 23, 2023). "PaLM-E: an embodied multimodal language model". Proceedings of the 40th International Conference on Machine Learning (ICML). 340: 8469–8488. arXiv: 2303.03378v1 – via ACM DL.
↑ Brohan, Anthony; Brown, Noah; Carbajal, Justice; Chebotar, Yevgen; Dabis, Joseph; Finn, Chelsea; Gopalakrishnan, Keerthana; Hausman, Karol; Herzog, Alex (August 11, 2023), RT-1: Robotics Transformer for Real-World Control at Scale, arXiv: 2212.06817
↑ Oquab, Maxime; Darcet, Timothée; Moutakanni, Théo; Vo, Huy; Szafraniec, Marc; Khalidov, Vasil; Fernandez, Pierre; Haziza, Daniel; Massa, Francisco (February 2, 2024), DINOv2: Learning Robust Visual Features without Supervision, Transactions on Machine Learning Research Journal, arXiv: 2304.07193 {{citation}}: CS1 maint: location missing publisher (link)
↑ Radford, Alec; Kim, Jong Wook; Hallacy, Chris; Ramesh, Aditya; Goh, Gabriel; Agarwal, Sandhini; Sastry, Girish; Askell, Amanda; Mishkin, Pamela (February 26, 2021), Learning Transferable Visual Models From Natural Language Supervision, Proceedings of the 38th International Conference on Machine Learning, arXiv: 2103.00020 {{citation}}: CS1 maint: location (link) CS1 maint: location missing publisher (link)
↑ O’Neill, Abby; Rehman, Abdul; Maddukuri, Abhiram; Gupta, Abhishek; Padalkar, Abhishek; Lee, Abraham; Pooley, Acorn; Gupta, Agrim; Mandlekar, Ajay; Jain, Ajinkya; Tung, Albert; Bewley, Alex; Herzog, Alex; Irpan, Alex; Khazatsky, Alexander (May 2024). "Open X-Embodiment: Robotic Learning Datasets and RT-X Models : Open X-Embodiment Collaboration⁰". 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 6892–6903. doi:10.1109/ICRA57147.2024.10611477. ISBN 979-8-3503-8457-4.
↑ Team, Octo Model; Ghosh, Dibya; Walke, Homer; Pertsch, Karl; Black, Kevin; Mees, Oier; Dasari, Sudeep; Hejna, Joey; Kreiman, Tobias (May 26, 2024), Octo: An Open-Source Generalist Robot Policy, arXiv: 2405.12213
↑ Wen, Junjie; Zhu, Yichen; Li, Jinming; Zhu, Minjie; Tang, Zhibin; Wu, Kun; Xu, Zhiyuan; Liu, Ning; Cheng, Ran; Shen, Chaomin; Peng, Yaxin; Feng, Feifei; Tang, Jian (April 2025). "TinyVLA: Toward Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation". IEEE Robotics and Automation Letters. 10 (4): 3988–3995. arXiv: 2409.12514 . Bibcode:2025IRAL...10.3988W. doi:10.1109/LRA.2025.3544909. ISSN 2377-3766.
↑ Beyer, Lucas; Steiner, Andreas; Pinto, André Susano; Kolesnikov, Alexander; Wang, Xiao; Salz, Daniel; Neumann, Maxim; Alabdulmohsin, Ibrahim; Tschannen, Michael (October 10, 2024), PaliGemma: A versatile 3B VLM for transfer, arXiv: 2407.07726
↑ Zhai, Xiaohua; Mustafa, Basil; Kolesnikov, Alexander; Beyer, Lucas (October 1, 2023). "Sigmoid Loss for Language Image Pre-Training". 2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE. pp. 11941–11952. doi:10.1109/ICCV51070.2023.01100. ISBN 979-8-3503-0718-4.
↑ Team, Gemma; Mesnard, Thomas; Hardin, Cassidy; Dadashi, Robert; Bhupatiraju, Surya; Pathak, Shreya; Sifre, Laurent; Rivière, Morgane; Kale, Mihir Sanjay (April 16, 2024), Gemma: Open Models Based on Gemini Research and Technology, arXiv: 2403.08295
↑ Beyer, Lucas; Steiner, Andreas; Pinto, André Susano; Kolesnikov, Alexander; Wang, Xiao; Salz, Daniel; Neumann, Maxim; Alabdulmohsin, Ibrahim; Tschannen, Michael (October 10, 2024), PaliGemma: A versatile 3B VLM for transfer, arXiv: 2407.07726
↑ Black, Kevin; Brown, Noah; Driess, Danny; Esmail, Adnan; Equi, Michael; Finn, Chelsea; Fusai, Niccolo; Groom, Lachy; Hausman, Karol (2024), $π_0$: A Vision-Language-Action Flow Model for General Robot Control, arXiv: 2410.24164
↑ Pertsch, Karl; Stachowicz, Kyle; Ichter, Brian; Driess, Danny; Nair, Suraj; Vuong, Quan; Mees, Oier; Finn, Chelsea; Levine, Sergey (January 16, 2025), FAST: Efficient Action Tokenization for Vision-Language-Action Models, arXiv: 2501.09747
↑ "Gemini Robotics". Google DeepMind. Retrieved July 9, 2025.
↑ "SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data". huggingface.co. June 3, 2025. Retrieved July 9, 2025.
↑ "lerobot (LeRobot)". huggingface.co. June 24, 2025. Retrieved July 9, 2025.
↑ Shukor, Mustafa; Aubakirova, Dana; Capuano, Francesco; Kooijmans, Pepijn; Palma, Steven; Zouitine, Adil; Aractingi, Michel; Pascal, Caroline; Russi, Martino (June 2, 2025), SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics, arXiv: 2506.01844

v t e Robotics
Main articles	Outline Glossary Index History Geography Hall of Fame Ethics Laws Competitions AI competitions
Types	Aerobot Anthropomorphic Humanoid Android Cyborg Gynoid Claytronics Companion Automaton Animatronic Audio-Animatronics Industrial Articulated arm Domestic Educational Entertainment Juggling Military Medical Service Disability Agricultural Food service Retail BEAM robotics Soft robotics
Classifications	Biorobotics Cloud robotics Continuum robot Unmanned vehicle aerial ground Mobile robot Microbotics Nanorobotics Necrobotics Robotic spacecraft Space probe Swarm Telerobotics Underwater remotely-operated Robotic fish
Locomotion	Tracks Walking Hexapod Climbing Electric unicycle Robotic fins
Navigation and mapping	Motion planning Simultaneous localization and mapping Visual odometry Vision-guided robot systems
Research	Evolutionary Kits Simulator Suite Open-source Software Adaptable Developmental Human–robot interaction Paradigms Perceptual Situated Ubiquitous
Companies	ABB Amazon Robotics Anybots Barrett Technology Boston Dynamics Doosan Robotics Energid Technologies FarmWise FANUC Figure AI Foster-Miller Harvest Automation HD Hyundai Robotics Honeybee Robotics Intuitive Surgical IRobot KUKA Rainbow Robotics Starship Technologies Symbotic Universal Robotics Wolf Robotics Yaskawa
Related	Critique of work Powered exoskeleton Workplace robotics safety Robotic tech vest Technological unemployment Terrainability Fictional robots
Category Outline

Vision-language-action model

Contents