A vision-language-action model (VLA) is a foundation model that allows control of robot actions through vision and language commands. [1]
One method for constructing a VLA is to fine-tune a vision-language model (VLM) by training it on robot trajectory data and large-scale visual language data [2] or Internet-scale vision-language tasks. [3]
Examples of VLAs include RT-2 from Google DeepMind. [4]