RoboVQA is a dataset developed by the anonymous authors for robot vision and question-answering tasks. It contains 2,000 episodes of a Franka Emika Panda robot interacting with objects, including RGB images, depth data, and robot joint states. The dataset supports research in open-vocabulary language understanding and real-time robot control, with a focus on integrating language and vision for task execution. It is accompanied by evaluation scripts and pre-trained models, enabling comparisons across different human-robot interaction methods. While the dataset's license is not explicitly stated, it is primarily intended for academic use and emphasizes the integration of visual and language data for robot task understanding.
A robot or a human performs any long-horizon requests from a user within the entirety of 3 office buildings.
Field | Value |
---|---|
Action Space | EEF Position |
Control Frequency | 10 |
Depth Cams | 1 |
Gripper | Default |
Has Camera Calibration | True |
Has Proprioception | True |
Has Suboptimal | False |
Language Annotations | Natural |
Rgb Cams | 1 |
Robot Morphology | 3 embodiments: single-armed robot, single-armed human, single-armed human using grasping tools |
Scene Type | Table Top, Kitchen (also toy kitchen), Other Household environments, Hallways, anything within 3 entire office buildings |
Wrist Cams | 0 |