Embodied AI Workshop
CVPR 2023

#

Overview

Minds live in bodies, and bodies move through a changing world. The goal of embodied artificial intelligence is to create agents, such as robots, which learn to creatively solve challenging tasks requiring interaction with the environment. While this is a tall order, fantastic advances in deep learning and the increasing availability of large datasets like ImageNet have enabled superhuman performance on a variety of AI tasks previously thought intractable. Computer vision, speech recognition and natural language processing have experienced transformative revolutions at passive input-output tasks like language translation and image processing, and reinforcement learning has similarly achieved world-class performance at interactive tasks like games. These advances have supercharged embodied AI, enabling a growing collection of researchers to make rapid progress towards intelligent agents which can:

  • See: perceive their environment through vision or other senses.
  • Talk: hold a natural language dialog grounded in their environment.
  • Listen: understand and react to audio input anywhere in a scene.
  • Act: navigate and interact with their environment to accomplish goals.
  • Reason: consider and plan for the long-term consequences of their actions.

The goal of the Embodied AI workshop is to bring together researchers from computer vision, language, graphics, and robotics to share and discuss the latest advances in embodied intelligent agents. This year's workshop will focus on the three themes of:

  • Foundation Models: Large pretrained models such as CLIP, ViLD and PaLI which enable few-shot and zero-shot performance on novel tasks.
  • Generalist Agents: Single learning methods for multiple tasks, such as RT-1, which enable models trained on one task to be expanded to novel tasks.
  • Sim to Real Transfer: Techniques which enable models trained in simulation to be deployed in the real world.
The Embodied AI 2023 workshop will be held in conjunction with CVPR 2023 in Vancouver, British Columbia. It will feature a host of invited talks covering a variety of topics in Embodied AI, many exciting Embodied AI challenges, a poster session, and panel discussions. For more information on the Embodied AI Workshop series, see our Retrospectives paper on the first three years of the workshop.

Sign Up for Updates
You can unsubscribe at any time.

#

Timeline

Workshop Announced
March 15, 2023
Paper Submission Deadline
May 26, 2023 (Anywhere on Earth)
Challenge Submission Deadlines
May 2023. Check each challenge for the specific date.
Fourth Annual Embodied AI Workshop at CVPR
Vancouver Convention Center
Monday, June 19, 2023
9:00 AM - 5:30 PM PT
East Ballroom A
Challenge Winners Announced
June 19, 2023 at the workshop. Check each challenge for specifics.

#

Workshop Schedule

Embodied AI will be a hybrid workshop, with both in-person talks and streaming via zoom.
  • Workshop Talks: 9:00AM-5:30PM PT - East Ballroom A
  • Poster Session: NOON-1:20PM PT - West Exhibit Hall, Posters #123 - #148
Zoom information is available on the CVPR virtual platform for registered attendees.
Remote and in-person attendees are welcome to as questions via Slack:

  • Workshop Introduction: Embodied AI
    East Ballroom A
    9:00 - 9:10 AM PT
    Claudia Perez D'Arpino
    NVIDIA
  • Navigation & Understanding Challenge Presentations
    (Habitat, MultiON, SoundSpaces, RxR-Habitat, RVSU)
    9:10 - 10:00 AM PT
    • 9:10: RxR-Habitat
    • 9:20: MultiOn
    • 9:30: SoundSpaces
    • 9:40: RVSU
    • 9:50: Habitat
  • Navigation & Understanding Challenge Q&A Panel
    10:00 - 10:30 AM PT
  • Invited Talk - Embodied Navigation:
    Robot Learning by Understanding Videos
    10:30 - 11:00 AM PT
    Saurabh Gupta
    UIUC

    Saurabh Gupta is an Assistant Professor in the ECE Department at UIUC. Before starting at UIUC in 2019, he received his Ph.D. from UC Berkeley in 2018 and spent the following year as a Research Scientist at Facebook AI Research in Pittsburgh. His research interests span computer vision, robotics, and machine learning, with a focus on building agents that can intelligently interact with the physical world around them. He received the President's Gold Medal at IIT Delhi in 2011, the Google Fellowship in Computer Vision in 2015, an Amazon Research Award in 2020, and an NSF CAREER Award in 2022. He has also won many challenges at leading computer vision conferences.

    True gains of machine learning in AI sub-fields such as computer vision and natural language processing have come about from the use of large-scale diverse datasets for learning. In this talk, I will discuss if and how we can leverage large-scale div... [Expand]
  • Invited Talk - Robotics:
    Embodied Reasoning Through Planning with Language and Vision Foundation Models
    11:00 - 11:30 AM PT
    Fei Xia
    Google

    Fei Xia is a Research Scientist at Google Research where he works on the Robotics team. He received his PhD degree from the Department of Electrical Engineering, Stanford University. He was co-advised by Silvio Savarese in SVL and Leonidas Guibas. His mission is to build intelligent embodied agents that can interact with complex and unstructured real-world environments, with applications to home robotics. He has been approaching this problem from 3 aspects: 1) Large scale and transferrable simulation for Robotics. 2) Learning algorithms for long-horizon tasks. 3) Combining geometric and semantic representation for environments. Most recently, He has been exploring using foundation models for robot decision making.

    Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could in principle be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a... [Expand]
  • Invited Talk - Generalist Agents:
    Building Embodied Autonomous Agents with Multimodal Interaction
    11:30 AM - 12 NOON PT
    Ruslan Salakhutdinov
    CMU

    Russ Salakhutdinov is a UPMC Professor of Computer Science in the Department of Machine Learning at CMU. He received his PhD in computer science from the University of Toronto. After spending two post-doctoral years at MIT, he joined the University of Toronto and later moved to CMU. Russ's primary interests lie in deep learning, machine learning, and large-scale optimization. He is an action editor of the Journal of Machine Learning Research, served as a director of AI research at Apple, served on the senior programme committee of several top-tier learning conferences including NeurIPS and ICML, was a program co-chair for ICML 2019, and will serve as a general chair for ICML 2024. He is an Alfred P. Sloan Research Fellow, Microsoft Research Faculty Fellow, a recipient of the Early Researcher Award, Google Faculty Award, and Nvidia's Pioneers of AI award.

    In this talk I will give an overview of our recent work on how we can design modular agents for visual navigation that can perform tasks specified by natural language instructions, perform efficient exploration and long-term planning, build and utilize 3D semantic maps, while generalizing across domains and tasks. [Expand]
  • Accepted Papers Poster Session
    West Exhibit Hall - Posters #123 - #148.
    12:00 NOON - 1:20 PM PT
  • Invited Talk - Foundation Models:
    Large Language Models for Solving Long-Horizon Robotic Manipulation Problems
    East Ballroom A
    1:30 - 2:00 PM PT
    Jeannette Bohg
    Stanford
    My long-term research goal is enable real robots to manipulate any kind of object such that they can perform many different tasks in a wide variety of application scenarios such as in our homes, in hospitals, warehouses, or factories. Many of these t... [Expand]
  • Invited Talk - Sim to Real
    Toward Foundational Robot Manipulation Skills
    2:00 - 2:30 PM PT
    Dieter Fox
    NVIDIA
    U Washington

    Dieter Fox received his PhD degree from the University of Bonn, Germany. He is a professor in the Allen School of Computer Science & Engineering at the University of Washington, where he heads the UW Robotics and State Estimation Lab. He is also Senior Director of Robotics Research at NVIDIA. His research is in robotics and artificial intelligence, with a focus on learning and estimation applied to problems such as robot manipulation, planning, language grounding, and activity recognition. He has published more than 300 technical papers and is co-author of the textbook "Probabilistic Robotics". Dieter is a Fellow of the IEEE, ACM, and AAAI, and recipient of the IEEE RAS Pioneer Award and the IJCAI John McCarthy Award.

    In this talk, I will discuss our ongoing efforts toward developing the models and generating the kind of data that might lead to foundational manipulation skills for robotics. To generate large amounts of data, we sample many object rearrangement ta... [Expand]
  • Interaction & Rearrangement Challenge Presentations
    AI2-Rearrangement, ALFRED+TEACh, DialFRED, ManiSkill, TDW-Transport
    2:30 - 3:30 PM PT
    • 2:30: AI2-Rearrangement
    • 2:40: ALFRED+TEACh
    • 2:50: DialFRED
    • 3:00: ManiSkill
    • 3:10: TDW-Transport
    • 3:20: Break
  • Interaction & Rearrangement Challenge Q&A Panel
    3:30 - 4:00 PM PT
  • Invited Talk - External Knowledge
    From goals to grasps: Learning about action from people in video
    4:00 - 4:30 PM PT
    Kristen Grauman
    UT Austin

    Kristen Grauman is a Professor in the Department of Computer Science at the University of Texas at Austin and a Research Director in Facebook AI Research (FAIR). Her research in computer vision and machine learning focuses on video, visual recognition, and action for perception or embodied AI. Before joining UT-Austin in 2007, she received her Ph.D. at MIT. She is an IEEE Fellow, AAAI Fellow, Sloan Fellow, a Microsoft Research New Faculty Fellow, and a recipient of NSF CAREER and ONR Young Investigator awards, the PAMI Young Researcher Award in 2013, the 2013 Computers and Thought Award from the International Joint Conference on Artificial Intelligence (IJCAI), the Presidential Early Career Award for Scientists and Engineers (PECASE) in 2013. She was inducted into the UT Academy of Distinguished Teachers in 2017. She and her collaborators have been recognized with several Best Paper awards in computer vision, including a 2011 Marr Prize and a 2017 Helmholtz Prize (test of time award). She served for six years as an Associate Editor-in-Chief for the Transactions on Pattern Analysis and Machine Intelligence (PAMI) and for ten years as an Editorial Board member for the International Journal of Computer Vision (IJCV). She also served as a Program Chair of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2015 and a Program Chair of Neural Information Processing Systems (NeurIPS) 2018, and will serve as a Program Chair of the IEEE International Conference on Computer Vision (ICCV) 2023.

  • Invited Speaker Panel
    4:30 - 5:30 PM PT
    Moderator - Anthony Francis
    Logical Robotics
  • Workshop Concludes
    5:30 PM PT

#

Demos

In association with the Embodied AI Workshop, Meta AI will present a demo of LSC: Language-guided Skill Coordination for Open-Vocabulary Mobile Pick-and-Place in which a Boston Dynamics Spot will follow voice commands for object rearrangement such as "Find the plush in the table and place it in the case." The demo times for LSC include:

  • Expo Meta AI Booth: Tue-Thu, June 20-22 11:00-5:00
  • West Exhibit Hall Demo Area: Thu, June 22 10:00-18:00

#

Challenges

The Embodied AI 2023 workshop is hosting many exciting challenges covering a wide range of topics such as rearrangement, visual navigation, vision-and-language, and audio-visual navigation. More details regarding data, submission instructions, and timelines can be found on the individual challenge websites.

The workshop organizers are awarding each first-place challenge winner $300 dollars, sponsored by Apple, Hello Robot and Logical Robotics.

Challenge winners will be given the opportunity to present a talk at the workshop. Since many challenges can be grouped into similar tasks, we encourage participants to submit models to more than 1 challenge. The table below describes, compares, and links each challenge.

Challenge
Task
2023 Winner
Simulation Platform
Scene Dataset
Observations
Action Space
Interactive Actions?
Stochastic Acuation?
HabitatObjectNavSkillFusion (AIRI)HabitatHM3D SemanticsRGB-D, LocalizationContinuous
HabitatImageNavLQHabitatHM3D SemanticsRGB-D, LocalizationContinuous
RxR-HabitatVision-and-Language NavigationThe GridMM TeamHabitatMatterport3DRGB-DDiscrete
MultiONMulti-Object NavigationHabitatHM3D SemanticsRGB-D, LocalizationDiscrete
SoundSpacesAudio Visual NavigationAK-lab-tokyotechHabitatMatterport3DRGB-D, Audio WaveformDiscrete
SoundSpacesActive Audio Visual Source SeparationAK-lab-tokyotechHabitatMatterport3DRGB-D, Audio WaveformDiscrete
Robotic Vision Scene UnderstandingSemantic SLAMTeam SPIsaac SimActive Scene UnderstandingRGB-D, Pose Data, Flatscan LaserDiscretePartially
Robotic Vision Scene UnderstandingRearrangement (SCD)MSC LabIsaac SimActive Scene UnderstandingRGB-D, Pose Data, Flatscan LaserDiscrete
TDW-TransportRearrangementTDWTDWRGB-D, MetadataDiscrete
AI2-THOR RearrangementRearrangementTIDEEAI2-THORiTHORRGB-D, LocalizationDiscrete
Language InteractionInstruction Following and DialogueYonsei VnLAI2-THORiTHORRGBDiscrete
DialFREDVision-and-Dialogue InteractionTeam KeioAI2-THORiTHORRGBDiscrete
ManiSkillGeneralized ManipulationGXU-LIPESAPIENPartNet-Mobility, YCB, EGADRGB-D, MetadataContinuous

#

Call for Papers

We invite high-quality 2-page extended abstracts on embodied AI, especially in areas relevant to the themes of this year's workshop:

  • Foundation Models
  • Generalist Agents
  • Sim to Real Transfer
as well as themes related to embodied AI in general:
  • Simulation Environments
  • Visual Navigation
  • Rearrangement
  • Embodied Question Answering
  • Embodied Vision & Language
Accepted papers will be presented as posters or spotlight talks at the workshop. These papers will be made publicly available in a non-archival format, allowing future submission to archival journals or conferences. Paper submissions do not have to be anononymized. Per CVPR rules regarding workshop papers, at least one author must register for CVPR using an in-person registration.

The submission deadline is May 26th (Anywhere on Earth). Papers should be no longer than 2 pages (excluding references) and styled in the CVPR format.

  • Paper submissions have now CLOSED.

Note. The order of the papers is randomized each time the page is refreshed.

Question Generation to Disambiguate Referring Expressions in 3D Environment
Fumiya Matsuzawa, Ryo Nakamura, Kodai Nakashima, Yue Qiu, Hirokatsu Kataoka, Yutaka Satoh
Our paper presents a novel task and method for question generation, aimed to disambiguate referring expressions within 3D indoor environments (3D-REQ). [Expand]
LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models
Chan Hee Song, Jiaman Wu, Clayton B Washington, Brian M. Sadler, Wei-Lun Chao, Yu Su
In this work, we propose a novel method, LLM-Planner, that harnesses the power of large language models to do few-shot planning for embodied agents. [Expand]
Look Ma, No Hands! Agent-Environment Factorization of Egocentric Videos
Matthew Chang, Aditya Prakash, Saurabh Gupta
The analysis and use of egocentric videos for robotic tasks is made challenging by occlusion due to the hand and the visual mismatch between the human hand and a robot end-effector. [Expand]
Boosting Outdoor Vision-and-Language Navigation with On-the-route Objects
Yanjun Sun, Yue Qiu, Yoshimitsu Aoki, Hirokatsu Kataoka
Outdoor Vision-and-Language Navigation (VLN) is a challenging task that requires an agent to navigate using real-world urban environment data and natural language instructions. [Expand]
Emergence of Implicit System Identification via Embodiment Randomization
Pranav Putta, Gunjan Aggarwal, Roozbeh Mottaghi, Dhruv Batra, Naoki Harrison Yokoyama, Joanne Truong, Arjun Majumdar
We show that embodiment randomization can produce visual navigation agents that are able to generalize to new embodiments in a zero-shot manner. [Expand]
When Learning Is Out of Reach, Reset: Generalization in Autonomous Visuomotor Reinforcement Learning
Zichen Zhang, Luca Weihs
Episodic training, where an agent's environment is reset to some initial condition after every success or failure, is the de facto standard when training embodied reinforcement learning (RL) agents. [Expand]
A Hypothetical Framework of Embodied Generalist Agent with Foundation Model Assistance
Weirui Ye, Yunsheng Zhang, Xianfan Gu, Yang Gao
Recent significant advancements in computer vision (CV) and natural language processing (NLP) have showcased the vital importance of leveraging prior knowledge obtained from extensive data for a generalist agent. [Expand]
Curriculum Learning via Task Selection for Embodied Navigation
Ram Ramrakhya, Dhruv Batra, Aniruddha Kembhavi, Luca Weihs
In this work, we study the use of ACL for training long-horizon embodied AI tasks with sparse rewards using RL. [Expand]
SalsaBot: Towards a Robust and Generalizable Embodied Agent
Chan Hee Song, Jiaman Wu, Ju-Seung Byun, Zexin Xu, Vardaan Pahuja, Goonmeet Bajaj, Samuel Stevens, Ziru Chen, Yu Su
As embodied agents become more powerful, there arises a need for an agent to collaboratively solve tasks with humans. [Expand]
Audio Visual Language Maps for Robot Navigation
Chenguang Huang, Oier Mees, Andy Zeng, Wolfram Burgard
While interacting in the world is a multi-sensory experience, many robots continue to predominantly rely on visual perception to map and navigate in their environments. [Expand]
Exploiting Proximity-Aware Tasks for Embodied Social Navigation
Enrico Cancelli, Tommaso Campari, Luciano Serafini, Angel X Chang, Lamberto Ballan
Learning how to navigate among humans in an occluded and spatially constrained indoor environment, is a key ability required to embodied agent to be integrated into our society. [Expand]
Reduce, Reuse, Recycle: Modular Multi-Object Navigation
Sonia Raychaudhuri, Tommaso Campari, Unnat Jain, Manolis Savva, Angel X Chang
Our work focuses on the Multi-Object Navigation (MultiON) task, where an agent needs to navigate to multiple objects in a given sequence. [Expand]
Dynamic-Resolution Model Learning for Object Pile Manipulation
Yixuan Wang, Yunzhu Li, Katherine Rose Driggs-Campbell, Li Fei-Fei, Jiajun Wu
Dynamics models learned from visual observations have shown to be effective in various robotic manipulation tasks. [Expand]
DialMAT: Dialogue-Enabled Transformer with Moment-Based Adversarial Training
Kanta Kaneda, Ryosuke Korekata, Yuiga Wada, Shunya Nagashima, Motonari Kambara, Yui Iioka, Haruka Matsuo, Yuto Imai, Takayuki Nishimura, Komei Sugiura
This paper focuses on the DialFRED task, which is the task of embodied instruction following in a setting where an agent can actively ask questions about the task. [Expand]
Unordered Navigation to Multiple Semantic Targets in Novel Environments
Bernadette Bucher, Katrina Ashton, Bo Wu, Karl Schmeckpeper, Siddharth Goel, Nikolai Matni, Georgios Georgakis, Kostas Daniilidis
We consider the problem of unordered navigation to multiple objects in a novel environment. [Expand]
SegmATRon: Embodied Adaptive Semantic Segmentation for Indoor Environment
Tatiana Zemskova, Margarita Kichik, Dmitry Yudin, Aleksandr Panov
This paper presents an adaptive transformer model named SegmATRon for embodied image semantic segmentation. [Expand]
Predicting Motion Plans for Articulating Everyday Objects
Arjun Gupta, Max Shepherd, Saurabh Gupta
Mobile manipulation tasks such as opening a door, pulling open a drawer, or lifting a toilet lid require constrained motion of the end-effector under environmental and task constraints. [Expand]
EnvironAI: Extending AI Research into the Whole Environment
Jingyi Duan, Song Tong, Hongyi Shi, Honghong Bai, Xuefeng Liang, Kaiping Peng
This paper introduces Environment with AI (EnvironAI) as a complementary perspective to Embodied AI research. [Expand]
Situated Real-time Interaction with a Virtually Embodied Avatar
Sunny Panchal, Guillaume Berger, Antoine Mercier, Cornelius Böhm, Florian Dietrichkeit, Xuanlin Li, Reza Pourreza, Pulkit Madan, Apratim Bhattacharyya, Mingu Lee, Mark Todorovich, Ingo Bax, Roland Memisevic
Recent advances in large language model fine-tuning datasets and techniques have made them flourish as general dialogue-based assistants that are well-suited to strictly turn-based interactions. [Expand]
Fully Automated Task Management for Generation, Execution, and Evaluation: A Framework for Fetch-and-Carry Tasks with Natural Language Instructions in Continuous Space
Motonari Kambara, Komei Sugiura
This paper aims to develop a framework that enables a robot to execute tasks based on visual information, in response to natural language instructions for Fetch-and-Carry with Object Grounding (FCOG) tasks. [Expand]
Generalizing Skill Embeddings Across Body Shapes for Physically Simulated Characters
Sammy Christen, Nina Schmid, Otmar Hilliges
Recent progress in physics-based character animation has enabled learning diverse skills from large motion capture datasets. [Expand]

#

Sponsors

The Embodied AI 2023 Workshop is sponsored by the following organizations:

AppleHello RobotLogical Robotics

#

Organizers

The Embodied AI 2023 workshop is a joint effort by a large set of researchers from a variety of organizations. They are listed below in alphabetical order.
Andrew Szot
GaTech
Angel X. Chang
SFU
Anthony Francis
Logical Robotics
Changan Chen
UT Austin
Chengshu Li
Stanford
Claudia Pérez D’Arpino
NVIDIA
David Hall
CSIRO
Devendra Singh Chaplot
Meta AI
Devon Hjelm
Apple
Jesse Thomason
USC
Lamberto Ballan
U Padova
Luca Weihs
AI2
Matt Deitke
AI2, UW
Mike Roberts
Intel
Mohit Shridhar
UW
Naoki Yokoyama
GaTech
Oleksandr Maksymets
Meta AI
Rin Metcalf
Apple
Sagnik Majumder
UT Austin
Sören Pirk
Adobe
Yonatan Bisk
CMU
Angel X. Chang
SFU
Changan Chen
UT Austin
Chuang Gan
IBM, MIT
David Hall
CSIRO
Devendra Singh Chaplot
Meta AI
Dhruv Batra
GaTech, Meta AI
Fanbo Xiang
UCSD
Govind Thattai
Amazon
Jacob Krantz
Oregon State
Jesse Thomason
USC
Jiayuan Gu
UCSD
Luca Weihs
AI2
Manolis Savva
SFU
Matt Deitke
AI2, UW
Mohit Shridhar
UW
Naoki Yokoyama
GaTech
Oleksandr Maksymets
Meta AI
Roozbeh Mottaghi
FAIR, UW
Ruohan Gao
Stanford
Sagnik Majumder
UT Austin
Sonia Raychaudhuri
SFU
Stone Tao
UCSD
Tommaso Campari
SFU, UNIPD
Unnat Jain
UIUC
Xiaofeng Gao
Amazon
Yonatan Bisk
CMU
Alexander Toshev
Apple
Ali Farhadi
Apple, UW
Aniruddha Kembhavi
AI2, UW
Antonio M. Lopez
UAB-CVC
Devi Parikh
GaTech, Meta AI
Dhruv Batra
GaTech, Meta AI
Fei-Fei Li
Stanford
German Ros
Intel
Jie Tan
Google
Joanne Truong
GaTech
Jose A. Iglesias-Guitian
UDC-CITIC
Jose M. Alvarez
NVIDIA
Manolis Savva
SFU
Roberto Martín-Martín
Stanford
Roozbeh Mottaghi
FAIR, UW
Silvio Savarese
Salesforce, Stanford
Sonia Chernova
GaTech
Overview
Timeline
Workshop Schedule
Demos
Challenges
Call for Papers
Sponsors
Organizers