#

Overview

Minds live in bodies, and bodies move through a changing world. The goal of embodied artificial intelligence is to create agents, such as robots, which learn to creatively solve challenging tasks requiring interaction with the environment. While this is a tall order, fantastic advances in deep learning and the increasing availability of large datasets like ImageNet have enabled superhuman performance on a variety of AI tasks previously thought intractable. Computer vision, speech recognition and natural language processing have experienced transformative revolutions at passive input-output tasks like language translation and image processing, and reinforcement learning has similarly achieved world-class performance at interactive tasks like games. These advances have supercharged embodied AI, enabling a growing collection of researchers to make rapid progress towards intelligent agents which can:

See: perceive their environment through vision or other senses.
Talk: hold a natural language dialog grounded in their environment.
Listen: understand and react to audio input anywhere in a scene.
Act: navigate and interact with their environment to accomplish goals.
Reason: consider and plan for the long-term consequences of their actions.

The goal of the Embodied AI workshop is to bring together researchers from computer vision, language, graphics, and robotics to share and discuss the latest advances in embodied intelligent agents. This year's workshop will focus on the three themes of:

Foundation Models: Large pretrained models such as CLIP, ViLD and PaLI which enable few-shot and zero-shot performance on novel tasks.
Generalist Agents: Single learning methods for multiple tasks, such as RT-1, which enable models trained on one task to be expanded to novel tasks.
Sim to Real Transfer: Techniques which enable models trained in simulation to be deployed in the real world.

The Embodied AI 2023 workshop will be held in conjunction with CVPR 2023 in Vancouver, British Columbia. It will feature a host of invited talks covering a variety of topics in Embodied AI, many exciting Embodied AI challenges, a poster session, and panel discussions. For more information on the Embodied AI Workshop series, see our Retrospectives paper on the first three years of the workshop.

#

Timeline

Workshop Announced

March 15, 2023

Paper Submission Deadline

May 26, 2023 (Anywhere on Earth)

Challenge Submission Deadlines

May 2023. Check each challenge for the specific date.

Fourth Annual Embodied AI Workshop at CVPR

Vancouver Convention Center
Monday, June 19, 2023
9:00 AM - 5:30 PM PT
East Ballroom A

Challenge Winners Announced

June 19, 2023 at the workshop. Check each challenge for specifics.

#

Workshop Schedule

Embodied AI will be a hybrid workshop, with both in-person talks and streaming via zoom.

Workshop Talks: 9:00AM-5:30PM PT - East Ballroom A
Poster Session: NOON-1:20PM PT - West Exhibit Hall, Posters #123 - #148

Zoom information is available on the CVPR virtual platform for registered attendees.
Remote and in-person attendees are welcome to as questions via Slack:

Ask questions on Slack

Workshop Introduction: Embodied AI
East Ballroom A
9:00 - 9:10 AM PT
Claudia Perez D'Arpino
NVIDIA
Navigation & Understanding Challenge Presentations
(Habitat, MultiON, SoundSpaces, RxR-Habitat, RVSU)
9:10 - 10:00 AM PT
- 9:10: RxR-Habitat
- 9:20: MultiOn
- 9:30: SoundSpaces
- 9:40: RVSU
- 9:50: Habitat
Navigation & Understanding Challenge Q&A Panel
10:00 - 10:30 AM PT
Invited Talk - Embodied Navigation:
Robot Learning by Understanding Videos
10:30 - 11:00 AM PT
Saurabh Gupta
UIUC
Saurabh Gupta is an Assistant Professor in the ECE Department at UIUC. Before starting at UIUC in 2019, he received his Ph.D. from UC Berkeley in 2018 and spent the following year as a Research Scientist at Facebook AI Research in Pittsburgh. His research interests span computer vision, robotics, and machine learning, with a focus on building agents that can intelligently interact with the physical world around them. He received the President's Gold Medal at IIT Delhi in 2011, the Google Fellowship in Computer Vision in 2015, an Amazon Research Award in 2020, and an NSF CAREER Award in 2022. He has also won many challenges at leading computer vision conferences.
True gains of machine learning in AI sub-fields such as computer vision and natural language processing have come about from the use of large-scale diverse datasets for learning. In this talk, I will discuss if and how we can leverage large-scale div... [Expand]
Invited Talk - Robotics:
Embodied Reasoning Through Planning with Language and Vision Foundation Models
11:00 - 11:30 AM PT
Fei Xia
Google
Fei Xia is a Research Scientist at Google Research where he works on the Robotics team. He received his PhD degree from the Department of Electrical Engineering, Stanford University. He was co-advised by Silvio Savarese in SVL and Leonidas Guibas. His mission is to build intelligent embodied agents that can interact with complex and unstructured real-world environments, with applications to home robotics. He has been approaching this problem from 3 aspects: 1) Large scale and transferrable simulation for Robotics. 2) Learning algorithms for long-horizon tasks. 3) Combining geometric and semantic representation for environments. Most recently, He has been exploring using foundation models for robot decision making.
Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could in principle be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a... [Expand]
Invited Talk - Generalist Agents:
Building Embodied Autonomous Agents with Multimodal Interaction
11:30 AM - 12 NOON PT
Ruslan Salakhutdinov
CMU
Russ Salakhutdinov is a UPMC Professor of Computer Science in the Department of Machine Learning at CMU. He received his PhD in computer science from the University of Toronto. After spending two post-doctoral years at MIT, he joined the University of Toronto and later moved to CMU. Russ's primary interests lie in deep learning, machine learning, and large-scale optimization. He is an action editor of the Journal of Machine Learning Research, served as a director of AI research at Apple, served on the senior programme committee of several top-tier learning conferences including NeurIPS and ICML, was a program co-chair for ICML 2019, and will serve as a general chair for ICML 2024. He is an Alfred P. Sloan Research Fellow, Microsoft Research Faculty Fellow, a recipient of the Early Researcher Award, Google Faculty Award, and Nvidia's Pioneers of AI award.
In this talk I will give an overview of our recent work on how we can design modular agents for visual navigation that can perform tasks specified by natural language instructions, perform efficient exploration and long-term planning, build and utilize 3D semantic maps, while generalizing across domains and tasks. [Expand]
Accepted Papers Poster Session
West Exhibit Hall - Posters #123 - #148.
12:00 NOON - 1:20 PM PT
Invited Talk - Foundation Models:
Large Language Models for Solving Long-Horizon Robotic Manipulation Problems
East Ballroom A
1:30 - 2:00 PM PT
Jeannette Bohg
Stanford
My long-term research goal is enable real robots to manipulate any kind of object such that they can perform many different tasks in a wide variety of application scenarios such as in our homes, in hospitals, warehouses, or factories. Many of these t... [Expand]
Invited Talk - Sim to Real
Toward Foundational Robot Manipulation Skills
2:00 - 2:30 PM PT
Dieter Fox
NVIDIA
U Washington
Dieter Fox received his PhD degree from the University of Bonn, Germany. He is a professor in the Allen School of Computer Science & Engineering at the University of Washington, where he heads the UW Robotics and State Estimation Lab. He is also Senior Director of Robotics Research at NVIDIA. His research is in robotics and artificial intelligence, with a focus on learning and estimation applied to problems such as robot manipulation, planning, language grounding, and activity recognition. He has published more than 300 technical papers and is co-author of the textbook "Probabilistic Robotics". Dieter is a Fellow of the IEEE, ACM, and AAAI, and recipient of the IEEE RAS Pioneer Award and the IJCAI John McCarthy Award.
In this talk, I will discuss our ongoing efforts toward developing the models and generating the kind of data that might lead to foundational manipulation skills for robotics. To generate large amounts of data, we sample many object rearrangement ta... [Expand]
Interaction & Rearrangement Challenge Presentations
AI2-Rearrangement, ALFRED+TEACh, DialFRED, ManiSkill, TDW-Transport
2:30 - 3:30 PM PT
- 2:30: AI2-Rearrangement
- 2:40: ALFRED+TEACh
- 2:50: DialFRED
- 3:00: ManiSkill
- 3:10: TDW-Transport
- 3:20: Break
Interaction & Rearrangement Challenge Q&A Panel
3:30 - 4:00 PM PT
Invited Talk - External Knowledge
From goals to grasps: Learning about action from people in video
4:00 - 4:30 PM PT
Kristen Grauman
UT Austin
Kristen Grauman is a Professor in the Department of Computer Science at the University of Texas at Austin and a Research Director in Facebook AI Research (FAIR). Her research in computer vision and machine learning focuses on video, visual recognition, and action for perception or embodied AI. Before joining UT-Austin in 2007, she received her Ph.D. at MIT. She is an IEEE Fellow, AAAI Fellow, Sloan Fellow, a Microsoft Research New Faculty Fellow, and a recipient of NSF CAREER and ONR Young Investigator awards, the PAMI Young Researcher Award in 2013, the 2013 Computers and Thought Award from the International Joint Conference on Artificial Intelligence (IJCAI), the Presidential Early Career Award for Scientists and Engineers (PECASE) in 2013. She was inducted into the UT Academy of Distinguished Teachers in 2017. She and her collaborators have been recognized with several Best Paper awards in computer vision, including a 2011 Marr Prize and a 2017 Helmholtz Prize (test of time award). She served for six years as an Associate Editor-in-Chief for the Transactions on Pattern Analysis and Machine Intelligence (PAMI) and for ten years as an Editorial Board member for the International Journal of Computer Vision (IJCV). She also served as a Program Chair of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2015 and a Program Chair of Neural Information Processing Systems (NeurIPS) 2018, and will serve as a Program Chair of the IEEE International Conference on Computer Vision (ICCV) 2023.
Invited Speaker Panel
4:30 - 5:30 PM PT
Moderator - Anthony Francis
Logical Robotics
Workshop Concludes
5:30 PM PT

#

Demos

In association with the Embodied AI Workshop, Meta AI will present a demo of LSC: Language-guided Skill Coordination for Open-Vocabulary Mobile Pick-and-Place in which a Boston Dynamics Spot will follow voice commands for object rearrangement such as "Find the plush in the table and place it in the case." The demo times for LSC include:

Expo Meta AI Booth: Tue-Thu, June 20-22 11:00-5:00
West Exhibit Hall Demo Area: Thu, June 22 10:00-18:00

#

Challenges

The Embodied AI 2023 workshop is hosting many exciting challenges covering a wide range of topics such as rearrangement, visual navigation, vision-and-language, and audio-visual navigation. More details regarding data, submission instructions, and timelines can be found on the individual challenge websites.

The workshop organizers are awarding each first-place challenge winner $300 dollars, sponsored by Apple, Hello Robot and Logical Robotics.

Challenge winners will be given the opportunity to present a talk at the workshop. Since many challenges can be grouped into similar tasks, we encourage participants to submit models to more than 1 challenge. The table below describes, compares, and links each challenge.

Challenge	Task	2023 Winner	Simulation Platform	Scene Dataset	Observations	Action Space	Interactive Actions?	Stochastic Acuation?


Habitat	ObjectNav	SkillFusion (AIRI)	Habitat	HM3D Semantics	RGB-D, Localization	Continuous
Habitat	ImageNav	LQ	Habitat	HM3D Semantics	RGB-D, Localization	Continuous
RxR-Habitat	Vision-and-Language Navigation	The GridMM Team	Habitat	Matterport3D	RGB-D	Discrete
MultiON	Multi-Object Navigation		Habitat	HM3D Semantics	RGB-D, Localization	Discrete
SoundSpaces	Audio Visual Navigation	AK-lab-tokyotech	Habitat	Matterport3D	RGB-D, Audio Waveform	Discrete
SoundSpaces	Active Audio Visual Source Separation	AK-lab-tokyotech	Habitat	Matterport3D	RGB-D, Audio Waveform	Discrete
Robotic Vision Scene Understanding	Semantic SLAM	Team SP	Isaac Sim	Active Scene Understanding	RGB-D, Pose Data, Flatscan Laser	Discrete		Partially
Robotic Vision Scene Understanding	Rearrangement (SCD)	MSC Lab	Isaac Sim	Active Scene Understanding	RGB-D, Pose Data, Flatscan Laser	Discrete		✓
TDW-Transport	Rearrangement		TDW	TDW	RGB-D, Metadata	Discrete	✓	✓
AI2-THOR Rearrangement	Rearrangement	TIDEE	AI2-THOR	iTHOR	RGB-D, Localization	Discrete	✓
Language Interaction	Instruction Following and Dialogue	Yonsei VnL	AI2-THOR	iTHOR	RGB	Discrete	✓
DialFRED	Vision-and-Dialogue Interaction	Team Keio	AI2-THOR	iTHOR	RGB	Discrete	✓
ManiSkill	Generalized Manipulation	GXU-LIPE	SAPIEN	PartNet-Mobility, YCB, EGAD	RGB-D, Metadata	Continuous	✓

#

Call for Papers

We invite high-quality 2-page extended abstracts on embodied AI, especially in areas relevant to the themes of this year's workshop:

Foundation Models
Generalist Agents
Sim to Real Transfer

as well as themes related to embodied AI in general:

Simulation Environments
Visual Navigation
Rearrangement
Embodied Question Answering
Embodied Vision & Language

Accepted papers will be presented as posters or spotlight talks at the workshop. These papers will be made publicly available in a non-archival format, allowing future submission to archival journals or conferences. Paper submissions do not have to be anononymized. Per CVPR rules regarding workshop papers, at least one author must register for CVPR using an in-person registration.

Submission

The submission deadline is May 26th (Anywhere on Earth). Papers should be no longer than 2 pages (excluding references) and styled in the CVPR format.

Paper submissions have now CLOSED.

Accepted Papers

Note. The order of the papers is randomized each time the page is refreshed.

Unordered Navigation to Multiple Semantic Targets in Novel Environments

Bernadette Bucher, Katrina Ashton, Bo Wu, Karl Schmeckpeper, Siddharth Goel, Nikolai Matni, Georgios Georgakis, Kostas Daniilidis

We consider the problem of unordered navigation to multiple objects in a novel environment. [Expand]

#