Workshop Overview

This workshop addresses a critical gap in current AI research by focusing on the integration of language and 3D perception, which is essential for developing embodied agents and robots, especially considering the recent rise of multimodal LLMs and vision-language-action (VLA) models.

Building on the momentum and insights from our first workshop at CVPR 2025, this 2nd edition will continue to provide a unique platform for discussing the integration of language and 3D perception. The workshop will deepen the collaboration established in the first edition and further advance the state-of-the-art in 3D-LLM/VLA research.

Topics Include:

Integration of language and 3D perception
Large language models (LLMs) in 3D environments
3D vision-language-action (VLA) models
Embodied agents that integrate language, vision, and action
3D scene understanding and generative world models
Robot control and navigation using natural language
Multimodal learning for embodied AI

Call for Papers

Overview

We invite submissions of papers related to the integration of language and 3D perception, with a focus on developing embodied agents and robots. Topics of interest include, but are not limited to:

Language-guided 3D perception and understanding
Large language models (LLMs) for 3D environment understanding
3D vision-language-action (VLA) models
Embodied agents that integrate language, vision, and action
3D scene understanding and generative world models
Robot control and navigation using natural language
Multimodal learning for embodied AI
Datasets and benchmarks for 3D-LLMs and 3D-VLAs
Applications of 3D-LLMs and 3D-VLAs in real-world scenarios
Ethical considerations in developing embodied AI systems

Awards

Congratulations to our seven spotlight papers, including the best paper and runner-up award recipients:

Best Paper Award:
Ψ0: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation
Best Paper Runner-up Awards:
Do 3D Large Language Models Really Understand 3D Spatial Relationships?

PA3FF: Learning Part-Aware Dense 3D Feature Field For Generalizable Articulated Object Manipulation

VLS: Steering Pretrained Robot Policies via Vision–Language Models

Robot Learning from a Physical World Model
Spotlight Papers:
GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans

Submission Guidelines

Papers can be submitted in any major conference's format, with a length of 2-8 pages (excluding references). All submissions will be peer-reviewed, and accepted papers will be presented at the workshop as posters.

Submissions should be made through the OpenReview submission system

Important Dates

Paper Submission Deadline: April 26, 2026
Decision Notification: May 10, 2026
Camera Ready Deadline: May 24, 2026
Workshop Date: June 3, 2026

Publication

The workshop will be non-archival. Authors of accepted papers retain the full copyright of their work and are free to submit extended versions to conferences or journals.

Schedule

June 3, 2026 | Room 1CD | CVPR 2026, Denver, CO, USA

12:30pm - 1:00pm

Poster Setup & Poster Session

Authors set up posters and early attendees can browse

1:00pm - 1:45pm

Keynote 1: Ziwei Liu

Nanyang Technological University

Computer Vision, Multimodal AI

1:45pm - 2:30pm

Keynote 2: Yue Wang (Remote)

University of Southern California

Vision, Robotics

2:30pm - 3:15pm

Keynote 3: Leonidas Guibas

Stanford University

3D Vision, Robotics

3:15pm - 3:30pm

Coffee Break

3:30pm - 4:15pm

Keynote 4: Angela Dai

Technical University of Munich

3D Vision

4:15pm - 5:00pm

Keynote 5: Ranjay Krishna

University of Washington

Vision, NLP, Robotics

5:00pm - 5:45pm

Keynote 6: Marc Pollefeys

ETH Zurich

3D Vision

5:45pm - 6:15pm

Closing Poster Session, Best Paper Announcement & Remarks

Poster presentations, best paper announcement, and workshop closing discussion

Accepted Papers

Stress-Aware Reasoning for Robust Vision-Language-Action Agents

LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans Spotlight

Ψ0: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation Best Paper Spotlight

LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds

Humanoid Everyday: A Comprehensive Robotic Dataset for Open-World Humanoid Manipulation

Chasing Ghosts: A Simulation-to-Real Olfactory Navigation Stack with Optional Vision Augmentation

FunFact: Building Probabilistic Functional 3D Scene Graphs via Factor-Graph Reasoning

EthosVLA: A Constitutional Safety Framework for Vision-Language-Action Models in Physical 3D Environments

CausalScene: Learning Causal 3D Scene Graphs for Counterfactual Reasoning in Embodied Agents

What Do VLAs Actually Learn through In-Context Failure Conditioning?

Do 3D Large Language Models Really Understand 3D Spatial Relationships? Runner-up Spotlight

Prune-Then-Plan: Step-Level Calibration for Stable Frontier Exploration in Embodied Question Answering

Code3DBench: Single-Image to Executable Low-Poly 3D Code Generation

Video2Assets: Extracting 3D Object Assets from Unconstrained Video

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation Spotlight

Multimodal Causal Subtask Modeling for Scalable VLA Pipelines in Long-Horizon Manipulation

SA-VLA: Spatially-Aware Flow-Matching for Vision-Language-Action Reinforcement Learning

Reference-Free Assessment of Physical Consistency in World Model-based Video Generation

Robot Learning from a Physical World Model Runner-up Spotlight

Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations

AbsVLA: Learning Robust Primitive Manipulation Skills for VLA Models in Object-Centric Abstracted States

Autonomous Frontier-Based Exploration with VLM Guidance

SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning

LychSim: A Controllable and Interactive Simulation Framework for Multimodal LLM

LangPose: SE(3)-Equivariant Language Grounding for 3D Vision-Language-Action Models

A Taxonomy-Driven Modular Defense against Non-Canonical Language in Vision-Language-Action Models

Semantic-Aware Guided Drone Exploration for Language-Conditioned 3D Indoor Mapping

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

VLS: Steering Pretrained Robot Policies via Vision–Language Models Runner-up Spotlight

Token Warping Helps MLLMs Look from Nearby Viewpoints

Explicit Token-Based Adapters for Frozen Vision-Language-Action Models

SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization

Compressing 3D Scene Context for LLMs: Spatial Enriched Graph Attention

Can Embodied Agents Remember What You Said? Evaluating Dialogue-Grounded Embodied Memory in 3D Environments

Dynamic Anchors for Closed-Loop Language-Guided Camera Control in Basketball Scenes

BIT-Nav: Brain-Inspired Trajectory Memory for Embodied Navigation

PA3FF: Learning Part-Aware Dense 3D Feature Field For Generalizable Articulated Object Manipulation Runner-up Spotlight

ConstructGPT: BIM-Conditioned 3D Diffusion and World Models for Automated Construction Site Monitoring and Compliance Checking

Contact Us

Have questions? We're here to help

Email

For general inquiries:

yinihong@cs.stanford.edu

whu@cs.ucla.edu

Paper Submissions

Submit your papers via:

OpenReview Submission System

Workshop Location

CVPR 2026

Room 1CD

Denver, CO, USA

Frequently Asked Questions

What is the paper submission deadline?

The paper submission deadline is April 26, 2026.

Is the workshop in-person or virtual?

The workshop will be held in-person on June 3, 2026 at CVPR 2026 in Room 1CD, Denver, CO, USA.

Are the workshop papers archival?

No, the workshop will be non-archival. Authors of accepted papers retain the full copyright of their work and are free to submit extended versions to conferences or journals.

What is the maximum page length for submissions?

Papers can be submitted in any major conference's format, with a length of 2-8 pages (excluding references).

Bridging Language, Vision and Action in 3D Environments

Workshop Overview

Topics Include:

Important Dates

Paper Submission

Notification

Camera Ready

Workshop Date

Call for Papers

Overview

Awards

Submission Guidelines

Important Dates

Publication

Schedule

Poster Setup & Poster Session

Keynote 1: Ziwei Liu

Keynote 2: Yue Wang (Remote)

Keynote 3: Leonidas Guibas

Coffee Break

Keynote 4: Angela Dai

Keynote 5: Ranjay Krishna

Keynote 6: Marc Pollefeys

Closing Poster Session, Best Paper Announcement & Remarks

Keynote Speakers

Ziwei Liu

Yue Wang (Remote)

Leonidas Guibas

Angela Dai

Ranjay Krishna

Marc Pollefeys

Organizing Committee

Yining Hong

Wenbo Hu

Jianing Yang

Shengyi Qian

Valts Blukis

Yilun Du

Manling Li

David Fouhey

Joyce Chai

Jiajun Wu

Leonidas Guibas

Fei-Fei Li

Yejin Choi

Accepted Papers

Contact Us

Email

Paper Submissions

Workshop Location

Frequently Asked Questions

What is the paper submission deadline?

Is the workshop in-person or virtual?

Are the workshop papers archival?

What is the maximum page length for submissions?

Sponsors