We appreciate any useful suggestions for improvement of this paper list or survey from peers. Please raise issues or send an email to [email protected]. Thanks for your cooperation! We also welcome your pull requests for this project!
Before taking action, humans make predictions based on their objectives and observations of the current environment. These predictions manifest in various forms, \eg, textual planning, visual imagination of future scene changes, or even subconscious planning at the action level. Each of these predictive capabilities is critical to the successful completion of tasks. With the development of generative models, agents driven by these models are exhibiting predictive capabilities that enable them to complete embodied tasks by making human-like predictions, high-level planning, image-based guidance, or future video prediction to drive actions. We refer to these models as World Models. Recently, these models have been widely applied across various domains spanning from developing agents to solve inference tasks to leveraging predictions for driving robots to perform specific actions.
- [2024.12.29] We release the first version of the paper list for Embodied AI. This page is continually updating!
- Books & Surveys
- Foundation Model
- General World Model
- Autonomous Driving
- Robot Manipulation
- Indoor Navigation
- World Model Safty
- Social Simulation
- Multi-Agent World Model
- Multimodal Large Models: The New Paradigm of Artificial General Intelligence, Publishing House of Electronics Industry (PHE), 2024
Yang Liu, Liang Lin
[Page]
The construction of world models often relies on various fundamental models.
- [LLaMA] LLaMA: Open and Efficient Foundation Language Models [arxiv]
- [PaLM] PaLM: Scaling Language Modeling with Pathways [arxiv]
- [PaLM-E] PaLM-E: An Embodied Multimodal Language Model [arxiv]
- [LLaVA] Visual Instruction Tuning [NIPS2023]
- [SDXL] Sdxl: Improving latent diffusion models for high-resolution image synthesis [arxiv]
- [PixArt-Ξ±] PixArt-Ξ±: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis [ICLR2024]
- [Show-o] Show-o: One single transformer to unify multimodal understanding and generation [arxiv]
- [ModelScope] ModelScope Text-to-Video Technical Report [arxiv]
- [VideoCrafter] VideoCrafter1: Open Diffusion Models for High-Quality Video Generation [arxiv]
- [DynamiCrafter] DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors [ECCV2024]
- [CogVideoX] CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [arxiv]
- [Open-Sora Plan] [project]
General World Model aims at representing and simulating wide range of situations and interactions, especically those encountered in the real world.
- [Gen-2] Gen-2: Generate novel videos with text, images or video clips. [project]
- [Gen-3-Alpha] Gen-3 Alpha: A New Frontier for Video Generation. [project]
- [Pandora] Pandora: Towards General World Model with Natural Language Actions and Video States [arxiv]
- [Dreamer] Dream to Control: Learning Behaviors by Latent Imagination. [ICLR2020]
- [DreamerV2] Mastering Atari with discrete world models. [ICLR2021]
- [DreamerV3] Mastering Diverse Domains through World Models [arxiv]
- [TD-MPC2] TD-MPC2: Scalable, Robust World Models for Continuous Control [ICLR2024]
- [UniSim] Learning Interactive Real-World Simulators [ICLR2024]
- [General-World-Models-Survey] Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond [project]
- [3D-VLA] 3D-VLA: A 3D Vision-Language-Action Generative World Model [ICML2024]
- Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving. [ICRA]
- [LanguageMPC] LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving. [arxiv]
- [DriveMLM] DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving. [arxiv]
- [DriveLM] DriveLM: Driving with Graph Visual Question Answering. [ECCV2024]
- [LMDrive] LMDrive: Closed-Loop End-to-End Driving with Large Language Models. [CVPR2024]
- [DiLu] DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models. [ICLR2024]
- [DriveVLM] DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models. [CoRL2024]
- [LeapAD] Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving. [NIPS2024]
- [AD-H] AD-H: Autonomous Driving with Hierarchical Agents. [arxiv]
- [Think2Drive] Think2Drive: Efficient Reinforcement Learning by Thinking in Latent World Model for Quasi-Realistic Autonomous Driving (in CARLA-v2). [ECCV2024]
- [GAIA-1] GAIA-1: A generative world model for autonomous driving. [arxiv]
- [MagicDrive] MagicDrive: Street View Generation with Diverse 3D Geometry Control. [ICLR2024]
- [DriveDiffusion] DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model. [ECCV2024]
- [OCCWorld] OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving. [ECCV2024]
- [Vista] Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability. [[NIPS2024]]
- [ADriver-I] ADriver-I: A General World Model for Autonomous Driving [arxiv]
- [GenAD] GenAD: Generative End-to-End Autonomous Driving. [ECCV2024]
- [DriveDreamer] DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving. [ECCV2024]
- [DriveDreamer-2] DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation. [arxiv]
- [NeMo] Neural Volumetric World Models for Autonomous Driving. [ECCV2024]
- [ViDAR] Visual Point Cloud Forecasting enables Scalable Autonomous Driving. [CVPR2024]
- [Drive-WM] Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving. [CVPR2024]
- [DriveWorld] DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving. [CVPR2024]
-
[PlaNet] Learning Latent Dynamics for Planning from Pixels. [ICML2019]
-
[Plan2Explore] Planning to Explore via Self-Supervised World Models. [ICML2020]
-
[RoboDreamer] Learning Compositional World Models for Robot Imagination. [ICML2024]
-
[SWIM] Structured World Models from Human Videos. [RSS2023]
-
[FOWM] Finetuning Offline World Models in the Real World. [CoRL2023]
-
[STEDIE] Interaction-based Disentanglement of Entities for Object-centric World Models. [ICLR2023]
-
[MWM] Masked World Models for Visual Control. [CoRL2022]
-
[CEE-US] Curious Exploration via Structured World Models Yields Zero-Shot Object Manipulation. [NeurIPS2022]
-
[MV-MWM] Multi-View Masked World Models for Visual Robotic Manipulation. [ICML2023]
-
[ContextWM] Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning. [NeurIPS2023]
-
[DexSim2Real2] DexSim2Real2: Building Explicit World Model for Precise Articulated Object Dexterous Manipulation. [arxiv]
-
[DayDreamer] DayDreamer: World Models for Physical Robot Learning. [CoRL2022]
-
[] Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning. [NeurIPS2023]
-
[] When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning. [NeurIPS2023]
- [Survey] What Foundation Models can Bring for Robot Learning in Manipulation : A Survey. [arxiv]
- [] Transferring Foundation Models for Generalizable Robotic Manipulation. [arxiv]
- [SculptBot] SculptBot: Pre-Trained Models for 3D Deformable Object Manipulation. [ICRA2024]
- [HiP] Compositional Foundation Models for Hierarchical Planning. [NeurIPS2023]
- [MA] Manipulate-Anything: Automating Real-World Robots using Vision-Language Models. [arxiv]
- [AutoRT] AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents. [arxiv]
- [SuSIE] Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models. [arxiv]
- [MOO] Open-World Object Manipulation usingPre-Trained Vision-Language Models[CoRL2023]
- [Pathdreamer] Pathdreamer: A World Model for Indoor Navigation [ICCV2021]
- [Panogen] Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation [NeurIPS2023]
- [Dreamwalker] Dreamwalker: Mental planning for continuous vision-language navigation [ICCV2023]
- [VLN-SIG] Improving Vision-and-Language Navigation by Generating Future-View Image Semantics [CVPR2023]
- [LFG] Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning [CoRL2023]
- [ViNT] ViNT: A Foundation Model for Visual Navigation [CoRL2023]
- [ENTL] ENTL: Embodied Navigation Trajectory Learner [ICCV2023]
- [GSAI] Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems [arxiv]
- [LlamaGuard] Llama guard: Llm-based input-output safeguard for human-ai conversations [arxiv]
- [Llama Guard 3 Vision] Meta Llama Guard 3 Vision [huggingface]
-
[SPA-VL] SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model [arxiv]
-
[SafeBench] SafeBench: A Benchmarking Platform for Safety Evaluation of Autonomous Vehicles [NIPS2022]
-
[Beavertails] Beavertails: Towards improved safety alignment of llm via a human-preference dataset [NIPS2023]
-
[Salad-bench] Salad-bench: A hierarchical and comprehensive safety benchmark for large language models [ACL2024 Findings]
-
[GCG] Universal and transferable adversarial attacks on aligned language models [arxiv]
-
[COLD-Attack] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability [arxiv]
-
[-] From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking [arxiv]
-
[Agent Smith] Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast [arxiv]
-
[MLLM-Protector] MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance [paper]
-
[Adversarial Tuning] Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs [arxiv]
-
[-] Generative Agents: Interactive Simulacra of Human Behavior[paper]
-
[$S^3$]
$S^3$ : Social-network Simulation System with Large Language Model-Empowered Agents[paper] -
[ConsensusLLM] Multi-Agent Consensus Seeking via Large Language Models[paper]
-
[SaF] Lyfe Agents: Generative agents for low-cost real-time social interactions[paper]
-
[-] Quantifying the Impact of Large Language Models on Collective Opinion Dynamics[paper]
-
[CAMEL] CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society[NIPS2023]
-
[ToM] Theory of Mind for Multi-Agent Collaboration via Large Language Models[EMNLP2023]
-
[-] Can Large Language Models Transform Computational Social Science?[paper]
- [COMBO] COMBO: Compositional World Models for Embodied Multi-Agent Cooperation. [arxiv]