Skip to content

[World-Model-Survey-2024] Paper list and projects for World Model

Notifications You must be signed in to change notification settings

IranQin/Awesome_World_Model_Papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 

Repository files navigation


Paper list for World Model

We appreciate any useful suggestions for improvement of this paper list or survey from peers. Please raise issues or send an email to [email protected]. Thanks for your cooperation! We also welcome your pull requests for this project!

🏠 About

Before taking action, humans make predictions based on their objectives and observations of the current environment. These predictions manifest in various forms, \eg, textual planning, visual imagination of future scene changes, or even subconscious planning at the action level. Each of these predictive capabilities is critical to the successful completion of tasks. With the development of generative models, agents driven by these models are exhibiting predictive capabilities that enable them to complete embodied tasks by making human-like predictions, high-level planning, image-based guidance, or future video prediction to drive actions. We refer to these models as World Models. Recently, these models have been widely applied across various domains spanning from developing agents to solve inference tasks to leveraging predictions for driving robots to perform specific actions.

πŸ’₯ Update Log

  • [2024.12.29] We release the first version of the paper list for Embodied AI. This page is continually updating!
  • Multimodal Large Models: The New Paradigm of Artificial General Intelligence, Publishing House of Electronics Industry (PHE), 2024
    Yang Liu, Liang Lin
    [Page]

The construction of world models often relies on various fundamental models.

Text Generation

  • [LLaMA] LLaMA: Open and Efficient Foundation Language Models [arxiv]
  • [PaLM] PaLM: Scaling Language Modeling with Pathways [arxiv]
  • [PaLM-E] PaLM-E: An Embodied Multimodal Language Model [arxiv]
  • [LLaVA] Visual Instruction Tuning [NIPS2023]

Image Generation

  • [SDXL] Sdxl: Improving latent diffusion models for high-resolution image synthesis [arxiv]
  • [PixArt-Ξ±] PixArt-Ξ±: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis [ICLR2024]
  • [Show-o] Show-o: One single transformer to unify multimodal understanding and generation [arxiv]

Video Generation

  • [ModelScope] ModelScope Text-to-Video Technical Report [arxiv]
  • [VideoCrafter] VideoCrafter1: Open Diffusion Models for High-Quality Video Generation [arxiv]
  • [DynamiCrafter] DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors [ECCV2024]
  • [CogVideoX] CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [arxiv]
  • [Open-Sora Plan] [project]

General World Model aims at representing and simulating wide range of situations and interactions, especically those encountered in the real world.

  • [Gen-2] Gen-2: Generate novel videos with text, images or video clips. [project]
  • [Gen-3-Alpha] Gen-3 Alpha: A New Frontier for Video Generation. [project]
  • [Pandora] Pandora: Towards General World Model with Natural Language Actions and Video States [arxiv]
  • [Dreamer] Dream to Control: Learning Behaviors by Latent Imagination. [ICLR2020]
  • [DreamerV2] Mastering Atari with discrete world models. [ICLR2021]
  • [DreamerV3] Mastering Diverse Domains through World Models [arxiv]
  • [TD-MPC2] TD-MPC2: Scalable, Robust World Models for Continuous Control [ICLR2024]
  • [UniSim] Learning Interactive Real-World Simulators [ICLR2024]
  • [General-World-Models-Survey] Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond [project]
  • [3D-VLA] 3D-VLA: A 3D Vision-Language-Action Generative World Model [ICML2024]

Action Generation

  • Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving. [ICRA]
  • [LanguageMPC] LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving. [arxiv]
  • [DriveMLM] DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving. [arxiv]
  • [DriveLM] DriveLM: Driving with Graph Visual Question Answering. [ECCV2024]
  • [LMDrive] LMDrive: Closed-Loop End-to-End Driving with Large Language Models. [CVPR2024]
  • [DiLu] DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models. [ICLR2024]
  • [DriveVLM] DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models. [CoRL2024]
  • [LeapAD] Continuously Learning, Adapting, and Improving: A Dual-Process Approach to Autonomous Driving. [NIPS2024]
  • [AD-H] AD-H: Autonomous Driving with Hierarchical Agents. [arxiv]
  • [Think2Drive] Think2Drive: Efficient Reinforcement Learning by Thinking in Latent World Model for Quasi-Realistic Autonomous Driving (in CARLA-v2). [ECCV2024]

Future Generation

  • [GAIA-1] GAIA-1: A generative world model for autonomous driving. [arxiv]
  • [MagicDrive] MagicDrive: Street View Generation with Diverse 3D Geometry Control. [ICLR2024]
  • [DriveDiffusion] DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model. [ECCV2024]
  • [OCCWorld] OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving. [ECCV2024]
  • [Vista] Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability. [[NIPS2024]]

Future Generation for Action

  • [ADriver-I] ADriver-I: A General World Model for Autonomous Driving [arxiv]
  • [GenAD] GenAD: Generative End-to-End Autonomous Driving. [ECCV2024]
  • [DriveDreamer] DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving. [ECCV2024]
  • [DriveDreamer-2] DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation. [arxiv]
  • [NeMo] Neural Volumetric World Models for Autonomous Driving. [ECCV2024]
  • [ViDAR] Visual Point Cloud Forecasting enables Scalable Autonomous Driving. [CVPR2024]
  • [Drive-WM] Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving. [CVPR2024]
  • [DriveWorld] DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving. [CVPR2024]
  • [PlaNet] Learning Latent Dynamics for Planning from Pixels. [ICML2019]

  • [Plan2Explore] Planning to Explore via Self-Supervised World Models. [ICML2020]

  • [RoboDreamer] Learning Compositional World Models for Robot Imagination. [ICML2024]

  • [SWIM] Structured World Models from Human Videos. [RSS2023]

  • [FOWM] Finetuning Offline World Models in the Real World. [CoRL2023]

  • [STEDIE] Interaction-based Disentanglement of Entities for Object-centric World Models. [ICLR2023]

  • [MWM] Masked World Models for Visual Control. [CoRL2022]

  • [CEE-US] Curious Exploration via Structured World Models Yields Zero-Shot Object Manipulation. [NeurIPS2022]

  • [MV-MWM] Multi-View Masked World Models for Visual Robotic Manipulation. [ICML2023]

  • [ContextWM] Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning. [NeurIPS2023]

  • [DexSim2Real2] DexSim2Real2: Building Explicit World Model for Precise Articulated Object Dexterous Manipulation. [arxiv]

  • [DayDreamer] DayDreamer: World Models for Physical Robot Learning. [CoRL2022]

  • [] Leveraging Pre-trained Large Language Models to Construct and Utilize World Models for Model-based Task Planning. [NeurIPS2023]

  • [] When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning. [NeurIPS2023]

Foundation Model

  • [Survey] What Foundation Models can Bring for Robot Learning in Manipulation : A Survey. [arxiv]
  • [] Transferring Foundation Models for Generalizable Robotic Manipulation. [arxiv]
  • [SculptBot] SculptBot: Pre-Trained Models for 3D Deformable Object Manipulation. [ICRA2024]
  • [HiP] Compositional Foundation Models for Hierarchical Planning. [NeurIPS2023]
  • [MA] Manipulate-Anything: Automating Real-World Robots using Vision-Language Models. [arxiv]
  • [AutoRT] AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents. [arxiv]
  • [SuSIE] Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models. [arxiv]
  • [MOO] Open-World Object Manipulation usingPre-Trained Vision-Language Models[CoRL2023]
  • [Pathdreamer] Pathdreamer: A World Model for Indoor Navigation [ICCV2021]
  • [Panogen] Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation [NeurIPS2023]
  • [Dreamwalker] Dreamwalker: Mental planning for continuous vision-language navigation [ICCV2023]
  • [VLN-SIG] Improving Vision-and-Language Navigation by Generating Future-View Image Semantics [CVPR2023]
  • [LFG] Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning [CoRL2023]
  • [ViNT] ViNT: A Foundation Model for Visual Navigation [CoRL2023]
  • [ENTL] ENTL: Embodied Navigation Trajectory Learner [ICCV2023]

Definition

  • [GSAI] Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems [arxiv]

Safety Evaluation

  • [LlamaGuard] Llama guard: Llm-based input-output safeguard for human-ai conversations [arxiv]
  • [Llama Guard 3 Vision] Meta Llama Guard 3 Vision [huggingface]

Safety Benchmark

  • [SPA-VL] SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model [arxiv]

  • [SafeBench] SafeBench: A Benchmarking Platform for Safety Evaluation of Autonomous Vehicles [NIPS2022]

  • [Beavertails] Beavertails: Towards improved safety alignment of llm via a human-preference dataset [NIPS2023]

  • [Salad-bench] Salad-bench: A hierarchical and comprehensive safety benchmark for large language models [ACL2024 Findings]

Attack

  • [GCG] Universal and transferable adversarial attacks on aligned language models [arxiv]

  • [COLD-Attack] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability [arxiv]

  • [-] From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking [arxiv]

  • [Agent Smith] Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast [arxiv]

Safety enhancement

  • [MLLM-Protector] MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance [paper]

  • [Adversarial Tuning] Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs [arxiv]

  • [-] Generative Agents: Interactive Simulacra of Human Behavior[paper]

  • [$S^3$] $S^3$: Social-network Simulation System with Large Language Model-Empowered Agents[paper]

  • [ConsensusLLM] Multi-Agent Consensus Seeking via Large Language Models[paper]

  • [SaF] Lyfe Agents: Generative agents for low-cost real-time social interactions[paper]

  • [-] Quantifying the Impact of Large Language Models on Collective Opinion Dynamics[paper]

  • [CAMEL] CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society[NIPS2023]

  • [ToM] Theory of Mind for Multi-Agent Collaboration via Large Language Models[EMNLP2023]

  • [-] Can Large Language Models Transform Computational Social Science?[paper]

  • [COMBO] COMBO: Compositional World Models for Embodied Multi-Agent Cooperation. [arxiv]

πŸ‘ Acknowledgements

About

[World-Model-Survey-2024] Paper list and projects for World Model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published