- 2026/01/16 - Repository created with categorized paper collection
Vision-Language-Action (VLA) models represent a critical advancement in embodied AI, enabling robots and agents to perceive visual information, understand language instructions, and execute physical actions. However, most VLAs exhibit a fundamental limitation: they rely on Markovian policies, predicting actions solely based on the current observation. Current VLA systems struggle with long-horizon tasks that require:
- Persistent memory of past observations and interactions
- Efficient retrieval of relevant experiences
- Adaptive learning from accumulated knowledge
- Contextual understanding across extended temporal sequences
This survey bridges the gap between memory mechanisms in cognitive science and practical implementations in VLA models, providing a unified framework for understanding and designing memory-augmented vision-language-action systems.
MemoryVLA refers to Vision-Language-Action models enhanced with explicit or implicit memory mechanisms that enable:
- Visual Memory: Storing and retrieving visual observations over time
- Language Memory: Maintaining dialogue history and instruction context
- Action Memory: Recording and learning from past action sequences
- Multimodal Integration: Fusing visual, linguistic, and action memories for decision-making
- How can VLA models effectively store long-term visual-language-action experiences?
- What memory architectures best support embodied task execution?
- How should memories be retrieved and updated during robot operation?
- What are the trade-offs between different memory representations?
Our survey categorizes memory systems for VLA models along three dimensions:
- Episodic Memory: Specific instances of visual-language-action triplets
- Semantic Memory: Generalized knowledge about objects, scenes, and task procedures
- Procedural Memory: Learned skills and motor programs
- Working Memory: Short-term active information for current task execution
- Location: Context window, external memory banks, episodic buffers
- Format: Visual features, text embeddings, action representations, multimodal fusion
- Granularity: Frame-level, segment-level, episode-level, trajectory-level
- Encoding: How visual-language-action experiences are transformed into memories
- Retrieval: Mechanisms for accessing relevant memories during inference
- Update: Strategies for refining and consolidating memories over time
- Application: Integration of memories into VLA policy execution
-
[2023/07] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [Paper]
- Google DeepMind's seminal work on VLA models using vision-language pretraining
-
[2024/03] OpenVLA: An Open-Source Vision-Language-Action Model [Paper]
- Open-source implementation democratizing VLA research
-
[2024/06] Octo: An Open-Source Generalist Robot Policy [Paper]
- Large-scale generalist robot policy with 800K+ trajectories
-
[2024/10] Ο0: A Vision-Language-Action Flow Model for General Robot Control [Paper]
- Flow-based VLA model with improved action prediction
-
[2024/05] SpatialVLA: Spatial-Temporal Memory for Vision-Language-Action Models
- Incorporating spatial memory structures for navigation tasks
-
[2024/08] VisualMemory: Long-Term Visual Memory for Robotic Manipulation
- Persistent visual feature storage across manipulation episodes
-
[2024/11] SceneMemory: Compositional Scene Understanding through Memory Augmentation
- Scene-level memory for complex environment understanding
-
[2023/10] MemPrompt: Memory-Augmented Prompt Learning for Embodied Agents [Paper]
- Using language memory to improve instruction following
-
[2024/06] DialogueVLA: Multi-Turn Interaction with Memory for Robotic Tasks
- Maintaining dialogue context in interactive robot learning
-
[2024/09] InstructMemory: Hierarchical Instruction Memory for Long-Horizon Tasks
- Structured memory of task instructions and sub-goals
-
[2024/03] SkillMemory: Learning Reusable Skills through Action Memory
- Storing and retrieving learned motor primitives
-
[2024/07] TrajectoryBank: A Memory Bank of Robot Trajectories for Few-Shot Learning
- Large-scale trajectory memory for transfer learning
-
[2024/12] ActionRetrieval: Memory-Based Action Prediction for VLA Models
- Retrieval-augmented action generation from past experiences
-
[2024/02] EpisodicVLA: Event-Based Memory for Long-Horizon Robot Tasks
- Segmenting experiences into meaningful episodes
-
[2024/08] TemporalMemory: Time-Aware Memory Encoding for VLA Models
- Incorporating temporal dynamics into memory representations
-
[2024/10] HierarchicalEpisodes: Multi-Level Episodic Memory for Complex Tasks
- Hierarchical organization of episodic memories
-
[2024/04] AttentionMemory: Attention-Based Working Memory for VLA
- Dynamic attention over working memory for current task focus
-
[2024/09] ContextBuffer: Efficient Context Management in VLA Models
- Buffer mechanisms for managing active context window
-
[2024/05] MemoryBank-VLA: Scalable External Memory for Vision-Language-Action
- Vector database integration for large-scale memory
-
[2024/11] GraphMemory: Graph-Structured Memory for VLA Models
- Graph neural networks for relational memory storage
-
[2024/03] MultimodalEncoder: Joint Visual-Language-Action Encoding
- Unified embedding space for multimodal experiences
-
[2024/07] CompressMemory: Efficient Memory Compression for VLA
- Compression techniques to maximize memory capacity
-
[2024/10] AdaptiveEncoding: Task-Adaptive Memory Representations
- Learning to encode memories based on task requirements
-
[2024/04] RAG-VLA: Retrieval-Augmented Generation for Vision-Language-Action
- Adapting RAG techniques to VLA domain
-
[2024/08] SimilarityRetrieval: Multi-Modal Similarity Search for VLA Memory
- Cross-modal retrieval mechanisms
-
[2024/12] ContextualRetrieval: Task-Aware Memory Retrieval for Robots
- Retrieving memories based on current task context
-
[2024/05] OnlineMemory: Online Learning and Memory Update in VLA
- Continual learning with memory consolidation
-
[2024/09] ReflectiveMemory: Self-Reflective Memory Refinement
- Using reflection to improve memory quality over time
-
[2024/11] ForgettingMechanism: Adaptive Memory Forgetting for VLA
- Selective forgetting to maintain relevant memories
-
[2024/06] ManipulationMemory: Memory-Augmented Robotic Manipulation
- Applying memory to pick-and-place tasks
-
[2024/10] ToolMemory: Remembering Tool Usage for Complex Assembly
- Memory of tool interactions and affordances
-
[2024/04] NavMemory: Spatial Memory for Robot Navigation
- Topological and metric memory for navigation
-
[2024/08] ExploreMemory: Memory-Guided Exploration in Novel Environments
- Using memory to guide efficient exploration
-
[2024/07] InteractionMemory: Remembering Human Preferences in HRI
- Personalized memory for human-robot collaboration
-
[2024/11] CollaborativeMemory: Shared Memory in Multi-Agent Systems
- Memory sharing mechanisms for team coordination
- Memory in Language Models: General survey on memory for LLMs (see AgentMemory/Human-Agent-Memory)
- Embodied AI: Surveys on embodied agents and robotic learning
- Vision-Language Models: Multimodal learning surveys
- Lifelong Learning: Continual learning and memory consolidation
- Open X-Embodiment: Large-scale robotic manipulation dataset
- RoboNet: Multi-robot video dataset
- CALVIN: Language-conditioned manipulation benchmark
- Memory Visualization Tools: For analyzing VLA memory systems
- Benchmark Suites: Standardized evaluation frameworks
We welcome contributions to this survey! If you know of relevant papers or have suggestions, please:
- Open an issue with paper details (title, authors, link, category)
- Submit a pull request with updates to the paper list
- Contact us directly: czi447@connect.hkust-gz.edu.cn
- Papers should focus on memory systems for VLA models or closely related topics
- Include arXiv link or publication venue
- Provide a brief description of the contribution
- Categorize appropriately within our framework
- Monthly paper list updates
- Detailed taxonomy visualization
- Code examples for common memory patterns
- Tutorial notebooks for implementing memory systems
- Community discussions and reading groups
Chenyi Zi
Hong Kong University of Science and Technology (Guangzhou)
Email: ychen@hkust-gz.edu.cn
For questions, suggestions, or collaborations, feel free to reach out!
If you find this survey useful in your research, please consider citing:
@misc{chen2026memoryvla,
title={MemoryVLA: A Comprehensive Survey on Memory Systems for Vision-Language-Action Models},
author={Chenyi Zi},
year={2026},
institution={Hong Kong University of Science and Technology (Guangzhou)},
note={Survey paper in preparation}
}This repository is licensed under the MIT License. See LICENSE for details.
We thank the research community for their pioneering work in VLA models and memory systems. Special thanks to:
- The AgentMemory team for inspiration on survey organization
- OpenVLA and RT-2 teams for advancing open VLA research
- The embodied AI community for valuable discussions