The Problem with Current TTA

Test-Time Adaptation (TTA) for Vision-Language Models (VLMs) often struggles with two primary limitations: local adaptation that fails to accumulate knowledge over time, and a focus on single-modality optimization that ignores the inherent multi-modal nature of these models. This leads to models that are brittle when facing dynamic, real-world distribution shifts.

The ComMem Architecture

Inspired by the biological brain's complementary memory systems, ComMem introduces a dual-component architecture that balances short-term flexibility with long-term stability:

  • Fast-Adapting Detailed Memory (Hippocampus): This component functions as a dynamic visual cache. It captures high-confidence test samples to provide immediate, instance-specific adaptation, allowing the model to respond quickly to new data distributions.
  • Slow-Integrating Abstract Memory (Neocortex): This component continually refines global textual prototypes. By integrating information over time, it ensures the model maintains a stable, generalized understanding of concepts, preventing the "catastrophic forgetting" often associated with rapid adaptation.

Cross-Modal Consistency

For every test instance, ComMem optimizes both memory systems simultaneously. This joint optimization forces the model to maintain consistency between the visual cache and the textual prototypes. By aligning these two memory streams, the model achieves better generalization and robustness compared to methods that adapt modalities in isolation. The approach was validated across 15 benchmark datasets, demonstrating significant performance gains in both natural distribution shifts and cross-dataset generalization scenarios.