The Challenge of Streaming Audio-Visual Memory
Streaming audio-visual Large Language Models (LLMs) face a significant bottleneck: the massive computational and memory overhead required to process continuous, high-dimensional input streams. As these models process longer sequences, the memory footprint grows linearly, often leading to latency issues or the need to discard historical context. Traditional compression methods often struggle to maintain the fidelity of these inputs, leading to information loss that degrades model performance over time.
Perturbation-Aware Compression
OmniMem addresses this by implementing a perturbation-aware memory compression strategy. Instead of standard lossy compression, which may treat all input data as equally important, this approach explicitly accounts for the sensitivity of the model to input noise. By training the compression mechanism to be aware of how perturbations in the latent space affect downstream performance, the model can more effectively prune redundant information while preserving critical features necessary for accurate audio-visual reasoning.
Performance and Implementation
This technique allows for significantly reduced memory usage in streaming environments, enabling models to maintain longer context windows without the typical performance degradation associated with aggressive compression. The implementation, integrated with the SALMONN framework, demonstrates that it is possible to achieve a balance between computational efficiency and model accuracy in complex multimodal tasks. This is particularly relevant for real-time applications where the model must maintain a coherent understanding of long-duration audio and visual events.