Project Details
Learned Video Coding using Autoregressive Transformers
Applicant
Professor Dr.-Ing. André Kaup
Subject Area
Communication Technology and Networks, High-Frequency Technology and Photonic Systems, Signal Processing and Machine Learning for Information Technology
Term
since 2026
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 567512406
The continued growth of streaming services, social media, and online communication platforms has driven an unprecedented increase in internet traffic, with video content accounting for nearly 80% of global data consumption. As demand for high-resolution formats such as 4K and 8K grows, network infrastructure and storage systems face increasing pressure. Consequently, effective video compression is crucial for managing this demand efficiently. At the same time, recent advancements in generative model architectures, including transformers and diffusion models, have achieved remarkable success in image and video generation. However, many of these innovations have yet to be effectively leveraged for video compression, presenting a significant opportunity to enhance learned codecs. Despite recent progress in Neural Video Codecs, existing methods still face several key limitations. Pixel-level motion estimation can be inefficient due to not aligning with the downsampled feature representations. Additionally, inter-frame prediction is typically restricted to a limited set of reference frames, preventing the model from fully capturing long-range temporal dependencies. Furthermore, current approaches employ a single feature transform where all extracted features are transmitted, limiting the ability to adapt to different types of image content. This research project proposes a novel transformer-based approach to video compression that addresses these limitations by unifying spatial and temporal context modeling within a single context model. By integrating intra- and inter-frame prediction into a shared transformer architecture, the need for explicit motion estimation is removed, enabling a more flexible learned approach to leveraging temporal information. Since there is no longer any error propagation between frames, the proposed compression scheme is inherently more robust, preventing drastic failure modes. Unlike previous methods, the proposed model can handle inputs of any size without the need for spatial or temporal partitioning, increasing compression efficiency by leveraging long-range spatial and temporal correlations. Additionally, a feature gating mechanism, controlled by the transformer, allows the model to learn specialized, content-dependent feature transforms, improving visual quality and allowing for more flexible rate allocation. Overall, this research project aims to create a more efficient and adaptable video codec that offers superior rate-distortion performance and robustness compared to state-of-the-art models.
DFG Programme
Research Grants
