ACTIVUS: Representations and Foundation Models for Actionable Visual Understanding

Applicant Dr.-Ing. Nikita Araslanov

Subject Area Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Methods in Artificial Intelligence and Machine Learning

Term since 2026

Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 572932173

Project Description

Modern computer vision excels at static scene understanding but remains fundamentally limited in dynamic environments, where autonomous systems must anticipate how scene elements move and how they respond to actions. State-of-the-art vision encoders provide precise semantic segmentation or geometric scene properties, yet they do not encode what actions are plausible, how objects interact, or how the world may evolve under user-defined instructions. As a result, bridging perception and action still requires substantial engineering effort. The proposed project, ACTIVUS, addresses this gap by developing actionable visual understanding: deep representations and models that capture how scenes can move and be acted upon.ACTIVUS is organised into three work areas. WA1 develops actionable representations (ARs), pixel-level embeddings learned from large video collections that encode a statistical prior over 3D motion. ARs treat the input image as context from which to infer how scene components typically move. WA2 introduces virtual interventions, defined as the open set of plausible actions, which an agent could perform in a scene. Inferring such interventions requires models to combine semantic and geometric reasoning. WA2 will align ARs with language models under weak supervision, establish a benchmark for evaluating open-vocabulary interventions, and develop a baseline model that maps text prompts (e.g., “open the window”) to latent action operators.WA3 focuses on world prediction: modelling the temporal evolution of a scene in response to virtual interventions. Using the motion priors encoded in ARs and the semantic interface provided by virtual interventions, WA3 will build models capable of generating geometrically and semantically grounded video predictions under hypothetical actions.Together, these work areas establish a unified framework for actionable visual understanding, enabling counterfactual reasoning and advancing vision systems toward real-world dynamic interaction.

DFG Programme Emmy Noether Independent Research Groups

Major Instrumentation Server with 2x NVIDIA® RTX PRO 6000 Blackwell

Instrumentation Group 7030 Dedizierte, dezentrale Rechenanlagen, Prozeßrechner

Servicenavigation

Hauptnavigation

ACTIVUS: Representations and Foundation Models for Actionable Visual Understanding

Additional Information

Servicenavigation

Hauptnavigation

ACTIVUS: Representations and Foundation Models for Actionable Visual Understanding

Additional Information

Textvergrößerung und Kontrastanpassung