Project Details
ACTIVUS: Representations and Foundation Models for Actionable Visual Understanding
Applicant
Dr.-Ing. Nikita Araslanov
Subject Area
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Methods in Artificial Intelligence and Machine Learning
Methods in Artificial Intelligence and Machine Learning
Term
since 2026
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 572932173
Modern computer vision excels at static scene understanding but remains fundamentally limited in dynamic environments, where autonomous systems must anticipate how scene elements move and how they respond to actions. State-of-the-art vision encoders provide precise semantic segmentation or geometric scene properties, yet they do not encode what actions are plausible, how objects interact, or how the world may evolve under user-defined instructions. As a result, bridging perception and action still requires substantial engineering effort. The proposed project, ACTIVUS, addresses this gap by developing actionable visual understanding: deep representations and models that capture how scenes can move and be acted upon.ACTIVUS is organised into three work areas. WA1 develops actionable representations (ARs), pixel-level embeddings learned from large video collections that encode a statistical prior over 3D motion. ARs treat the input image as context from which to infer how scene components typically move. WA2 introduces virtual interventions, defined as the open set of plausible actions, which an agent could perform in a scene. Inferring such interventions requires models to combine semantic and geometric reasoning. WA2 will align ARs with language models under weak supervision, establish a benchmark for evaluating open-vocabulary interventions, and develop a baseline model that maps text prompts (e.g., “open the window”) to latent action operators.WA3 focuses on world prediction: modelling the temporal evolution of a scene in response to virtual interventions. Using the motion priors encoded in ARs and the semantic interface provided by virtual interventions, WA3 will build models capable of generating geometrically and semantically grounded video predictions under hypothetical actions.Together, these work areas establish a unified framework for actionable visual understanding, enabling counterfactual reasoning and advancing vision systems toward real-world dynamic interaction.
DFG Programme
Emmy Noether Independent Research Groups
Major Instrumentation
Server with 2x NVIDIA® RTX PRO 6000 Blackwell
Instrumentation Group
7030 Dedizierte, dezentrale Rechenanlagen, Prozeßrechner
