Project Details
GenD²: Genuine Dependency Discovery
Applicant
Professor Dr. Felix Naumann
Subject Area
Data Management, Data-Intensive Systems, Computer Science Methods in Business Informatics
Term
since 2025
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 560743026
Data, stored in tables of databases, can be described with metadata. Simple metadata include datatypes, cardinalities or value patterns, while more complex metadata include various constraints, such as a uniqueness constraint that demands all values in a column or column group to be unique, or a functional dependency between two columns, which demands that if two records have same values in the one column, they must also have same values in the other column. Such complex metadata are often summarized under the term of dependencies. Data profiling is the act of extracting such metadata from a given (relational) data instance. Especially for dependencies, the efficient discovery in large datasets poses significant computational challenges, which have been the focus of much recent research. While most existing discovery methods have focused on finding all syntactically valid dependencies, most use-cases rely on semantically genuine dependencies. A valid dependency is true for the current instance of data, but need not necessarily be true in past and future instances. In fact, the vast majority of discovered dependencies are spurious, i.e., have no inherent meaning. For instance, tens of thousands of functional dependencies are typically valid for a given table, while only a handful of them reflect a genuine real-world dependency. The goal of this project is to develop methods for recognizing genuine dependencies, either directly from a given dataset or among the discovered valid dependencies. Thus, we shall bridge the important gap from syntax to semantics. We hypothesize that a dependency's validity over long stretches of time is a useful signal to predict its genuineness. As datasets often contain errors and as updates may happen only incrementally, we need to relax the definitions of such temporal dependencies. We plan to define and design algorithms to efficiently discover these dependencies. In a second step, we will use these temporal and further contextual signals to design classification methods to predict genuineness.
DFG Programme
Research Grants
