Project Details
Projekt Print View

Fine-grained Data Provenance for Very Expressive Queries

Subject Area Security and Dependability, Operating-, Communication- and Distributed Systems
Term from 2018 to 2022
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 398800066
 
Data provenance uncovers how database queries transform, filter, merge, and aggregate input data to arrive at the final output. With today's characteristic steep growth in data volume as well as query complexity, the inner workings of a query quickly become hard to assess and validate: where in the input did this piece of output originate? Why did the query emit this item but omit another? How did the query produce this result value and exactly which query constructs participated in the evaluation? Data provenance has answers to these and further questions and the responses explain query internals (and bugs), aid in data quality assessments, and help to build trust in query results—a critical service to data-dependent science and society.With provenance, we shift a query's focus from values and their transformation to the dependencies between output and input data. This research proposal is built on the central hypothesis that abstract interpretation provides an ideal framework to think and reason about as well as to implement this shift of focus. In abstract interpretation, a program analysis discipline first established in the 1970s, all but one (or few) selected aspect(s) of a program's evaluation are ignored. This project will adapt these ideas to develop a view of queries and programs in which input/output dependencies—not: values—assume the primary role.The benefits of data provenance grow with the complexity of the query logic it is able to explain. We set out to derive provenance for advanced query language constructs and idioms like deep nesting, sliding windows, user-defined and built-in functions, or recursion. It is a core goal to embrace practically relevant and complex languages, like modern variants of SQL, where prior work exhibited significant restrictions. We will capitalize on the flexibility of abstract interpretation and design abstract domains that explain provenance at various levels of data granularity, down to individual atomic values (table cells, say). Further adaptations of the abstract domain and query interpretation rules will allow the exploration of new and notoriously difficult types of data provenance (e.g., those of values absent in the output). Abstract interpretation is both, a powerful theoretical but also a practical tool. Building on the latter, we will study parallel provenance derivation for queries over large data volumes and the seamless integration of data provenance into query compilers of existing modern database systems.
DFG Programme Research Grants
 
 

Additional Information

Textvergrößerung und Kontrastanpassung