Project Details
Projekt Print View

Fine-grained Data Provenance for Very Expressive Queries

Subject Area Security and Dependability, Operating-, Communication- and Distributed Systems
Term from 2018 to 2022
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 398800066
 
Final Report Year 2023

Final Report Abstract

Contemporary relational database systems (RDBMSs) come with ever more capable and versatile SQL query processors that allow the evaluation of complex computation and algorithms right next to the data. The more intricate such queries become, the more pressing are questions regarding the correctness and origin (or data provenance) of query results: where in the input did this piece of the output originate? Why did the query emit this item but did omit another? How did the query produce this result and exactly which query constructs participated in the evaluation (can we thus possibly simplify the query or touch less data)? The present project pursued the primary goal to let techniques for the derivation of such data provenance catch up with the significant advances that query languages and their expressiveness have made in the recent decade. These include the staples of data analytics, like grouping and aggregation, or the variety of window functions available in SQL. Beyond these constructs, we studied the computation of provenance for quantifiers, deeply nested (possibly correlated) subqueries and, in particular, queries and functions that use iteration or recursion to process their tabular inputs. The latter marks a significant step forward since recursive constructs, in principle, turn query languages into expressive programming languages. We showed how data provenance (a) uncovers the iterative nature of the resulting computation and (b) can help to locate and fix bugs in recursive SQL queries. Data provenance shifts a query’s traditional focus from values and their transformation to the dependencies between output and input data: a single value—an individual cell or row of a query’s output table, say—will generally depend on an entire set of values in the input tables. We thus devised systematic query transformations that turn a given value-based SQL query into its own dependency-set-processing variant. These transformations were deliberately designed in a compositional fashion to avoid failure in the face of complex SQL queries. The project revolved around this SQL-to-SQL translation strategy since it enabled the use of existing RDBMSs to derive provenance: data never leaves the system and we could build on optimizing query engines that incorporate decades of research and engineering effort. Importantly, since this approach represents the derived provenance information in tabular form, SQL itself again be used to explore the found data dependencies. The shift from values to dependency sets is reminiscient of the abstract interpretation of queries—we were thus able to draw on a body of substantial work performed by the programming languages community. Dependency sets proved to be a notion sufficiently general and flexible to represent a whole variety of data provenance kinds. We derived Where-provenance, Why -provenance, as well as How-provenance for SQL queries. The latter built on the canonical translation of queries into intermediate imperative programs whose execution generated a trace of relevant query constructs. Among other insights, How-provenance managed to uncover the lazy evaluation of existential quantification or the intricacies of (non-terminating) recursive queries—to the best of our knowledge, the present project is the first to derive such detailed provenance for queries of this complexity. Tracing data dependencies through imperative programs also paved a way to integrate provenance derivation deeper into database systems, e.g., directly into the compiling query engines that power the most efficient analytical RDBMSs available today. While the focus of the three-year project remained on the mentioned SQL language–level transformations— which are independent of any particular RDBMS backend and thus are widely applicable—the groundwork for true system–level provenance derivation has certainly been laid.

Publications

 
 

Additional Information

Textvergrößerung und Kontrastanpassung