Fein-granulare Analyse der Datenherkunft in ausdrucksstarken Anfragen

Antragsteller Professor Dr. Torsten Grust

Fachliche Zuordnung Sicherheit und Verlässlichkeit, Betriebs-, Kommunikations- und verteilte Systeme

Förderung Förderung von 2018 bis 2022

Projektkennung Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 398800066

Erstellungsjahr 2023

Zusammenfassung der Projektergebnisse

Contemporary relational database systems (RDBMSs) come with ever more capable and versatile SQL query processors that allow the evaluation of complex computation and algorithms right next to the data. The more intricate such queries become, the more pressing are questions regarding the correctness and origin (or data provenance) of query results: where in the input did this piece of the output originate? Why did the query emit this item but did omit another? How did the query produce this result and exactly which query constructs participated in the evaluation (can we thus possibly simplify the query or touch less data)? The present project pursued the primary goal to let techniques for the derivation of such data provenance catch up with the signiﬁcant advances that query languages and their expressiveness have made in the recent decade. These include the staples of data analytics, like grouping and aggregation, or the variety of window functions available in SQL. Beyond these constructs, we studied the computation of provenance for quantiﬁers, deeply nested (possibly correlated) subqueries and, in particular, queries and functions that use iteration or recursion to process their tabular inputs. The latter marks a signiﬁcant step forward since recursive constructs, in principle, turn query languages into expressive programming languages. We showed how data provenance (a) uncovers the iterative nature of the resulting computation and (b) can help to locate and ﬁx bugs in recursive SQL queries. Data provenance shifts a query’s traditional focus from values and their transformation to the dependencies between output and input data: a single value—an individual cell or row of a query’s output table, say—will generally depend on an entire set of values in the input tables. We thus devised systematic query transformations that turn a given value-based SQL query into its own dependency-set-processing variant. These transformations were deliberately designed in a compositional fashion to avoid failure in the face of complex SQL queries. The project revolved around this SQL-to-SQL translation strategy since it enabled the use of existing RDBMSs to derive provenance: data never leaves the system and we could build on optimizing query engines that incorporate decades of research and engineering effort. Importantly, since this approach represents the derived provenance information in tabular form, SQL itself again be used to explore the found data dependencies. The shift from values to dependency sets is reminiscient of the abstract interpretation of queries—we were thus able to draw on a body of substantial work performed by the programming languages community. Dependency sets proved to be a notion sufﬁciently general and ﬂexible to represent a whole variety of data provenance kinds. We derived Where-provenance, Why -provenance, as well as How-provenance for SQL queries. The latter built on the canonical translation of queries into intermediate imperative programs whose execution generated a trace of relevant query constructs. Among other insights, How-provenance managed to uncover the lazy evaluation of existential quantiﬁcation or the intricacies of (non-terminating) recursive queries—to the best of our knowledge, the present project is the ﬁrst to derive such detailed provenance for queries of this complexity. Tracing data dependencies through imperative programs also paved a way to integrate provenance derivation deeper into database systems, e.g., directly into the compiling query engines that power the most efﬁcient analytical RDBMSs available today. While the focus of the three-year project remained on the mentioned SQL language–level transformations— which are independent of any particular RDBMS backend and thus are widely applicable—the groundwork for true system–level provenance derivation has certainly been laid.

Projektbezogene Publikationen (Auswahl)

How How Explains What What Computes—How-Provenance for SQL and Query Compilers. In 10th USENIX Workshop on Theory and Practica of Provenance (TaPP), London, UK, July 2018
D. O’Grady, T. Müller, and T. Grust
You say 'what', i hear 'where' and 'why'. Proceedings of the VLDB Endowment, 11(11), 1536-1549.
Müller, Tobias; Dietrich, Benjamin & Grust, Torsten
Detached Provenance Analysis. Ph.D. Thesis, University of Tübingen, Department of Computer Science, March 2020
T. Müller
Functional-Style SQL UDFs With a Capital 'F'. Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (2020, 5, 31), 1273-1287. American Geophysical Union (AGU).
Duta, Christian & Grust, Torsten
Data provenance for recursive SQL queries. Proceedings of the 14th International Workshop on the Theory and Practice of Provenance (2022, 6, 12), 1-8. American Geophysical Union (AGU).
Dietrich, Benjamin; Müller, Tobias & Grust, Torsten
How, Where, and Why Data Provenance Improves Query Debugging: A Visual Demonstration of Fine–Grained Provenance Analysis for SQL. 2022 IEEE 38th International Conference on Data Engineering (ICDE) (2022, 5), 3178-3181. American Geophysical Union (AGU).
Muller, Tobias & Engel, Pascal

Servicenavigation

Hauptnavigation

Fein-granulare Analyse der Datenherkunft in ausdrucksstarken Anfragen

Zusammenfassung der Projektergebnisse

Projektbezogene Publikationen (Auswahl)

Zusatzinformationen

Servicenavigation

Hauptnavigation

Fein-granulare Analyse der Datenherkunft in ausdrucksstarken Anfragen

Zusammenfassung der Projektergebnisse

Projektbezogene Publikationen (Auswahl)

Zusatzinformationen

Textvergrößerung und Kontrastanpassung