Detailseite
Data Profiling and Data Cleansing on Stratosphere
Antragsteller
Professor Dr. Felix Naumann
Fachliche Zuordnung
Sicherheit und Verlässlichkeit, Betriebs-, Kommunikations- und verteilte Systeme
Förderung
Förderung von 2013 bis 2017
Projektkennung
Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 132320961
Project E follows two main research goals, namely “Low-Latency Cleansing” and “Scalable Data Profiling”. Data cleansing represents a true and tested application area of Stratosphere, yet achieving lowlatency and early results enables new application areas. Data profiling serves both as a challenging application area for developers and as a core technology for query planning, data analysis, and data integration and cleansing.Low-latency cleansing aims at reducing the pipeline- blocking nature that is typical of data cleansing and integration operators. For instance, to answer a query requiring a duplicate-free result, a complex duplicate detection operator is invoked, which might perform multiple sorting steps and possibly some advanced clustering before emitting results. In big-data scenarios where not all data is even initially available or in user-interactive scenarios such latency is unacceptable. We plan to investigate specialized cleansing operators that can handle such high-throughput data with minimal effect on thequality of the outcome. For instance, outputting a pre-selection of particularly clean data can “buy” the necessary time to prepare more complex cleansing steps.Scalable data profiling has the goal to determine metadata about very large datasets, such as simple tuple-counts or column-uniqueness, but also moderately complex information, such as frequent value patterns and data types, to information that is highly complex to determine, including (conditional) functional dependencies and inclusion dependencies. Section 3.1 gives a comprehensive overview of data profiling tasks. General tools usually cover only a subset of possible data profiling tasks, due to their complexity. Specialized methods are often more efficient, but cover only one type of information and often assume that all data resides in main memory. Our goal is to alleviate both aspects: First, we cannot assume that main memory-based approaches suffice but rather need to handle data that must be distributed over multiple sites. Second, we plan to combine calculations for various tasks so as to minimize I/O activity. The results of data profiling have at least two immediate uses. First, they serve as input to Stratosphere’s statistics component, which in turn supports query optimization. Second, data profiling results serve as input to data cleansing methods. For instance, knowledge about common value patterns in a column allows the configuration of a normalization operator on that column. Vice-versa, some simple data cleansing.
DFG-Verfahren
Forschungsgruppen
Teilprojekt zu
FOR 1306:
Stratosphere - Information Management on the Cloud