Detailseite
A Library for Scalable Analytics and Mining in Stratosphere
Antragsteller
Professor Dr. Felix Naumann, seit 8/2014
Fachliche Zuordnung
Sicherheit und Verlässlichkeit, Betriebs-, Kommunikations- und verteilte Systeme
Förderung
Förderung von 2013 bis 2017
Projektkennung
Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 132320961
The Web can be viewed as a conglomerate of “big data” repositories. The sheer size of these repositories makes the management and the analysis of the underlying data (no matter whether structured or not) difficult. At such a scale, any data analysis program needs to run in a distributed fashion, and use preprocessed input whenever possible. More specifically, for the application of data mining and machine learning methods at such a scale, we need to address the following research challenges:1. Separate the wheat from the chaff: recognize “valuable entities” in these datasets and extract salient features about those entities. Existing Stratosphere operators will be adapted and new operators will be implemented to perform this Web-scale extraction task. We will combine the extracted textual features with knowledgeoriented features derived from knowledge bases that contain the entities. The feature vectors obtained in this way will be used as generic representations of real-world entities and fed into a wide range of prediction, recommendation, and knowledge discovery methods.2. Divide and conquer: although the above vectors will be rather sparse, there will potentially be hundreds of millions of them. Hence there is a need for maintaining and manipulating them in a distributed and declarative fashion. To this end, we will focus on techniques for efficiently partitioning a dataset of high-dimensional vectors onto different machines, while considering various criteria, such as load-balancing, topical proximity, and efficient updates. As part of the feature selection process, we plan to provide efficient declarative functions for switching feature values on and off, based on the learning task at hand.3. Let the data speak for itself: provide a library of machine learning, data mining, and knowledge discovery techniques that can operate on the feature vectors described above. In contrary to existing efforts, which basically focus on providing scalable algorithms, we also aim at providing meaningful data (in terms of preprocessed feature vectors) for such algorithms. Stratosphere's Supremo operator model and the Strata data programming language will be at the heart of all our algorithms.
DFG-Verfahren
Forschungsgruppen
Teilprojekt zu
FOR 1306:
Stratosphere - Information Management on the Cloud
Ehemaliger Antragsteller
Professor Dr. Gjergji Kasneci, bis 7/2014