A Library for Scalable Analytics and Mining in Stratosphere

Applicant Professor Dr. Felix Naumann, since 8/2014

Subject Area Security and Dependability, Operating-, Communication- and Distributed Systems

Term from 2013 to 2017

Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 132320961

The Web can be viewed as a conglomerate of “big data” repositories. The sheer size of these repositories makes the management and the analysis of the underlying data (no matter whether structured or not) difficult. At such a scale, any data analysis program needs to run in a distributed fashion, and use preprocessed input whenever possible. More specifically, for the application of data mining and machine learning methods at such a scale, we need to address the following research challenges:1. Separate the wheat from the chaff: recognize “valuable entities” in these datasets and extract salient features about those entities. Existing Stratosphere operators will be adapted and new operators will be implemented to perform this Web-scale extraction task. We will combine the extracted textual features with knowledgeoriented features derived from knowledge bases that contain the entities. The feature vectors obtained in this way will be used as generic representations of real-world entities and fed into a wide range of prediction, recommendation, and knowledge discovery methods.2. Divide and conquer: although the above vectors will be rather sparse, there will potentially be hundreds of millions of them. Hence there is a need for maintaining and manipulating them in a distributed and declarative fashion. To this end, we will focus on techniques for efficiently partitioning a dataset of high-dimensional vectors onto different machines, while considering various criteria, such as load-balancing, topical proximity, and efficient updates. As part of the feature selection process, we plan to provide efficient declarative functions for switching feature values on and off, based on the learning task at hand.3. Let the data speak for itself: provide a library of machine learning, data mining, and knowledge discovery techniques that can operate on the feature vectors described above. In contrary to existing efforts, which basically focus on providing scalable algorithms, we also aim at providing meaningful data (in terms of preprocessed feature vectors) for such algorithms. Stratosphere's Supremo operator model and the Strata data programming language will be at the heart of all our algorithms.

DFG Programme Research Units

Subproject of FOR 1306: Stratosphere - Information Management above the Clouds

Ehemaliger Antragsteller Professor Dr. Gjergji Kasneci, until 7/2014

Servicenavigation

Hauptnavigation

A Library for Scalable Analytics and Mining in Stratosphere

Additional Information

Servicenavigation

Hauptnavigation

A Library for Scalable Analytics and Mining in Stratosphere

Additional Information

Textvergrößerung und Kontrastanpassung