Project Details
Projekt Print View

Soft-Clustering - From Heuristics To Approximation Algorithms

Subject Area Theoretical Computer Science
Term from 2017 to 2021
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 324506751
 
A clustering is a partition of a set of objects into groups, so-called clusters. The computation of a clustering is a typical problem from machine learning and data mining with applications in several fields, including data compression, image and pattern recognition and signal processing. One of the most well-known clustering problems is the so-called K-means problem. However, in practice it does not always make sense to assign each data point to a unique cluster. That is why so-called soft clustering problems assign each data point to each cluster with some degree of membership. One of the most well-known soft clustering problems is the maximum likelihood estimation (MLE) problem with respect to mixture models. The Expectation Maximization (EM) algorithm is a popular approach to solve this problem in practical applications. It is a generalization of Lloyd's algorithm, which is readily applied when solving the K-means problem. Even though it is regularly used, the EM algorithm is a heuristic with two significant downsides. First, there is no guarantee on the solutions that the EM algorithm computes. Second, there is no known upper bound on its runtime. The main goal of this project is the algorithmic and complexity theoretical analysis of the MLE problem for mixture models. Comparing the MLE problem to the well-analyzed K-means problem, one can identify two substantial structural differences. The distribution of the memberships of points in the sense of a soft clustering, and the cost function in the form of a mixture model. That is why we want to approach the MLE problem from two different directions. To this end, we want to examine several variants "inbetween" the K-means and the MLE problem. The core idea is that these variants do not exhibit both of the previously addressed differences to the K-means problem, but in each case only one of them. In particular, we analyze the fuzzy K-means and the CMLE problem. These problems are, aside from the role they assume in this project as "intermediates" between the K-means and MLE problem, of great practical interest. Analogously to Lloyd's algorithm and the EM algorithm, there are heuristics for the fuzzy K-means and the CMLE problem, which are usually applied in practice. Our goal is to develop algorithms that solve these problems with provable guarantees on the quality of solutions and on the runtime. Additionally, we want to work out structural differences between these problems and the K-means problem, respectively between their optimal solutions. We want to use the insights gained from this analysis for the algorithmic and complexity theoretical examination of the MLE problem.
DFG Programme Research Grants
 
 

Additional Information

Textvergrößerung und Kontrastanpassung