Optimizing k-means clustering using mini-batches and distance bounds.
Access rightsWorldwide access
Shrestha Khimbaja, Sumit, 1994-
MetadataShow full item record
Clustering is a crucial branch of machine learning that groups the input data into different clusters based on the features of the data without the training label. K-means clustering is a widely used iterative clustering technique that groups data into k clusters by repeatedly minimizing a criterion function. The standard k-means algorithm, Lloyd’s algorithm, is simple and easy to implement the algorithm. But it does perform a lot of unnecessary work while computing the distance between the data points and the centers. This increases the runtime of the algorithm and makes it unsuitable for real-life applications. Previous works on k-means optimizations have used the triangle inequality law to generate geometric bounds for the data points which are then used to prevent unnecessary distance computations. Another approach to accelerate the k-means algorithm is to use only using a subset of data instead of the entire data in a single iteration. Due to the reduction in data size used for computation, this method also generates faster clustering results. In this thesis, we are proposing an algorithm that accelerates the k-means algorithm by using geometric bounds on mini-batches. We are combining the triangle inequality based optimization technique with the mini-batch approach by running the k-means algorithm in minibatches for multiple iterations and using the geometric bounds within the mini-batch to reduce the distance computations. The results show that there is a large speedup over the existing optimization techniques in terms of runtime while producing good quality clusters.