Theses/Dissertations - Computer Science

Permanent URI for this collectionhttps://hdl.handle.net/2104/4810

Browse

Now showing 1 - 1 of 1

Optimizing k-means clustering using mini-batches and distance bounds.
(2020-10-22) Shrestha Khimbaja, Sumit, 1994-; Hamerly, Gregory James, 1977-
Clustering is a crucial branch of machine learning that groups the input data into different clusters based on the features of the data without the training label. K-means clustering is a widely used iterative clustering technique that groups data into k clusters by repeatedly minimizing a criterion function. The standard k-means algorithm, Lloyd’s algorithm, is simple and easy to implement the algorithm. But it does perform a lot of unnecessary work while computing the distance between the data points and the centers. This increases the runtime of the algorithm and makes it unsuitable for real-life applications. Previous works on k-means optimizations have used the triangle inequality law to generate geometric bounds for the data points which are then used to prevent unnecessary distance computations. Another approach to accelerate the k-means algorithm is to use only using a subset of data instead of the entire data in a single iteration. Due to the reduction in data size used for computation, this method also generates faster clustering results. In this thesis, we are proposing an algorithm that accelerates the k-means algorithm by using geometric bounds on mini-batches. We are combining the triangle inequality based optimization technique with the mini-batch approach by running the k-means algorithm in minibatches for multiple iterations and using the geometric bounds within the mini-batch to reduce the distance computations. The results show that there is a large speedup over the existing optimization techniques in terms of runtime while producing good quality clusters.

Browse

Browsing Theses/Dissertations - Computer Science by Subject "Clustering. K-means. Mini-batch."