Faster k-means clustering.

Hamerly, Gregory James, 1977-Drake, Jonathan, 1989-2013-09-242013-09-242013-082013-09-24http://hdl.handle.net/2104/8826The popular k-means algorithm is used to discover clusters in vector data automatically. We present three accelerated algorithms that compute exactly the same clusters much faster than the standard method. First, we redesign Hamerly’s algorithm to use k heaps to avoid checking distance bounds for all n points, with little empirical gain. Second, we use an adaptive number of distance bounds to avoid redundant calculations (Drake and Hamerly 2012). Experiments show the superior performance of adaptive k-means in medium dimension (20 ≤ d ≤ 200) on uniform random data. Finally, we reformulate the triangle inequality to constrain the search space for a point’s nearest center to an annular region centered at the origin. For uniform random data, annulus k-means is competitive with or much faster than other algorithms in low dimension (d < 20), and it outperforms other algorithms on five of six naturally-clustered, real-world datasets tested (d ≤ 74).en-USBaylor University theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. Contact librarywebmaster@baylor.edu for inquiries about permission.Machine learning.Clustering.Faster k-means clustering.ThesisWorldwide access