Faster k-means clustering.
Date
Authors
Access rights
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
The popular k-means algorithm is used to discover clusters in vector data automatically. We present three accelerated algorithms that compute exactly the same clusters much faster than the standard method. First, we redesign Hamerly’s algorithm to use k heaps to avoid checking distance bounds for all n points, with little empirical gain. Second, we use an adaptive number of distance bounds to avoid redundant calculations (Drake and Hamerly 2012). Experiments show the superior performance of adaptive k-means in medium dimension (20 ≤ d ≤ 200) on uniform random data. Finally, we reformulate the triangle inequality to constrain the search space for a point’s nearest center to an annular region centered at the origin. For uniform random data, annulus k-means is competitive with or much faster than other algorithms in low dimension (d < 20), and it outperforms other algorithms on five of six naturally-clustered, real-world datasets tested (d ≤ 74).