Abstract
Cluster analysis is a widely used statistical tool for assisting researchers in identification of subgroups within the population based upon a set of variables. Research has shown that many clustering algorithms have difficulty in correctly grouping individuals within a sample when the subgroups within the population are of very different sizes. In addition, cluster algorithms can also yield inaccurate clustering results when outliers are present in the data, as well as if some of the variables used in the clustering are not actually associated with subgroup separation. A variety of clustering algorithms have been developed to deal with these problems, including approaches based on the popular Lasso regularization estimator, a trimmed estimator, density based clustering, the use of medoids rather than centroids, and a robust approach. Prior research has examined several of these methods with some, but not all of the challenging data scenarios outlined above. The purpose of this simulation study, therefore, was to compare a number of these clustering algorithms when group sizes were unequal, outliers were present, and some of the variables represented noise, rather than actually contributing to cluster separation. Results of the study showed that the robust, trimmed, and density based clustering methods yielded the most accurate clustering results when all of these issues were present in the data. When none were present, all of the methods examined here performed similarly and well. An empirical example is also demonstrated, and implications of the simulation study are discussed.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Copyright (c) 2019 W. Holmes Finch (Author)