Advanced Analyses > CLUSTER Command

Type Of Clustering Algorithm

There are six different clustering algorithms available in StatPac. The TY option is used to select the clustering method. Algorithms 1-3 are agglomerative hierarchical clustering algorithms while algorithms 4-6 are non-hierarchical clustering algorithms. Each of the six algorithms are described below:

Minimum average sum of squares cluster analysis TY=1

With this algorithm, the clusters merged at each stage are chosen so as to minimize the average contribution to the error sum of squares for each member in the cluster. This quantity is also the variance in each cluster and is similar to average linkage in that it tends to produce clusters of approximately equal variance. Consequently, if the clusters are all of approximately the same density, then there will be a tendency for large natural groups to appear as several smaller clusters, or for small natural groups to merge into larger clusters.

Ward's method  TY=2

At each stage, this method minimizes the within-cluster sum of squares over all partitions due to the merger of clusters p and q. This method tends to join clusters with a small number of observations and is biased towards producing clusters with roughly the same number of observations.

Centroid method TY=3

This method minimizes the squared Euclidian distance between clusters at each stage. The centroid method is not as sensitive to the presence of outliers, but does not perform as well as the first two methods if there are no outliers.

If there are no outliers, one of the first two methods should be used. The first method performs better than Ward's method under certain types of errors (Milligan, 1980). The three non-hierarchical clustering algorithms are all based on the convergent K-means method (Anderberg, 1973) and differ only in terms of their starting values.

Convergent K-means using minimum average sum of squares centroids TY=4

This algorithm first runs the minimum average sum of squares hierarchical cluster analysis method and uses the centroids from this method as input to the convergent K-means procedure. The distance measure used to allocate an observation to a cluster in the convergent K-means procedure is the Euclidian distance obtained from the clustering variables for that observation.

Convergent K-means using Ward method centroids TY=5

This algorithm first runs the Ward hierarchical cluster analysis method and uses the centroids from this method as input to the convergent K-means procedure. The distance measure used to allocate an observation to a cluster in the convergent K-means procedure is the Euclidian distance obtained from the clustering variables for that observation.

Convergent K-means using centroids from the centroid method TY=6

This algorithm first runs the centroid hierarchical cluster analysis method and uses the centroids from this method as input to the convergent K-means procedure. The distance measure used to allocate an observation to a cluster in the convergent K-means procedure is the Euclidian distance obtained from the clustering variables for that observation.

Non-hierarchical methods generally perform better than hierarchical methods if non-random starting clusters are used. When random starting clusters are used (for example, the first p observations are used as centroids for the p starting clusters), the non-hierarchical clustering methods perform rather poorly. The random start methods were, therefore, not implemented in StatPac. K-means procedures appear more robust than any hierarchical methods with respect to the presence of outliers, error perturbations of distance measures and choice of distance metric. However, non-hierarchical methods require the number of clusters to be given. Many studies recommend the following series of steps in running cluster analysis:

1. Run cluster analysis using one of the first two hierarchical cluster analysis algorithms (minimum average sum of squares or Ward methods).

2. Remove outliers from the data set. Outliers can be located by looking at the distance from the cluster centroids (CC option), or the hierarchical tree diagram (one observation clusters that are late in merging with other clusters). Outliers often represent segments of the population that are under-represented and therefore, should not be discarded, without examination.

3. Delete dormant clustering variables. These can be located using the decomposition of sum of squares (DC option).

4. Determine the number of clusters. This can be done using the criterion function column in the decomposition of sum of squares (DC option), as well as the hierarchical tree diagram (TD option).

5. Once outliers are discarded, dormant variables omitted and the number of clusters determined, run one of the first two non-hierarchical methods (TY=4 or 5) several times, varying the number of clusters.