Cluster analysis can be a powerful data-mining tool for any organisation that needs to identify discrete groups of customers, sales transactions, or other types of behaviours and things. For example, insurance providers use cluster analysis to detect fraudulent claims, and banks use it for credit scoring.
Cluster analysis, like reduced space analysis (factor analysis), is concerned with data matrices in which the variables have not been partitioned beforehand into criterion versus predictor subsets.
The objective of cluster analysis is to find similar groups of subjects, where “similarity” between each pair of subjects means some global measure over the whole set of characteristics. In this article we discuss various methods of clustering and the key role that distance plays as measures of the proximity of pairs of points.
Basic Questions in Cluster Analysis
The most common use of cluster analysis is classification. Subjects are separated into groups so that each subject is more similar to other subjects in its group than to subjects outside the group.
We will initially focus on clustering procedures that result in the assignment of each subject to one, and only one, class. Subjects within a class are usually assumed to be indistinguishable from one another. Thus, we assume that the underlying structure of the data involves an unordered set of discrete classes. In some cases we may also view these classes as hierarchical in nature, with some classes divided into subclasses. Clustering procedures can be viewed as “pre-classificatory” in the sense that the researcher has not used prior judgment to partition the subjects (rows of the data matrix). However, it is assumed that some of the objectives are heterogeneous; that is, that “clusters” exist.
This presupposition of different groups is based on commonalities within the set of independent variables. This assumption is different from the one made in the case of discriminant analysis or automatic interaction detection, where the dependent variable is used to formally define groups of objects and the distinction is not made on the basis of profile resemblance in the data matrix itself.
Thus, given that no information on group definition is formally evaluated in advance, the major problems of cluster analysis will be discussed as follows:
- What measure of inter-subject similarity is to be used and how is each variable to be “weighted” in the construction of such a summary measure?
- After inter-subject similarities are obtained, how are the classes to be formed?
- After the classes have been formed, what summary measures of each cluster are appropriate in a descriptive sense; that is, how are the clusters to be defined?
- Assuming that adequate descriptions of the clusters can be obtained, what inferences can be drawn regarding their statistical significance?