Evaluating clustering algorithms requires different metrics than those used for classification, as clustering is typically an **unsupervised** task. Below are common metrics used to evaluate clustering algorithms:
### 1. **Internal Evaluation Metrics**
These metrics do not require ground truth labels and evaluate the clustering structure itself, based on the data points and cluster characteristics.
- **Silhouette Score**: Measures how similar a data point is to its own cluster compared to other clusters. Values range from -1 to 1, where higher values indicate better-defined clusters.
$
\text{Silhouette} = \frac{b(i) - a(i)}{\max(a(i), b(i))}
$
Where:
- $a(i)$ is the average distance from point $i$ to all other points in the same cluster.
- $b(i)$ is the average distance from point $i$ to points in the nearest neighboring cluster.
- **Davies-Bouldin Index**: Measures the average similarity ratio of each cluster with the cluster that is most similar to it. Lower values indicate better clustering quality.
$
DB = \frac{1}{n} \sum_{i=1}^{n} \max_{i \neq j} \frac{d(C_i) + d(C_j)}{d(C_i, C_j)}
$
Where:
- $d(C_i)$ is the intra-cluster distance (within cluster $i$).
- $d(C_i, C_j)$ is the inter-cluster distance (between clusters $i$ and $j$).
- **Dunn Index**: Evaluates the ratio between the minimum inter-cluster distance and the maximum intra-cluster distance. Higher values indicate better clustering.
$
Dunn = \frac{\min_{i \neq j} d(C_i, C_j)}{\max_k d(C_k)}
$
Where:
- $d(C_i, C_j)$ is the distance between the centroids of clusters $i$ and $j$.
- $d(C_k)$ is the intra-cluster distance of cluster $k$.
- **Within-Cluster Sum of Squares (WCSS)**: Also known as inertia, it measures the compactness of the clusters, i.e., how tightly grouped the data points are within clusters. Lower values indicate better clustering.
$
WCSS = \sum_{k=1}^{K} \sum_{i \in C_k} \left \| x_i - \mu_k \right \|^2
$
Where:
- $C_k$ is the set of points in cluster $k$.
- $\mu_k$ is the centroid of cluster $k$.
- **Calinski-Harabasz Index** (Variance Ratio Criterion): Measures the ratio of the sum of between-cluster dispersion to within-cluster dispersion. Higher values indicate better-defined clusters.
$
CH = \frac{\text{Tr}(B_k)}{\text{Tr}(W_k)} \times \frac{n - k}{k - 1}
$
Where:
- $\text{Tr}(B_k)$ is the trace of the between-cluster dispersion matrix.
- $\text{Tr}(W_k)$ is the trace of the within-cluster dispersion matrix.
- $n$ is the number of data points, and $k$ is the number of clusters.
### 2. **External Evaluation Metrics**
These metrics require ground truth labels and evaluate how well the clustering results match the actual labeled data.
- **Adjusted Rand Index (ARI)**: Measures the similarity between the true labels and the predicted clusters, adjusted for chance. ARI ranges from -1 to 1, where 1 means perfect agreement, 0 means random labeling, and negative values indicate poor clustering.
$
ARI = \frac{\text{RI} - E(\text{RI})}{\max(\text{RI}) - E(\text{RI})}
$
Where:
- $\text{RI}$ is the Rand Index, which measures the agreement between predicted and true pairs.
- $E(\text{RI})$ is the expected Rand Index under random clustering.
- **Normalized Mutual Information (NMI)**: Measures the amount of information shared between the clustering and the ground truth. It is normalized between 0 and 1, with 1 indicating perfect alignment and 0 indicating no mutual information.
$
NMI = \frac{I(Y; C)}{\sqrt{H(Y) H(C)}}
$
Where:
- $I(Y; C)$ is the mutual information between the true labels $Y$ and predicted clusters $C$.
- $H(Y)$ and $H(C)$ are the entropies of the true labels and the predicted clusters, respectively.
- **Fowlkes-Mallows Index (FMI)**: Measures the similarity between true clusters and predicted clusters using precision and recall. It is between 0 and 1, with higher values indicating better clustering.
$
FMI = \sqrt{\frac{TP}{TP + FP} \times \frac{TP}{TP + FN}}
$
Where:
- $TP$, $FP$, and $FN$ refer to true positives, false positives, and false negatives when comparing clusters to ground truth labels.
- **Homogeneity, Completeness, and V-Measure**:
- **Homogeneity**: Measures if all data points in a cluster have the same ground truth label.
- **Completeness**: Measures if all data points with the same ground truth label are in the same cluster.
- **V-Measure**: The harmonic mean of homogeneity and completeness. It ranges from 0 to 1, with 1 indicating perfect clustering.
### 3. **Relative Evaluation Metrics**
These metrics compare the performance of different clustering algorithms by varying parameters like the number of clusters.
- **Elbow Method**: Plots the WCSS for different numbers of clusters and looks for the “elbow point,” where the addition of more clusters does not significantly improve the compactness of clusters.
- **Gap Statistic**: Compares the total intra-cluster variation for different numbers of clusters with a reference distribution (typically random sampling).
### 4. **Clustering-specific Metrics for Density-based Algorithms**
- **Cluster Density Metrics**: For density-based clustering algorithms like DBSCAN, which do not necessarily optimize a metric like WCSS, the density of the clusters can be analyzed. This includes metrics like average distance between points in a cluster and their nearest neighbors.
- **Core Points and Noise Ratio**: Evaluates the ratio of core points (points inside dense regions) to noise points (points not assigned to any cluster) for density-based clustering methods like DBSCAN.
### Summary of When to Use These Metrics
- **Internal metrics** like Silhouette Score and Davies-Bouldin Index are useful when you do not have ground truth labels and want to assess the clustering structure.
- **External metrics** like ARI, NMI, and V-Measure are best when you have ground truth labels and want to compare the clustering results to the true class labels.
- **Relative evaluation** (e.g., Elbow Method) is useful for comparing different parameter settings (e.g., the number of clusters).
- **Density-based clustering metrics** are specifically for density-based algorithms like DBSCAN.
Each clustering algorithm has different strengths, so evaluating them using a combination of these metrics can give a more comprehensive understanding of their performance.