Evaluating clustering algorithms requires different metrics than those used for classification, as clustering is typically an **unsupervised** task. Below are common metrics used to evaluate clustering algorithms: ### 1. **Internal Evaluation Metrics** These metrics do not require ground truth labels and evaluate the clustering structure itself, based on the data points and cluster characteristics. - **Silhouette Score**: Measures how similar a data point is to its own cluster compared to other clusters. Values range from -1 to 1, where higher values indicate better-defined clusters. $ \text{Silhouette} = \frac{b(i) - a(i)}{\max(a(i), b(i))} $ Where: - $a(i)$ is the average distance from point $i$ to all other points in the same cluster. - $b(i)$ is the average distance from point $i$ to points in the nearest neighboring cluster. - **Davies-Bouldin Index**: Measures the average similarity ratio of each cluster with the cluster that is most similar to it. Lower values indicate better clustering quality. $ DB = \frac{1}{n} \sum_{i=1}^{n} \max_{i \neq j} \frac{d(C_i) + d(C_j)}{d(C_i, C_j)} $ Where: - $d(C_i)$ is the intra-cluster distance (within cluster $i$). - $d(C_i, C_j)$ is the inter-cluster distance (between clusters $i$ and $j$). - **Dunn Index**: Evaluates the ratio between the minimum inter-cluster distance and the maximum intra-cluster distance. Higher values indicate better clustering. $ Dunn = \frac{\min_{i \neq j} d(C_i, C_j)}{\max_k d(C_k)} $ Where: - $d(C_i, C_j)$ is the distance between the centroids of clusters $i$ and $j$. - $d(C_k)$ is the intra-cluster distance of cluster $k$. - **Within-Cluster Sum of Squares (WCSS)**: Also known as inertia, it measures the compactness of the clusters, i.e., how tightly grouped the data points are within clusters. Lower values indicate better clustering. $ WCSS = \sum_{k=1}^{K} \sum_{i \in C_k} \left \| x_i - \mu_k \right \|^2 $ Where: - $C_k$ is the set of points in cluster $k$. - $\mu_k$ is the centroid of cluster $k$. - **Calinski-Harabasz Index** (Variance Ratio Criterion): Measures the ratio of the sum of between-cluster dispersion to within-cluster dispersion. Higher values indicate better-defined clusters. $ CH = \frac{\text{Tr}(B_k)}{\text{Tr}(W_k)} \times \frac{n - k}{k - 1} $ Where: - $\text{Tr}(B_k)$ is the trace of the between-cluster dispersion matrix. - $\text{Tr}(W_k)$ is the trace of the within-cluster dispersion matrix. - $n$ is the number of data points, and $k$ is the number of clusters. ### 2. **External Evaluation Metrics** These metrics require ground truth labels and evaluate how well the clustering results match the actual labeled data. - **Adjusted Rand Index (ARI)**: Measures the similarity between the true labels and the predicted clusters, adjusted for chance. ARI ranges from -1 to 1, where 1 means perfect agreement, 0 means random labeling, and negative values indicate poor clustering. $ ARI = \frac{\text{RI} - E(\text{RI})}{\max(\text{RI}) - E(\text{RI})} $ Where: - $\text{RI}$ is the Rand Index, which measures the agreement between predicted and true pairs. - $E(\text{RI})$ is the expected Rand Index under random clustering. - **Normalized Mutual Information (NMI)**: Measures the amount of information shared between the clustering and the ground truth. It is normalized between 0 and 1, with 1 indicating perfect alignment and 0 indicating no mutual information. $ NMI = \frac{I(Y; C)}{\sqrt{H(Y) H(C)}} $ Where: - $I(Y; C)$ is the mutual information between the true labels $Y$ and predicted clusters $C$. - $H(Y)$ and $H(C)$ are the entropies of the true labels and the predicted clusters, respectively. - **Fowlkes-Mallows Index (FMI)**: Measures the similarity between true clusters and predicted clusters using precision and recall. It is between 0 and 1, with higher values indicating better clustering. $ FMI = \sqrt{\frac{TP}{TP + FP} \times \frac{TP}{TP + FN}} $ Where: - $TP$, $FP$, and $FN$ refer to true positives, false positives, and false negatives when comparing clusters to ground truth labels. - **Homogeneity, Completeness, and V-Measure**: - **Homogeneity**: Measures if all data points in a cluster have the same ground truth label. - **Completeness**: Measures if all data points with the same ground truth label are in the same cluster. - **V-Measure**: The harmonic mean of homogeneity and completeness. It ranges from 0 to 1, with 1 indicating perfect clustering. ### 3. **Relative Evaluation Metrics** These metrics compare the performance of different clustering algorithms by varying parameters like the number of clusters. - **Elbow Method**: Plots the WCSS for different numbers of clusters and looks for the “elbow point,” where the addition of more clusters does not significantly improve the compactness of clusters. - **Gap Statistic**: Compares the total intra-cluster variation for different numbers of clusters with a reference distribution (typically random sampling). ### 4. **Clustering-specific Metrics for Density-based Algorithms** - **Cluster Density Metrics**: For density-based clustering algorithms like DBSCAN, which do not necessarily optimize a metric like WCSS, the density of the clusters can be analyzed. This includes metrics like average distance between points in a cluster and their nearest neighbors. - **Core Points and Noise Ratio**: Evaluates the ratio of core points (points inside dense regions) to noise points (points not assigned to any cluster) for density-based clustering methods like DBSCAN. ### Summary of When to Use These Metrics - **Internal metrics** like Silhouette Score and Davies-Bouldin Index are useful when you do not have ground truth labels and want to assess the clustering structure. - **External metrics** like ARI, NMI, and V-Measure are best when you have ground truth labels and want to compare the clustering results to the true class labels. - **Relative evaluation** (e.g., Elbow Method) is useful for comparing different parameter settings (e.g., the number of clusters). - **Density-based clustering metrics** are specifically for density-based algorithms like DBSCAN. Each clustering algorithm has different strengths, so evaluating them using a combination of these metrics can give a more comprehensive understanding of their performance.