Basic concepts of Cluster analysis

Conceptos básicos en Análisis de conglomerados

Clustering is a technique, non-parametric (Zelterman 2015, page 287), to group similar data points into one group and separate out dissimilar observations into different groups or clusters. In Hierarchical Clustering, clusters are created such that they have a predetermined ordering i.e. a hierarchy.
Un Clústeres es una técnica, no paramétrica (Zelterman 2015), para 1) agrupar datos similares en un grupo, y 2) separar datos diferentes en grupos o conglomerados. En clúster jerárquicos, los grupos se crean de manera que tengan un orden predeterminado, es decir, una jerarquía.


Clustering is not a typical statistical method in that it does not test any hypothesis. Clustering helps bring out some features hidden in the data; it is the user who decides if these structures are interesting and worth interpreting in ecological terms (Borcard et al. 2012).
Los clústeres no son métodos estadísticos típicos en el sentido de que no prueban ninguna hipótesis. Los clústeres ayuda a resaltar algunas características ocultas en los datos. Es el usuario quien decide si estas estructuras son interesantes y vale la pena interpretarlas en términos ecológicos (Borcard et al. 2012).


There are at least 5 families of grouping methods. More details in Borcard et al. (2012)
Hay al menos 5 familias de metodos de agrupación. Mas detalles en Borcard et al. (2012)


Ward’s Minimum Variance Clustering

This method is based on the linear model criterion of least squares. The objective is to define groups in such a way that the within-group sum of squares (i.e. the squared error of ANOVA) is minimized. More details in Murtagh & Legendre (2014).
Este método se basa en el criterio del modelo lineal de mínimos cuadrados. El objetivo es definir grupos de tal manera que se minimice la suma de cuadrados dentro del grupo (es decir, el error al cuadrado de ANOVA).


Step 1

Load libraries.
Cargar las librerias que necesitas.

library(vegan)
library(factoextra)


Step 2

Data
Cargar los datos.

spe.fish  <- read.csv("D:/OneDrive - University of Vermont/Curriculum/19_ Personal webpage/TropicalFreshwaterEcology/_pages/Lectures/data/Cluster_analysis_fish_PR.csv", header = TRUE, row.names = 1)
head(spe.fish)


Step 1. Cluster

Ejecutar el analisis de conglomerados

# Ward Hierarchical Clustering
# distance matrix.  You need to calculate this before running the cluster analysis.  The most popular distance is # the Euclidean distance (Zelterman 2015, page 293)
dist.fish <- dist(spe.fish, method = "euclidean") 

# method:  What type of algorithm should be used to cluster points and define groups
fish_clust <- hclust(dist.fish, method="ward.D2")

# display dendogram
plot(fish_clust, cex = 0.6, hang = -1) 


Step 2: Cluster

Delinear los grupos

plot(fish_clust, cex = 0.6, hang = -1, 
     main = "Cluster dendrogram", sub = NULL,
     xlab = "", ylab = "Euclidean distance") 

rect.hclust(fish_clust, k = 3, border = 2:6)

# abline(h = 100, col = 'red')


Step 2.1

Identificar a que grupo pertenece cada dato

cut_avg <- cutree(fish_clust, k = 3)
cut_avg
##  2008  2009  2010  2011  2012  2013  2014 2015a 2015b 2015c  2016 
##     1     2     1     1     1     1     1     3     3     3     3


Step 2.2

Número de observaciones en cada grupo

table(cut_avg)
## cut_avg
## 1 2 3 
## 6 1 4


Optimal number of clusters

In order to find the optimal number of clusters is recommended to choose it based on:


Elbow method

The location of a knee in the plot is usually considered as an indicator of the appropriate number of clusters because it means that adding another cluster does not improve much better the partition. This method seems to suggest 4 clusters.

The Elbow method is sometimes ambiguous and an alternative is the average silhouette method.

fviz_nbclust(spe.fish, kmeans, method = "wss") +
  geom_vline(xintercept = 5, linetype = 2) + # add line for better visualisation
  labs(subtitle = "Elbow method") # add subtitle


Silhouette method

The Silhouette method measures the quality of a clustering and determines how well each point lies within its cluster.

fviz_nbclust(spe.fish, kmeans, method = "silhouette") +
  labs(subtitle = "Silhouette method")


References

Borcard, D., Gillet, F., & Legendre, P. (2011). Numerical ecology with R (Vol. 2, p. 688). New York: springer.

Legendre, P., & Legendre, L. (2012). Numerical ecology. Elsevier.

Murtagh, F., & Legendre, P. (2014). Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion?. Journal of classification, 31(3), 274-295.

Zelterman, D. (2015). Applied multivariate statistics with R. Switzerland: Springer International Publishing.