Clustering is a technique, non-parametric (Zelterman 2015, page 287),
to group similar data points into one group and separate out dissimilar
observations into different groups or clusters. In Hierarchical
Clustering, clusters are created such that they have a predetermined
ordering i.e. a hierarchy.
Un Clústeres es una técnica, no
paramétrica (Zelterman 2015), para 1) agrupar datos similares en un
grupo, y 2) separar datos diferentes en grupos o conglomerados. En
clúster jerárquicos, los grupos se crean de manera que tengan un orden
predeterminado, es decir, una jerarquía.
Clustering is not a typical statistical method in that it does
not test any hypothesis. Clustering helps bring out
some features hidden in the data; it is the user who decides if these
structures are interesting and worth interpreting in ecological terms
(Borcard et al. 2012).
Los clústeres no son métodos estadísticos
típicos en el sentido de que no prueban ninguna
hipótesis. Los clústeres ayuda a resaltar algunas
características ocultas en los datos. Es el usuario quien decide si
estas estructuras son interesantes y vale la pena interpretarlas en
términos ecológicos (Borcard et al. 2012).
There are at least 5 families of grouping methods. More details in
Borcard et al. (2012)
Hay al menos 5 familias de metodos de
agrupación. Mas detalles en Borcard et al. (2012)
This method is based on the linear model criterion of least squares.
The objective is to define groups in such a way that the within-group
sum of squares (i.e. the squared error of ANOVA) is minimized. More
details in Murtagh & Legendre (2014).
Este método se basa en
el criterio del modelo lineal de mínimos cuadrados. El objetivo es
definir grupos de tal manera que se minimice la suma de cuadrados dentro
del grupo (es decir, el error al cuadrado de ANOVA).
Load libraries.
Cargar las librerias que necesitas.
library(vegan)
library(factoextra)
Data
Cargar los datos.
spe.fish <- read.csv("D:/OneDrive - University of Vermont/Curriculum/19_ Personal webpage/TropicalFreshwaterEcology/_pages/Lectures/data/Cluster_analysis_fish_PR.csv", header = TRUE, row.names = 1)
head(spe.fish)
Ejecutar el analisis de conglomerados
# Ward Hierarchical Clustering
# distance matrix. You need to calculate this before running the cluster analysis. The most popular distance is # the Euclidean distance (Zelterman 2015, page 293)
dist.fish <- dist(spe.fish, method = "euclidean")
# method: What type of algorithm should be used to cluster points and define groups
fish_clust <- hclust(dist.fish, method="ward.D2")
# display dendogram
plot(fish_clust, cex = 0.6, hang = -1)
Delinear los grupos
plot(fish_clust, cex = 0.6, hang = -1,
main = "Cluster dendrogram", sub = NULL,
xlab = "", ylab = "Euclidean distance")
rect.hclust(fish_clust, k = 3, border = 2:6)
# abline(h = 100, col = 'red')
Identificar a que grupo pertenece cada dato
cut_avg <- cutree(fish_clust, k = 3)
cut_avg
## 2008 2009 2010 2011 2012 2013 2014 2015a 2015b 2015c 2016
## 1 2 1 1 1 1 1 3 3 3 3
Número de observaciones en cada grupo
table(cut_avg)
## cut_avg
## 1 2 3
## 6 1 4
In order to find the optimal number of clusters is recommended to choose it based on:
The location of a knee in the plot is usually considered as an indicator of the appropriate number of clusters because it means that adding another cluster does not improve much better the partition. This method seems to suggest 4 clusters.
The Elbow method is sometimes ambiguous and an alternative is the average silhouette method.
fviz_nbclust(spe.fish, kmeans, method = "wss") +
geom_vline(xintercept = 5, linetype = 2) + # add line for better visualisation
labs(subtitle = "Elbow method") # add subtitle
The Silhouette method measures the quality of a clustering and determines how well each point lies within its cluster.
fviz_nbclust(spe.fish, kmeans, method = "silhouette") +
labs(subtitle = "Silhouette method")
Borcard, D., Gillet, F., & Legendre, P. (2011). Numerical ecology with R (Vol. 2, p. 688). New York: springer.
Legendre, P., & Legendre, L. (2012). Numerical ecology. Elsevier.
Murtagh, F., & Legendre, P. (2014). Ward’s hierarchical agglomerative clustering method: which algorithms implement Ward’s criterion?. Journal of classification, 31(3), 274-295.
Zelterman, D. (2015). Applied multivariate statistics with R. Switzerland: Springer International Publishing.