PCA is an ordination used to reduce a data set with
n cases (e.g., study sites) and
p variables (e.g., environmental variables) to
a smaller number of synthetic variables that represent most of the
information in the set of original data (McCune et al.Ā 2002).
El PCA es una ordenaciĆ³n que se utiliza para reducir un conjunto de datos con n casos (e.g., sitios de estudio) y p variables (e.g., variables ambientales) a un nĆŗmero menor de variables sintĆ©ticas que representan la mayor parte de la informaciĆ³n en el conjunto de datos original (McCune et al.Ā 2002).
Some blogs (e.g., Blog
1) recommend using the āprcompā package to
run PCAs, since it uses the singular value decomposition (SVD),
which has slightly better numerical accuracy compared to other packages
like āprincompā.
Algunos blogs (e.g., Blog 1) recomiendan usar el paquete āprcompā para ejecutar PCA, ya que usa la descomposiciĆ³n de valores singulares (SVD) que tiene una precisiĆ³n numĆ©rica ligeramente mejor comparado con āprincompā.
First we will define these two concepts that are fundamental in
statistical analysis and modeling:
Standardization (or Z-score) refers to
making the empirical distribution Yā¼N(0,1). That is, make the variables
follow a standard normal distribution (mean
(Ī¼) = 0 and standard deviation
(Ļ) = 1).
Standardization of the samples are calculated
as follows: \[z = (x-\mu)/\sigma\]
Normalization (or Scaling) refers to
scaling the variable in a certain range, usually between 0 and
1.
There is more information in this Blog
2
Thus, we must remember that the main objective of a PCA is to find
directions that maximize the variance. If the variance of one variable
is greater than that of others, we make the PCA components biased in
that direction.
So the best we can do is make the variance of all the variables the
same. We do this by Standardizing all the
variables. In contrast, Normalization does not
make all variables have the same variance.
Primero vamos a definir estos dos conceptos que son
fundamentales en anĆ”lisis estadĆstico y modelaje:
Load libraries.
Cargar las librerias que necesitas.
libraries <- c('cluster','data.table','dplyr','factoextra',
'ggfortify','ggplot2','missForest','missMDA','labdsv')
lapply(libraries, require, character.only = TRUE)
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] TRUE
##
## [[6]]
## [1] TRUE
##
## [[7]]
## [1] TRUE
##
## [[8]]
## [1] TRUE
##
## [[9]]
## [1] TRUE
Data
Cargar los datos.
channel <- read.csv("D:/OneDrive - University of Vermont/Curriculum/19_ Personal webpage/TropicalFreshwaterEcology/_pages/Lectures/data/PCA_channel_form.csv",header=TRUE)
head(channel)
## Forma NAN_Am NADBO NAtemp nit NASatO2 Elevacion Ancho Velocidad Rocas
## 1 Trapecio 0.03 2.38 27.33 0.35 92.04 23 16 5 20
## 2 Trapecio 0.03 2.95 27.81 NA 100.03 31 11 0 20
## 3 Trapecio 0.03 3.13 24.27 NA 96.82 35 14 10 30
## 4 Trapecio 1.15 4.73 27.06 7.54 64.35 9 5 2 0
## 5 Trapecio 0.50 8.16 26.60 NA 110.39 43 11 9 10
## 6 Trapecio 0.53 8.57 23.82 NA 106.09 23 11 5 20
## Canto grava arena Limo
## 1 25 30 20 0
## 2 45 20 15 0
## 3 30 20 10 0
## 4 0 0 50 50
## 5 40 10 20 20
## 6 60 20 0 0
Vamos a examinar los datos
summary(channel)
## Forma NAN_Am NADBO NAtemp
## Length:138 Min. :0.0200 Min. : 1.310 Min. :14.67
## Class :character 1st Qu.:0.0400 1st Qu.: 1.930 1st Qu.:24.30
## Mode :character Median :0.2150 Median : 3.000 Median :26.05
## Mean :0.3201 Mean : 6.164 Mean :25.84
## 3rd Qu.:0.5000 3rd Qu.: 8.585 3rd Qu.:27.70
## Max. :1.5000 Max. :34.900 Max. :32.18
## NA's :35
## nit NASatO2 Elevacion Ancho
## Min. : 0.00 Min. : 23.43 Min. : 3.00 Min. : 1.000
## 1st Qu.: 0.40 1st Qu.: 86.24 1st Qu.: 25.25 1st Qu.: 2.000
## Median : 0.92 Median : 94.59 Median : 53.00 Median : 3.000
## Mean : 12.00 Mean : 91.05 Mean : 230.89 Mean : 3.875
## 3rd Qu.: 1.62 3rd Qu.:100.52 3rd Qu.: 269.25 3rd Qu.: 3.000
## Max. :324.11 Max. :122.73 Max. :2370.00 Max. :16.000
## NA's :57 NA's :2
## Velocidad Rocas Canto grava
## Min. : 0.000 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 3.000 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 3.75
## Median :11.000 Median :10.00 Median :25.00 Median :20.00
## Mean : 9.169 Mean :16.27 Mean :25.46 Mean :17.86
## 3rd Qu.:14.000 3rd Qu.:30.00 3rd Qu.:40.00 3rd Qu.:25.00
## Max. :16.000 Max. :90.00 Max. :80.00 Max. :80.00
## NA's :2 NA's :2 NA's :2 NA's :2
## arena Limo
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 10.00 1st Qu.: 0.00
## Median : 15.00 Median : 7.50
## Mean : 19.83 Mean : 20.51
## 3rd Qu.: 25.00 3rd Qu.: 25.00
## Max. :100.00 Max. :100.00
## NA's :2 NA's :2
Muchas veces, los sensores se apagan o funcional mal, o no se pudo recolectar una muestra. Esto genera vacios. R los reconoce como NAs. Podemos llenar estos vacios imputando datos.
df1 <- dplyr::select(channel, Elevacion, Ancho, Velocidad, Rocas,
Canto, grava, arena, Limo)
df1_a <- missForest(df1, maxiter = 4, ntree = 100,
variablewise = TRUE, decreasing = FALSE, verbose=F, replace=TRUE,
classwt = NULL, cutoff = NULL, strata = NULL,
sampsize = NULL, nodesize = NULL, maxnodes = NULL,
xtrue = NA, parallelize = "no")
2.3 Seleccionamos otras variables de interes.
df2_a <- dplyr::select(channel, Elevacion, NAtemp, NASatO2)
Unimos las dos tablas
new_channel <- do.call("merge", c(lapply(list(df1_a$ximp, df2_a), data.frame,
row.names=NULL), by = 0, all = TRUE, sort = FALSE))[-1]
#remove 1 elevation
new_channel <- dplyr::select(new_channel, -Elevacion.y)
Ahora, corremos el PCA usando el dataframe que contiene las variables completas (i.e., sin NAs).
channel.pca <- prcomp(new_channel, center = TRUE, scale =TRUE)
summary(channel.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.8010 1.2814 1.1993 1.0875 0.92746 0.77087 0.70864
## Proportion of Variance 0.3244 0.1642 0.1438 0.1183 0.08602 0.05942 0.05022
## Cumulative Proportion 0.3244 0.4886 0.6324 0.7507 0.83666 0.89609 0.94631
## PC8 PC9 PC10
## Standard deviation 0.62774 0.37510 0.04674
## Proportion of Variance 0.03941 0.01407 0.00022
## Cumulative Proportion 0.98571 0.99978 1.00000
Exploremos los datos con los graficos default que tiene el paquete estadistico.
autoplot(channel.pca)
autoplot(channel.pca, data = channel, colour = 'Forma')
autoplot(channel.pca, data = channel, colour = 'Forma', loadings = TRUE)
autoplot(channel.pca, data = channel, colour = 'Forma', loadings = TRUE,
loadings.colour = 'blue',
loadings.label = TRUE, loadings.label.size = 3)
Vamos a ver la contribucion de cada una de las variables. Podemos ver cuanto explican los primero 3 PC
variance <- (channel.pca$sdev)^2
# Cargar los loadings
loadings <- channel.pca$rotation
round(loadings, 2)[ , 1:3]
## PC1 PC2 PC3
## Elevacion.x 0.38 0.35 -0.22
## Ancho -0.15 -0.48 -0.45
## Velocidad 0.35 0.13 0.42
## Rocas 0.37 0.16 -0.16
## Canto 0.34 -0.38 -0.09
## grava 0.11 -0.23 0.60
## arena -0.26 -0.16 -0.12
## Limo -0.39 0.41 -0.09
## NAtemp -0.37 -0.22 0.38
## NASatO2 0.30 -0.41 -0.10
Tambien, podemos ver cuanto explican todos los PCs
print(channel.pca)
## Standard deviations (1, .., p=10):
## [1] 1.80099889 1.28142364 1.19925760 1.08747167 0.92746374 0.77086804
## [7] 0.70864256 0.62773833 0.37510321 0.04673795
##
## Rotation (n x k) = (10 x 10):
## PC1 PC2 PC3 PC4 PC5
## Elevacion.x 0.3839303 0.3547755 -0.22140395 0.30538299 -0.22055655
## Ancho -0.1450036 -0.4793025 -0.44909356 0.09382497 -0.26086563
## Velocidad 0.3520939 0.1252229 0.42106249 0.01089237 0.02329108
## Rocas 0.3727511 0.1635361 -0.16444976 -0.15407803 0.55914350
## Canto 0.3375196 -0.3770555 -0.08709614 -0.32215810 -0.36958127
## grava 0.1083379 -0.2267439 0.59819672 0.36408176 -0.25505626
## arena -0.2594130 -0.1608611 -0.12443750 0.64315713 0.37798760
## Limo -0.3903427 0.4114782 -0.09133884 -0.28515261 -0.19455570
## NAtemp -0.3669218 -0.2202764 0.38095231 -0.33910778 0.26526008
## NASatO2 0.2978376 -0.4058111 -0.10286271 -0.16667591 0.34261720
## PC6 PC7 PC8 PC9 PC10
## Elevacion.x 0.084393222 -0.16897820 0.119785538 0.69698027 0.001498302
## Ancho 0.240092330 0.26262204 -0.555762735 0.18764912 -0.002220551
## Velocidad -0.480483420 0.18039704 -0.640900901 0.09072128 0.001630017
## Rocas 0.400678063 0.37954324 -0.054227661 -0.04267760 -0.401907113
## Canto -0.365542757 0.17118777 0.360579087 0.02730157 -0.449731444
## grava 0.488091529 -0.12108373 -0.007208548 -0.08093794 -0.349147590
## arena -0.411761083 0.01431205 0.089455195 0.06185671 -0.391455838
## Limo -0.011988579 -0.32599479 -0.288402488 0.01320258 -0.600870180
## NAtemp 0.047549853 0.12982079 0.106014531 0.67528052 -0.002779655
## NASatO2 -0.004413495 -0.74738375 -0.174922994 0.04066842 0.002838301
rownames(loadings) <- colnames(new_channel)
scores <- channel.pca$x
3.3 Ver graficamente lo que explica cada axis.
layout(matrix(1:2, ncol=2))
screeplot(channel.pca)
screeplot(channel.pca, type="lines")
varPercent <- variance/sum(variance) * 100
barplot(varPercent, xlab='PC', ylab='Percent Variance',
names.arg=1:length(varPercent), las=1, col='gray') +
abline(h=1/ncol(new_channel)*100, col="red")
## numeric(0)
4 Otras formas de visualizar los datos.
fviz_eig(channel.pca)
fviz_pca_biplot(channel.pca, repel = TRUE,
col.var = "#2E9FDF", # Variables color
col.ind = "#696969", # Individuals color
select.var = list(contrib = 5))
## Warning: ggrepel: 28 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
4.1 Con las elipses.
fviz_pca_ind(channel.pca, label="none", habillage=channel$Forma,
addEllipses=TRUE, ellipse.level=0.95, palette = "Dark2")
4.2
PCA <- fviz_pca_biplot(channel.pca, label = "var", habillage=channel$Forma,
addEllipses=TRUE, ellipse.level=0.95,
ggtheme = theme_minimal())
# PCA + ggsave("PCA.jpg", width=11, height=8.5)
5. Convertirlo en una data.frame para trabajarlo en ggplot2
data <- data.table(PC1=channel.pca$x[,1], PC2=channel.pca$x[,2], Forma= channel[,1])
data <- data[order(channel$Forma),]
ggplot(data, aes(x=PC1,y=PC2)) +
geom_point(size = 2, aes(color=Forma))
McCune, B., Grace, J. B., & Urban, D. L. (2002). Analysis of ecological communities (Vol. 28). Gleneden Beach, OR: MjM software design.