Principal Component Analysis

Basic concepts of PCAs

Conceptos básicos en PCAs

PCA is an ordination used to reduce a data set with n cases (e.g., study sites) and p variables (e.g., environmental variables) to a smaller number of synthetic variables that represent most of the information in the set of original data (McCune et al. 2002).

El PCA es una ordenación que se utiliza para reducir un conjunto de datos con n casos (e.g., sitios de estudio) y p variables (e.g., variables ambientales) a un número menor de variables sintéticas que representan la mayor parte de la información en el conjunto de datos original (McCune et al. 2002).

“prcomp” package

Some blogs (e.g., Blog 1) recommend using the “prcomp” package to run PCAs, since it uses the singular value decomposition (SVD), which has slightly better numerical accuracy compared to other packages like “princomp”.

Algunos blogs (e.g., Blog 1) recomiendan usar el paquete “prcomp” para ejecutar PCA, ya que usa la descomposición de valores singulares (SVD) que tiene una precisión numérica ligeramente mejor comparado con “princomp”.

Standardization and/or Normalization

First we will define these two concepts that are fundamental in statistical analysis and modeling:

Standardization (or Z-score) refers to making the empirical distribution Y∼N(0,1). That is, make the variables follow a standard normal distribution (mean (μ) = 0 and standard deviation (σ) = 1).
Standardization of the samples are calculated as follows: \[z = (x-\mu)/\sigma\]

Normalization (or Scaling) refers to scaling the variable in a certain range, usually between 0 and 1.

There is more information in this Blog 2

Thus, we must remember that the main objective of a PCA is to find directions that maximize the variance. If the variance of one variable is greater than that of others, we make the PCA components biased in that direction.

So the best we can do is make the variance of all the variables the same. We do this by Standardizing all the variables. In contrast, Normalization does not make all variables have the same variance.

Primero vamos a definir estos dos conceptos que son fundamentales en análisis estadístico y modelaje:

Correlation vs Covariance

Step 1

Load libraries.
Cargar las librerias que necesitas.

libraries <- c('cluster','data.table','dplyr','factoextra',
               'ggfortify','ggplot2','missForest','missMDA','labdsv')

lapply(libraries, require, character.only = TRUE)

## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] TRUE
## 
## [[5]]
## [1] TRUE
## 
## [[6]]
## [1] TRUE
## 
## [[7]]
## [1] TRUE
## 
## [[8]]
## [1] TRUE
## 
## [[9]]
## [1] TRUE

Step 2

Data
Cargar los datos.

channel <- read.csv("D:/OneDrive - University of Vermont/Curriculum/19_ Personal webpage/TropicalFreshwaterEcology/_pages/Lectures/data/PCA_channel_form.csv",header=TRUE)
head(channel)

##      Forma NAN_Am NADBO NAtemp  nit NASatO2 Elevacion Ancho Velocidad Rocas
## 1 Trapecio   0.03  2.38  27.33 0.35   92.04        23    16         5    20
## 2 Trapecio   0.03  2.95  27.81   NA  100.03        31    11         0    20
## 3 Trapecio   0.03  3.13  24.27   NA   96.82        35    14        10    30
## 4 Trapecio   1.15  4.73  27.06 7.54   64.35         9     5         2     0
## 5 Trapecio   0.50  8.16  26.60   NA  110.39        43    11         9    10
## 6 Trapecio   0.53  8.57  23.82   NA  106.09        23    11         5    20
##   Canto grava arena Limo
## 1    25    30    20    0
## 2    45    20    15    0
## 3    30    20    10    0
## 4     0     0    50   50
## 5    40    10    20   20
## 6    60    20     0    0

Step 2.1

Vamos a examinar los datos

summary(channel)

##     Forma               NAN_Am           NADBO            NAtemp     
##  Length:138         Min.   :0.0200   Min.   : 1.310   Min.   :14.67  
##  Class :character   1st Qu.:0.0400   1st Qu.: 1.930   1st Qu.:24.30  
##  Mode  :character   Median :0.2150   Median : 3.000   Median :26.05  
##                     Mean   :0.3201   Mean   : 6.164   Mean   :25.84  
##                     3rd Qu.:0.5000   3rd Qu.: 8.585   3rd Qu.:27.70  
##                     Max.   :1.5000   Max.   :34.900   Max.   :32.18  
##                                      NA's   :35                      
##       nit            NASatO2         Elevacion           Ancho       
##  Min.   :  0.00   Min.   : 23.43   Min.   :   3.00   Min.   : 1.000  
##  1st Qu.:  0.40   1st Qu.: 86.24   1st Qu.:  25.25   1st Qu.: 2.000  
##  Median :  0.92   Median : 94.59   Median :  53.00   Median : 3.000  
##  Mean   : 12.00   Mean   : 91.05   Mean   : 230.89   Mean   : 3.875  
##  3rd Qu.:  1.62   3rd Qu.:100.52   3rd Qu.: 269.25   3rd Qu.: 3.000  
##  Max.   :324.11   Max.   :122.73   Max.   :2370.00   Max.   :16.000  
##  NA's   :57                                          NA's   :2       
##    Velocidad          Rocas           Canto           grava      
##  Min.   : 0.000   Min.   : 0.00   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 3.000   1st Qu.: 0.00   1st Qu.: 0.00   1st Qu.: 3.75  
##  Median :11.000   Median :10.00   Median :25.00   Median :20.00  
##  Mean   : 9.169   Mean   :16.27   Mean   :25.46   Mean   :17.86  
##  3rd Qu.:14.000   3rd Qu.:30.00   3rd Qu.:40.00   3rd Qu.:25.00  
##  Max.   :16.000   Max.   :90.00   Max.   :80.00   Max.   :80.00  
##  NA's   :2        NA's   :2       NA's   :2       NA's   :2      
##      arena             Limo       
##  Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 10.00   1st Qu.:  0.00  
##  Median : 15.00   Median :  7.50  
##  Mean   : 19.83   Mean   : 20.51  
##  3rd Qu.: 25.00   3rd Qu.: 25.00  
##  Max.   :100.00   Max.   :100.00  
##  NA's   :2        NA's   :2

Step 2.2

Muchas veces, los sensores se apagan o funcional mal, o no se pudo recolectar una muestra. Esto genera vacios. R los reconoce como NAs. Podemos llenar estos vacios imputando datos.

df1 <- dplyr::select(channel, Elevacion, Ancho, Velocidad, Rocas, 
                     Canto, grava, arena, Limo)

df1_a <- missForest(df1, maxiter = 4, ntree = 100,
                    variablewise = TRUE, decreasing = FALSE, verbose=F, replace=TRUE,
                    classwt = NULL, cutoff = NULL, strata = NULL,
                    sampsize = NULL, nodesize = NULL, maxnodes = NULL,
                    xtrue = NA, parallelize = "no")

Step 2.3

2.3 Seleccionamos otras variables de interes.

df2_a <- dplyr::select(channel, Elevacion, NAtemp, NASatO2)

Step 2.4

Unimos las dos tablas

new_channel <- do.call("merge", c(lapply(list(df1_a$ximp, df2_a), data.frame, 
                                         row.names=NULL), by = 0, all = TRUE, sort = FALSE))[-1]
#remove 1 elevation
new_channel <- dplyr::select(new_channel, -Elevacion.y)

Step 3

Ahora, corremos el PCA usando el dataframe que contiene las variables completas (i.e., sin NAs).

channel.pca <- prcomp(new_channel, center = TRUE, scale =TRUE)
summary(channel.pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.8010 1.2814 1.1993 1.0875 0.92746 0.77087 0.70864
## Proportion of Variance 0.3244 0.1642 0.1438 0.1183 0.08602 0.05942 0.05022
## Cumulative Proportion  0.3244 0.4886 0.6324 0.7507 0.83666 0.89609 0.94631
##                            PC8     PC9    PC10
## Standard deviation     0.62774 0.37510 0.04674
## Proportion of Variance 0.03941 0.01407 0.00022
## Cumulative Proportion  0.98571 0.99978 1.00000

Step 3.1

Exploremos los datos con los graficos default que tiene el paquete estadistico.

autoplot(channel.pca)

autoplot(channel.pca, data = channel, colour = 'Forma')

autoplot(channel.pca, data = channel, colour = 'Forma', loadings = TRUE)

autoplot(channel.pca, data = channel, colour = 'Forma', loadings = TRUE,
         loadings.colour = 'blue', 
         loadings.label = TRUE, loadings.label.size = 3)

Step 4

Vamos a ver la contribucion de cada una de las variables. Podemos ver cuanto explican los primero 3 PC

variance <- (channel.pca$sdev)^2

# Cargar los loadings
loadings <- channel.pca$rotation
round(loadings, 2)[ , 1:3]

##               PC1   PC2   PC3
## Elevacion.x  0.38  0.35 -0.22
## Ancho       -0.15 -0.48 -0.45
## Velocidad    0.35  0.13  0.42
## Rocas        0.37  0.16 -0.16
## Canto        0.34 -0.38 -0.09
## grava        0.11 -0.23  0.60
## arena       -0.26 -0.16 -0.12
## Limo        -0.39  0.41 -0.09
## NAtemp      -0.37 -0.22  0.38
## NASatO2      0.30 -0.41 -0.10

Tambien, podemos ver cuanto explican todos los PCs

print(channel.pca)

## Standard deviations (1, .., p=10):
##  [1] 1.80099889 1.28142364 1.19925760 1.08747167 0.92746374 0.77086804
##  [7] 0.70864256 0.62773833 0.37510321 0.04673795
## 
## Rotation (n x k) = (10 x 10):
##                    PC1        PC2         PC3         PC4         PC5
## Elevacion.x  0.3839303  0.3547755 -0.22140395  0.30538299 -0.22055655
## Ancho       -0.1450036 -0.4793025 -0.44909356  0.09382497 -0.26086563
## Velocidad    0.3520939  0.1252229  0.42106249  0.01089237  0.02329108
## Rocas        0.3727511  0.1635361 -0.16444976 -0.15407803  0.55914350
## Canto        0.3375196 -0.3770555 -0.08709614 -0.32215810 -0.36958127
## grava        0.1083379 -0.2267439  0.59819672  0.36408176 -0.25505626
## arena       -0.2594130 -0.1608611 -0.12443750  0.64315713  0.37798760
## Limo        -0.3903427  0.4114782 -0.09133884 -0.28515261 -0.19455570
## NAtemp      -0.3669218 -0.2202764  0.38095231 -0.33910778  0.26526008
## NASatO2      0.2978376 -0.4058111 -0.10286271 -0.16667591  0.34261720
##                      PC6         PC7          PC8         PC9         PC10
## Elevacion.x  0.084393222 -0.16897820  0.119785538  0.69698027  0.001498302
## Ancho        0.240092330  0.26262204 -0.555762735  0.18764912 -0.002220551
## Velocidad   -0.480483420  0.18039704 -0.640900901  0.09072128  0.001630017
## Rocas        0.400678063  0.37954324 -0.054227661 -0.04267760 -0.401907113
## Canto       -0.365542757  0.17118777  0.360579087  0.02730157 -0.449731444
## grava        0.488091529 -0.12108373 -0.007208548 -0.08093794 -0.349147590
## arena       -0.411761083  0.01431205  0.089455195  0.06185671 -0.391455838
## Limo        -0.011988579 -0.32599479 -0.288402488  0.01320258 -0.600870180
## NAtemp       0.047549853  0.12982079  0.106014531  0.67528052 -0.002779655
## NASatO2     -0.004413495 -0.74738375 -0.174922994  0.04066842  0.002838301

rownames(loadings) <- colnames(new_channel)
scores <- channel.pca$x

3.3 Ver graficamente lo que explica cada axis.

layout(matrix(1:2, ncol=2))
screeplot(channel.pca)
screeplot(channel.pca, type="lines")

varPercent <- variance/sum(variance) * 100
barplot(varPercent, xlab='PC', ylab='Percent Variance',
names.arg=1:length(varPercent), las=1, col='gray') +
abline(h=1/ncol(new_channel)*100, col="red")

## numeric(0)

4 Otras formas de visualizar los datos.

fviz_eig(channel.pca)

fviz_pca_biplot(channel.pca, repel = TRUE,
                col.var = "#2E9FDF", # Variables color
                col.ind = "#696969",  # Individuals color
                select.var = list(contrib = 5))

## Warning: ggrepel: 28 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

4.1 Con las elipses.

fviz_pca_ind(channel.pca, label="none", habillage=channel$Forma,
     addEllipses=TRUE, ellipse.level=0.95, palette = "Dark2")

4.2

PCA <- fviz_pca_biplot(channel.pca, label = "var", habillage=channel$Forma,
               addEllipses=TRUE, ellipse.level=0.95,
               ggtheme = theme_minimal())

# PCA + ggsave("PCA.jpg", width=11, height=8.5)

5. Convertirlo en una data.frame para trabajarlo en ggplot2

data <- data.table(PC1=channel.pca$x[,1], PC2=channel.pca$x[,2], Forma=  channel[,1])
data <- data[order(channel$Forma),]

ggplot(data, aes(x=PC1,y=PC2)) + 
  geom_point(size = 2, aes(color=Forma))

References

McCune, B., Grace, J. B., & Urban, D. L. (2002). Analysis of ecological communities (Vol. 28). Gleneden Beach, OR: MjM software design.