class: center, middle, inverse, title-slide .title[ # Correlations and Partial Correlations in R ] .author[ ### Pablo E. Gutiérrez-Fonseca ] .date[ ### 2024-11-04 ] --- # Decision Flowchart
--- # What is Correlation? - A correlation test quantifies how two variables change together. - Are the variables closely related? - To what extent does one variable predict the other? --- # When to Use Correlation? - Correlation test is used to **evaluate** the **association between two or more variables**. - You have two continuous variables measured from the same observational unit. - You want to determine if the variables are related (vary in unison). --- # Key Assumptions: - Independent, random samples. - Normality: - Pearson correlation (parametric): Assumes both variables follow a normal distribution. - Spearman correlation (nonparametric): Does not assume normality. - Linear relationship: Correlation looks for a linear association between the variables. --- # Describing Correlations - **Strength**: Indicates how closely the variables are related. - Categories: Weak, Moderate, Strong - **Nature**: Shows the direction of the relationship. - Direct (Positive): As one variable increases, so does the other. - Indirect (Negative): As one variable increases, the other decreases. - **Significance**: Assesses if the correlation is meaningful. - Based on the p-value, which tells us if the observed relationship is likely due to chance. --- # Hypothesis testing with Correlations - Null Hypothesis ( `\(H_0\)` ) - ** `\(H_0\)` **: There is no correlation in the population. - For a two-tailed test, this means there is **no linear relationship** between x and y. $$ H_0: r = 0 $$ $$ p > 0.05 $$ - Alternative Hypothesis ( `\(H_1\)` ) - ** `\(H_1\)` **: There is a real, non-zero correlation in the population. - The alternative hypothesis suggests a **significant linear relationship** between x and y. $$ H_1: r \neq 0 $$ $$ p < 0.05 $$ --- # Computing Simple Correlations - There are many kinds of correlation coefficients but the most commonly used measure of correlation is the **Pearson’s correlation (parametric statistic)**. `$$r_{XY} = \frac{n \sum XY - \left( \sum X \right) \left( \sum Y \right)}{\sqrt{\left[ n \sum X^2 - \left( \sum X \right)^2 \right] \left[ n \sum Y^2 - \left( \sum Y \right)^2 \right]}}$$` --- # Computing Simple Correlations - There are many kinds of correlation coefficients but the most commonly used measure of correlation is the **Pearson’s correlation (parametric statistic)**. `$$r_{XY} = \frac{n \sum XY - \left( \sum X \right) \left( \sum Y \right)}{\sqrt{\left[ n \sum X^2 - \left( \sum X \right)^2 \right] \left[ n \sum Y^2 - \left( \sum Y \right)^2 \right]}}$$` Where: - `\(r\)` = Pearson Correlation Coefficient - `\(N\)` = Number of pairs of observations - `\(∑XY\)` = Sum of the products of paired values - `\(∑X\)` = Sum of the \(x\) scores - `\(∑Y\)` = Sum of the \(y\) scores - `\(∑X^2\)` = Sum of the squared \(x\) scores - `\(∑Y^2\)` = Sum of the squared \(y\) scores --- # Nature and Interpretation of `\(r\)` Value .pull-left[ - Perfect **Positive Correlation**: `\(r=+1\)` - All points lie on an upward-sloping line. ] .pull-right[ ![](Correlations-and-Partial-Correlations-in-R_files/figure-html/unnamed-chunk-2-1.png)<!-- --> ] --- # Nature and Interpretation of `\(r\)` Value .pull-left[ - Perfect Positive Correlation: `\(r=+1\)` - All points lie on an upward-sloping line. - **No Correlation**: `\(r=0\)` - Points do not align along any line, or they form a horizontal line. ] .pull-right[ ![](Correlations-and-Partial-Correlations-in-R_files/figure-html/unnamed-chunk-3-1.png)<!-- --> ] --- # Nature and Interpretation of `\(r\)` Value .pull-left[ - Perfect Positive Correlation: `\(r=+1\)` - All points lie on an upward-sloping line. - No Correlation: `\(r=0\)` - Points do not align along any line, or they form a horizontal line. - Perfect **Negative Correlation**: `\(r=−1\)` - All points lie on a downward-sloping line. ] .pull-right[ ![](Correlations-and-Partial-Correlations-in-R_files/figure-html/unnamed-chunk-4-1.png)<!-- --> ] --- # Nature and Interpretation of `\(r\)` Value .pull-left[ - Perfect **Positive Correlation**: `\(r=+1\)` - All points lie on an upward-sloping line. - No Correlation: `\(r=0\)` - Points do not align along any line, or they form a horizontal line. - Perfect **Negative Correlation**: `\(r=−1\)` - All points lie on a downward-sloping line. ] .pull-right[ ![](Correlations-and-Partial-Correlations-in-R_files/figure-html/unnamed-chunk-5-1.png)<!-- --> ] - Note: When `\(∣r∣=1\)`, we have a perfect correlation, meaning all points fall precisely along a straight line. --- # Strength of Correlation .pull-left[ - Weak Correlation: - `\(0.0≤ ∣r∣ <0.3\)` - Moderate Correlation: - `\(0.3≤∣r∣<0.7\)` - Strong Correlation: - `\(0.7≤∣r∣<1.0\)` - Perfect Correlation: - `\(∣r∣=1.0\)` ] .pull-right[ ![](Correlations-and-Partial-Correlations-in-R_files/figure-html/unnamed-chunk-6-1.png)<!-- --> ] --- # Strength of Correlation: Examples .pull-left[ - Weak Correlation: - `\(0.0≤ ∣r∣ <0.3\)` - Moderate Correlation: - `\(0.3≤∣r∣<0.7\)` - Strong Correlation: - `\(0.7≤∣r∣<1.0\)` - Perfect Correlation: - `\(∣r∣=1.0\)` ] .pull-right[ - Husband’s age / Wife’s age: - r = .94 (strong positive correlation) - Husband’s height / Wife’s height: - r = .36 (weak positive correlation) - Distance of golf putt / percent success: - r = -.94 (strong negative correlation) ] --- # Coefficient of Determination ( `\(r^2\)` ) - `\(r^2\)` quantifies the proportion of variability in one variable that can be explained by its relationship with the other variable. - It represents the shared variability between the two variables, indicating how much of the total variability they have in common. - When two variables are correlated, `\(r^2\)` tells us how much variance in one variable can be accounted for by the other. - This concept is similar to the Effect Size we examined in ANOVA. --- # Difference between `\(r\)` and `\(r^2\)` - `\(r\)` (Pearson’s Correlation Coefficient): - Measures how much two variables vary together. - Ranges from -1 to 1: - -1: Perfect negative correlation - 0: No correlation - 1: Perfect positive correlation - Unitless and applicable across contexts. --- # Difference between `\(r\)` and `\(r^2\)` - `\(r\)` (Pearson’s Correlation Coefficient): - Measures how much two variables vary together. - Ranges from -1 to 1: - -1: Perfect negative correlation - 0: No correlation - 1: Perfect positive correlation - Unitless and applicable across contexts. - `\(r^2\)` (Coefficient of Determination): - Indicates the proportion of variability explained by the relationship. - Ranges from 0 to 1: - 0: No variability explained - 1: All variability explained - Interpreted as the percentage of variance explained. --- # p-value of the correlation The p-value (significance level) of the correlation can be determined : - by using the correlation coefficient table for the degrees of freedom : **df= n−2**, where **n is the number of observation in x and y variables**. --- # p-value of the correlation The p-value (significance level) of the correlation can be determined : - by using the correlation coefficient table for the degrees of freedom : **df= n−2**, where **n is the number of observation in x and y variables**. - or by calculating the t value as follow: $$ t = \frac{r_1 - r_2}{\sqrt{\frac{1 - r^2}{n - 2}}} $$ In the case 2) the corresponding p-value is determined using t distribution table for df=n−2 --- # Understanding significant correlations - **Significant correlations** indicate that fluctuations in the values are sufficiently regular, making it unlikely that the association occurred by chance. - However, significance does not imply meaningfulness: - A significant correlation does not guarantee that the amount of variance accounted for is practically significant. - Use ** `\(r^2\)` to assess the meaningfulness** of your results: - Interpret `\(r^2\)` to understand the proportion of variance explained by the correlation and its practical implications. --- # Writing Up Your Results .pull-left[ Example: - Correlation Results: `\(r_{(19)}=0.606, p=0.024, r^2=0.37\)` ] --- # Writing Up Your Results .pull-left[ Example: - Correlation Results: `\(r_{(19)}=0.606, p=0.024, r^2=0.37\)` Interpretation: - r: - Test Statistic: 0.606 (obtained value) - Indicates a moderate positive correlation between the variables. Degrees of Freedom: - 19 (calculated as `\(n−2\)`, where `\(n\)` is the number of observations). ] .pull-right[ p-value: - 0.024: Indicates a statistically significant correlation (p < 0.05), suggesting the correlation is unlikely to have arisen by chance. `\(r^2\)`: - 0.37: Indicates that approximately 37% of the variance in one variable is explained by the other, providing a measure of the correlation's practical significance. ] --- # Complications and limitations of correlation --- # Complications and limitations of correlation .pull-left[ - Sensitivity to Outliers: - Correlations can be significantly affected by outliers, which may skew the results. - Influence of Outliers: - A single outlier can have a disproportionate impact on the correlation coefficient, potentially leading to misleading conclusions. ] .pull-right[ ![](Correlations-and-Partial-Correlations-in-R_files/figure-html/unnamed-chunk-7-1.png)<!-- --> ] --- # Complications and limitations of correlation - **Spurious Correlations**: - A significant correlation may not indicate a direct relationship between the two variables. - It can arise from an indirect correlation influenced by a third variable. - For example, an observed correlation between ice cream sales and drowning incidents might be spurious, driven by a third factor like temperature. - **Causation vs. Correlation**: - Correlation does not imply causation. Just because two variables are correlated does not mean one causes the other. - **Context Matters**: - Always consider the context and underlying mechanisms that could affect the relationship between the variables. --- # Multivariate interpretation - What to Do? --- # Multivariate interpretation - What to Do? - Check out your Partial Correlations. - Purpose: These correlations adjust each correlation to account for the influence of all other variables in the analysis. - Benefit: They reveal the unique relationship between variables by removing the influence of others. - When to use: Conduct this analysis only on variables that significantly correlate with your variable of interest. --- # Partial Correlations - Keep in Mind: - These new partial correlations have been adjusted for covariance with all other variables. - The goal is to isolate the unique covariance between the two variables of interest. - Important Note: - Expect these values to be lower than the original correlations! - Why? Because we are accounting for the influence of other variables. - Focus on relative strength rather than trying to meet an absolute threshold of strength. --- # Other complications - Quantifying Relationships: Pearson’s, Spearman’s non-parametric, and Partial Correlations are all designed to quantify linear relationships. - They will not detect nonlinear relationships. - Misleading Data: Correlations using rates and averages (smoothed data) can be misleading. --- # Things to remember - Correlations only tell us about the strength of a linear relationship between two variables. - We cannot interpret causality from correlation alone. --- # Spearman correlation formula The Spearman correlation method computes the correlation between the rank of `\(x\)` and the rank of `\(y\)` variables: `$$\rho = \frac{\sum (x' - \bar{X}) (y' - \bar{Y})}{\sqrt{\sum (x' - \bar{X})^2 \sum (y' - \bar{Y})^2}}$$` Where `\(x' = rank(x)\)` and `\(y' = rank(y)\)`. --- # Example --- # Examining the relationship between CO₂ and Global Temperature **Load Data**: Import the dataset containing global temperature and CO₂ concentration. ``` r global_T <- read_xlsx('Lecturer Practice/global mean T.xlsx') tail(global_T) ``` ``` ## # A tibble: 6 × 3 ## year temperature Mona_Loa_co2 ## <dbl> <dbl> <dbl> ## 1 2013 14.6 397. ## 2 2014 14.7 399. ## 3 2015 14.8 401. ## 4 2016 14.9 404. ## 5 2017 14.8 407. ## 6 2018 14.7 409. ``` --- # Relationship between CO₂ and Temperature **Test for Normality**: Use Shapiro-Wilk tests to check if temperature and Mona_Loa_co2 are normally distributed. ``` r shapiro.test(global_T$temperature) ``` ``` ## ## Shapiro-Wilk normality test ## ## data: global_T$temperature *## W = 0.9314, p-value = 0.000002771 ``` --- # Relationship between CO₂ and Temperature **Test for Normality**: Use Shapiro-Wilk tests to check if temperature and Mona_Loa_co2 are normally distributed. ``` r shapiro.test(global_T$Mona_Loa_co2) ``` ``` ## ## Shapiro-Wilk normality test ## ## data: global_T$Mona_Loa_co2 *## W = 0.94085, p-value = 0.005914 ``` --- # Relationship between CO₂ and Temperature This code performs a **Spearman correlation** test to examine the relationship between global temperature and CO₂ concentration. ``` r #Test correlation (parametric) test <- cor.test(global_T$temperature, global_T$Mona_Loa_co2, method = "spearman") test ``` ``` ## ## Spearman's rank correlation rho ## ## data: global_T$temperature and global_T$Mona_Loa_co2 *## S = 2403.6, p-value < 0.00000000000000022 ## alternative hypothesis: true rho is not equal to 0 ## sample estimates: ## rho *## 0.9332148 ``` --- # Coefficient of Determination (R²) for Spearman Correlation Approximate the proportion of variance in temperature explained by CO₂ concentration. ``` r rho <- test$estimate # Spearman correlation coefficient R_squared <- rho^2 R_squared ``` ``` ## rho ## 0.8708898 ``` ``` r # Extract correlation coefficient and p-value p_value <- test$p.value ``` --- # Data visualization .panelset[ .panel[.panel-name[R Code] ``` r ggplot(global_T, aes(x = Mona_Loa_co2, y = temperature)) + geom_line() + geom_point() + geom_smooth() + # ylim(14,14.75) + labs(x = "CO2 (ppm)", y = "Temperature (C)", title = "") + annotate("text", x = 350, y = 14.8, label = paste("Spearman's correlation =", round(rho, 2), "\n", "p =", format.pval(p_value)), hjust = 1, vjust = 0) ``` ] .panel[.panel-name[Plot] ![](Correlations-and-Partial-Correlations-in-R_files/figure-html/plot_1-1.png) ] ] --- # How to summarize results <span style="color:#00796B;">We set out to examine the relationship between CO2 concentrations and global mean temperatures based on 60 years of observations at the Mona Loa Observatory. </span> <span style="color:#E65100;"> A Spearman’s rho correlation</span><span style="color:#1E88E5;"> identified an extremely strong, significant direct correlation ( `\(rho_{(58)} =0.93, p<0.001 , r^2 =0.87\)` ) </span><span style="color:#424242;"> This indicates that of all the myriad of factors that influence global mean temperatures, the concentration of ONE gas alone can account for 87% of the variability witnessed over the past six decades. This information strongly suggests that CO2 emissions have contributed to warming trends witnessed over the past half century.</span> --- # Example --- # Example: Partial Correlation Analysis --- # Partial Correlation Analysis - Explore the relationships between variables in the dataset while controlling for the influence of other variables. ``` r df <- data.frame( x= c(1,2,3,4,5,6,7,7,7,8), y= c(4,5,6,7,8,9,9,9,10,10), z= c(1,3,5,7,9,11,13,15,17,19)) head(df) ``` ``` ## x y z ## 1 1 4 1 ## 2 2 5 3 ## 3 3 6 5 ## 4 4 7 7 ## 5 5 8 9 ## 6 6 9 11 ``` --- # Partial Correlation Analysis - Load library ppcor ``` r library(ppcor) ``` --- # Partial Correlation Analysis .pull-left[ - Calculate Partial Correlation ``` r pcor( df ) ``` ] .pull-right[ ``` ## $estimate ## x y z ## x 1.0000000 0.76314445 0.58810321 ## y 0.7631444 1.00000000 0.05552034 ## z 0.5881032 0.05552034 1.00000000 ## ## $p.value ## x y z ## x 0.00000000 0.01673975 0.09578687 ## y 0.01673975 0.00000000 0.88718502 ## z 0.09578687 0.88718502 0.00000000 ## ## $statistic ## x y z ## x 0.000000 3.1244245 1.9238403 ## y 3.124425 0.0000000 0.1471199 ## z 1.923840 0.1471199 0.0000000 ## ## $n ## [1] 10 ## ## $gp ## [1] 1 ## ## $method ## [1] "pearson" ``` ] --- # Partial Correlation Analysis .pull-left[ - Calculate Partial Correlation ``` r pcor( df ) ``` - **$estimate**: The partial correlation value between x and y. - **$p.value**: The p-value for this partial correlation is 0.01673975. - **$statistic**: A matrix of the value of the test statistic. ] .pull-right[ ``` ## $estimate ## x y z ## x 1.0000000 0.76314445 0.58810321 ## y 0.7631444 1.00000000 0.05552034 ## z 0.5881032 0.05552034 1.00000000 ## ## $p.value ## x y z ## x 0.00000000 0.01673975 0.09578687 ## y 0.01673975 0.00000000 0.88718502 ## z 0.09578687 0.88718502 0.00000000 ## ## $statistic ## x y z ## x 0.000000 3.1244245 1.9238403 ## y 3.124425 0.0000000 0.1471199 ## z 1.923840 0.1471199 0.0000000 ## ## $n ## [1] 10 ## ## $gp ## [1] 1 ## ## $method ## [1] "pearson" ``` ]