EDA

The primary questions of interest are: 1. What is the association between life expectancy and alcohol consumption? 2. Does this association differ by Sex? 3. How has life expectancy and alcohol consumption changed over time?

From both box plots, we notice that there are some outliers in both variables. However, this is expected since there exists people who consume more Alcohol and there exists people who have lower Life Expectation.

Stacked histogram of alcohol consumption by sex. Use different color schemes than the ggplot default.

Figure Interpretation: Most People consume alcohol less than 10 unit, Females are more likely to drink less alcohol and Males are more likely to drink more alcohol

Facet plot by year for 2000, 2010, and 2019 showing scatterplots with regression lines of life expectancy and alcohol consumption

## `geom_smooth()` using formula = 'y ~ x'

Figure Interpretation: For all of these years, Life Expectancy increases as Alcohol Consumption increases. The Life Expectancy increases by years.

A linear model of life expectancy as a function of time, adjusted for sex.

## 
## Call:
## lm(formula = Life_Expectation ~ Year + Sex, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.367  -6.413   1.836   6.960  16.682 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -580.19632   38.19150  -15.19   <2e-16 ***
## Year           0.32467    0.01901   17.08   <2e-16 ***
## SexMale       -5.06339    0.21900  -23.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.957 on 6689 degrees of freedom
## Multiple R-squared:   0.11,  Adjusted R-squared:  0.1097 
## F-statistic: 413.2 on 2 and 6689 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Life_Expectation ~ Year + Sex, data = dat[dat$Country == 
##     "Canada", ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53566 -0.17599  0.03276  0.18809  0.46039 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.621e+02  1.472e+01  -17.81   <2e-16 ***
## Year         1.718e-01  7.326e-03   23.46   <2e-16 ***
## SexMale     -4.465e+00  8.449e-02  -52.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2672 on 37 degrees of freedom
## Multiple R-squared:  0.9891, Adjusted R-squared:  0.9885 
## F-statistic:  1672 on 2 and 37 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Life_Expectation ~ Year + Sex, data = dat[dat$Country == 
##     "China", ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98607 -0.04110  0.07953  0.17883  0.28329 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.267e+02  1.368e+01  -38.50   <2e-16 ***
## Year         3.009e-01  6.807e-03   44.21   <2e-16 ***
## SexMale     -5.275e+00  7.851e-02  -67.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2483 on 37 degrees of freedom
## Multiple R-squared:  0.9943, Adjusted R-squared:  0.994 
## F-statistic:  3235 on 2 and 37 DF,  p-value: < 2.2e-16
## `geom_smooth()` using formula = 'y ~ x'

Summary: From both linear model and plots, we notice that Male have lower Life Expectancy than Female. The Life expecancy is increasing over years. Canada has a higher Life expectancy than China, while both country have higher Life expectancy than the World Average.

A barplot of male and female life expectancy for the 10 countries with largest discrepancies in 2019.

Figure Interpretation: Females usually have higher life expectancy than Males.

A boxplot of life expectancy by alcohol consumption level and sex for the year 2019.

Figure Interpretation: People with high alcohol consumption level usually have higher life expectancy. People with Low and Median alcohol consumption level usually have similar but less life expectancy. Females have overall higher life expectancy than males.

A visualization to examine the association life expectancy with alcohol consumption over time.

Figure Interpretation: For all of these years, Life Expectancy increases as Alcohol Consumption increases. The Life Expectancy increases by years.

Advanced Regression

Construct a multiple linear regression model to examine the association between alcohol consumption and life expectancy, adjusted for time and sex. First use time as a linear predictor variable, and then fit another model where you put a cubic regression spline on time. Provide summaries of your models, plots of the linear and non-linear associations, and interpretation of the linear and non-linear associations.

## 
## Call:
## lm(formula = Life_Expectation ~ Year + Sex + Alcohol, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.421  -5.276   2.071   6.275  16.226 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -581.31471   36.40924  -15.97   <2e-16 ***
## Year           0.32453    0.01812   17.91   <2e-16 ***
## SexMale       -8.77330    0.25312  -34.66   <2e-16 ***
## Alcohol        0.54729    0.02111   25.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.539 on 6688 degrees of freedom
## Multiple R-squared:  0.1912, Adjusted R-squared:  0.1909 
## F-statistic: 527.1 on 3 and 6688 DF,  p-value: < 2.2e-16
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Life_Expectation ~ s(Year, bs = "cr") + Sex + Alcohol
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 70.83944    0.15717  450.73   <2e-16 ***
## SexMale     -8.77330    0.25312  -34.66   <2e-16 ***
## Alcohol      0.54729    0.02111   25.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##         edf Ref.df     F p-value    
## s(Year)   1      1 320.8  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.191   Deviance explained = 19.1%
## GCV = 72.965  Scale est. = 72.921    n = 6692

## Analysis of Variance Table
## 
## Model 1: Life_Expectation ~ Year + Sex + Alcohol
## Model 2: Life_Expectation ~ s(Year, bs = "cr") + Sex + Alcohol
##   Res.Df    RSS         Df  Sum of Sq      F    Pr(>F)    
## 1   6688 487698                                           
## 2   6688 487698 5.3624e-09 2.8405e-07 0.7264 5.223e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation: From the plots, summary, and the ANOVA table of both models, we found there is no difference between them, which means there is no additional effect if we add a cubic regression spline on Year. Overally speaking, both model tells that the life expectation gets higher linearly by year, alcohol consumption, and females have higher life expectation than males.