Introduction


Hierarchical linear models (HLM) are an analysis tool we can use to work with data that has a hierarchical structure (i.e. a clustering effect). Example of this data is if you are testing the score results of students in different classes in different schools. The data has a hierarchical structure with the clustering levels being that of the classrooms and the schools. This is a problem for our analyses as it violates the assumption of independence of the data. Additionally between the different clusters we may also see different levels of variance which means that out data is likely to be heteroscedastic meaning we violate another one of our assumptions.

Fortunately we can use HLMs to account for this effect. In this example we’ll look at the effect of Collar Color and examine the clustering effects (e.g. hierarchical structure) of gender and age on this. To do this we’ll be using the package lme4. If you need to install it you can use install.packages("lme4"). Also a draw back of lme4 is that the default version does not report p-values (the creators of the lme4 package are philosophically opposed to p-values, for reasons I’ll not go into here). Fortunately an extension exists that remedies this called lmerTest which you should also install (install.packages('lmerTest')). Now they’re installed, let’s load them in.

library(lme4)
## Loading required package: Matrix
library(lmerTest)
## 
## Attaching package: 'lmerTest'
## The following object is masked from 'package:lme4':
## 
##     lmer
## The following object is masked from 'package:stats':
## 
##     step

1. Generate one or more hypotheses to be tested. State the null and alternative hypotheses.


We’ll be testing if Mediator Skills are associated with CollarColor after accounting for the effects of Gender and Age.

Participants collar color is associated with mediator skills.

And the null..

There is no relationship between collar color and mediator skills.


As usual we will do our data cleaning.

# setwd
setwd('~/Documents/upwork/Tammy_Ferrante')
# load in data (df stands for "data frame")
df <- read.csv('LeadershipData.csv')

# pre processing
# count how many rows have zero variance
length(df[df$ZeroVar == 1,]$ResponseId)
## [1] 13
# drop these rows from the data frame
df <- df[!df$ZeroVar == 1,]
# how many rows are we removing?
length(df[!complete.cases(df), ]$ResponseId)
## [1] 71
# remove the rows
df <- df[complete.cases(df),]
# since we have removed a lot of rows it is sometimes useful to reset the index
row.names(df) <- NULL
# make gender and country factors
df$Gender <- as.factor(df$Gender)
df$Age <- as.factor(df$Age)
df$CollarColor <- as.factor(df$CollarColor)

Let’s check the CollarColor and Gender groups and make sure there’s enough data in each

summary(df$CollarColor)
##   1   2   3   4   5 
## 204 206 328 496  11
summary(df$Gender)
##   1   2 
## 573 672
summary(df$Age)
##   1   2   3   4   5   6 
## 186 647 240 118  45   9

5 is pretty low (only 11 samples). It is interesting data (executives) but let’s drop it as it will likely skew the results. Also only 9 samples in 6 for Age so let’s drop that too.

df <- df[!df$CollarColor == 5,]
df <- df[!df$Age == 6,]
df$CollarColor <- droplevels(df$CollarColor)
df$Age<- droplevels(df$Age)
row.names(df) <- NULL

Now let’s check our groups with some quick plots.

# divde graphics into 1x3 panel
op <- par(mfrow = c(1, 3))
# make plots
boxplot(MediatorDomain~Gender, data=df, main='Mediator Skills and Gender', xlab='Gender', ylab='Score')
boxplot(MediatorDomain~Age, data=df, main='Mediator Skills and Age', xlab='Age', ylab='Score')
boxplot(MediatorDomain~CollarColor, data=df, main='Mediator Skills and Collar Color', xlab='Collar Color', ylab='Score')

par(op)

Now let’s build our model so we can check assumptions and remove outliers. Don’t worry about understanding the model format yet. We’ll get to that in question 4.

fit_HLM <- lmer(MediatorDomain ~ CollarColor + (1 | Gender) + (1 | Age), data=df)

Okay now let’s use Cook’s distance again to remove the outliers.

cooksd <- cooks.distance(fit_HLM)
sample_size <- nrow(df)
plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance")  # plot cook's distance
abline(h = 4/sample_size, col="red")  # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd, 
     labels=ifelse(cooksd>4/sample_size, names(cooksd),""), col="red")  # add labels

# drop rows
influential <- as.numeric(names(cooksd)[(cooksd > (4/sample_size))])
df <- df[-influential,]
# don't forget to reset the index!
row.names(df) <- NULL
# how many outliers did we drop? 
length(influential)
## [1] 187

Don’t forget to refit the model now that we removed the outliers.

fit_HLM <- lmer(MediatorDomain ~ CollarColor + (1 | Gender) + (1 | Age), data=df)

3. Address all relevant statistical assumptions and provide a written summary of the findings, including methods for resolving any violations.


Once again it’s time to check the model assumptions.

  1. Independence of observations
  2. The residuals are normally distributed
  3. The residuals are homoscedastic (i.e. the variance of data in groups should be the same)

Because we’re using lme4 we can’t just diagnostics using the normal plot(fit) methods for everything. We’ll to do it a little different but the tests are all the same.

However first let’s check the independence of observations. This is taken care of in our study because we’re accounting for this in our HLM model. But for the sake of completeness, again like in the last exercise we’ll just make sure there’s no duplicate participants.

# How many duplicate IDs?
length(df[duplicated(df$ResponseId),]$ResponseID)
## [1] 0

Now let’s check for the normality of the residuals. We’ll have to do this manually this time.

qqnorm(resid(fit_HLM))

Looks really good. Now let’s check that the data is homoscedastic. We can just use plot(fit_HLM) for this.

plot(fit_HLM)

No worries! Let’s move on to the analysis.

4. Run the HLM using your statistical software of choice.


Now we get to run the HLM using the lmer function. Notice that we’re adding our random effects to this model using (1 | Gender) and (1 | Age). These are the hierarchical clusters we discussed in the introduction i.e. the random effects of the model. These can be thought of as a special kind of interaction terms. More information on these terms can be found here The part we’re interested is fit as normal (CollarColor) and is the fixed effect of the model.

fit_HLM <- lmer(MediatorDomain ~ CollarColor + (1 | Gender) + (1 | Age), data=df)
summary(fit_HLM)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: MediatorDomain ~ CollarColor + (1 | Gender) + (1 | Age)
##    Data: df
## 
## REML criterion at convergence: 2688
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.79682 -0.71612  0.05872  0.77679  2.19951 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Age      (Intercept) 0.01132  0.1064  
##  Gender   (Intercept) 0.01618  0.1272  
##  Residual             0.76649  0.8755  
## Number of obs: 1038, groups:  Age, 5; Gender, 2
## 
## Fixed effects:
##               Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)  7.828e+00  1.268e-01 3.041e+00  61.744 8.20e-06 ***
## CollarColor2 2.784e-01  1.010e-01 1.033e+03   2.756  0.00596 ** 
## CollarColor3 2.792e-01  9.005e-02 1.031e+03   3.101  0.00198 ** 
## CollarColor4 3.984e-01  8.455e-02 9.722e+02   4.712 2.81e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) CllrC2 CllrC3
## CollarColr2 -0.404              
## CollarColr3 -0.455  0.575       
## CollarColr4 -0.496  0.627  0.701

As you can see it’s clear that there is an effect of collar color on mediator skills after accounting for gender and age in our analysis. Notice that the variance of the random effects is also pretty close to zero. This suggests they are having very little impact on the model.

Visualising multilevel regressions

Now because we have 2 groups of Gender and 5 groups of Age we are effectively looking at the variance of 10 different groups (2*5 = 10). For fun let’s have a visual look at all the different regression lines using ggplot2.

library(ggplot2)

df$mixed <- factor(paste0('G', df$Gender, '-A', df$Age))
df$CollarColor_num <- as.numeric(df$CollarColor)
  
ggplot(df,aes(x=CollarColor_num,y=MediatorDomain,color=mixed)) +
  stat_smooth(method = "lm") + 
  geom_point(data=df,aes(x=CollarColor_num,y=MediatorDomain)) +
  facet_wrap(~mixed,nrow=4)

Notice we have a pretty similar line in most of these cases. This suggests to me that the random effects aren’t having a huge impact on our analysis.

5. Provide a summary of the overall findings


We found that there was a relationship between CollarColor and MediatorDomain after accounting for both Gender and Age. Consequently we could reject our null hypothesis that “there is no relationship between collar color and mediator skills”. Additionally the random effects in the model showed very little variance and therefore were not an important factor in the model.

a. Data issues and how they were resolved


The data was cleaned and processed. We removed categorical groups with very few samples and we removed all rows that contained zero variance in their responses and rows containing missing values were removed. Additionally we checked for duplicated ResponseIDs and removed data outliers using 4/N threshold of Cook’s distance.

b. Descriptive summary of the data


summary(df$MediatorDomain)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.833   7.500   8.167   8.103   8.833  10.000
summary(df$CollarColor)
##   1   2   3   4 
## 147 161 279 451
summary(df$Gender)
##   1   2 
## 481 557
summary(df$Age)
##   1   2   3   4   5 
## 137 576 204  91  30
# divde graphics into 1x3 panel
op <- par(mfrow = c(1, 3))
# make plots of clean data
boxplot(MediatorDomain~Gender, data=df, main='Mediator Skills and Gender', xlab='Gender', ylab='Score')
boxplot(MediatorDomain~Age, data=df, main='Mediator Skills and Age', xlab='Age', ylab='Score')
boxplot(MediatorDomain~CollarColor, data=df, main='Mediator Skills and Collar Color', xlab='Collar Color', ylab='Score')

par(op)

c. Results of statistical assumption tests


We used R to build diagnostic plots to test statistical assumptions:

  1. Independence of observations
  2. The residuals are normally distributed
  3. The residuals are homoscedastic (i.e. the variance of data in groups should be the same)

We found no violation of these assumptions and were able to continue the analysis without transforming the data.

d. HLM results with a comparison of models (e.g., intercept only, random intercept only, an addition of fixed effects to random intercept only model)


Well here’s the HLM results.

summary(fit_HLM)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: MediatorDomain ~ CollarColor + (1 | Gender) + (1 | Age)
##    Data: df
## 
## REML criterion at convergence: 2688
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.79682 -0.71612  0.05872  0.77679  2.19951 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Age      (Intercept) 0.01132  0.1064  
##  Gender   (Intercept) 0.01618  0.1272  
##  Residual             0.76649  0.8755  
## Number of obs: 1038, groups:  Age, 5; Gender, 2
## 
## Fixed effects:
##               Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)  7.828e+00  1.268e-01 3.041e+00  61.744 8.20e-06 ***
## CollarColor2 2.784e-01  1.010e-01 1.033e+03   2.756  0.00596 ** 
## CollarColor3 2.792e-01  9.005e-02 1.031e+03   3.101  0.00198 ** 
## CollarColor4 3.984e-01  8.455e-02 9.722e+02   4.712 2.81e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) CllrC2 CllrC3
## CollarColr2 -0.404              
## CollarColr3 -0.455  0.575       
## CollarColr4 -0.496  0.627  0.701

Now let’s do the comparisons between these different versions of the model. First let’s build them.

# intercept only
fit_int <- lm(MediatorDomain ~ 1, data=df)
# random intercept only
fit_ran <- lmer(MediatorDomain ~ (1 | Gender) + (1 | Age), data=df)
# add fixed effects
fit_fix.ran <- lmer(MediatorDomain ~ CollarColor + (1 | Gender) + (1 | Age), data=df)

Now let’s compare the 3 models using AIC

AIC(fit_int)
## [1] 2713.6
AIC(fit_ran)
## [1] 2707.417
AIC(fit_fix.ran)
## [1] 2701.976

e. Limitations of the analysis and results


This analysis has not taken into account the possible interaction effects of potential covariates such as country, education. Further analysis could expand by accounting for all potential sources of variance.