Analysis Part 3 - Factorial MANOVA

1. Generate one or more hypotheses to be tested. State the null and alternative hypotheses.

Again this is easy as it’s already been implied by the variables chosen. Our hypotheses are..

Participants job level is related to leadership qualities

Participants gender is related to leadership qualities

And the respective nulls..

There is no relationship between job level and leadership qualities

There is no relationship between gender and leadership qualities

2. Provide a brief overview of any missing data, outliers or anomalies and how those cases were handled.

Because we’ve already gone through it twice now, we’ll skip over this part quite quickly. Let’s load in the data and process it as we did before.

# setwd
setwd('~/Documents/upwork/Tammy_Ferrante')
# load in data (df stands for "data frame")
df <- read.csv('LeadershipData.csv')

# pre processing
# count how many rows have zero variance
length(df[df$ZeroVar == 1,]$ResponseId)

## [1] 13

# drop these rows from the data frame
df <- df[!df$ZeroVar == 1,]
# how many rows are we removing?
length(df[!complete.cases(df), ]$ResponseId)

## [1] 71

# remove the rows
df <- df[complete.cases(df),]
# since we have removed a lot of rows it is sometimes useful to reset the index
row.names(df) <- NULL
# make collarcolor a factor
df$CollarColor <- as.factor(df$CollarColor)
df$Gender <- as.factor(df$Gender)

Let’s check the CollarColor groups and make sure there’s enough data in each

summary(df$Gender)

##   1   2 
## 573 672

summary(df$CollarColor)

##   1   2   3   4   5 
## 204 206 328 496  11

Like last time, CollarColor 5 is pretty low (only 11 samples) so let’s drop it as it will likely skew the results.

df <- df[!df$CollarColor == 5,]
df$CollarColor <- droplevels(df$CollarColor)
row.names(df) <- NULL

Now let’s check our groups with a quick plot.

# divde graphics into 1x3 panel
op <- par(mfrow = c(1, 3))
# make Collar Color plots
boxplot(ValuesDomain~CollarColor, data=df, main='Leadership Values', xlab='Collar Color', ylab='Score')
boxplot(MediatorDomain~CollarColor, data=df, main='Mediator Skills', xlab='Collar Color', ylab='Score')
boxplot(FulfillmentDomain~CollarColor, data=df, main='Leadership Fulfillment', xlab='Collar Color', ylab='Score')

# make Gender plots
boxplot(ValuesDomain~Gender, data=df, main='Leadership Values', xlab='Gender', ylab='Score')
boxplot(MediatorDomain~Gender, data=df, main='Mediator Skills', xlab='Gender', ylab='Score')
boxplot(FulfillmentDomain~Gender, data=df, main='Leadership Fulfillment', xlab='Gender', ylab='Score')

Now let’s build our model so we can check assumptions and remove outliers. Notice we use cbind() to merge the three dependent variables together to build the MANOVA. The onyl difference to the last one is that we are now including a second dependent variable (gender) and also looking at the interaction effect of CollarColor:Gender.

fit <- manova(cbind(ValuesDomain, MediatorDomain, FulfillmentDomain) ~ CollarColor*Gender, data = df)

Okay now let’s use Cook’s distance again to remove the outliers just like last time.

# calculate cooksd
cooksd <- cooks.distance(fit)
sample_size <- nrow(df)
influential <- as.numeric(1:length(cooksd[,1]))[(cooksd>4/sample_size)[,1] | (cooksd>4/sample_size)[,2] | (cooksd>4/sample_size)[,3]]

# drop rows
df <- df[-influential,]
# don't forget to reset the index!
row.names(df) <- NULL
print(paste(as.character(length(influential)), 'outliers removed'))

## [1] "117 outliers removed"

3. Address all relevant statistical assumptions and provide a written summary of the findings, including methods for resolving any violations.

Alright, time to check the model assumptions. As before, the assumptions for MANOVA are exactly the same as ANOVA:

Independence of observations
The residuals are normally distributed
The residuals are homoscedastic (i.e. the variance of data in groups should be the same)

And like last time, the only difference is, checking the assumptions for a MANOVA can be a real pain in the ass! Because if you remember from part 2..

Reason being R does not have good diagnostic support for mlm objects. Consequently we need to make a model to test for each dependent variable and category! Luckily we’re only checking one categorical independent variable so that’s just three plots. We’ll do this quickly this time using the plot(lm) to quickly produce our diagnostics.

However first let’s check the independence of observations. Again like in the last exercise we’ll just make sure there’s no duplicate participants.

# How many duplicate IDs?
length(df[duplicated(df$ResponseId),]$ResponseID)

## [1] 0

Now let’s quickly make our models to check assumptions and produce the diagnostics.

fitVal <- lm(formula = ValuesDomain ~ CollarColor*Gender, data = df)
fitMed <- lm(formula = MediatorDomain ~ CollarColor*Gender, data = df)
fitFul <- lm(formula = FulfillmentDomain ~ CollarColor*Gender, data = df)

And make the diagnostic plots.

op <- par(mfrow = c(2,2))
plot(fitVal)

plot(fitMed)

plot(fitFul)

par(op)

Everything looks fine and nothing is being violated so now we can continue.

4. Run the one-way MANOVA using your statistical software of choice.

Now we get to run the MANOVA. First let’s look at the response overall and then break it down to each variable. First we need to refit the model (as we removed the outliers before).

fit <- manova(cbind(ValuesDomain, MediatorDomain, FulfillmentDomain) ~ CollarColor*Gender, data = df)
summary(fit)

##                      Df   Pillai approx F num Df den Df    Pr(>F)    
## CollarColor           3 0.035675   4.4489      9   3327 8.113e-06 ***
## Gender                1 0.031122  11.8528      3   1107 1.209e-07 ***
## CollarColor:Gender    3 0.024277   3.0159      9   3327  0.001367 ** 
## Residuals          1109                                              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Well it’s pretty convincing that CollarColor is having an affect of some kind and so is Gender. Additionally there’s also a significant interaction effect taking place between CollarColor and Gender. Let’s break it down further and look at each DV in isolation.

summary(aov(fit))

##  Response ValuesDomain :
##                      Df  Sum Sq Mean Sq F value    Pr(>F)    
## CollarColor           3   24.10  8.0336  7.4820 5.837e-05 ***
## Gender                1   26.92 26.9193 25.0709 6.424e-07 ***
## CollarColor:Gender    3   15.49  5.1639  4.8093  0.002475 ** 
## Residuals          1109 1190.76  1.0737                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response MediatorDomain :
##                      Df  Sum Sq Mean Sq F value    Pr(>F)    
## CollarColor           3   12.52  4.1722  4.1085 0.0065343 ** 
## Gender                1   11.49 11.4857 11.3105 0.0007970 ***
## CollarColor:Gender    3   19.08  6.3617  6.2646 0.0003238 ***
## Residuals          1109 1126.19  1.0155                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response FulfillmentDomain :
##                      Df  Sum Sq Mean Sq F value    Pr(>F)    
## CollarColor           3   13.82  4.6059  4.3572 0.0046345 ** 
## Gender                1   25.93 25.9328 24.5322 8.444e-07 ***
## CollarColor:Gender    3   18.37  6.1242  5.7934 0.0006268 ***
## Residuals          1109 1172.32  1.0571                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Clearly there is a strong effect here for both CollarColor and Gender. Additionally there is also a significant interaction effect between CollarColor and Gender for the each response. Time to break it down further with a post-hoc analysis.

5. Conduct the most appropriate post-hoc analysis.

Again, Tukey doesn’t support multiple dependent variables but that’s okay. We can just break down the responses to their individual regression values.

TukeyVal <- TukeyHSD(x=aov(fitVal))
TukeyMed <- TukeyHSD(x=aov(fitMed))
TukeyFul <- TukeyHSD(x=aov(fitFul))

Again let’s produce some nice post-hoc analyses using the function we wrote in the previous report.

# usually load all libraries at the start but this one I'm loading now to show where it's used.
library(multcompView)

plot.tukeyGroup <- function(TUKEY, DV, variable){
  # I need to group the treatments that are not different each other together.
  generate_label_df <- function(TUKEY, variable){
   
       # Extract labels and factor levels from Tukey post-hoc 
       Tukey.levels <- TUKEY[[variable]][,4]
       Tukey.labels <- data.frame(multcompLetters(Tukey.levels)['Letters'])
       
       #I need to put the labels in the same order as in the boxplot :
       Tukey.labels$treatment=rownames(Tukey.labels)
       Tukey.labels=Tukey.labels[order(Tukey.labels$treatment) , ]
       return(Tukey.labels)
       }
   
  # Apply the function on my dataset
  LABELS=generate_label_df(TUKEY, variable)
   
  # A panel of colors to draw each group with the same color :
  my_colors=c( rgb(143,199,74,maxColorValue = 255),rgb(242,104,34,maxColorValue = 255),
               rgb(111,145,202,maxColorValue = 255),rgb(254,188,18,maxColorValue = 255),
               rgb(74,132,54,maxColorValue = 255),rgb(236,33,39,maxColorValue = 255),
               rgb(165,103,40,maxColorValue = 255))
   
  # Draw the basic boxplot
  a=boxplot(df[[DV]] ~ df[[variable]], ylim=c(min(df[[DV]]), 1.1*max(df[[DV]])),
            col=my_colors[as.numeric(LABELS[,1])] , ylab="Score" , main=DV)
   
  # I want to write the letter over each box. Over is how high I want to write it.
  over=0.1*max( a$stats[nrow(a$stats),] )
   
  #Add the labels
  text(c(1:nlevels(df[[variable]])), a$stats[nrow(a$stats),]+over, LABELS[,1],
      col=my_colors[as.numeric(LABELS[,1])])
}

Now let’s apply it to our tests. Note that we do not need to do this for Gender as it has only two levels. These post hoc analyese are only useful if you have more than two levels in your factors.

plot.tukeyGroup(TukeyVal, "ValuesDomain", "CollarColor")

plot.tukeyGroup(TukeyMed, "MediatorDomain", "CollarColor")

plot.tukeyGroup(TukeyFul, "FulfillmentDomain", "CollarColor")

Now because we have a positive interaction effect, we know the effects of CollarColor operate differently for each gender. Let’s illustrate this in a plot so we can really get an idea of what is going on.

# first seperate the dataframes by Gender
df2 <- split(df, df$Gender)
# divde graphics into 1x3 panel
op <- par(mfrow = c(1, 3))
# make Collar Color plots for gender 1
print('Male')

## [1] "Male"

color = 'blue'
boxplot(ValuesDomain~CollarColor, data=df2[[1]], main='Leadership Values', xlab='Collar Color', ylab='Score', col=color)
boxplot(MediatorDomain~CollarColor, data=df2[[1]], main='Mediator Skills', xlab='Collar Color', ylab='Score', col=color)
boxplot(FulfillmentDomain~CollarColor, data=df2[[1]], main='Leadership Fulfillment', xlab='Collar Color', ylab='Score', col=color)

# make Collar Color plots for gender 2
print('Female')

## [1] "Female"

color = 'orange'
boxplot(ValuesDomain~CollarColor, data=df2[[2]], main='Leadership Values', xlab='Collar Color', ylab='Score', col=color)
boxplot(MediatorDomain~CollarColor, data=df2[[2]], main='Mediator Skills', xlab='Collar Color', ylab='Score', col=color)
boxplot(FulfillmentDomain~CollarColor, data=df2[[2]], main='Leadership Fulfillment', xlab='Collar Color', ylab='Score', col=color)

6. Provide a summary of the overall findings

We have performed a factorial MANOVA on multiple dependent variables and have shown that CollarColor has a significant effect on leadership qualities and Gender also has a significant effect on leadership qualities. Additionally there is a significant interaction effect between Gender and CollarColor suggesting that the effect of CollarColor on leadership qualities is different bewteen each gender.

summary(fit)

##                      Df   Pillai approx F num Df den Df    Pr(>F)    
## CollarColor           3 0.035675   4.4489      9   3327 8.113e-06 ***
## Gender                1 0.031122  11.8528      3   1107 1.209e-07 ***
## CollarColor:Gender    3 0.024277   3.0159      9   3327  0.001367 ** 
## Residuals          1109                                              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because of the clear results we are able to reject our null hypotheses that “There is no relationship between job level and leadership qualities” and “There is no relationship between gender and leadership qualities”.

a. Data issues and how they were resolved

The data was cleaned and processed. We removed categorical groups with very few samples and we removed all rows that contained zero variance in their responses and rows containing missing values were removed. Additionally we checked for duplicated ResponseIDs and removed data outliers using 4/N threshold of Cook’s distance.

b. Descriptive summary of the data

summary(df$ValuesDomain)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.750   7.250   8.125   7.993   8.750  10.000

summary(df$MediatorDomain)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.917   7.417   8.167   8.099   8.833  10.000

summary(df$FulfillmentDomain)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.364   8.091   8.037   8.818  10.000

summary(df$CollarColor)

##   1   2   3   4 
## 154 171 302 490

summary(df$Gender)

##   1   2 
## 509 608

# divde graphics into 1x3 panel
op <- par(mfrow = c(1, 3))
# make Collar plots
boxplot(ValuesDomain~CollarColor, data=df, main='Leadership Values', xlab='Collar Color', ylab='Score')
boxplot(MediatorDomain~CollarColor, data=df, main='Mediator Skills', xlab='Collar Color', ylab='Score')
boxplot(FulfillmentDomain~CollarColor, data=df, main='Leadership Fulfillment', xlab='Collar Color', ylab='Score')

# make Gender plots
boxplot(ValuesDomain~Gender, data=df, main='Leadership Values', xlab='Gender', ylab='Score')
boxplot(MediatorDomain~Gender, data=df, main='Mediator Skills', xlab='Gender', ylab='Score')
boxplot(FulfillmentDomain~Gender, data=df, main='Leadership Fulfillment', xlab='Gender', ylab='Score')

c. Results of statistical assumption tests

We used R to build diagnostic plots to test the assumptions of MANOVA:

Independence of observations
The residuals are normally distributed
The residuals are homoscedastic (i.e. the variance of data in groups should be the same)

We found no violation of these assumptions and were able to continue the analysis without transforming the data.

op <- par(mfrow = c(2,2))
plot(fitVal)

plot(fitMed)

plot(fitFul)

par(op)

d. One-way MANOVA results and decided to retain or reject the null hypothesis

MANOVA:

summary(fit)

##                      Df   Pillai approx F num Df den Df    Pr(>F)    
## CollarColor           3 0.035675   4.4489      9   3327 8.113e-06 ***
## Gender                1 0.031122  11.8528      3   1107 1.209e-07 ***
## CollarColor:Gender    3 0.024277   3.0159      9   3327  0.001367 ** 
## Residuals          1109                                              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

for each DV:

summary(aov(fit))

##  Response ValuesDomain :
##                      Df  Sum Sq Mean Sq F value    Pr(>F)    
## CollarColor           3   24.10  8.0336  7.4820 5.837e-05 ***
## Gender                1   26.92 26.9193 25.0709 6.424e-07 ***
## CollarColor:Gender    3   15.49  5.1639  4.8093  0.002475 ** 
## Residuals          1109 1190.76  1.0737                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response MediatorDomain :
##                      Df  Sum Sq Mean Sq F value    Pr(>F)    
## CollarColor           3   12.52  4.1722  4.1085 0.0065343 ** 
## Gender                1   11.49 11.4857 11.3105 0.0007970 ***
## CollarColor:Gender    3   19.08  6.3617  6.2646 0.0003238 ***
## Residuals          1109 1126.19  1.0155                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response FulfillmentDomain :
##                      Df  Sum Sq Mean Sq F value    Pr(>F)    
## CollarColor           3   13.82  4.6059  4.3572 0.0046345 ** 
## Gender                1   25.93 25.9328 24.5322 8.444e-07 ***
## CollarColor:Gender    3   18.37  6.1242  5.7934 0.0006268 ***
## Residuals          1109 1172.32  1.0571                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There was a clear significant relationship between CollarColor and leadership qualities and Gender and leadership qualities. Consequently we are able to reject noth our null hypotheses that “There is no relationship between job level and leadership qualities” and “There is no relationship between gender and leadership qualities”.

e. Limitations of the analysis and results

This analysis has not taken into account the possible interaction effects of other potential covariates such as country, age and employment. Further analysis could expand on this study by investigating the effect of additional covariates.