1. Generate one or more hypotheses to be tested. State the null and alternative hypotheses.


Well this is easy as it’s already been implied by the variables chosen.

Participants job level is related to leadership qualities

And the null..

There is no relationship between job level and leadership qualities

2. Provide a brief overview of any missing data, outliers or anomalies and how those cases were handled.


Because we’ve already gone through it we’ll skip over this part real. Let’s load in the data and process it as we did before.

# setwd
setwd('~/Documents/upwork/Tammy_Ferrante')
# load in data (df stands for "data frame")
df <- read.csv('LeadershipData.csv')

# pre processing
# count how many rows have zero variance
length(df[df$ZeroVar == 1,]$ResponseId)
## [1] 13
# drop these rows from the data frame
df <- df[!df$ZeroVar == 1,]
# how many rows are we removing?
length(df[!complete.cases(df), ]$ResponseId)
## [1] 71
# remove the rows
df <- df[complete.cases(df),]
# since we have removed a lot of rows it is sometimes useful to reset the index
row.names(df) <- NULL
# make collarcolor a factor
df$CollarColor <- as.factor(df$CollarColor)

Let’s check the CollarColor groups and make sure there’s enough data in each

summary(df$CollarColor)
##   1   2   3   4   5 
## 204 206 328 496  11

5 is pretty low (only 11 samples). It is interesting data (executives) but let’s drop it as it will likely skew the results.

df <- df[!df$CollarColor == 5,]
df$CollarColor <- droplevels(df$CollarColor)
row.names(df) <- NULL

Now let’s check our groups with a quick plot. I notice you’ve used a bar plot for this. This is okay except it’s more for frequency data. For this type of data you’re much better producing a box plot as we did in the last analysis as it shows details such as outliers, variance and the mean etc.

# divde graphics into 1x3 panel
op <- par(mfrow = c(1, 3))
# make age plots
boxplot(ValuesDomain~CollarColor, data=df, main='Leadership Values', xlab='Collar Color', ylab='Score')
boxplot(MediatorDomain~CollarColor, data=df, main='Mediator Skills', xlab='Collar Color', ylab='Score')
boxplot(FulfillmentDomain~CollarColor, data=df, main='Leadership Fulfillment', xlab='Collar Color', ylab='Score')

Now let’s build our model so we can check assumptions and remove outliers. Notice we use cbind() to merge the three dependent variables together to build the MANOVA.

fit <- manova(cbind(ValuesDomain, MediatorDomain, FulfillmentDomain) ~ CollarColor, data = df)

Okay now let’s use Cook’s distance again to remove the outliers. This time we won’t make a plot as having the three dependent variables makes this a little tricky (the cooksd matrix will have three columns so it becomes a little more advanced to manipulate). It’s possible but really just not worth the effort for this exercise. Also because of this difference notice how I extract the influential (outlier) indexes is also different..

cooksd <- cooks.distance(fit)
sample_size <- nrow(df)
influential <- as.numeric(1:length(cooksd[,1]))[(cooksd>4/sample_size)[,1] | (cooksd>4/sample_size)[,2] | (cooksd>4/sample_size)[,3]]

The above line is a little tricky so let’s break it down. We’re basically getting a numerical list of the all the cooksd index values (i.e. numbers 1:1245) using the code as.numeric(1:length(cooksd[,1])). We then subset this by following this but of code with square brackets ([]). Inside these square brackets we put a conditional statement that says for each row, if any of the three columns in cooksd (one column for each dependent variable) contains a value that is greater than our 4/N cutoff, then include that index value. To be honest I wouldn’t expect you to follow if you’re reasonably new to R as it’s rather messy indexing but as long as you kinda see where it’s coming from. The | means OR in R.

Anyway, now that we have the outlier indexes we can use that list to drop those rows from the data frame.

# drop rows
df <- df[-influential,]
# don't forget to reset the index!
row.names(df) <- NULL

3. Address all relevant statistical assumptions and provide a written summary of the findings, including methods for resolving any violations.


Alright, time to check the model assumptions. The assumptions for MANOVA are exactly the same as ANOVA:

  1. Independence of observations
  2. The residuals are normally distributed
  3. The residuals are homoscedastic (i.e. the variance of data in groups should be the same)

The only difference is, checking the assumptions for a MANOVA can be a real pain in the ass! Reason being R does not have good diagnostic support for mlm objects. Consequently we need to make a model to test for each dependent variable and category! Luckily we’re only checking one categorical independent variable so that’s just three plots. We’ll do this quickly this time using the plot(lm) to quickly produce our diagnostics.

However first let’s check the independence of observations. Again like in the last exercise we’ll just make sure there’s no duplicate participants.

# How many duplicate IDs?
length(df[duplicated(df$ResponseId),]$ResponseID)
## [1] 0

Now let’s quickly make our models to check assumptions and produce the diagnostics.

fitVal <- lm(formula = ValuesDomain ~ CollarColor, data = df)
fitMed <- lm(formula = MediatorDomain ~ CollarColor, data = df)
fitFul <- lm(formula = FulfillmentDomain ~ CollarColor, data = df)

And make the diagnostic plots.

op <- par(mfrow = c(2,2))
plot(fitVal)

plot(fitMed)

plot(fitFul)

par(op)

Well everything looks fine and nothing is being violated so now we can continue. Time for the fun part.

4. Run the one-way MANOVA using your statistical software of choice.


Now we get to run the MANOVA. First let’s look at the response overall and then break it down to each variable. First we need to refit the model as we removed the outliers before.

fit <- manova(cbind(ValuesDomain, MediatorDomain, FulfillmentDomain) ~ CollarColor, data = df)
summary(fit)
##               Df   Pillai approx F num Df den Df   Pr(>F)    
## CollarColor    3 0.040508   5.1009      9   3354 7.01e-07 ***
## Residuals   1118                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Well it’s pretty convincing that CollarColor is having an affect of some kind! Let’s break it down further and look at each DV in isolation.

summary(aov(fit))
##  Response ValuesDomain :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## CollarColor    3   30.2 10.0656  9.4139 3.806e-06 ***
## Residuals   1118 1195.4  1.0692                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response MediatorDomain :
##               Df  Sum Sq Mean Sq F value   Pr(>F)   
## CollarColor    3   13.36  4.4549  4.4157 0.004273 **
## Residuals   1118 1127.93  1.0089                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response FulfillmentDomain :
##               Df Sum Sq Mean Sq F value   Pr(>F)   
## CollarColor    3   16.4  5.4677  5.1266 0.001591 **
## Residuals   1118 1192.4  1.0666                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Indeed the effect appears to be significant overall the dependent variables. Especially for the ValuesDomain. Time to break it down further with a post-hoc analysis.

5. Conduct the most appropriate post-hoc analysis.

Unfortunately Tukey doesn’t support multiple dependent variables but that’s okay. We can just break down the responses to their individual regression values.

TukeyVal <- TukeyHSD(x=aov(fitVal))
TukeyMed <- TukeyHSD(x=aov(fitMed))
TukeyFul <- TukeyHSD(x=aov(fitFul))
# show results
TukeyVal
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = fitVal)
## 
## $CollarColor
##            diff          lwr        upr     p adj
## 2-1  0.58753852  0.294365420  0.8807116 0.0000018
## 3-1  0.20966472 -0.053624785  0.4729542 0.1708741
## 4-1  0.25200261  0.006104969  0.4979002 0.0421399
## 3-2 -0.37787380 -0.629566022 -0.1261816 0.0006840
## 4-2 -0.33553591 -0.568973827 -0.1020980 0.0013004
## 4-3  0.04233789 -0.152249248  0.2369250 0.9438707
TukeyMed
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = fitMed)
## 
## $CollarColor
##            diff         lwr        upr     p adj
## 2-1  0.37090334  0.08612276 0.65568393 0.0046144
## 3-1  0.20419006 -0.05156238 0.45994251 0.1690299
## 4-1  0.28694291  0.04808445 0.52580136 0.0109937
## 3-2 -0.16671328 -0.41120043 0.07777387 0.2961382
## 4-2 -0.08396044 -0.31071584 0.14279497 0.7762872
## 4-3  0.08275284 -0.10626395 0.27176964 0.6732564
TukeyFul
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = fitFul)
## 
## $CollarColor
##            diff         lwr         upr     p adj
## 2-1  0.37538938  0.08258303  0.66819572 0.0055278
## 3-1  0.05135189 -0.21160824  0.31431202 0.9584985
## 4-1  0.19716647 -0.04842356  0.44275650 0.1651088
## 3-2 -0.32403749 -0.57541484 -0.07266013 0.0051965
## 4-2 -0.17822291 -0.41136879  0.05492298 0.2012601
## 4-3  0.14581458 -0.04852913  0.34015830 0.2158458

That’s a bit of information to process so maybe it would be easier if we could divide the CollarColor factors into groups based on if they differ from each other or not. We can actually achieve this quite easily using built in functions.

# Tuckey test representation, let's just plot one for now.
plot(TukeyVal)

That’s nice but let’s do one better! I reckon we can get it all on a nice boxplot and colour coordinate the groups! Much easier to read! Let’s do it by making one big function.. I’m just using one someone wrote here that I have modified slightly.

# usually load all libraries at the start but this one I'm loading now to show where it's used.
library(multcompView)

plot.tukeyGroup <- function(TUKEY, DV, variable){
  # I need to group the treatments that are not different each other together.
  generate_label_df <- function(TUKEY, variable){
   
       # Extract labels and factor levels from Tukey post-hoc 
       Tukey.levels <- TUKEY[[variable]][,4]
       Tukey.labels <- data.frame(multcompLetters(Tukey.levels)['Letters'])
       
       #I need to put the labels in the same order as in the boxplot :
       Tukey.labels$treatment=rownames(Tukey.labels)
       Tukey.labels=Tukey.labels[order(Tukey.labels$treatment) , ]
       return(Tukey.labels)
       }
   
  # Apply the function on my dataset
  LABELS=generate_label_df(TUKEY, variable)
   
  # A panel of colors to draw each group with the same color :
  my_colors=c( rgb(143,199,74,maxColorValue = 255),rgb(242,104,34,maxColorValue = 255),
               rgb(111,145,202,maxColorValue = 255),rgb(254,188,18,maxColorValue = 255),
               rgb(74,132,54,maxColorValue = 255),rgb(236,33,39,maxColorValue = 255),
               rgb(165,103,40,maxColorValue = 255))
   
  # Draw the basic boxplot
  a=boxplot(df[[DV]] ~ df[[variable]], ylim=c(min(df[[DV]]), 1.1*max(df[[DV]])),
            col=my_colors[as.numeric(LABELS[,1])] , ylab="Score" , main=DV)
   
  # I want to write the letter over each box. Over is how high I want to write it.
  over=0.1*max( a$stats[nrow(a$stats),] )
   
  #Add the labels
  text(c(1:nlevels(df[[variable]])), a$stats[nrow(a$stats),]+over, LABELS[,1],
      col=my_colors[as.numeric(LABELS[,1])])
}

Great, now it’s apply it to our Tukey tests.

plot.tukeyGroup(TukeyVal, "ValuesDomain", "CollarColor")

plot.tukeyGroup(TukeyMed, "MediatorDomain", "CollarColor")

plot.tukeyGroup(TukeyFul, "FulfillmentDomain", "CollarColor")

Now that is much more readable!! We can easily see which groups relate to each other.

6. Provide a summary of the overall findings


We have performed a MANOVA on multiple dependent variables and have shown that CollarColor has a significant effect on Leadership qualities.

summary(fit)
##               Df   Pillai approx F num Df den Df   Pr(>F)    
## CollarColor    3 0.040508   5.1009      9   3354 7.01e-07 ***
## Residuals   1118                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because of the clear results we are able to reject the null hypothesis that “There is no relationship between job level and leadership qualities

a. Data issues and how they were resolved


The data was cleaned and processed. We removed categorical groups with very few samples and we removed all rows that contained zero variance in their responses and rows containing missing values were removed. Additionally we checked for duplicated ResponseIDs and removed data outliers using 4/N threshold of Cook’s distance.

b. Descriptive summary of the data


summary(df$ValuesDomain)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.250   8.125   8.015   8.750  10.000
summary(df$MediatorDomain)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.083   7.417   8.167   8.112   8.833  10.000
summary(df$FulfillmentDomain)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.364   8.136   8.053   8.818  10.000
summary(df$CollarColor)
##   1   2   3   4 
## 154 177 303 488
# divde graphics into 1x3 panel
op <- par(mfrow = c(1, 3))
# make age plots
boxplot(ValuesDomain~CollarColor, data=df, main='Leadership Values', xlab='Collar Color', ylab='Score')
boxplot(MediatorDomain~CollarColor, data=df, main='Mediator Skills', xlab='Collar Color', ylab='Score')
boxplot(FulfillmentDomain~CollarColor, data=df, main='Leadership Fulfillment', xlab='Collar Color', ylab='Score')

c. Results of statistical assumption tests


We used R to build diagnostic plots to test the assumptions of MANOVA:

  1. Independence of observations
  2. The residuals are normally distributed
  3. The residuals are homoscedastic (i.e. the variance of data in groups should be the same)

We found no violation of these assumptions and were able to continue the analysis without transforming the data.

op <- par(mfrow = c(2,2))
plot(fitVal)

plot(fitMed)

plot(fitFul)

par(op)

d. One-way MANOVA results and decided to retain or reject the null hypothesis


MANOVA:

summary(fit)
##               Df   Pillai approx F num Df den Df   Pr(>F)    
## CollarColor    3 0.040508   5.1009      9   3354 7.01e-07 ***
## Residuals   1118                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

for each DV:

summary(aov(fit))
##  Response ValuesDomain :
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## CollarColor    3   30.2 10.0656  9.4139 3.806e-06 ***
## Residuals   1118 1195.4  1.0692                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response MediatorDomain :
##               Df  Sum Sq Mean Sq F value   Pr(>F)   
## CollarColor    3   13.36  4.4549  4.4157 0.004273 **
## Residuals   1118 1127.93  1.0089                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response FulfillmentDomain :
##               Df Sum Sq Mean Sq F value   Pr(>F)   
## CollarColor    3   16.4  5.4677  5.1266 0.001591 **
## Residuals   1118 1192.4  1.0666                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There was a clear significant relationship between CollarColor and leadership qualities. Consequently we are able to reject the null hypothesis that “There is no relationship between job level and leadership qualities

e. Limitations of the analysis and results


This analysis has not taken into account the possible interaction effects of potential covariates such as country, age and gender. Further analysis could expand on this by performing a MANCOVA analysis.