Note, I’ve written this up using 3 separate ANOCOVA models in the analysis (for each of the 3 dependent variables). It would be better in reality to instead do a MANCOVA on all of these at once but since the questions ask about ANCOVA I have stuck with using this method.

1. Generate one or more hypotheses to be tested. State the null and alternative hypotheses.


I thought these two hypothesis would be interesting to test.

Leadership qualities increase with age (with age comes wisdom)

The respective null hypothesis is:

Leadership qualities are unrelated to age

We’re going to use education as a covariate for our ANCOVA as maybe we suspect that older people have higher education and therefore may be skewing the results? Let’s investigate.

2. Generate and present the appropriate descriptive statistics and visuals, and synthesize the characteristics of the data.


Well let’s take a quick look at our data by making some nice plots using R. First step of course is to read in and process the data.

# set our working directory to where our data is kept
setwd('~/Documents/upwork/Tammy_Ferrante')
# load in data (df stands for "data frame")
df <- read.csv('LeadershipData.csv')

Because education and age are categorical, let’s make them factors.

df$Age <- as.factor(df$Age)
df$Education <- as.factor(df$Education)

a. Identify missing data and remove it within the program

Firstly, let’s see how many participants had “zero variance” in their responses (i.e. clicked the same response for everything) as we’ll want to drop these from the analysis.

# count how many rows have zero variance
length(df[df$ZeroVar == 1,]$ResponseId)
## [1] 13
# drop these rows from the data frame
df <- df[!df$ZeroVar == 1,]

To remove these I have used df <- df[!df$ZeroVar == 1,]. In R, and exclamation mark (!) means “not”. So this line of code tells are to only keep rows where ZeroVar is not equal to 1. Now we can remove the missing data quite easily using the function complete.cases(). This allows us to drop rows that have mssingvalues (NA) in any columns.

# how many rows are we removing?
length(df[!complete.cases(df), ]$ResponseId)
## [1] 71
# remove the rows
df <- df[complete.cases(df),]
# since we have removed a lot of rows it is sometimes useful to reset the index
row.names(df) <- NULL

Now that we have gotten rid of the junk, let’s have a quick visual look at our data by making some box plots.

# divde graphics into 1x3 panel
op <- par(mfrow = c(1, 3))
# make age plots
boxplot(ValuesDomain~Age, data=df, main='Leadership Values', xlab='Age group', ylab='Score')
boxplot(MediatorDomain~Age, data=df, main='Mediator Skills', xlab='Age group', ylab='Score')
boxplot(FulfillmentDomain~Age, data=df, main='Leadership Fulfillment', xlab='Age group', ylab='Score')

# make education plots
boxplot(ValuesDomain~Education, data=df, main='Leadership Values', xlab='Education Level', ylab='Score')
boxplot(MediatorDomain~Education, data=df, main='Mediator Skills', xlab='Education Level', ylab='Score')
boxplot(FulfillmentDomain~Education, data=df, main='Leadership Fulfillment', xlab='Education Level', ylab='Score')

# return graphics to normal
par(op)

Finally let’s check our groups

summary(df$Age)
##   1   2   3   4   5   6 
## 186 647 240 118  45   9
summary(df$Education)
##   2   3   4   5   6 
##   8  96 327 554 260

Well there’s a few issues there. Age has only 9 members in group 6! And Education has only 8 in group 2! Let’s drop these groups from the analysis as they will skew the results.

df <- df[!df$Age == 6,]
df <- df[!df$Education == 2,]
df$Age <- droplevels(df$Age)
df$Education <- droplevels(df$Education)

b. Identify any outliers in the data and discuss how those cases were handled

You’ll notice there were a few outliers in the above box plots. So now we need to identify them and remove them from the data. Well we can’t actually identify what the outliers in our model are, before we actually have a model. So let’s build the linear regression models we will use to do our ANCOVA. I’ll explain these in more detail later.

# build models
fitVal <- lm(formula = ValuesDomain ~ Age*Education, data = df)
fitFul <- lm(formula = FulfillmentDomain ~ Age*Education, data = df)
fitMed <- lm(formula = MediatorDomain ~ Age*Education, data = df)

Now we’re going to use something called “Cook’s distance” to remove outliers from our data. Cook’s distance refers to how far, on average, predicted y-values will move if the observation in question is dropped from the data set. We’ll use the I am using the traditional cut-off of 4/N. I’ll do a detailed example of this process for one of the models and then we’ll repeat the process for all the models.

cooksd <- cooks.distance(fitVal)
sample_size <- nrow(df)
plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance")  # plot cook's distance
abline(h = 4/sample_size, col="red")  # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd, 
     labels=ifelse(cooksd>4/sample_size, names(cooksd),""), col="red")  # add labels

All the values above the threshold we will drop. Let’s do this for all of the models.

# calculate cooks distance
cooksdV <- cooks.distance(fitVal)
cooksdM <- cooks.distance(fitMed)
cooksdF <- cooks.distance(fitFul)
# make index list of influential rows to remove
influential <- as.numeric(names(cooksdV)[(cooksdV > (4/sample_size))])
influential <- append(influential, as.numeric(names(cooksdM)[(cooksdM > (4/sample_size))]))
influential <- append(influential, as.numeric(names(cooksdF)[(cooksdF > (4/sample_size))]))
influential <- sort(unique(influential))
length(influential)
## [1] 87

So we have 87 outliers we are dropping based on Cook’s distance. Let’s drop them and refit the models.

df <- df[-influential,]
# rebuild models
fitVal <- lm(formula = ValuesDomain ~ Age*Education, data = df)
fitMed <- lm(formula = MediatorDomain ~ Age*Education, data = df)
fitFul <- lm(formula = FulfillmentDomain ~ Age*Education, data = df)

3. Address all relevant statistical assumptions and provide a written summary of the findings, including methods for resolving any violations.


ANCOVA like ANOVA is a type of linear regression. Consequently it has the same assumptions as ANOVA being:

  1. Independence of observations
  2. The residuals are normally distributed
  3. The residuals are homoscedastic (i.e. the variance of data in groups should be the same)

Let’s investigate these. For the sake of brevity I will perform these checks for one of our ANCOVA models but the others should be checked too.

Independence of observations


Independence of observations means each participant is only counted as one observation. Fortunately in this data set with the the ResponseID column we can use to check this. I’m going to do this for one of the models but you may wish to repeat this for each model

# How many duplicate IDs?
length(df[duplicated(df$ResponseId),]$ResponseID)
## [1] 0

There’s no duplicate IDs so this assumption is okay.

The residuals are normally distributed


Remember the residuals are simply how much your values deviate from what your model predicts.

Figure example of residuals from sthda.com

Figure example of residuals from sthda.com

In ANOVA, these residuals should be normally distributed. We can check for this by producing a Q-Q plot quantile-quantile plot). The Q-Q plot graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential. A Q-Q plot is a scatter plot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight

# extract residuals
fitVal.resid <- resid(fitVal)
# make Q-Q plot
qqnorm(fitVal.resid)
qqline(fitVal.resid)

In our case the line is quite straight without any heavy tails. Consequently we can conclude that he residuals are normally distributed.

The residuals are homoscedastic


Now we want to make sure the residuals are homoscedastic and not hetereoscedastic. Basically heteroscedastic data will have unequal variance in treatment groups or across the range of a predictive value. Here’s an example of heteroscedastic data.

Figure example of heteroscedatic data from wikipedia

Figure example of heteroscedatic data from wikipedia

We can check for this graphically or do a statistical test. I’ll run you through both. First the graphical method. R actually has some in built plotting functions for checking all of these assumptions (including a Q-Q plot) which you can access simply by plotting a model.

plot(fitVal, 1)

Residuals vs Fitted: This plot checks for linearity. If the data is homscedastic you would expect to see a straight line with all the a bunch of randomly distributed points. In our case it’s quite clear out data is homoscedastic but let’s be doubly sure!

plot(fitVal, 3)

Scale-Location:It’s also called Spread-Location plot. This plot shows if residuals are spread equally along the ranges of predictors. This is good way to check the assumption of equal variance (homoscedasticity). It’s good if you see a horizontal line with equally (randomly) spread points. Well it looks pretty good! But just to be certain let’s do a statistical test.

We’re going to use a Breusch–Pagan test. This test is used to test for heteroscedasticity in a linear regression model.

lmtest::bptest(fitVal)
## 
##  studentized Breusch-Pagan test
## 
## data:  fitVal
## BP = 32.704, df = 19, p-value = 0.026

The P value is less than our 0.05 cutoff so our data is likely hetereoscedastic. Often this can be fixed with a simple log transformation. I looked at all the other models as well and they also require transformation. For more information on data transformation I have a tutorial on that you can veiw here.

fitVal <- lm(formula = log(ValuesDomain) ~ Age*Education, data = df)
fitMed <- lm(formula = log(MediatorDomain) ~ Age*Education, data = df)
fitFul <- lm(formula = log(FulfillmentDomain) ~ Age*Education, data = df)

And for brevity check Values..

lmtest::bptest(fitVal)
## 
##  studentized Breusch-Pagan test
## 
## data:  fitVal
## BP = 26.962, df = 19, p-value = 0.1055

Looks good!

4. Run the factorial ANCOVA using your R or Python

Great, time to finally run our ANCOVA analyses! Let’s run the ANCOVA for each of the leadership metrics we’re interested in. Remember our formula we put in the models was DV ~ Age*Education. In R formulas, a*b is actually short hand for a + b + a:b. This basically tells R that we are interested in:

  1. the effect of age on leadership
  2. the effect of education on leadership
  3. the interaction effect of age:education on leadership (e.g. our covariate)
anova(fitVal)
## Analysis of Variance Table
## 
## Response: log(ValuesDomain)
##                 Df  Sum Sq  Mean Sq F value   Pr(>F)   
## Age              4  0.3602 0.090060  3.5747 0.006622 **
## Education        3  0.2542 0.084734  3.3632 0.018153 * 
## Age:Education   12  0.3299 0.027491  1.0912 0.363539   
## Residuals     1123 28.2929 0.025194                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fitMed)
## Analysis of Variance Table
## 
## Response: log(MediatorDomain)
##                 Df  Sum Sq  Mean Sq F value  Pr(>F)  
## Age              4  0.2901 0.072521  3.3070 0.01051 *
## Education        3  0.1664 0.055462  2.5291 0.05591 .
## Age:Education   12  0.1957 0.016308  0.7437 0.70904  
## Residuals     1123 24.6267 0.021929                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fitFul)
## Analysis of Variance Table
## 
## Response: log(FulfillmentDomain)
##                 Df  Sum Sq  Mean Sq F value    Pr(>F)    
## Age              4  0.4539 0.113485  4.7627 0.0008194 ***
## Education        3  0.0810 0.026986  1.1326 0.3347664    
## Age:Education   12  0.3357 0.027974  1.1740 0.2967339    
## Residuals     1123 26.7587 0.023828                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In all cases it’s clear that age is having a significant effect on leadership qualities. Education seems to only have an effect on Values. Also we can see that the interaction effect (Age:Education) is not significant indicating that the effect of the covariate does not depend on the level of the ’treatment’ group (i.e. age), and thus that the same slope describes the relationship between leadership and education for each age group. If the Age:Education interaction is significant then the slopes of the separate regressions are dissimilar, which means that there can be no common adjustment for the covariate across all the treatment groups, and thus the analysis should not proceed.

5. Conduct post-hoc analyses if appropriate.

Well we know now that there is differences in Leadership qualities based on age but let’s do a post-hoc analysis to see where these differences lie. We’re going to use something called a Tukey’s test.

TukeyHSD(x=aov(fitVal), 'Age', conf.level=0.95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = fitVal)
## 
## $Age
##            diff          lwr        upr     p adj
## 2-1 0.013157616 -0.024785012 0.05110024 0.8781762
## 3-1 0.021948882 -0.022304596 0.06620236 0.6565863
## 4-1 0.032483023 -0.021218765 0.08618481 0.4639963
## 5-1 0.096666350  0.022504462 0.17082824 0.0035245
## 3-2 0.008791266 -0.025056014 0.04263855 0.9543311
## 4-2 0.019325407 -0.026184483 0.06483530 0.7740451
## 5-2 0.083508734  0.015045213 0.15197225 0.0078974
## 4-3 0.010534141 -0.040356697 0.06142498 0.9799749
## 5-3 0.074717468  0.002564984 0.14686995 0.0381377
## 5-4 0.064183327 -0.014120521 0.14248718 0.1659721

This compares all the groups! You’re interested in the adjusted P values. In this case we can see the differences are present between 5-1, 5-2 and 5-3. We can repeat this for the other models

TukeyHSD(x=aov(fitMed), 'Age', conf.level=0.95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = fitMed)
## 
## $Age
##           diff          lwr        upr     p adj
## 2-1 0.00726517 -0.028133909 0.04266425 0.9805989
## 3-1 0.01772974 -0.023557130 0.05901661 0.7667015
## 4-1 0.03453871 -0.015563090 0.08464050 0.3267902
## 5-1 0.07907838  0.009888065 0.14826870 0.0157636
## 3-2 0.01046457 -0.021113699 0.04204284 0.8948762
## 4-2 0.02727354 -0.015185520 0.06973259 0.4006733
## 5-2 0.07181321  0.007939263 0.13568716 0.0185013
## 4-3 0.01680897 -0.030670317 0.06428825 0.8698360
## 5-3 0.06134864 -0.005966974 0.12866426 0.0935720
## 5-4 0.04453968 -0.028514938 0.11759429 0.4557145
TukeyHSD(x=aov(fitFul), 'Age', conf.level=0.95)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = fitFul)
## 
## $Age
##           diff          lwr        upr     p adj
## 2-1 0.01406007 -0.022839486 0.05095962 0.8361951
## 3-1 0.02741547 -0.015621441 0.07045238 0.4094671
## 4-1 0.05625087  0.004025389 0.10847635 0.0274135
## 5-1 0.09078370  0.018660593 0.16290682 0.0054614
## 3-2 0.01335540 -0.019561384 0.04627219 0.8020971
## 4-2 0.04219080 -0.002067981 0.08644958 0.0701835
## 5-2 0.07672364  0.010142242 0.14330504 0.0145211
## 4-3 0.02883540 -0.020656406 0.07832720 0.5028951
## 5-3 0.06336823 -0.006800713 0.13353718 0.0988545
## 5-4 0.03453284 -0.041618369 0.11068404 0.7284738

6. Provide a summary of the overall findings


a. Data issues and how they were resolved


The data was cleaned and processed. We removed categorical groups with very few samples and we removed all rows that contained zero variance in their responses and rows containing missing values were removed. Additionally we checked for duplicated ResponseIDs and removed data outliers using 4/N threshold of Cook’s distance.

b. Descriptive summary of the data


summary(df$ValuesDomain)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   7.250   8.000   7.935   8.750  10.000
summary(df$MediatorDomain)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   7.333   8.167   8.034   8.833  10.000
summary(df$FulfillmentDomain)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.727   7.273   8.091   7.972   8.818  10.000
summary(df$Age)
##   1   2   3   4   5 
## 167 600 226 107  43
summary(df$Education)
##   3   4   5   6 
##  85 297 515 246

And some visual plots containing the same data.

# divde graphics into 1x3 panel
op <- par(mfrow = c(1, 3))
# make age plots
boxplot(ValuesDomain~Age, data=df, main='Leadership Values', xlab='Age group', ylab='Score')
boxplot(MediatorDomain~Age, data=df, main='Mediator Skills', xlab='Age group', ylab='Score')
boxplot(FulfillmentDomain~Age, data=df, main='Leadership Fulfillment', xlab='Age group', ylab='Score')

# make education plots
boxplot(ValuesDomain~Education, data=df, main='Leadership Values', xlab='Education Level', ylab='Score')
boxplot(MediatorDomain~Education, data=df, main='Mediator Skills', xlab='Education Level', ylab='Score')
boxplot(FulfillmentDomain~Education, data=df, main='Leadership Fulfillment', xlab='Education Level', ylab='Score')

# return graphics to normal
par(op)

c. Results of statistical assumption tests

We used R to build diagnostic plots to test the assumptions of ANCOVA:

  1. Independence of observations
  2. The residuals are normally distributed
  3. The residuals are homoscedastic (i.e. the variance of data in groups should be the same)

We found no violation of these assumptions and were able to continue the analysis without transforming the data.

Example:

op <- par(mfrow = c(2,2))
plot(fitVal)

par(op)

d. Factorial ANCOVA results and decided to retain or reject the null hypothesis (address main effects and interaction effects)

anova(fitVal)
## Analysis of Variance Table
## 
## Response: log(ValuesDomain)
##                 Df  Sum Sq  Mean Sq F value   Pr(>F)   
## Age              4  0.3602 0.090060  3.5747 0.006622 **
## Education        3  0.2542 0.084734  3.3632 0.018153 * 
## Age:Education   12  0.3299 0.027491  1.0912 0.363539   
## Residuals     1123 28.2929 0.025194                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fitMed)
## Analysis of Variance Table
## 
## Response: log(MediatorDomain)
##                 Df  Sum Sq  Mean Sq F value  Pr(>F)  
## Age              4  0.2901 0.072521  3.3070 0.01051 *
## Education        3  0.1664 0.055462  2.5291 0.05591 .
## Age:Education   12  0.1957 0.016308  0.7437 0.70904  
## Residuals     1123 24.6267 0.021929                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fitFul)
## Analysis of Variance Table
## 
## Response: log(FulfillmentDomain)
##                 Df  Sum Sq  Mean Sq F value    Pr(>F)    
## Age              4  0.4539 0.113485  4.7627 0.0008194 ***
## Education        3  0.0810 0.026986  1.1326 0.3347664    
## Age:Education   12  0.3357 0.027974  1.1740 0.2967339    
## Residuals     1123 26.7587 0.023828                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We found that age had a significant impact on leadership qualities and were able to reject the null hypothesis that “leadership qualities are unrelated to age”. Furthermore we found no significant interaction effect between age and level of education.

e. Limitations of the analysis and results

We only investigated two dependent variables and did test for other potential convariate responses such as gender or country. Additionally the ANOCOVA analysis is not the ideal candidate for this study considering we have multiple independent variables. A better approach would be to use a MANCOVA analysis and consider the IVs together.