BIOL703 - Justin

Create dummy data

The first thing I like to do when planning analysis is to go through it using dummy data. So we’re going to create a data frame in R containing data simmilar to what you may expect from your results. Let’s pretend you sampled 20 rock pools. Let’s make a 20 row data frame (for each rock pool) with the following 6 columns (representing your variables):

Rockpool ID
Diameter
Depth
shelter (categorical)
distance
diversity (response)

set.seed(420)
df <- data.frame(rockpoolID=as.factor(1:20),
                        diameter=c(rnorm(10, 60, 30),rnorm(10, 160, 30)),
                        depth=c(rnorm(10, 30, 5),rnorm(10, 60, 10)),
                        shelter=as.factor(sample(c(1,2,3), replace=TRUE, size=20)),
                        distance=c(rnorm(20, 400, 100)),
                        diversity=as.integer(c(rnorm(10, 10, 5),rnorm(10, 30, 5))))

You don’t really need to follow the above code so much but I’ll explain it anyway. So I’ve used two functions to create the data, rnorm() and sample(). rnorm() generates random data with a normal distribution (e.g. rnorm(10, 60, 40) means we are creating 10 data points with a mean around 60 and a standard deviation of 40). sample(c(1,2,3), replace=TRUE, size=20) basically says sample randomly sample 1, 2 and 3 20 times with a equal probability of selection (good for categorical data). Also notice I’ve used as.integer() on diversity as diversity count will be a whole number and not a float.

So let’s print the data frame and see the result.. You’ll probably see some patterns but I’ve set the data up like that on purpose for the point of this exercise.

df

##    rockpoolID  diameter    depth shelter distance diversity
## 1           1  68.03133 30.28572       3 376.9318        17
## 2           2  31.89878 29.55033       3 372.0529        11
## 3           3  77.88618 34.33180       2 459.0162         7
## 4           4  50.63869 28.26795       1 292.0820        12
## 5           5  71.93937 31.07823       3 191.3815        12
## 6           6  44.55611 30.27582       1 388.2214        10
## 7           7  43.44403 23.43898       3 294.1491         2
## 8           8  46.18499 39.50024       2 255.7723        11
## 9           9 -28.46374 26.26912       1 396.8166         4
## 10         10  36.98562 24.26433       1 313.1342         8
## 11         11 188.15563 63.33320       1 460.4730        22
## 12         12 151.38708 53.40020       3 479.1626        19
## 13         13 201.45068 48.11403       1 446.0004        30
## 14         14 176.41338 53.65589       2 252.9160        30
## 15         15 147.32704 70.91895       3 533.5327        33
## 16         16 155.17029 58.60173       2 416.5810        24
## 17         17 176.08291 53.77797       3 349.5985        21
## 18         18 178.91670 64.75640       1 460.7931        33
## 19         19 167.49320 47.52024       3 505.7611        34
## 20         20 153.43184 31.05091       3 473.6726        32

Now to answer your question about MANOVA. In this case (if I have interpreted your experiemnt correctly) you would not use MANOVA for your analysis. This is beacuse a MANOVA requires at least 2 dependent variables (DV) where as you have 1 DV and 4 independent variables (IV). So considering this I would use a multivariate regression model. Basically we just build a linear regression model with all your model IVs. MANOVA is another flavour of multivariate regression so you weren’t far off.

So let’s built the model. suing using the lm function to build it. Out formula is diversity ~ diameter*depth + shelter + distance. In R formulas, a*b is actually short hand for a + b + a:b. This foruma basically tells R that we are interested in: 1. the effect of diameter on diveristy 2. the effect of depth on diveristy 3. the interaction effect of diamter:depth on diveristy (maybe the imapct of diamter is differet at different depths?) 4. The effect shelter on diveristy 5. the effect of distance on diversity

This model is an example and maybe you would be curious of some other possible interactin effects? I programmed in the data in our example but I reckon the impact of shelter in your real data would be very interesting ;)

fit <- lm(data=df, formula = diversity ~ diameter*depth + shelter + distance)
summary(fit)

## 
## Call:
## lm(formula = diversity ~ diameter * depth + shelter + distance, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3949 -2.9702  0.4022  4.0322  7.4130 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    -18.092254  15.706510  -1.152   0.2701  
## diameter         0.256762   0.132701   1.935   0.0751 .
## depth            0.682788   0.582198   1.173   0.2619  
## shelter2        -4.530684   5.099133  -0.889   0.3904  
## shelter3        -1.183964   3.595110  -0.329   0.7472  
## distance         0.010267   0.017243   0.595   0.5618  
## diameter:depth  -0.004154   0.003756  -1.106   0.2888  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.678 on 13 degrees of freedom
## Multiple R-squared:  0.8053, Adjusted R-squared:  0.7154 
## F-statistic: 8.962 on 6 and 13 DF,  p-value: 0.0005297

So let’s look at the results. Well there’s not actually a lot going on here. Diameter may be having an effect on diversity but the P values is still pretty low (0.0751 .), although remember this is a dummy dataset and not real. Let’s plot all the IVs and see what they look like. We’ll use the library ggplot2 to do this. Note: I’m loading modules as I go in this tutorial but it’s much better practice to load them at the start of your script. Let’s load it now.

library(ggplot2)

First we’ll build a function to generate a plot.

plotIV <- function(df, iv){  
  ggplot(df, aes_string(x=iv, y="diversity")) + 
  geom_point() +
  stat_smooth(method = "lm", color='grey45')
}

And now we’ll plot the results for each IV using a function called lapply. This just takes the elements of a list and applies a function you specify to each one and saves the reults to another list (in this case plot.ls which will hold all our plots). The argument df = df is just there to tell it which dataframe to use. If you’re confused by this, let me know as this is actually a reasonably complex example to be introduced to lapply() and I can make a more simple tutorial on these vectorised functions if people would like to see them.

# making df$shelter numeric as stat_smooth does not like factor variables
df$shelter <- as.numeric(df$shelter)
# make all the plots
plot.ls <- c(lapply(names(df)[2:5], plotIV, df=df))

Cool, now let’s plot them all together as one plot using the gridExtra library.

library(gridExtra)

# calculate how many plots we have (4)
n <- length(plot.ls)
# calculate how many columns we should use to arrange them
ncol <- floor(sqrt(n))
# plot the 4 plots as a single plot on a 2x2 grid
do.call("grid.arrange", c(plot.ls, ncol=ncol))

Well looking at those plots, I’m suspicious that there is something going on and we’re just not dectecting due to having quite low statistical power. Let’s do all these steps again but instead of 20 rock pools, let’s pretend we sampled 200!

# make data for 200 rock pools
df <- data.frame(rockpoolID=as.factor(1:200),
                 diameter=c(rnorm(100, 60, 30),rnorm(100, 160, 30)),
                 depth=c(rnorm(100, 30, 5),rnorm(100, 60, 10)),
                 shelter=as.factor(sample(c(1,2,3), replace=TRUE, size=200)),
                 distance=c(rnorm(200, 400, 100)),
                 diversity=as.integer(c(rnorm(100, 10, 5),rnorm(100, 30, 5))))

# run the linear model
fit <- lm(data=df, formula = diversity ~ diameter*depth + shelter + distance)
summary(fit)

## 
## Call:
## lm(formula = diversity ~ diameter * depth + shelter + distance, 
##     data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.9099  -3.8661  -0.0769   4.1214  19.1899 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.258e+01  3.886e+00  -3.238  0.00142 ** 
## diameter        1.565e-01  3.165e-02   4.945 1.65e-06 ***
## depth           5.652e-01  9.693e-02   5.830 2.29e-08 ***
## shelter2        1.515e+00  1.068e+00   1.419  0.15747    
## shelter3        1.038e+00  1.030e+00   1.008  0.31494    
## distance       -4.052e-03  4.236e-03  -0.957  0.33992    
## diameter:depth -1.879e-03  6.924e-04  -2.714  0.00724 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.831 on 193 degrees of freedom
## Multiple R-squared:  0.7255, Adjusted R-squared:  0.7169 
## F-statistic:    85 on 6 and 193 DF,  p-value: < 2.2e-16

# make all the plots (don't need to repeat the grid column calculations)
df$shelter <- as.numeric(df$shelter)
plot.ls <- c(lapply(names(df)[2:5], plotIV, df=df))
do.call("grid.arrange", c(plot.ls, ncol=ncol))

Well that changed things! Now it’s evident that diamter is having an effect (1.65e-06 ***) and depth is also having an even bigger effect (2.29e-08 ***). We’re also detecting an interation effect between diameter and depth (0.00724 **). Remember!!! This is all just dummy data so don’t take any of this seriously (you can see the two normal distributions I made to make the data as two seperate splatters which is why we’re seeing an effect at all).

Also I’m not suggestimng you up your sampling to 200, but it’s worth keeping in mind how much difference statistical power can make and in these smaller experiments can be significant.