Potential Analysis for BIOL703

So first thing I’ll do is create some dummy data that will be somewhat similar to what we can expect from your experiments. I find this a good step when planning experiments as it really forces you to think about how to organise your data and really think through the analysis. Let’s start with your growth data.

Plant Growth

# this line just "sets a seed" which means all random generation will be the same each time so this example will be reproducible on your computer.
set.seed(999)

# generating a data.frame of fake growth data
df_growth <- data.frame(plantID=as.factor(rep(1:30,4)),
                        soil=rep(c(rep("A",15),rep("B",15)),4),
                        time=as.numeric(c(rep(1,30),rep(2,30),rep(3,30),rep(4,30))),
                        height=c(rnorm(15, 5, 2),rnorm(15, 6, 2),
                          rnorm(15, 10, 2),rnorm(15, 12, 2),
                          rnorm(15, 15, 2),rnorm(15, 18, 2),
                          rnorm(15, 17, 2),rnorm(15, 22, 2)))

So in the above lines we’ve created a dataframe with 120 rows and 4 columns. We’ve made 30 plant IDs (repeated 4 times for the 4 measurements); two soil treatments; a time column indicating which measurement interval; and a height column which is the dependent variable you’re measuring. There’s a few functions in there, rep() simply repeats whatever you want n times. rnorm() generates random data with a normal distribution (e.g. rnorm(15, 10, 2) means we are creating 15 data points with a mean around 10 and a standard deviation of 2). I’ve put a few in to simulate some growth data and made the results for each soil type a little different so we can actually have a measurable effect in this example.

Okay so now we have that, let’s plot the data and see how it looks. We’ll need the library (ggplot2) to do this.

# loading ggplot2, in a actual script it's better practice to load all modules at the start
library(ggplot2)

ggplot(df_growth, aes(x=time, y=height, colour=soil)) + 
  geom_point() +
  stat_smooth(data=subset(df_growth, soil=="A"), method = "loess") +
  stat_smooth(data=subset(df_growth, soil=="B"), method = "loess")

Note that I used method = "loess" which is a smoothing function for the line. You might just want the linear trend lines in which case you would use method = "lm". I just like the smoothed lines when exploring data as I tend to notice things a bit more.

Anyway, the data looks good so let’s do an anlysis. Because we have the height as a function of time we are actually going to build a linear regression model to analyse the dataset and perform an analysis of variance (anova) on the two regression lines for Soil types A and B. We fit the model with the following code..

# run a linear regression model
fit1 <- lm(formula = height ~ time*soil, data = df_growth)

So we now have our model saved as fit1. Notice that the forumla I’ve used is height ~ time*soil. In R formulas, a*b is actually short hand for a + b + a:b. This basically tells R that we are interested in: 1. the effect of time on growth 2. the effect of soil type on growth 3. the interaction effect of time:soil on growth (e.g. the effect of the soila may chnage over time)

Cool so let’s run our ANOVA on our model. This is very easy.

# run anova
anova(fit1)

## Analysis of Variance Table
## 
## Response: height
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## time        1 3507.6  3507.6 749.658 < 2.2e-16 ***
## soil        1  164.3   164.3  35.111 3.264e-08 ***
## time:soil   1   49.7    49.7  10.630  0.001461 ** 
## Residuals 116  542.8     4.7                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And there’s you’re results. It’s telling you the effect of time on growth is super significant (duh), that the soil effect is significant, and the interaction is significant (soils are having a different effect over time for our dummy data).

Germination data

Now let’s have a look at some potential germenation data. Let’s start by generating some more fake data.

# generate data
df_germ <- data.frame(plantID=as.factor(1:30),soil=c(rep("A",15),rep("B",15)),
                 germ=as.factor(c(sample(c(0,1), replace=TRUE, size=15, prob=c(0.2,0.8)),
                                  sample(c(0,1), replace=TRUE, size=15, prob=c(0.5,0.5)))))
# change count data (easier for plotting)
tab_germ <- split(df_germ, df_germ$soil)
tab_germ <- data.frame(soil=c("A","B"), germ=c(table(tab_germ[[1]]$germ)[2],
                                              table(tab_germ[[2]]$germ)[2]))

So first up I have made a dataframe with 30 rows and 3 columns. We have the 30 plant IDs (1:30), the soil type ("A" or "B") and the binary outcome for germination (1 for germinate, 0 for no germination). The function I have used to generate this data is sample() and I have done it different for both soil types. sample(c(0,1), replace=TRUE, size=15, prob=c(0.2,0.8) basically says sample 1 and 0 15 times with a 0.2 probabiloity of selecting 0 and a 0.8 probability of selecitng 1. Notice for soil type two I’ve used a different probability vector as I want us to have different results to compare.

After this I have made a table like dataframe with how many succsessful germinations there were for each soil type (e.g. counted how many 1s).

print(tab_germ)

##   soil germ
## 1    A   13
## 2    B    9

And let’s plot this data to see what it looks like..

# bar plot
ggplot(data=tab_germ, aes(x=soil, y=germ, fill=soil)) +
  geom_bar(stat="identity", width=0.5) +
  guides(fill=FALSE)

So looks like something is going on in our dummy data (suprise, suprise). Now you’re right that we can’t just compare the total number of germinations for each soil group as that’s just two numbers. Effectively we don’t have any replicates. However we do have our binary outcome data and in this setting every plant is a replicate. This is still tricky to see how to analyse however ever as both the treatment (soil) and the outcome (germination) is categorical.

Luckily there is one tool built exactly for this situation called “logistic regression” that is made for data with binary outcomes (1 or 0) and can handle independent variables that are categorical. Definitely would be worth you reading up on these. All right let’s build the model and see the results..

# logistic regression model
fit2 <- glm(formula= germ ~ soil, data=df_germ, family=binomial(link="logit"))
summary(fit2)

## 
## Call:
## glm(formula = germ ~ soil, family = binomial(link = "logit"), 
##     data = df_germ)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0074  -0.8815   0.5350   1.0108   1.0108  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   1.8718     0.7596   2.464   0.0137 *
## soilB        -1.4663     0.9245  -1.586   0.1127  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 34.795  on 29  degrees of freedom
## Residual deviance: 31.971  on 28  degrees of freedom
## AIC: 35.971
## 
## Number of Fisher Scoring iterations: 4

Well it seems from our results we can’t conclude that there is a significant effect on germination due to soil type (Pval > 0.05, even though it’s getting close). However as I mentioned in a previous email this is likely due to our poor statistical power. For fun let’s run the same analysis for germination but instead of 15 plants in each group, let’s increase that by a factor of 10 and use 150 in both groups.

# generate data
df_germ <- data.frame(plantID=as.factor(1:30*10),soil=c(rep("A",15*10),rep("B",15*10)),
                 germ=as.factor(c(sample(c(0,1), replace=TRUE, size=15*10, prob=c(0.2,0.8)),
                                  sample(c(0,1), replace=TRUE, size=15*10, prob=c(0.5,0.5)))))
# change count data (easier for plotting)
tab_germ <- split(df_germ, df_germ$soil)
tab_germ <- data.frame(soil=c("A","B"), germ=c(table(tab_germ[[1]]$germ)[2],
                                              table(tab_germ[[2]]$germ)[2]))

Now let’s plot it again..

# bar plot
ggplot(data=tab_germ, aes(x=soil, y=germ, fill=soil)) +
  geom_bar(stat="identity", width=0.5) +
  guides(fill=FALSE)

And then run the logistic regression once more..

# logistic regression model
fit2 <- glm(formula= germ ~ soil, data=df_germ, family=binomial(link="logit"))
summary(fit2)

## 
## Call:
## glm(formula = germ ~ soil, family = binomial(link = "logit"), 
##     data = df_germ)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7757  -1.0990   0.6805   0.6805   1.2579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.3451     0.2016   6.671 2.54e-11 ***
## soilB        -1.5323     0.2599  -5.895 3.74e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 397.45  on 299  degrees of freedom
## Residual deviance: 359.49  on 298  degrees of freedom
## AIC: 363.49
## 
## Number of Fisher Scoring iterations: 4

And look at that! Now that we’ve increased the sample size we are now able to measure the effect of soil on germination. In both case the effect was the same but we needed more samples to see the signal within the noise.