Transforming Data

A few of you have asked me about transforming your data so I thought I would write a quick tutorial on this subject.

Why transform data?

Transforming data is one step in addressing data that do not fit model assumptions. Most parametric tests require that residuals be normally distributed and that the residuals be homoscedastic. If they don’t then we can’t trust our results. There is nothing wrong about transforming variables, but you must be careful about how the results from analyses with transformed variables are reported. For example, looking at the turbidity of water across three locations, you might report, “Locations showed a significant difference in log-transformed turbidity.” To present means or other summary statistics, you might present the mean of transformed values, or back transform means to their original units.

As a reminder, your residuals are how much your data deviates from your model.

Chick weight example

This example uses one the datasets available in the datasets library. The dataset is chick weights over time from different diets. Let’s load it in..

library(datasets)
# get the chick weight data
df <- ChickWeight
# print the first few rows
head(df)

##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

And let’s have a quick look at the data by making a plot.

library(ggplot2)

ggplot(df, aes(x=Time, y=weight, colour=Diet)) + 
  geom_point() +
  stat_smooth(data=df, method = "lm")

Checking assumptions

Well the data looks good so let’s analyse it. Let’s build a simple regression model.

fit <- lm(formula = weight ~ Diet*Time, data = df)

An important assumption of regression models (including ANOVA) is that the variances (and therefore standard deviations) of treatment groups are independent of group means. This is the assumption of homoscedasticity, or “homogeneity of group variances”. In our above data, you might suspect that the data is more variable with time, but it’s not always immediately apparent. We can use some procedures that make it easier for us to judge whether the assumptions have been met for these data. We’ll do this by making some diagnostics plots which is very easy to do in R.

The default plot method for this object results in four separate graphs, and, as the comments show in the code, we can display all four on the same graphics device in a 2 × 2 array

# divides graphics device into 2 x 2 grid
op <- par(mfrow = c(2, 2))
# default diagnostic plots
plot(fit)

# restore graphics device to original state
par(op)

Alright let’s unpack these plots.

Residuals vs Fitted

plot(fit, 1)

This plot checks for linearity. This plot is showing that with data that is heteroscedastic as it’s quite clear that the variance is increasing. Ideally on this plot you would want a straight line without any relationship.

Normal Q-Q

plot(fit, 2)

The Q-Q plot, or quantile-quantile plot, is a graphical tool to help us assess if a set of data plausibly came from some theoretical distribution such as a Normal or exponential. A Q-Q plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a line that’s roughly straight diagonal. In our data the tails are quite heavy. Normal Q-Q plots that exhibit this behavior usually mean your data have more extreme values than would be expected if they truly came from a Normal distribution.

Scale-Location

plot(fit, 3)

It’s also called Spread-Location plot. This plot shows if residuals are spread equally along the ranges of predictors. This is how you can check the assumption of equal variance (homoscedasticity). It’s good if you see a horizontal line with equally (randomly) spread points.

Residulas Vs Leverage

plot(fit, 5)

This plot helps us to find influential cases if any. Not all outliers are influential in linear regression analysis (whatever outliers mean). Even though data have extreme values, they might not be influential to determine a regression line. That means, the results wouldn’t be much different if we either include or exclude them from analysis.

Transforming data

There’s a lot of ways you can transform data. I wish I could tell you how to choose but the best way is really just to try a few and see how they go. Some common ones are square-root, cubic or log. In this example I’m going to use a log transformation. Let’s rebuild the model using the transformation.

fit <- lm(data = df, formula = log(weight) ~ Diet*Time)

And plot our diagnostics..

# divides graphics device into 2 x 2 grid
op <- par(mfrow = c(2, 2))
# default diagnostic plots
plot(fit)

# restore graphics device to original state
par(op)

You can see it’s not perfect but it’s a lot better! I’m satisfied so let’s do our ANOVA.

anova(fit)

## Analysis of Variance Table
## 
## Response: log(weight)
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## Diet        3   8.600   2.867   57.986 < 2.2e-16 ***
## Time        1 158.357 158.357 3203.299 < 2.2e-16 ***
## Diet:Time   3   1.644   0.548   11.082 4.433e-07 ***
## Residuals 570  28.178   0.049                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1