In this analysis we are going to act as a company harvests abalone in Tasmania. To save money on processing we want to be able to predict the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope - a boring and time-consuming task. Being able to predict abalone age from other physical characteristics would free up labour from this intensive task.
We are using the a large of abalone measurements produced by Sam Waugh which can be downloaded from here. This dataset is very clean and formatted well and is a very large dataset for it’s field and provides more than enough data points to answer our research questions.
Here is the metadata for the dataset.
Name / Data Type / Measurement Unit / Description
-----------------------------
Sex / nominal / -- / M, F, and I (infant)
Length / continuous / mm / Longest shell measurement
Diameter / continuous / mm / perpendicular to length
Height / continuous / mm / with meat in shell
Whole weight / continuous / grams / whole abalone
Shucked weight / continuous / grams / weight of meat
Viscera weight / continuous / grams / gut weight (after bleeding)
Shell weight / continuous / grams / after being dried
Rings / integer / -- / +1.5 gives the age in years
This is a multivariate problem so we will be looking at a whole range of factors that may predict abalone age. Our hypothesis is that..
Abalone age can be predicted from physical characteristics.
And the null..
Abalone age cannot be predicted from physical characteristics
Firstly we need to setup our work space and read in our data.
# setup workspace
setwd("~/Development/freelancing/upwork/Tammy_Ferrante/multivariate")
# load data
df <- read.csv('abalone.data')
colnames(df) <- c('sex', 'length', 'diameter', 'height', 'whole.weight', 'shucked.weight',
'viscera.weight', 'shell.weight', 'rings')
Now let’s scan the dataset and check if there are any missing values? I don’t think there will be seeing as this is a dataset from a research paper.
# how many rows have missing data?
length(df[!complete.cases(df), ]$rings)
## [1] 0
There’s no missing data so let’s continue.
First things first! As this is a multivariate problem we have a whole bunch of independent variables with different units and therefore on different scales. Before we continue we need to standardize our data so it’s all the same which consists of subtracting the mean and dividing by the standard deviation. We’ll do this using the scale()
function on all variables expect for rings
.
# standardise data
scaled.df <- df
scaled.df$sex <- NULL
scaled.df <- as.data.frame(scale(scaled.df))
# sub back in original quality data
scaled.df$rings <- df$rings
scaled.df$sex <- df$sex
# rename to df
df <- scaled.df
Let’s use cook distance to find anomalies in the data. First let’s fit a model, we’re going to use the formula rings ~ .
. In this context “.
” means “all other variables”. So we’re fitting a linear regression model with all the independent variables. Remember for abalone the number of rings is the age.
fit <- lm(rings ~ ., data=df)
Now let’s check for outliers using cook’s distance.
cooksd <- cooks.distance(fit)
sample_size <- nrow(df)
plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance") # plot cook's distance
abline(h = 4/sample_size, col="red") # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd,
labels=ifelse(cooksd>4/sample_size, names(cooksd),""), col="red") # add labels
Now let’s remove them from the dataframe and use length(influential)
to see how many we removed.
# remove influential data points
influential <- as.numeric(names(cooksd)[(cooksd > (4/sample_size))])
df <- df[-influential,]
# reindex
row.names(df) <- NULL
# how many data points did we remove?
length(influential)
## [1] 260
260 rows we’re removed. Great, now let’s refit the model and continue.
fit <- lm(rings ~ ., data=df)
As usual let’s check our assumptions using R’s diagnostic plots.
op <- par(mfrow = c(2,2))
plot(fit)
par(op)
You might notice some pretty odd patterns in the Scale-Location
and the Residuals vs Fitted
plots. This is not a big deal and due to the dependent variable (rings) be numerical integers (only whole numbers). This is why you see these patterns in the plots due to the variable not being truly continuous. Despite this the plots look good so let’s produce our model.
First up let’s fit our model with all the variables one last time.
fit <- lm(rings ~ ., data=df)
Now we want to find what model is the best prediction of rings
(i.e. age). There’s lots of ways to do this but because we don’t have too many variables we can simply try every single possible model and choose the one with the lowest AIC score. This is not possible if you have too many variables because the number of calculations you would need to do will increase exponentially with each variable you add. Fortunately we’re not quite at the limit of our computational ability yet.
We can use the function ols_step_all_possible()
to do this in the olsrr
library. This will return a dataframe of all the model results with a variety of comparison methods including AIC.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
compare_models <- ols_step_all_possible(fit)
compare_models
## # A tibble: 255 x 6
## Index N Predictors `R-Square` `Adj. R-Square` `Mallow's Cp`
## <int> <int> <chr> <dbl> <dbl> <dbl>
## 1 1 1 shell.weight 0.444 0.444 1604.
## 2 2 1 height 0.428 0.427 1771.
## 3 3 1 diameter 0.400 0.400 2048.
## 4 4 1 length 0.378 0.378 2264.
## 5 5 1 whole.weight 0.343 0.343 2608.
## 6 6 1 viscera.weight 0.315 0.315 2887.
## 7 7 1 sex 0.239 0.238 3649.
## 8 8 1 shucked.weight 0.231 0.231 3726.
## 9 9 2 shucked.weight she… 0.528 0.528 777.
## 10 10 2 whole.weight shuck… 0.521 0.520 850.
## # … with 245 more rows
Now let’s find the model with the lowest AIC score.
compare_models$predictors[which.min(compare_models$aic)]
## [1] "diameter height whole.weight shucked.weight viscera.weight shell.weight sex"
So our best model is diameter + height + whole.weight + shucked.weight + viscera.weight + shell.weight + sex
. Let’s have a look at the model
fit.best <- lm(rings ~ diameter + height + whole.weight + shucked.weight + viscera.weight + shell.weight + sex, data = df)
summary(fit.best)
##
## Call:
## lm(formula = rings ~ diameter + height + whole.weight + shucked.weight +
## viscera.weight + shell.weight + sex, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5203 -1.1090 -0.2177 0.8892 6.9969
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.93088 0.05075 195.676 < 2e-16 ***
## diameter 0.94131 0.08407 11.197 < 2e-16 ***
## height 0.71293 0.07702 9.257 < 2e-16 ***
## whole.weight 4.79398 0.36287 13.211 < 2e-16 ***
## shucked.weight -4.44650 0.18059 -24.622 < 2e-16 ***
## viscera.weight -1.25461 0.12751 -9.840 < 2e-16 ***
## shell.weight 0.75892 0.15665 4.845 1.32e-06 ***
## sexI -0.76441 0.08038 -9.509 < 2e-16 ***
## sexM 0.07747 0.06598 1.174 0.24
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.666 on 3907 degrees of freedom
## Multiple R-squared: 0.6066, Adjusted R-squared: 0.6058
## F-statistic: 753.1 on 8 and 3907 DF, p-value: < 2.2e-16
This is the model that best explains the variance in rings
. So now we can just take some simple measurements of abalone rather than going through the tedious process of counting the shell rings under the microscope. As a final step let’s check our model by breaking the data into two sets, a training set and a testing set to check the performance of our model.
set.seed(999) # set seed so random values are produced the same way
ind <- sample(c(TRUE, FALSE), length(df$rings), replace=TRUE, prob=c(0.7, 0.3))
df_train <- df[ind,]
df_test <- df[!ind,]
Now let’s build the model on the training set and test it’s predictive power on the rest of the dataset.
# train model
fit.best <- lm(rings ~ diameter + height + whole.weight + shucked.weight + viscera.weight + shell.weight + sex, data = df_train)
# preict rings values
df_test$rings_predicted <- predict(fit.best, df_test)
Now let’s plot the results against each other.. If our model us perfect we should see a straight diagonal line (which it will not because our model is not perfect).
plot(df_test$rings, df_test$rings_predicted)
Well our model is pretty good. It’s certainly not perfect but this is the best we can do using multivariate regression on these variables. Let’s calculate the mean absolute error of the model (i.e. the mean absolute value by which the model is off). We can do this using mae()
in the hydroGOF
library.
library(hydroGOF)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
mae(df_test$rings_predicted, df_test$rings)
## [1] 1.238378
So on average we’re off by about 1.2 years which is not too bad. From all of this we’re clearly able to reject our null hypothesis that Abalone age cannot be predicted from physical characteristics.
We looked at the best predictors of abalone age using physical characteristics and found the best model based on AIC. We were able to reject our null hypothesis that Abalone age cannot be predicted from physical characteristics.
This model is limited in that it does not account for any potential interaction effects between the independent variables. Further analysis could expand this study by investigating the correlations between the independent variables. Additionally we used one potential model scoring system (AIC) to determine our best model. Additional studies could compare the results of using other model scoring systems such as the R-squared values or BIC.
The company can now use this model to estimate the ages of abalone saving them a great deal of time and labour costs.