Multivariate Project

1. Introduction

a. Brief overview of the business problem to be solved

In this analysis we are going to act as a company harvests abalone in Tasmania. To save money on processing we want to be able to predict the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope - a boring and time-consuming task. Being able to predict abalone age from other physical characteristics would free up labour from this intensive task.

b. Brief discussion of how the dataset was selected and its relevance to the problem to be solved

We are using the a large of abalone measurements produced by Sam Waugh which can be downloaded from here. This dataset is very clean and formatted well and is a very large dataset for it’s field and provides more than enough data points to answer our research questions.

Here is the metadata for the dataset.

Name / Data Type / Measurement Unit / Description 
----------------------------- 
Sex / nominal / -- / M, F, and I (infant) 
Length / continuous / mm / Longest shell measurement 
Diameter    / continuous / mm / perpendicular to length 
Height / continuous / mm / with meat in shell 
Whole weight / continuous / grams / whole abalone 
Shucked weight / continuous / grams / weight of meat 
Viscera weight / continuous / grams / gut weight (after bleeding) 
Shell weight / continuous / grams / after being dried 
Rings / integer / -- / +1.5 gives the age in years

c.Specification of the research hypothesis or hypotheses to be tested in order to address the business problem

This is a multivariate problem so we will be looking at a whole range of factors that may predict abalone age. Our hypothesis is that..

Abalone age can be predicted from physical characteristics.

And the null..

Abalone age cannot be predicted from physical characteristics

2. Presentation and discussion of the data characteristics

a. Identify any missing data found and discuss how the missing data were handled

Firstly we need to setup our work space and read in our data.

# setup workspace
setwd("~/Development/freelancing/upwork/Tammy_Ferrante/multivariate")

# load data
df <- read.csv('abalone.data')
colnames(df) <- c('sex', 'length', 'diameter', 'height', 'whole.weight', 'shucked.weight', 
                  'viscera.weight', 'shell.weight', 'rings')

Now let’s scan the dataset and check if there are any missing values? I don’t think there will be seeing as this is a dataset from a research paper.

# how many rows have missing data?
length(df[!complete.cases(df), ]$rings)

## [1] 0

There’s no missing data so let’s continue.

b. Identify any outliers or anomalies in the data and discuss how those cases were handled

First things first! As this is a multivariate problem we have a whole bunch of independent variables with different units and therefore on different scales. Before we continue we need to standardize our data so it’s all the same which consists of subtracting the mean and dividing by the standard deviation. We’ll do this using the scale() function on all variables expect for rings.

# standardise data
scaled.df <- df
scaled.df$sex <- NULL
scaled.df <- as.data.frame(scale(scaled.df))
# sub back in original quality data
scaled.df$rings <- df$rings
scaled.df$sex <- df$sex
# rename to df
df <- scaled.df

Let’s use cook distance to find anomalies in the data. First let’s fit a model, we’re going to use the formula rings ~ .. In this context “.” means “all other variables”. So we’re fitting a linear regression model with all the independent variables. Remember for abalone the number of rings is the age.

fit <- lm(rings ~ ., data=df)

Now let’s check for outliers using cook’s distance.

cooksd <- cooks.distance(fit)
sample_size <- nrow(df)
plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance")  # plot cook's distance
abline(h = 4/sample_size, col="red")  # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd, 
     labels=ifelse(cooksd>4/sample_size, names(cooksd),""), col="red")  # add labels

Now let’s remove them from the dataframe and use length(influential) to see how many we removed.

# remove influential data points
influential <- as.numeric(names(cooksd)[(cooksd > (4/sample_size))])
df <- df[-influential,]
# reindex
row.names(df) <- NULL

# how many data points did we remove?
length(influential)

## [1] 260

260 rows we’re removed. Great, now let’s refit the model and continue.

fit <- lm(rings ~ ., data=df)

3. Presentation and discussion of how all relevant statistical assumptions were tested, including methods for resolving any violations

As usual let’s check our assumptions using R’s diagnostic plots.

op <- par(mfrow = c(2,2))
plot(fit)

par(op)

You might notice some pretty odd patterns in the Scale-Location and the Residuals vs Fitted plots. This is not a big deal and due to the dependent variable (rings) be numerical integers (only whole numbers). This is why you see these patterns in the plots due to the variable not being truly continuous. Despite this the plots look good so let’s produce our model.

4. Presentation and discussion of the implementation of the models

First up let’s fit our model with all the variables one last time.

fit <- lm(rings ~ ., data=df)

Now we want to find what model is the best prediction of rings (i.e. age). There’s lots of ways to do this but because we don’t have too many variables we can simply try every single possible model and choose the one with the lowest AIC score. This is not possible if you have too many variables because the number of calculations you would need to do will increase exponentially with each variable you add. Fortunately we’re not quite at the limit of our computational ability yet.

We can use the function ols_step_all_possible() to do this in the olsrr library. This will return a dataframe of all the model results with a variety of comparison methods including AIC.

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

compare_models <- ols_step_all_possible(fit)
compare_models

## # A tibble: 255 x 6
##    Index     N Predictors          `R-Square` `Adj. R-Square` `Mallow's Cp`
##    <int> <int> <chr>                    <dbl>           <dbl>         <dbl>
##  1     1     1 shell.weight             0.444           0.444         1604.
##  2     2     1 height                   0.428           0.427         1771.
##  3     3     1 diameter                 0.400           0.400         2048.
##  4     4     1 length                   0.378           0.378         2264.
##  5     5     1 whole.weight             0.343           0.343         2608.
##  6     6     1 viscera.weight           0.315           0.315         2887.
##  7     7     1 sex                      0.239           0.238         3649.
##  8     8     1 shucked.weight           0.231           0.231         3726.
##  9     9     2 shucked.weight she…      0.528           0.528          777.
## 10    10     2 whole.weight shuck…      0.521           0.520          850.
## # … with 245 more rows

Now let’s find the model with the lowest AIC score.

compare_models$predictors[which.min(compare_models$aic)]

## [1] "diameter height whole.weight shucked.weight viscera.weight shell.weight sex"

So our best model is diameter + height + whole.weight + shucked.weight + viscera.weight + shell.weight + sex. Let’s have a look at the model

fit.best <- lm(rings ~ diameter + height + whole.weight + shucked.weight + viscera.weight + shell.weight + sex, data = df)
summary(fit.best)

## 
## Call:
## lm(formula = rings ~ diameter + height + whole.weight + shucked.weight + 
##     viscera.weight + shell.weight + sex, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5203 -1.1090 -0.2177  0.8892  6.9969 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     9.93088    0.05075 195.676  < 2e-16 ***
## diameter        0.94131    0.08407  11.197  < 2e-16 ***
## height          0.71293    0.07702   9.257  < 2e-16 ***
## whole.weight    4.79398    0.36287  13.211  < 2e-16 ***
## shucked.weight -4.44650    0.18059 -24.622  < 2e-16 ***
## viscera.weight -1.25461    0.12751  -9.840  < 2e-16 ***
## shell.weight    0.75892    0.15665   4.845 1.32e-06 ***
## sexI           -0.76441    0.08038  -9.509  < 2e-16 ***
## sexM            0.07747    0.06598   1.174     0.24    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.666 on 3907 degrees of freedom
## Multiple R-squared:  0.6066, Adjusted R-squared:  0.6058 
## F-statistic: 753.1 on 8 and 3907 DF,  p-value: < 2.2e-16

This is the model that best explains the variance in rings. So now we can just take some simple measurements of abalone rather than going through the tedious process of counting the shell rings under the microscope. As a final step let’s check our model by breaking the data into two sets, a training set and a testing set to check the performance of our model.

set.seed(999) # set seed so random values are produced the same way
ind <- sample(c(TRUE, FALSE), length(df$rings), replace=TRUE, prob=c(0.7, 0.3))
df_train <- df[ind,]
df_test <- df[!ind,]

Now let’s build the model on the training set and test it’s predictive power on the rest of the dataset.

# train model
fit.best <- lm(rings ~ diameter + height + whole.weight + shucked.weight + viscera.weight + shell.weight + sex, data = df_train)

# preict rings values
df_test$rings_predicted <- predict(fit.best, df_test)

Now let’s plot the results against each other.. If our model us perfect we should see a straight diagonal line (which it will not because our model is not perfect).

plot(df_test$rings, df_test$rings_predicted)

Well our model is pretty good. It’s certainly not perfect but this is the best we can do using multivariate regression on these variables. Let’s calculate the mean absolute error of the model (i.e. the mean absolute value by which the model is off). We can do this using mae() in the hydroGOF library.

library(hydroGOF)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

mae(df_test$rings_predicted, df_test$rings)

## [1] 1.238378

So on average we’re off by about 1.2 years which is not too bad. From all of this we’re clearly able to reject our null hypothesis that Abalone age cannot be predicted from physical characteristics.

5. Conclusions

a. Key themes that were found and hypothesis test results (retain or reject the null hypothesis)

We looked at the best predictors of abalone age using physical characteristics and found the best model based on AIC. We were able to reject our null hypothesis that Abalone age cannot be predicted from physical characteristics.

b. Limitations of the analysis and results

This model is limited in that it does not account for any potential interaction effects between the independent variables. Further analysis could expand this study by investigating the correlations between the independent variables. Additionally we used one potential model scoring system (AIC) to determine our best model. Additional studies could compare the results of using other model scoring systems such as the R-squared values or BIC.

c. Recommendations for action

The company can now use this model to estimate the ages of abalone saving them a great deal of time and labour costs.