Personally I tend to use the base R packages for most data manipulation. The tidyverse package is quite popular in beginner tutorials at with the MQ R-Users Group. But I’m yet to be convinced that the data reading functions are worth using. I think using the base functions leads to more readble code and better coding practices on gereral. Dplyr and ggplot2 are definitely worth using though! As they’re awesome.

# load packages
library(ggplot2)
library(lubridate)
library(stringr)
library(gridExtra)
# set working directory
setwd("~/Development/other_people_help/Vanessa/bird_call_analysis")
# read in the data
df <- read.csv("ALL_VA_&_BB.csv")

R data cleaning and management tips

I like to use as simple descriptive names for columns in data frames as possilbe. Makes processing a whle lot easier. I also find it’s good practice to use all lower case for variable names as well.

colnames(df) <- c('pair','behaviour','behaviour_type','date','month','time','hour','daylight')

Fixing data reading errors

I noticed that “Circling” behaviour type has an issue where it is followed by an odd hexcode (“Circling <a0>”) which is causing it not to diplay as a label. In the string this is represented as “\xa0” which is a ‘non breaking space’ according to google (I don’t know a lot about unicode special characters). This is somehting that has come from Excel I think. Here’s an exmaple:

df$behaviour_type[84:88]
## [1] Ruffle feathers  Circling \xa0    Circling \xa0    Wing display    
## [5] E - repeated woo
## 20 Levels: A - ooom B - ooom Body contact C - single woo ... Wing display

Let’s clean the strings this using “gsub”. “gsub” works like this: gsub(pattern, replacement, string)

gsub(" \xa0", "", "Circling \xa0")
## [1] "Circling"

To apply this to each value in vbdata2$behaviour_type we will use a function called sapply. sapply takes vector (in this case vbdata2$behaviour_type) and applies a function that we set to every element. In this case we will apply gsub to every element.

df$behaviour_type <- sapply(df$behaviour_type, function(x) gsub(" \xa0","",x))

Now the “Cirlcing” strings are clean and will display correctly.

df$behaviour_type[84:88]
## [1] "Ruffle feathers"  "Circling"         "Circling"         "Wing display"    
## [5] "E - repeated woo"

Datetimes

Another thing I’ve noticed is that you’ve got columns for date, month, time and hour. I’m going to introduce you to dateimes! This a great way of working with time data. With datetimes you would only need one column to represent all of this data. First up let’s make a datetime column for your data. Because of the nature of your data in makes sense to store this as a datetime object with the local timezone specified.

df$dtAEST <- as.POSIXct(paste(df$date, df$time), '%d/%m/%Y %H:%M:%S', tz='Australia/Sydney')

Let me break that above line down. We’re using a function called as.POSIXct which is what will turn our string into a datetime object. A datetime object is a special class of integer (whole numeber) which is measured as seconds since 1970-01-01 (in UTC). To turn your date and time columsn into a POSIXct datetime object we first need to paste them together to make a datetime string. E.g.

paste(df$date[1], df$time[1])
## [1] "17/08/2018 0:00:40"

Then the next argument %d/%m/%Y %H:%M:%S tells R what the format of the string is in. This is a special format called strptime and is used in lots of porgramming languages. More info here. Finally tz='Australia/Sydney' indicates the timezone we want the datetimes to be displayed as. The underlying structure is still seconds since 1970-01-01 UTC though. Now because we have the new datetime column we can drop the other duplicate data.

# drop the data
df[,c('date','month','time','hour')] <- NULL

Plotting

Your plots look sick. First plot is fine. Circling label comes up now too that we cleaned the data.

# graph which shows the count of all behaviour types from june to september
ggplot(df, aes(x=behaviour_type)) +
  geom_bar() + 
  xlab('Behaviour Type') +
  ylab('Count') +
  coord_flip()

Now your hour plot… But we deleted hours? Ah, but we can extract that data from teh dtAEST column now. We’re going to use the funcition hour from lubridate. For example…

hour(df$dtAEST)
##   [1]  0  0  0  0  0  0  0  0  1  1  1  1  1  1  2  2  2  2  2  2  2  3  3  3  3
##  [26]  3  3  3  3  3  3  4  4  4  4  4  4  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [51]  5  6  6  6  6  6  6  6  6  6  6  7  7  7  7  8  8  8  9 10 10 10 11 13 13
##  [76] 13 13 13 15 15 16 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 18 18 18
## [101] 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18
## [126] 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18
## [151] 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 19 19 19 19 19 19 19 19 19 19
## [176] 19 19 19 19 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
## [201] 20 21 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23

And if we want to display it as you had it before we can zero pad it using str_pad and paste ‘h’ onto the end with paste0.

paste0(str_pad(hour(df$dtAEST), 2, pad='0'),'h')
##   [1] "00h" "00h" "00h" "00h" "00h" "00h" "00h" "00h" "01h" "01h" "01h" "01h"
##  [13] "01h" "01h" "02h" "02h" "02h" "02h" "02h" "02h" "02h" "03h" "03h" "03h"
##  [25] "03h" "03h" "03h" "03h" "03h" "03h" "03h" "04h" "04h" "04h" "04h" "04h"
##  [37] "04h" "05h" "05h" "05h" "05h" "05h" "05h" "05h" "05h" "05h" "05h" "05h"
##  [49] "05h" "05h" "05h" "06h" "06h" "06h" "06h" "06h" "06h" "06h" "06h" "06h"
##  [61] "06h" "07h" "07h" "07h" "07h" "08h" "08h" "08h" "09h" "10h" "10h" "10h"
##  [73] "11h" "13h" "13h" "13h" "13h" "13h" "15h" "15h" "16h" "17h" "17h" "17h"
##  [85] "17h" "17h" "17h" "17h" "17h" "17h" "17h" "17h" "17h" "17h" "17h" "17h"
##  [97] "17h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h"
## [109] "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h"
## [121] "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h"
## [133] "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h"
## [145] "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h"
## [157] "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "18h" "19h" "19h" "19h"
## [169] "19h" "19h" "19h" "19h" "19h" "19h" "19h" "19h" "19h" "19h" "19h" "20h"
## [181] "20h" "20h" "20h" "20h" "20h" "20h" "20h" "20h" "20h" "20h" "20h" "20h"
## [193] "20h" "20h" "20h" "20h" "20h" "20h" "20h" "20h" "20h" "21h" "21h" "21h"
## [205] "21h" "21h" "21h" "21h" "22h" "22h" "22h" "22h" "22h" "22h" "23h" "23h"
## [217] "23h"

Kind of a bit more to go through for the same thing. But it’s useful to be able to make a column like this programatically and not have to do manually (did you do this manually?). It’s good pratcice to have the minimum amount of data in your stored data file and then add all the extras in via scripting.

df$hour <- paste0(str_pad(hour(df$dtAEST), 2, pad='0'),'h')

Now let’s make the plot you did before (which is really cool btw).

# graph which shows the count of all behaviour type at each hour
ggplot(df, aes(x=hour)) +
  geom_bar() +
  xlab('Hour of Day') +
  ylab('Count') +
  coord_flip()

Filtering

I’m not a big fan of the tidyverse piping. I find it makes the code kind of heard to read and see what’s going on. But a lot of people think it’s the best. Probably just how I’ve learnt to code. I personally find it easier to just filter in a couple of steps like this… But it’s the same result so you do you!

# filter for "July"
df_sub <- df[month(df$dtAEST) == 7,]
# filter for behaviour
df_sub <- df_sub[df_sub$behaviour_type %in% c("A - ooom","B - ooom"),]
# keep just the columns you want
df_sub <- df_sub[,c('behaviour_type', 'dtAEST', 'hour')]

More plots

This one looks fine :)

# this plot ONLY lets you state the x OR the y, not both, and then does a count of the variables. 
ggplot(df_sub, aes(x=behaviour_type)) +
  geom_bar() +
  xlab('Behaviour Type') +
  ylab('Count') +
  coord_flip()

The problematic plot!

Now the plot you’re having issues with…

ggplot(df_sub, aes(x=hour, y=behaviour_type, fill=behaviour_type)) +
  geom_col(position = "dodge")

Now this next plot isn’t working for two reasons. You set y as behaviour_type which overides count data. Also using position “dodge” might not be ideal for this data as it will look odd with missing data. Better would be to use geom_bar and just have behaviour_type as the fill so you get a stacked style. Also scale_x_discrete(drop=FALSE) stops ggplot dropping factors with no data.

ggplot(df_sub ,aes(x=hour, fill=behaviour_type)) + 
  geom_bar() +
  scale_x_discrete(drop=FALSE) +
  labs(fill='Behaviour Type')

Now… notice that we still don’t have all the hours. This is because of the filtering on the dataset we did before. When we created df_sub, the hour column was a charcter and not a factor! Now that we’ve filtered the data there’s many hours missing. When we made that column we should have made that column a factor. That way it would remember the levels even if it’s missing data points. Let’s make that column again and make it a factor and then do the filtering and try the plot again!

# make the hour column but this time make a factor
df$hour <- as.factor(paste0(str_pad(hour(df$dtAEST), 2, pad='0'),'h'))
# Note: we could have just gone...
# df$hour <- as.factor(df$hour)
# but I'm just illustrating what we should have done. 

# now let's filter again
# filter for "July"
df_sub <- df[month(df$dtAEST) == 7,]
# filter for behaviour
df_sub <- df_sub[df_sub$behaviour_type %in% c("A - ooom","B - ooom"),]
# keep just the columns you want
df_sub <- df_sub[,c('behaviour_type', 'dtAEST', 'hour')]

And now try the plot again!

ggplot(df_sub ,aes(x=hour, fill=behaviour_type)) + 
  geom_bar() +
  scale_x_discrete(drop=FALSE) +
  labs(fill='Behaviour Type') +
  ggtitle('July "oooms"')

Functions - making everything more efficient

I need to introduce you to the very important concept for programmers called DRY (don’t repeat yourself). You want your code to be as DRY as possible and never WET (write everything twice). For example we just made a bit of code for subsetting the dataframe and producing our plot. We do not want to have to wrtie this again for every month. What we can do is make a function to do this for us.

Let’s make a function to do the plotting of oooms and woos for given months…

plot_oooms <- function(df, month){
  # make sure behaviour is a factor
  df$behaviour_type <- as.factor(df$behaviour_type)
  # do our filtering
  df <- df[month(df$dtAEST, label=TRUE, abbr=FALSE) == month,]
  df <- df[df$behaviour_type %in%
             c("A - ooom","B - ooom"),]
  df$behaviour_type <- factor(df$behaviour_type, levels=c("A - ooom","B - ooom"))
  # make the plot
  p <- ggplot(df ,aes(x=hour, fill=behaviour_type)) + 
    geom_bar() +
    scale_x_discrete(drop=FALSE) +
    scale_fill_discrete(drop=FALSE) +
    labs(fill='Behaviour Type') +
    ggtitle(paste(month,'"oooms"'))
  return(p)
}

Now we can call this function to make plots for us for selected months… For example:

plot_oooms(df, 'September')

Now even better… We can call this function using an apply function (rember when we used sapply before?) and do it for all the months we want.

our_months <- c('July','August','September')
plot_ls <- lapply(our_months, function(m) plot_oooms(df, m))

Now we have all our plots… Let’s plot them together in a grid! First work out how many columns we need…

n <- length(plot_ls)
ncol <- floor(sqrt(n))

And now plot everything together :)

do.call("grid.arrange", c(plot_ls, ncol=ncol))

You can define functions you want to use in a separate R file if you want and just read the functions in with source("./path/to/script.r"). I sent you an email that gives you access to my PhD code repo to see how I orangise things. Have a look at my cleaning pipeline in “scripts/data_processing/penguin_cleaning_pipeline.R”. It is pretty short but is doing heaps of work to process my penguin data. It calls functions located in “scripts/penguin_clean.R”.