Chapter 5 Data Wrangling

This class is about loading, cleaning, inspecting, merging and summarising data using the Tidyverse library for data science. The majority of most data science projects is getting the data into the best form for plotting or analysis, what is called data cleaning, data munging or data wrangling.

5.1 Class material

There are no slides for this class, it is taught entirely in workshop format. A core skill you’ll be practicing is searching for help on the task you want to complete. If you feel stuck search for “tidyverse” and a description of the data wrangling task you’re trying to do.

5.2 Exercises

Find the folder /datawrangle in the google drive

  • review the project organisation and files
  • open the file datawrangle_exercises.R
  • complete the tasks described

5.3 Checklist

You should leave being familiar with these concepts, and known where to look up how to implement them. Indented bullets are more advanced topics (not always covered in the exercise)

  • Load data from CSV
    • load data from an excel file
    • load data from a TSV file (tab separated values)
  • Recognise common tidyverse functions (below)
  • Use the pipe operator %>%
  • Mutate: new columns by combining old ones
    • Use string functions on column values (e.g. separate on characters, take substrings)
  • Select: select columns
  • Filter: select rows by values
  • group_by and summarise
    • Understand this is a specific example of a general method of split-apply-combine
  • rename: to change column names
  • convert variable types, e.g. using as.numeric
  • join: to merge data frames, requires a common key between data frames
    • understand inner joins, left and right joins

Almost any data wrangling task you can imagine can be done, you just need to find the right function or functions. So the final item for this classes checklist is

  • practice searching for solutions to your data wrangling problems

5.4 Resources