Chapter 5 Data Wrangling

This class is about loading, cleaning, inspecting, merging and summarising data using the Tidyverse library for data science. The majority of most data science projects is getting the data into the best form for plotting or analysis, what is called data cleaning, data munging or data wrangling.

5.1 Class material

There are no slides for this class, it is taught entirely in workshop format. A core skill you’ll be practicing is searching for help on the task you want to complete. If you feel stuck search for “tidyverse” and a description of the data wrangling task you’re trying to do.

5.2 Exercises

Find the folder /datawrangle in the google drive

review the project organisation and files
open the file datawrangle_exercises.R
complete the tasks described

5.3 Checklist

You should leave being familiar with these concepts, and known where to look up how to implement them. Indented bullets are more advanced topics (not always covered in the exercise)

Load data from CSV
- load data from an excel file
- load data from a TSV file (tab separated values)
Recognise common tidyverse functions (below)
Use the pipe operator %>%
Mutate: new columns by combining old ones
- Use string functions on column values (e.g. separate on characters, take substrings)
Select: select columns
Filter: select rows by values
group_by and summarise
- Understand this is a specific example of a general method of split-apply-combine
rename: to change column names
convert variable types, e.g. using as.numeric
join: to merge data frames, requires a common key between data frames
- understand inner joins, left and right joins

Almost any data wrangling task you can imagine can be done, you just need to find the right function or functions. So the final item for this classes checklist is

practice searching for solutions to your data wrangling problems

5.4 Resources

Slides on Data Cleaning https://cghlewis.github.io/ncme-data-cleaning-workshop/slides.html
Data Skills for Reproducible Science: Data Wrangling
Claudia A Engel: Data Wrangling with R
R for Data Science: Part II Wrangle, especially Chapter 12 Tidy Data and Chapter 13 Relational Data
datacarpentry.org Data Wrangling with dplyr and tidyr
RStudio: data wrangling cheatsheet (PDF)