Chapter 5 Data Wrangling
This class is about loading, cleaning, inspecting, merging and summarising data using the Tidyverse library for data science. The majority of most data science projects is getting the data into the best form for plotting or analysis, what is called data cleaning, data munging or data wrangling.
5.1 Class material
There are no slides for this class, it is taught entirely in workshop format. A core skill you’ll be practicing is searching for help on the task you want to complete. If you feel stuck search for “tidyverse” and a description of the data wrangling task you’re trying to do.
5.2 Exercises
Find the folder /datawrangle
in the google drive
- review the project organisation and files
- open the file
datawrangle_exercises.R
- complete the tasks described
5.3 Checklist
You should leave being familiar with these concepts, and known where to look up how to implement them. Indented bullets are more advanced topics (not always covered in the exercise)
- Load data from CSV
- load data from an excel file
- load data from a TSV file (tab separated values)
- Recognise common tidyverse functions (below)
- Use the pipe operator
%>%
- Mutate: new columns by combining old ones
- Use string functions on column values (e.g. separate on characters, take substrings)
- Select: select columns
- Filter: select rows by values
- group_by and summarise
- Understand this is a specific example of a general method of split-apply-combine
- rename: to change column names
- convert variable types, e.g. using as.numeric
- join: to merge data frames, requires a common key between data frames
- understand inner joins, left and right joins
Almost any data wrangling task you can imagine can be done, you just need to find the right function or functions. So the final item for this classes checklist is
- practice searching for solutions to your data wrangling problems
5.4 Resources
- Slides on Data Cleaning https://cghlewis.github.io/ncme-data-cleaning-workshop/slides.html
- Data Skills for Reproducible Science: Data Wrangling
- Claudia A Engel: Data Wrangling with R
- R for Data Science: Part II Wrangle, especially Chapter 12 Tidy Data and Chapter 13 Relational Data
- datacarpentry.org Data Wrangling with dplyr and tidyr
- RStudio: data wrangling cheatsheet (PDF)