- Database queries: SQL
10.14 Context
SQL, you can pronouce it S-Q-L or “sequel”, is a database system. It is old, in internet terms, but not going anywhere, and core to many data science jobs.
10.15 Key information
There are different flavours, such as MySQL which is used by WordPress websites, or PostgreSQL / postgres.
The common core of SQL is that that it is a ‘relational’ database management systems. It is used for storing information and saying how it is connected to other bits of information. Compare this to the spreadsheet/CSV, which stores information in a list.
Note that SQL is designed for databases with potentially billions of rows of data. It demands working habits where you can’t “look” at the data all at once, as you would with a spreadsheet.
SQL has it’s own terminology but to start off we will think about two things
- Tables: these are rows and columns of data.
- Queries: these are commands we use to select, pull out and/or combine tables
Tables have ‘fields’ (the variable names which define what information is stored, you may be used to thinking of these as the columns of your data)
Tables also have ‘records’ (the stored information, which you be used to thinking of as the rows of your data)
10.16 Exercises
In class, I will illustrate the key concepts of SQL with an example from my own research
- Stafford, T. (2018). Female chess players outperform expectations when playing men. Psychological Science, 29(3), 429-436. (Preprint).
Then we will look at this interactive turorial:
Specifically, we will do
- SELECT basics https://sqlzoo.net/wiki/SELECT_basics
- JOIN https://sqlzoo.net/wiki/The_JOIN_operation
10.17 Checklist
By the end of the class, you should
- Know what SQL is and why it is important for data science
- Understand key concepts: tables, queries, fields, rows
- Recognise and solve problems involving SQL commands such as SELECT, FROM, WHERE, wildcard (%), JOIN, ON,
10.18 Resources
- Select Star SQL - interactive tutorial
- w3schools.com JOIN
- Interactive tutorial sqlzoo.net/wiki/SELECT_basics
- SQL Database access dplyr
- SQLite is great for R and Shiny. The dbmisc package may help a bit.
- example SQL data from Kaggle: SQLLite database on Mental Health in the Tech Industry
- askedith.ai - natural language interface to SQL