Package and Data Loading
As this is a largely introductory workshop, we will only be using the packages which fall under the tidyverse bracket. These are the core packages used across data science, statistics and applied uses of R. And are therefore important for the use of R across data manipulation and cleaning. As we will be using a data base which contains large numbers, ensure to set scipen to 999 using the options() function.
library(tidyverse)
options(scipen = 999)
Furthermore, to learn how to apply the techniques covered in today’s practical, we will be using data from the World Bank Open Data. In particular we will be using a collection of variables from 1999, these variables were selected to provide us plenty of room to explore!
This data is accessible through the downloadable here, .zip file, Github or from the .zip file provided by the calendar invite. This can then be loaded using the read_csv() function, using the following code. Note; the file location, in this case “data/WBD_1999.csv”, should reflect the location where the data is stored on your computer.
WBD_1999 <- read_csv("data/WBD_1999.csv")
This data set contains 26 variables, including:
Section 1: Viewing and Summarizing Data
Viewing and Summarizing data are important first steps in understanding the data you wish to use in any project. As such, there are multiple different functions you can view the data you are working with, they include:
Exercise 1: Using one of the techniques mentioned, or one which you know of, view the data set
Although viewing your data set can be useful in understanding the type of data you may be using. Through summarizing your data (using the summary() function), you can get an idea of the Min, Max, Mean, Median, quartile ranges as well as number of missing values in the data set for the quantitative variables, as well as the frequency of those in categorical groups.
Exercise 2: Use the summary() function, to generate a summary of the data set.
Section 2: Data Manipulation 1: Data Types
For the purpose of today’s session, we will look at the conversion of variables into factor & numeric variables.
For this you will need: - as.numeric() - as.factor()
Exercise 3: Start by converting the Continent variable from character to factor variable, using the as.factor() function
This groups together shared observations of the continents. The limitation of this, is it can result in the loss of information, especially if you decide to round variables to reduce the number of factors created. Additionally, statistical analyses can also be impacted through the use of categorical rather than numeric variables.
Exercise 4: Convert the lit.rate.per variable from numerical to integer using the as.integer() function before converting it to a factor variable, using the as.factor() function, make sure to save this to a new variable lit.rate.fact.
Section 3: Missing Data
Exercise 5: Count the number of missing observations for the variables Continent, lit.rate.per & ed.years
Exercise 6: Count the number of missing values in Brazil (Row 28), Togo (Row 240) & the World (Row 261), across all the entire data set
Exercise 7: After copying the data set to a new variable name (WBD_1999.new for example) remove all the missing cases from lit.rate.per then ed.years. How many observations remain?
Section 4: Selecting Variables
Exercise 8: Create a new data set which contains only the variables Country Name (2), Continent (4), Pop (6), ed.years (15), labour (16) and lit.rate.per (17) & co2 (18)
Exercise 9: Using the data set produced in Exercise 8, create the following data sets:
Section 5: Transforming and Mutating Data
Exercise 10: Using the data set generated in Exercise 8, create the new variable co2_pp, calculating the CO2 emissions per person
Section 6: Bringing everything together
Using the skills covered in today’s session, attempt to complete the tasks below!