Edit this page

Educational Resources for Programming and Statistical Analysis > Data in R - An introduction to data manipulation and cleaning in R > Practical Worksheet > Beginner Practical Worksheet

Beginner Practical Worksheet

Package and Data Loading

As this is a largely introductory workshop, we will only be using the packages which fall under the tidyverse bracket. These are the core packages used across data science, statistics and applied uses of R. And are therefore important for the use of R across data manipulation and cleaning. As we will be using a data base which contains large numbers, ensure to set scipen to 999 using the options() function.

  library(tidyverse)
  
  options(scipen = 999)

Furthermore, to learn how to apply the techniques covered in today’s practical, we will be using data from the World Bank Open Data. In particular we will be using a collection of variables from 1999, these variables were selected to provide us plenty of room to explore!

This data is accessible through the downloadable here, .zip file, Github or from the .zip file provided by the calendar invite. This can then be loaded using the read_csv() function, using the following code. Note; the file location, in this case “data/WBD_1999.csv”, should reflect the location where the data is stored on your computer.

WBD_1999 <- read_csv("data/WBD_1999.csv")

This data set contains 26 variables, including:

…1 (row numbers)
Country Name (Country Name)
Country Code (Country Code)
Continent (Continent)
Year (Year)
Population (Pop)
Female Population (Pop.fe)
Male Population (Pop.ma)
Birth Rate, crude per 1000 people (birthrate)
Death Rate, crude per 1000 people (deathrate)
Life Expectancy at Birth in years (lifeexp)
Female Life Expectancy at Birth in years (lifeexp.fe)
Male Life Expectancy at Birth in years (lifeexp.ma)
Educational Spending, percentage of GDP (ed.spend)
Compulsory Education Duration in Years (ed.years)
Labour Force Total (labour)
Literature Rate in adults, percentage % (lit.rate.per)
CO2 Emissions, kt (co2)
Gross Domestic product, $ (gdp)
Unemployment, percentage of total labour force (unemp)
Female Unemployment, percentage of total labour force (unemp.fe)
Male Unemployment, percentage of total labour force (unemp.ma)
Health Expenditure per capita, $ (health.exp)
Hospital Beds per 1000 people (medbeds)
Number of Surgical Procedures per 1000 people (surg.pro)
Number of Nurses & Midwives per 1000 people (nurse.midwi)

Section 1: Viewing and Summarizing Data

Viewing and Summarizing data are important first steps in understanding the data you wish to use in any project. As such, there are multiple different functions you can view the data you are working with, they include:

view() - view the entire data set in a Rstudio Tab
head() / tail() - view the first or last six observations from your data set in the Console Tab
names() / colnames() - view the column names of the data set in question
str() - view the structure of the data set (including: dimensions, variable types and details)

Exercise 1: Using one of the techniques mentioned, or one which you know of, view the data set

## Insert the data set between the brackets

head()
tail()

view()

names()
colnames()

str()

Although viewing your data set can be useful in understanding the type of data you may be using. Through summarizing your data (using the summary() function), you can get an idea of the Min, Max, Mean, Median, quartile ranges as well as number of missing values in the data set for the quantitative variables, as well as the frequency of those in categorical groups.

Exercise 2: Use the summary() function, to generate a summary of the data set.

summary() ## Insert the data set

Section 2: Data Manipulation 1: Data Types

For the purpose of today’s session, we will look at the conversion of variables into factor & numeric variables.

For this you will need: - as.numeric() - as.factor()

Exercise 3: Start by converting the Continent variable from character to factor variable, using the as.factor() function

## Replace the XXX with the variable to convert to a factor
WBD_1999$XXX <- as.factor(WBD_1999$XXX)

This groups together shared observations of the continents. The limitation of this, is it can result in the loss of information, especially if you decide to round variables to reduce the number of factors created. Additionally, statistical analyses can also be impacted through the use of categorical rather than numeric variables.

Exercise 4: Convert the lit.rate.per variable from numerical to integer using the as.integer() function before converting it to a factor variable, using the as.factor() function, make sure to save this to a new variable lit.rate.fact.

Section 3: Missing Data

Exercise 5: Count the number of missing observations for the variables Continent, lit.rate.per & ed.years

## Replace XXX with the variable of your choice
sum(is.na(XXX))

Exercise 6: Count the number of missing values in Brazil (Row 28), Togo (Row 240) & the World (Row 261), across all the entire data set

## Replace XXX with the data set, and specify the row and column as required 
## Dataset[row,column]
rowSums(is.na(XXX[,]))

Exercise 7: After copying the data set to a new variable name (WBD_1999.new for example) remove all the missing cases from lit.rate.per then ed.years. How many observations remain?

WBD_1999.new <- WBD_1999

## Remember to use complete.cases() to remove NA cases

Section 4: Selecting Variables

Exercise 8: Create a new data set which contains only the variables Country Name (2), Continent (4), Pop (6), ed.years (15), labour (16) and lit.rate.per (17) & co2 (18)

Exercise 9: Using the data set produced in Exercise 8, create the following data sets:

Containing all countries in Europe
Containing countries with a literacy rate of over 75%
Containing countries with between 6 to 10 years in education

Section 5: Transforming and Mutating Data

Exercise 10: Using the data set generated in Exercise 8, create the new variable co2_pp, calculating the CO2 emissions per person

## Replace XXX with Data sets, YYY with the new variable name, and ZZ with the variables 

WBD_1999.new2 <- mutate(XXX, 
                        YYY = ZZ/ZZ)

Section 6: Bringing everything together

Using the skills covered in today’s session, attempt to complete the tasks below!

Generate a limited data set Containing (with only the variables required):
- Countries from Africa, South America and Asia,
- With a Categorical Life Expectancy above 60 years and Literacy rate over 60%

Generate a limited data set Containing (with only the variables required):
- Observations classified as “Aggregated Nations”
- With a categorical unemployment level above 10%
- And additional variables calculating the proportion of Males & Females in the Population

Generate a limited data set Containing (with only the variables required):
- Countries from North America, Oceania, or Europe
- Calculate an estimate for the number of unemployed, both male & female