Edit this page

Educational Resources for Programming and Statistical Analysis > Data in R - An introduction to data manipulation and cleaning in R > Practical Worksheet > Practical Worksheet Solutions

Practical Worksheet Solutions

Package and Data Loading

As this is a largely introductory workshop, we will only be using the packages which fall under the tidyverse bracket. These are the core packages used across data science, statistics and applied uses of R. And are therefore important for the use of R across data manipulation and cleaning. As we will be using a data base which contains large numbers, ensure to set scipen to 999 using the options() function.

  library(tidyverse)
  
  options(scipen = 999)

Furthermore, to learn how to apply the techniques covered in today’s practical, we will be using data from the World Bank Open Data. In particular we will be using a collection of variables from 1999, these variables were selected to provide us plenty of room to explore!

This data is accessible through the downloadable here, .zip file, Github or from the .zip file provided by the calendar invite. This can then be loaded using the read_csv() function, using the following code. Note; the file location, in this case “data/WBD_1999.csv”, should reflect the location where the data is stored on your computer.

WBD_1999 <- read_csv("../data/WBD_1999.csv")

This data set contains 26 variables, including:

…1 (row numbers)
Country Name (Country Name)
Country Code (Country Code)
Continent (Continent)
Year (Year)
Population (Pop)
Female Population (Pop.fe)
Male Population (Pop.ma)
Birth Rate, crude per 1000 people (birthrate)
Death Rate, crude per 1000 people (deathrate)
Life Expectancy at Birth in years (lifeexp)
Female Life Expectancy at Birth in years (lifeexp.fe)
Male Life Expectancy at Birth in years (lifeexp.ma)
Educational Spending, percentage of GDP (ed.spend)
Compulsory Education Duration in Years (ed.years)
Labour Force Total (labour)
Literature Rate in adults, percentage % (lit.rate.per)
CO2 Emissions, kt (co2)
Gross Domestic product, $ (gdp)
Unemployment, percentage of total labour force (unemp)
Female Unemployment, percentage of total labour force (unemp.fe)
Male Unemployment, percentage of total labour force (unemp.ma)
Health Expenditure per capita, $ (health.exp)
Hospital Beds per 1000 people (medbeds)
Number of Surgical Procedures per 1000 people (surg.pro)
Number of Nurses & Midwives per 1000 people (nurse.midwi)

Section 1: Viewing and Summarizing Data

Viewing and Summarizing data are important first steps in understanding the data you wish to use in any project. As such, there are multiple different functions you can view the data you are working with, they include:

view() - view the entire data set in a Rstudio Tab
head() / tail() - view the first or last six observations from your data set in the Console Tab
names() / colnames() - view the column names of the data set in question
str() - view the structure of the data set (including: dimensions, variable types and details)

Exercise 1: Using one of the techniques mentioned, or one which you know of, view the data set

#head(WBD_1999)
#tail(WBD_1999)
#view(WBD_1999)

names(WBD_1999)

##  [1] "...1"         "Country Name" "Country Code" "Continent"    "year"        
##  [6] "Pop"          "Pop.fe"       "Pop.ma"       "birthrate"    "deathrate"   
## [11] "lifeexp"      "lifeexp.fe"   "lifeexp.ma"   "ed.spend"     "ed.years"    
## [16] "labour"       "lit.rate.per" "co2"          "gdp"          "unemp"       
## [21] "unemp.fe"     "unemp.ma"     "health.exp"   "medbeds"      "surg.pro"    
## [26] "nurse.midwi"

str(WBD_1999)

## spec_tbl_df [264 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ...1        : num [1:264] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Country Name: chr [1:264] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Country Code: chr [1:264] "AFG" "ALB" "DZA" "ASM" ...
##  $ Continent   : chr [1:264] "Asia" "Europe" "Africa" "Oceania" ...
##  $ year        : num [1:264] 1999 1999 1999 1999 1999 ...
##  $ Pop         : num [1:264] 20170844 3108778 30623406 57053 64370 ...
##  $ Pop.fe      : num [1:264] 9784257 1543050 15103751 NA NA ...
##  $ Pop.ma      : num [1:264] 10386587 1565728 15519655 NA NA ...
##  $ birthrate   : num [1:264] 48.4 17.3 20.2 NA 12.6 ...
##  $ deathrate   : num [1:264] 12.04 5.98 5.05 NA NA ...
##  $ lifeexp     : num [1:264] 55.4 73.6 70.2 NA NA ...
##  $ lifeexp.fe  : num [1:264] 56.6 76.6 71.6 NA NA ...
##  $ lifeexp.ma  : num [1:264] 54.2 70.8 68.8 NA NA ...
##  $ ed.spend    : num [1:264] NA 3.59 NA NA NA ...
##  $ ed.years    : num [1:264] 6 8 10 NA 10 4 11 9 10 10 ...
##  $ labour      : num [1:264] 4906827 1330854 8670493 NA NA ...
##  $ lit.rate.per: num [1:264] NA NA NA NA NA ...
##  $ co2         : num [1:264] 821 2985 92045 NA 513 ...
##  $ gdp         : num [1:264] NA 3212119044 48639108361 NA 1239876305 ...
##  $ unemp       : num [1:264] NA 18.4 NA NA NA ...
##  $ unemp.fe    : num [1:264] NA 21.4 NA NA NA ...
##  $ unemp.ma    : num [1:264] NA 16.4 NA NA NA ...
##  $ health.exp  : logi [1:264] NA NA NA NA NA NA ...
##  $ medbeds     : num [1:264] NA 3.03 NA NA 3.2 ...
##  $ surg.pro    : logi [1:264] NA NA NA NA NA NA ...
##  $ nurse.midwi : num [1:264] NA 4.08 NA NA 3.17 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ...1 = col_double(),
##   ..   `Country Name` = col_character(),
##   ..   `Country Code` = col_character(),
##   ..   Continent = col_character(),
##   ..   year = col_double(),
##   ..   Pop = col_double(),
##   ..   Pop.fe = col_double(),
##   ..   Pop.ma = col_double(),
##   ..   birthrate = col_double(),
##   ..   deathrate = col_double(),
##   ..   lifeexp = col_double(),
##   ..   lifeexp.fe = col_double(),
##   ..   lifeexp.ma = col_double(),
##   ..   ed.spend = col_double(),
##   ..   ed.years = col_double(),
##   ..   labour = col_double(),
##   ..   lit.rate.per = col_double(),
##   ..   co2 = col_double(),
##   ..   gdp = col_double(),
##   ..   unemp = col_double(),
##   ..   unemp.fe = col_double(),
##   ..   unemp.ma = col_double(),
##   ..   health.exp = col_logical(),
##   ..   medbeds = col_double(),
##   ..   surg.pro = col_logical(),
##   ..   nurse.midwi = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Although viewing your data set can be useful in understanding the type of data you may be using. Through summarizing your data (using the summary() function), you can get an idea of the Min, Max, Mean, Median, quartile ranges as well as number of missing values in the data set for the quantitative variables, as well as the frequency of those in categorical groups.

Exercise 2: Use the summary() function, to generate a summary of the data set.

summary(WBD_1999)

##       ...1        Country Name       Country Code        Continent        
##  Min.   :  1.00   Length:264         Length:264         Length:264        
##  1st Qu.: 66.75   Class :character   Class :character   Class :character  
##  Median :132.50   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :132.50                                                           
##  3rd Qu.:198.25                                                           
##  Max.   :264.00                                                           
##                                                                           
##       year           Pop                 Pop.fe               Pop.ma          
##  Min.   :1999   Min.   :      9346   Min.   :     39545   Min.   :     35129  
##  1st Qu.:1999   1st Qu.:   1188092   1st Qu.:   1291039   1st Qu.:   1301637  
##  Median :1999   Median :   7992324   Median :   5165368   Median :   4860090  
##  Mean   :1999   Mean   : 240195206   Mean   : 130589623   Mean   : 132513783  
##  3rd Qu.:1999   3rd Qu.:  46398877   3rd Qu.:  31518470   3rd Qu.:  30712597  
##  Max.   :1999   Max.   :6034566869   Max.   :2995108533   Max.   :3036816014  
##                 NA's   :1            NA's   :24           NA's   :24          
##    birthrate       deathrate         lifeexp        lifeexp.fe   
##  Min.   : 7.80   Min.   : 1.924   Min.   :38.63   Min.   :39.16  
##  1st Qu.:14.05   1st Qu.: 6.415   1st Qu.:59.66   1st Qu.:60.86  
##  Median :22.71   Median : 8.631   Median :69.19   Median :72.19  
##  Mean   :24.64   Mean   : 9.227   Mean   :66.43   Mean   :68.92  
##  3rd Qu.:32.77   3rd Qu.:11.499   3rd Qu.:73.62   3rd Qu.:76.63  
##  Max.   :53.82   Max.   :24.366   Max.   :80.57   Max.   :83.90  
##  NA's   :16      NA's   :17       NA's   :20      NA's   :20     
##    lifeexp.ma       ed.spend         ed.years          labour          
##  Min.   :38.10   Min.   : 1.195   Min.   : 0.000   Min.   :     35359  
##  1st Qu.:57.80   1st Qu.: 3.299   1st Qu.: 7.000   1st Qu.:   1200604  
##  Median :66.29   Median : 4.152   Median : 9.000   Median :   4367000  
##  Mean   :64.05   Mean   : 4.419   Mean   : 8.646   Mean   : 120717076  
##  3rd Qu.:71.06   3rd Qu.: 5.148   3rd Qu.:10.000   3rd Qu.:  27488669  
##  Max.   :77.70   Max.   :14.539   Max.   :16.000   Max.   :2715639768  
##  NA's   :20      NA's   :112      NA's   :58       NA's   :31          
##   lit.rate.per        co2                gdp                     unemp       
##  Min.   :50.40   Min.   :       7   Min.   :      13687141   Min.   : 0.700  
##  1st Qu.:61.79   1st Qu.:    1536   1st Qu.:    3268276022   1st Qu.: 5.270  
##  Median :78.71   Median :   16490   Median :   15710148244   Median : 7.990  
##  Mean   :76.58   Mean   :  819503   Mean   : 1001144539190   Mean   : 8.984  
##  3rd Qu.:89.90   3rd Qu.:  176126   3rd Qu.:  192469941577   3rd Qu.:11.853  
##  Max.   :99.59   Max.   :24059187   Max.   :32543593572600   Max.   :32.400  
##  NA's   :215     NA's   :19         NA's   :25               NA's   :135     
##     unemp.fe         unemp.ma      health.exp        medbeds      
##  Min.   : 0.500   Min.   : 1.000   Mode:logical   Min.   : 0.200  
##  1st Qu.: 5.150   1st Qu.: 4.844   NA's:264       1st Qu.: 3.647  
##  Median : 9.998   Median : 7.743                  Median : 5.585  
##  Mean   :10.721   Mean   : 8.455                  Mean   : 5.621  
##  3rd Qu.:13.563   3rd Qu.:10.860                  3rd Qu.: 7.485  
##  Max.   :33.300   Max.   :31.900                  Max.   :14.800  
##  NA's   :151      NA's   :151                     NA's   :190     
##  surg.pro        nurse.midwi     
##  Mode:logical   Min.   : 0.0596  
##  NA's:264       1st Qu.: 1.9053  
##                 Median : 3.7000  
##                 Mean   : 4.4476  
##                 3rd Qu.: 6.2906  
##                 Max.   :15.0208  
##                 NA's   :203

Section 2: Data Manipulation 1: Data Types

For the purpose of today’s session, we will look at the conversion of variables into factor & numeric variables.

For this you will need: - as.numeric() - as.factor()

Exercise 3: Start by converting the Continent variable from character to factor variable, using the as.factor() function

WBD_1999$Continent <- as.factor(WBD_1999$Continent)

This groups together shared observations of the continents. The limitation of this, is it can result in the loss of information, especially if you decide to round variables to reduce the number of factors created. Additionally, statistical analyses can also be impacted through the use of categorical rather than numeric variables.

Exercise 4: Convert the lit.rate.per variable from numerical to integer using the as.integer() function before converting it to a factor variable, using the as.factor() function, make sure to save this to a new variable lit.rate.fact.

WBD_1999$lit.rate.fact <- as.factor(as.integer(WBD_1999$lit.rate.per))

Section 3: Missing Data

Exercise 5: Count the number of missing observations for the variables Continent, lit.rate.per & ed.years

sum(is.na(WBD_1999$Continent))

## [1] 0

summary(WBD_1999$lit.rate.per)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   50.40   61.79   78.71   76.58   89.90   99.59     215

table(is.na(WBD_1999$ed.years))

## 
## FALSE  TRUE 
##   206    58

Exercise 6: Count the number of missing values in Brazil (Row 28), Togo (Row 240) & the World (Row 261), across all the entire data set

rowSums(is.na(WBD_1999[28,]))

## [1] 5

rowSums(is.na(WBD_1999[240,]))

## [1] 9

rowSums(is.na(WBD_1999[261,]))

## [1] 6

Exercise 7: After copying the data set to a new variable name (WBD_1999.new for example) remove all the missing cases from lit.rate.per then ed.years. How many observations remain?

WBD_1999.new <- WBD_1999[complete.cases(WBD_1999[,"lit.rate.per"]),]
WBD_1999.new <- WBD_1999.new[complete.cases(WBD_1999.new[,"ed.years"]),]

Section 4: Selecting Variables

Exercise 8: Create a new data set which contains only the variables Country Name (2), Continent (4), Pop (6), ed.years (15), labour (16) and lit.rate.per (17) & co2 (18)

WBD_1999.new2 <- WBD_1999[,c("Country Name", "Continent", "Pop", "ed.years", "labour",
                            "lit.rate.per", "co2")]

WBD_1999.new2 <- WBD_1999[,c(2,4,6,15:18)]

Exercise 9: Using the data set produced in Exercise 8, create the following data sets:

Containing all countries in Europe
Containing countries with a literacy rate of over 75%
Containing countries with between 6 to 10 years in education

WBD_1999_europe <- filter(WBD_1999.new2, Continent == "Europe")
WBD_1999_lit75 <- filter(WBD_1999.new2, lit.rate.per >= 75)
WBD_1999_edu <- filter(WBD_1999.new2, ed.years > 6 & ed.years < 10)

Section 5: Transforming and Mutating Data

Exercise 10: Using the data set generated in Exercise 8, create the new variable co2_pp, calculating the CO2 emissions per person

WBD_1999.new2 <- mutate(WBD_1999.new2,
                        co2_pp = co2/Pop)

Section 6: Bringing everything together

Using the skills covered in today’s session, attempt to complete the tasks below!

Generate a limited data set Containing (with only the variables required):
- Countries from Africa, South America and Asia,
- With a Over 7 years spent in education and Literacy rate over 60%

WBD_1999.ext1 <- WBD_1999 %>%
  filter(`Continent` == "Africa" | `Continent` == "South America" | `Continent` == "Asia" )  %>%
  filter(lifeexp > 60) %>%
  filter(ed.years > 7)

WBD_1999.ex1 <- WBD_1999.ext1[,c(2,4,11,15)]

Generate a limited data set Containing (with only the variables required):
- Observations classified as “Aggregated Nations”
- With a categorical unemployment level above 10%
- And additional variables calculating the proportion of Males & Females in the Population

WBD_1999.ext2 <- WBD_1999 %>%
  filter(`Continent` == c("Aggregated Nations"))  %>%
  filter(unemp > 10) %>% 
  mutate(male_prop = Pop.ma/Pop) %>% 
  mutate(fe_prop = Pop.fe/Pop)

WBD_1999.ext2 <- WBD_1999.ext2[,c(2,4,6,20,28,29)]

Generate a limited data set Containing (with only the variables required):
- Countries from North America, Oceania, or Europe
- Calculate an estimate for the number of unemployed, both male & female

WBD_1999.ext3 <- WBD_1999 %>%
  filter(`Continent` == "Europe" | `Continent` == "North America" | `Continent` == "Oceania" )  %>%
  mutate(male_unemp_num = Pop.ma*(unemp.ma/100)) %>% 
  mutate(fe_unemp_num = Pop.fe*(unemp.fe/100))

WBD_1999.ext3 <- WBD_1999.ext3[,c(2,28,29)]