Package and Data Loading
As this is a largely introductory workshop, we will only be using the packages which fall under the tidyverse bracket. These are the core packages used across data science, statistics and applied uses of R. And are therefore important for the use of R across data manipulation and cleaning. As we will be using a data base which contains large numbers, ensure to set scipen to 999 using the options() function.
library(tidyverse)
options(scipen = 999)
Furthermore, to learn how to apply the techniques covered in today’s practical, we will be using data from the World Bank Open Data. In particular we will be using a collection of variables from 1999, these variables were selected to provide us plenty of room to explore!
This data is accessible through the downloadable here, .zip file, Github or from the .zip file provided by the calendar invite. This can then be loaded using the read_csv() function, using the following code. Note; the file location, in this case “data/WBD_1999.csv”, should reflect the location where the data is stored on your computer.
WBD_1999 <- read_csv("../data/WBD_1999.csv")
This data set contains 26 variables, including:
Section 1: Viewing and Summarizing Data
Viewing and Summarizing data are important first steps in understanding the data you wish to use in any project. As such, there are multiple different functions you can view the data you are working with, they include:
Exercise 1: Using one of the techniques mentioned, or one which you know of, view the data set
#head(WBD_1999)
#tail(WBD_1999)
#view(WBD_1999)
names(WBD_1999)
## [1] "...1" "Country Name" "Country Code" "Continent" "year"
## [6] "Pop" "Pop.fe" "Pop.ma" "birthrate" "deathrate"
## [11] "lifeexp" "lifeexp.fe" "lifeexp.ma" "ed.spend" "ed.years"
## [16] "labour" "lit.rate.per" "co2" "gdp" "unemp"
## [21] "unemp.fe" "unemp.ma" "health.exp" "medbeds" "surg.pro"
## [26] "nurse.midwi"
str(WBD_1999)
## spec_tbl_df [264 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ ...1 : num [1:264] 1 2 3 4 5 6 7 8 9 10 ...
## $ Country Name: chr [1:264] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Country Code: chr [1:264] "AFG" "ALB" "DZA" "ASM" ...
## $ Continent : chr [1:264] "Asia" "Europe" "Africa" "Oceania" ...
## $ year : num [1:264] 1999 1999 1999 1999 1999 ...
## $ Pop : num [1:264] 20170844 3108778 30623406 57053 64370 ...
## $ Pop.fe : num [1:264] 9784257 1543050 15103751 NA NA ...
## $ Pop.ma : num [1:264] 10386587 1565728 15519655 NA NA ...
## $ birthrate : num [1:264] 48.4 17.3 20.2 NA 12.6 ...
## $ deathrate : num [1:264] 12.04 5.98 5.05 NA NA ...
## $ lifeexp : num [1:264] 55.4 73.6 70.2 NA NA ...
## $ lifeexp.fe : num [1:264] 56.6 76.6 71.6 NA NA ...
## $ lifeexp.ma : num [1:264] 54.2 70.8 68.8 NA NA ...
## $ ed.spend : num [1:264] NA 3.59 NA NA NA ...
## $ ed.years : num [1:264] 6 8 10 NA 10 4 11 9 10 10 ...
## $ labour : num [1:264] 4906827 1330854 8670493 NA NA ...
## $ lit.rate.per: num [1:264] NA NA NA NA NA ...
## $ co2 : num [1:264] 821 2985 92045 NA 513 ...
## $ gdp : num [1:264] NA 3212119044 48639108361 NA 1239876305 ...
## $ unemp : num [1:264] NA 18.4 NA NA NA ...
## $ unemp.fe : num [1:264] NA 21.4 NA NA NA ...
## $ unemp.ma : num [1:264] NA 16.4 NA NA NA ...
## $ health.exp : logi [1:264] NA NA NA NA NA NA ...
## $ medbeds : num [1:264] NA 3.03 NA NA 3.2 ...
## $ surg.pro : logi [1:264] NA NA NA NA NA NA ...
## $ nurse.midwi : num [1:264] NA 4.08 NA NA 3.17 ...
## - attr(*, "spec")=
## .. cols(
## .. ...1 = col_double(),
## .. `Country Name` = col_character(),
## .. `Country Code` = col_character(),
## .. Continent = col_character(),
## .. year = col_double(),
## .. Pop = col_double(),
## .. Pop.fe = col_double(),
## .. Pop.ma = col_double(),
## .. birthrate = col_double(),
## .. deathrate = col_double(),
## .. lifeexp = col_double(),
## .. lifeexp.fe = col_double(),
## .. lifeexp.ma = col_double(),
## .. ed.spend = col_double(),
## .. ed.years = col_double(),
## .. labour = col_double(),
## .. lit.rate.per = col_double(),
## .. co2 = col_double(),
## .. gdp = col_double(),
## .. unemp = col_double(),
## .. unemp.fe = col_double(),
## .. unemp.ma = col_double(),
## .. health.exp = col_logical(),
## .. medbeds = col_double(),
## .. surg.pro = col_logical(),
## .. nurse.midwi = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Although viewing your data set can be useful in understanding the type of data you may be using. Through summarizing your data (using the summary() function), you can get an idea of the Min, Max, Mean, Median, quartile ranges as well as number of missing values in the data set for the quantitative variables, as well as the frequency of those in categorical groups.
Exercise 2: Use the summary() function, to generate a summary of the data set.
summary(WBD_1999)
## ...1 Country Name Country Code Continent
## Min. : 1.00 Length:264 Length:264 Length:264
## 1st Qu.: 66.75 Class :character Class :character Class :character
## Median :132.50 Mode :character Mode :character Mode :character
## Mean :132.50
## 3rd Qu.:198.25
## Max. :264.00
##
## year Pop Pop.fe Pop.ma
## Min. :1999 Min. : 9346 Min. : 39545 Min. : 35129
## 1st Qu.:1999 1st Qu.: 1188092 1st Qu.: 1291039 1st Qu.: 1301637
## Median :1999 Median : 7992324 Median : 5165368 Median : 4860090
## Mean :1999 Mean : 240195206 Mean : 130589623 Mean : 132513783
## 3rd Qu.:1999 3rd Qu.: 46398877 3rd Qu.: 31518470 3rd Qu.: 30712597
## Max. :1999 Max. :6034566869 Max. :2995108533 Max. :3036816014
## NA's :1 NA's :24 NA's :24
## birthrate deathrate lifeexp lifeexp.fe
## Min. : 7.80 Min. : 1.924 Min. :38.63 Min. :39.16
## 1st Qu.:14.05 1st Qu.: 6.415 1st Qu.:59.66 1st Qu.:60.86
## Median :22.71 Median : 8.631 Median :69.19 Median :72.19
## Mean :24.64 Mean : 9.227 Mean :66.43 Mean :68.92
## 3rd Qu.:32.77 3rd Qu.:11.499 3rd Qu.:73.62 3rd Qu.:76.63
## Max. :53.82 Max. :24.366 Max. :80.57 Max. :83.90
## NA's :16 NA's :17 NA's :20 NA's :20
## lifeexp.ma ed.spend ed.years labour
## Min. :38.10 Min. : 1.195 Min. : 0.000 Min. : 35359
## 1st Qu.:57.80 1st Qu.: 3.299 1st Qu.: 7.000 1st Qu.: 1200604
## Median :66.29 Median : 4.152 Median : 9.000 Median : 4367000
## Mean :64.05 Mean : 4.419 Mean : 8.646 Mean : 120717076
## 3rd Qu.:71.06 3rd Qu.: 5.148 3rd Qu.:10.000 3rd Qu.: 27488669
## Max. :77.70 Max. :14.539 Max. :16.000 Max. :2715639768
## NA's :20 NA's :112 NA's :58 NA's :31
## lit.rate.per co2 gdp unemp
## Min. :50.40 Min. : 7 Min. : 13687141 Min. : 0.700
## 1st Qu.:61.79 1st Qu.: 1536 1st Qu.: 3268276022 1st Qu.: 5.270
## Median :78.71 Median : 16490 Median : 15710148244 Median : 7.990
## Mean :76.58 Mean : 819503 Mean : 1001144539190 Mean : 8.984
## 3rd Qu.:89.90 3rd Qu.: 176126 3rd Qu.: 192469941577 3rd Qu.:11.853
## Max. :99.59 Max. :24059187 Max. :32543593572600 Max. :32.400
## NA's :215 NA's :19 NA's :25 NA's :135
## unemp.fe unemp.ma health.exp medbeds
## Min. : 0.500 Min. : 1.000 Mode:logical Min. : 0.200
## 1st Qu.: 5.150 1st Qu.: 4.844 NA's:264 1st Qu.: 3.647
## Median : 9.998 Median : 7.743 Median : 5.585
## Mean :10.721 Mean : 8.455 Mean : 5.621
## 3rd Qu.:13.563 3rd Qu.:10.860 3rd Qu.: 7.485
## Max. :33.300 Max. :31.900 Max. :14.800
## NA's :151 NA's :151 NA's :190
## surg.pro nurse.midwi
## Mode:logical Min. : 0.0596
## NA's:264 1st Qu.: 1.9053
## Median : 3.7000
## Mean : 4.4476
## 3rd Qu.: 6.2906
## Max. :15.0208
## NA's :203
Section 2: Data Manipulation 1: Data Types
For the purpose of today’s session, we will look at the conversion of variables into factor & numeric variables.
For this you will need: - as.numeric() - as.factor()
Exercise 3: Start by converting the Continent variable from character to factor variable, using the as.factor() function
WBD_1999$Continent <- as.factor(WBD_1999$Continent)
This groups together shared observations of the continents. The limitation of this, is it can result in the loss of information, especially if you decide to round variables to reduce the number of factors created. Additionally, statistical analyses can also be impacted through the use of categorical rather than numeric variables.
Exercise 4: Convert the lit.rate.per variable from numerical to integer using the as.integer() function before converting it to a factor variable, using the as.factor() function, make sure to save this to a new variable lit.rate.fact.
WBD_1999$lit.rate.fact <- as.factor(as.integer(WBD_1999$lit.rate.per))
Section 3: Missing Data
Exercise 5: Count the number of missing observations for the variables Continent, lit.rate.per & ed.years
sum(is.na(WBD_1999$Continent))
## [1] 0
summary(WBD_1999$lit.rate.per)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 50.40 61.79 78.71 76.58 89.90 99.59 215
table(is.na(WBD_1999$ed.years))
##
## FALSE TRUE
## 206 58
Exercise 6: Count the number of missing values in Brazil (Row 28), Togo (Row 240) & the World (Row 261), across all the entire data set
rowSums(is.na(WBD_1999[28,]))
## [1] 5
rowSums(is.na(WBD_1999[240,]))
## [1] 9
rowSums(is.na(WBD_1999[261,]))
## [1] 6
Exercise 7: After copying the data set to a new variable name (WBD_1999.new for example) remove all the missing cases from lit.rate.per then ed.years. How many observations remain?
WBD_1999.new <- WBD_1999[complete.cases(WBD_1999[,"lit.rate.per"]),]
WBD_1999.new <- WBD_1999.new[complete.cases(WBD_1999.new[,"ed.years"]),]
Section 4: Selecting Variables
Exercise 8: Create a new data set which contains only the variables Country Name (2), Continent (4), Pop (6), ed.years (15), labour (16) and lit.rate.per (17) & co2 (18)
WBD_1999.new2 <- WBD_1999[,c("Country Name", "Continent", "Pop", "ed.years", "labour",
"lit.rate.per", "co2")]
WBD_1999.new2 <- WBD_1999[,c(2,4,6,15:18)]
Exercise 9: Using the data set produced in Exercise 8, create the following data sets:
WBD_1999_europe <- filter(WBD_1999.new2, Continent == "Europe")
WBD_1999_lit75 <- filter(WBD_1999.new2, lit.rate.per >= 75)
WBD_1999_edu <- filter(WBD_1999.new2, ed.years > 6 & ed.years < 10)
Section 5: Transforming and Mutating Data
Exercise 10: Using the data set generated in Exercise 8, create the new variable co2_pp, calculating the CO2 emissions per person
WBD_1999.new2 <- mutate(WBD_1999.new2,
co2_pp = co2/Pop)
Section 6: Bringing everything together
Using the skills covered in today’s session, attempt to complete the tasks below!
WBD_1999.ext1 <- WBD_1999 %>%
filter(`Continent` == "Africa" | `Continent` == "South America" | `Continent` == "Asia" ) %>%
filter(lifeexp > 60) %>%
filter(ed.years > 7)
WBD_1999.ex1 <- WBD_1999.ext1[,c(2,4,11,15)]
WBD_1999.ext2 <- WBD_1999 %>%
filter(`Continent` == c("Aggregated Nations")) %>%
filter(unemp > 10) %>%
mutate(male_prop = Pop.ma/Pop) %>%
mutate(fe_prop = Pop.fe/Pop)
WBD_1999.ext2 <- WBD_1999.ext2[,c(2,4,6,20,28,29)]
WBD_1999.ext3 <- WBD_1999 %>%
filter(`Continent` == "Europe" | `Continent` == "North America" | `Continent` == "Oceania" ) %>%
mutate(male_unemp_num = Pop.ma*(unemp.ma/100)) %>%
mutate(fe_unemp_num = Pop.fe*(unemp.fe/100))
WBD_1999.ext3 <- WBD_1999.ext3[,c(2,28,29)]