Data in R

Session Objectives and Transferable Skills

To understand the importance of data manipulation and cleaning for good data analytic practices.
To understand how to select, manipulate and transform data under both standardized and conditional situation.
To understand how to clean data through identifying or removing missing observations.

Data Summarization, using summary(),
Data Manipulation, changing data types, transforming and mutating data,
Data Selection, based on static or conditional situations,
Data Cleaning, through identification, removal and replacement of missing observations,

Schedule

Introduction (5 minutes)

Part 1: (40 minutes)

Introduction to Data & Big Data (Theory)
Data Loading, Viewing & Summaries (Practical, Section 1)
Introduction to Data Types and Missing data (Theory)
Data Manipulation 1: Changing Data Types (Practical, Section 2)
Missing Values: Identification and Removal (Practical, Section 3)

Break (5 minutes)

Part 2: (35 minutes)

Introduction to Selection of Variables (Theory)
Selection of Variables and Observations (Practical, Section 4)
Manipulation through Mutation and Transformation (Theory)
Data Manipulation 2: Mutation & Transformation (Practical, Section 5)

Review and Conclusion (5 minutes)

Introduction

“House Rules”
Who am I?
Contact Me
- Find me online: thomasjwise.com
- Email me: thomas.j.wise@outlook.com
- Connect with me: www.linkedin.com/in/tjwise213/
- Follow me: @thomasj_wise

Part 1 Intro to Data

Single or Multiple variables, providing insight for qualitative or quantitative purposes.
Prior to analysis, raw data must be cleaned and manipulated into a usable format!

Big Data & Society

Active Data Generation
- Company/Product Reviews
- Company & Organisation subscriptions
Passive Data Generation
- Terms of Service Collection (Facebook, Google Maps)
- Location based information (Check ins, CCTV)
Through manipulation, merging and cleaning of this data, analyses can be run to help understand how society operates.

Data Loading

tidyverse package: readr
Additional packages: readxl, haven, xml2
csv files: read_csv()
delim files (those without delimiters): read_delim()
Excel Spreadsheets: read_excel(), read_xls(), read_xlsx
SAS Data Files: read_sas()
SPSS Data Files: read_sav(), read_spss()
Stata Data Files: read_stata()
Unsure about how to load in a data type? Check it on Google!
You prefix any function with ? to bring up the help menu and see what additional parameters can be used!

Data Viewing & Summaries

Once loaded, you can view the data in multiple ways.

view() - Views the entire data set in a Tab
head() - View the first six observations
tails() - View the last six observations
names() / colnames() - View the column names of the data set
str() - View the data set’s structure

Data Types

Numeric: 1, 2, 3, can present as “double”
Character: “one”, “two”, “three”
Logical: “TRUE”, “FALSE”
Factors: Nominal Variables
Vector: A string of elements
Matrix: A multiple vectors, formed into columns and rows, which share the same data type
Arrays: Similar to matrices, but with more than 2 dimensions
Data Frames: Multiple vectors, formed into columns and rows, each column indicates a different variable, each row a different observation
List: An ordered collection of objects, can be a merge of one type or many different ones.

Data Types

Conversion Functions
- To Numeric: as.numeric()
- To Integer: as.integer()
- To Character: as.character()
- To Logical: as.logical()
- To Factor: as.factor()
These will work in most cases, but do look up additional information regarding dates (as.date()), as well as data structures (matrices, data frames etc) as these can be more complex!

Missing Data: Identification

Missing values are typically represented by NA. How to identify the number of NAs per variable (column)

Using summary()
Using sum(is.na())
Using table(is.na())

How to identify the number of NAs per observation (row)

Using rowSums(is.na())

Handling Missing Data

Handling Missing Data is a highly debated topic, you can choose to:

Exclude the value: Removal
Impute the value: Mean, Median or Predicted Value (through regression or other analysis)

For today’s introduction we will focus on Removal, with an extension for those interested in Replacement online, as this combines knowledge from all of today’s sections.

Missing Data: Removal

To remove missing values from individual variables, the entire observation (row) of that missing value must be removed. This can be done by:

Using na.omit()
Using complete.cases()

You can also remove observation which surpass a threshold of missingness (for example 50% missing values), this however will be covered later!

Break

Part 2 Selecting Data in R

When manipulating data you may want to:

Select a specific Column(s)
Select a specific observation
Select a specific group of observations
Select observations which meet specific conditional requirements

Selecting Variables 1: Selecting Columns (Basic)

When selecting a variables (columns) you can:

Name the column specifically: WBD_1999$ed.year or WBD_1999[,ed.years]
Specify a range of columns by name: WBD_1999[,c(ed.years, lit.rate.per)]
Specify the column by number: WBD_1999[,15] (selecting ed.years)
Specify a continuous range of columns by number: WBD_1999[,15:16] (selecting ed.year & labour)
Specify a range of columns by number: WBD_1999[,c(15,17)] (selecting: ed.year & lit.rate.per)
Specify complement set of variables (everything but): WBD_1999[,!1]

Selecting Variables 2: Selecting Columns (Advanced)

If you are unsure of the exact names, column numbers or have too many to count/list. BUT know that all contain, are prefixed or suffixed with something you can use:

First, use select() function with:
Starts with, prefixed with: starts_with()
Ends with, suffixed by: ends_with()
Contains: contains()
Exactly Matches: matches()
Contains a numerical range: num_range()

Selecting Observations 1: Static Selection

When selecting observations, you can similarly:

Select a row by number or position: WBD_1999[1,] (selects the first row)
Select multiple rows by position: WBD_1999[c(1,3,5),]
Select continuous rows by position: WBD_1999[1:5,]

Selecting Observations 2: Conditional Selection

You can also select observations based upon pre-determined conditions,

For example, all those countries who’s ed.year are equal or above 10

This can be achieved through:

Using the filter() function: filter(WBD_1999, ed.years >= 10)
Using the subset() function: subset(WBD_1999, WBD_1999[,15] >= 10)
Using raw base R functions: WBD_1999[WBD_1999$ed.years > 10,]

Selecting Observations 2: Selecting Operators

== is exactly equal to
> is greater than
< is less than
>= is greater than or equal to
<= is less than or equal to
!= is not equal to
& union (and)
| or

Transforming and Mutating Data

Transformations and Mutations of Data are achieved through the mutate() function

How to structure mutate(), with an example lets calculate the GDP per person

WBD_1999 <- mutate(WBD_1999, GDP_PP = gdp / Pop)

Useful Mutate Functions

Mathematical Functions (+, -, /, log())
Summary Functions (mean(), median(), sd())
Conditional Functions (if_else(), recode())
Cumulative Functions (cumsum(), cummean(), etc)
Rank Functions (dense_rank(), row_number(), cume_dist())

Section 6 & Extension Topics

Section 6: Bringing everything together, will develop your skills from today’s session
Check online for additional extension topic: Replacing Missing Values

Questions, Comments and More

Find me online: thomasjwise.com
Email me: thomas.j.wise@outlook.com
Connect with me: www.linkedin.com/in/tjwise213/
Follow me: @thomasj_wise