R basics

Morning checklist
1. Intro to RStudio
2. Data types and Data structures
3. Overview using our gapminder dataset
Summary

Morning checklist

Good morning!
Setup Etherpad: http://pad.software-carpentry.org/2016-01-30-UTA
Workshop website: https://annawilliford.github.io/2016-01-30-UTA
Our DATA file for today:
- https://annawilliford.github.io/2016-01-30-UTA/data/gapminderData.csv
R RESOURCES to explore:

1. Intro to RStudio

Why R:
- Statistics, graphics, general purpose programming
- Thousands of packages available - these are collections of functions that implement all kinds of analyses of data sets from various disciplines
- Open source software - use for free, modify as you wish
- Active community of developers - new packages are being written as we speak
RStudio as interface for R(IDE - integrated development environment)
- The 4 windows of RStudio:
  - Top left = text editor,write your commands/code here
  - Bottom left = R console, run (execute) your commands here
  - Top right = things that R keeps track of:
    - History tab records all the commands you type in R console
    - Environment tab keeps track of all objects you create in the current session
    - Both records can be saved for later as .Rhistory and .RData files
  - Bottom right = several helpful tabs - play with them to see what they do
- Navigation Commands
  - getwd(), setwd()
  - Challenge 1.1 (see also Rcommands.R file for challenges and answers)
```
TASK: Use setwd() function to navigate  to `SoftwareCarpentry_Spr16` folder.
```
- Command flow
  - Console to R scipt; Rscript to console
  - Make Sunday_AM (dir.create()), go to Sunday_AM, save file “Rcommands.R”, make data folder
- More on console
  - incomplete commands; Esc to return to > when stuck with +
  - output is not reusable - need to assign the output to a variable to reuse
Variables and Assignment
- Assignment operator: <- (Alt+dash)
- cost <- 34; cost2 <- cost*2; cost <- cost +2:
- Challenge 1.2
```
TASK: What will be the value of each  variable  after each statement in the following code?

mass <- 47.5
age <- 122
mass <- mass * 2.3
age <- age - 20
```
- variable names:
  - NO SPACES in names
  - DO NOT START with numbers
  - names should be MEANINGFUL - help yourself and others to understand your code!
Manage environment
- ls() - list all objects in your environment
- rm(objectName) - remove object objectName
- rm(list=ls()) - remove all objects, clear environment
Help functions for functions
- ?plot
- args(setwd)
- help(mean)
R packages: To use functions from packages that are not installed by default, do 2 things:
- install.package("knitr") - get package
- library(knitr) - load package into R

2. Data types and Data structures

Let’s take a look at the example of a dataset, gapminderData.csv

To do that, recall your bash lesson. Open Git bash, navigate to Sunday_AM/data folder, then download file.
- curl -O https://annawilliford.github.io/2016-01-30-UTA/data/gapminderData.csv
- Open in Excel

This is a familiar, table-like dataset suitable for different statistical analyses. Much of the data you are working with can probably be represented in this format. Here we have a record of population size, life expectancy and other kind of information for different countries.

Any ideas about how this kind of dataset can be represented in R? In R this whole dataset is a single object (or data structure) built out of smaller pieces. Think of a castle built of legos. We want to understand how to built a castle and how to take it apart. Our example of a castle is this data set ( in R it is known as a dataframe) and it is relatively simple, but this is as far as we will go today. Let’s start from smallest lego pieces and build our dataset from them.

Smallest units in R: single-element data structures

Let’s assign value of 45 to a variable age. We just created the smallest lego piece (smallest object) in R:

age <- 45
length(age)

## [1] 1

str(age)

##  num 45

Variables can hold values of various types. Most common data types:

numeric(double+integer)
character
logical
complex
…

For example: What data type is stored in score variable?

score<-79
is.integer(score)

## [1] FALSE

typeof(score)

## [1] "double"

typeof(is.integer(score))

## [1] "logical"

The last expression is an example of nested function. Nested functions are very common in R, but are very difficult to understand it first. You can always split nested function into a series of single function calls. Remember that the variable inside the most inner paranthesis is an argument(input)for the function that will be evaluated first.

Challenge 2.1:Learn how to read the output of nested help functions

TASK: Break the following expression into multiple single function calls. You will need to assign the output of each function to a variable that will serve as an input(argument) for the next function. What is the value of each variable? What does each function do? Assign: `score<-79`

is.logical(is.numeric(typeof(is.integer(score))))

Challenge 2.1: Answer

score <- 79
step1 <- is.integer(score)
print(step1)

## [1] FALSE

step2 <- typeof(step1)
print(step2)

## [1] "logical"

step3 <- is.numeric(step2)
print(step3)

## [1] FALSE

step4 <- is.logical(step3)
print(step4)

## [1] TRUE

## Or as a single step:
print(is.logical(is.numeric(typeof(is.integer(score)))))

## [1] TRUE

Sometimes you will need to convert between data types. There are functions that do that: as.integer(), as.character(), and so on. The conversion between data types is not always possible - why? Let’s see what happens here:

score <- 79
typeof(score)

## [1] "double"

score <- as.integer(score)
typeof(score)

## [1] "integer"

#but can we convert character to integer?
name <- "Sasha"
typeof(name)

## [1] "character"

name <- as.integer(name)

## Warning: NAs introduced by coercion

# the data type will be changed, but no value will be assigned
typeof(name)

## [1] "integer"

print(name)   # NA = missing value

## [1] NA

Data structures with multiple elements

The small objects can be combined to build larger objects. Look at the gapminder dataset. Our smallest objects can be used to represent a single element in the dataset, like individual year, or individual country, but what would be the simplest object that you can make with multiple elements?

Vectors: collection of elements of the same data type

what part of our dataset can be represented by a vector?
how to create: use concatinate function, c()

###let's make a vector
v<-c(1:3, 45)
v

## [1]  1  2  3 45

##examine object
typeof(v) # tells you the data type of vector elements

## [1] "double"

length(v) # what does this do?

## [1] 4

str(v)    # tells you the structure of the object VERY USEFUL

##  num [1:4] 1 2 3 45

##view
head(v, n=2)  #look at the first 2 elements

## [1] 1 2

#what would `tail()` do?
tail(v, n=3)  #look at the first 3 elements

## [1]  2  3 45

##manipulate
v <- c(v,56)  #add element to vector
#vectorizarion: no loop is required to perform operation on each vector element
v1 <- 2*v   # multiply each vector element by 2
v1

## [1]   2   4   6  90 112

# let's try to add vectors
v2<-c(1:5)
v3 <- v1+v2
v3

## [1]   3   6   9  94 117

# you can name vectors; find out what `names()` function does
# change data type
v3 <- as.character(v3)  #also known as coersion
str(v3)

##  chr [1:5] "3" "6" "9" "94" "117"

Matrices: 2-dimensional vectors that contain elements of the same data type

how to create: use matrix() function

m <- matrix(c(1:18), 3,6)
m

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    4    7   10   13   16
## [2,]    2    5    8   11   14   17
## [3,]    3    6    9   12   15   18

# try functions that we used for vectors - do they work on matrices?
# new to 2D structures
dim(m)  # tells you number of rows and columns in your matrix

## [1] 3 6

Factors: special vectors used to represent categorical data

what part of our dataset can be represented by a vector?
to create: use factor()

f <- factor(c("M","F","F","F")) #4 observations, the first one for male, other 3 for female
str(f) # what are these numbers in the output?

##  Factor w/ 2 levels "F","M": 2 1 1 1

typeof(f)   # factors are of integer data type! Levels are numbered in alphabetical order

## [1] "integer"

#sometimes importent to reorder levels
f <- factor(f, levels=c("M","F"))
str(f)

##  Factor w/ 2 levels "M","F": 1 2 2 2

Lists : generic vectors - collection of elements with different data types

what part of our dataset can be represented by a list?
to create: use list() function

l<-list("Afghanistan", 1952, 8769855)
print(l)

## [[1]]
## [1] "Afghanistan"
## 
## [[2]]
## [1] 1952
## 
## [[3]]
## [1] 8769855

typeof(l)

## [1] "list"

str(l)

## List of 3
##  $ : chr "Afghanistan"
##  $ : num 1952
##  $ : num 8769855

length(l)

## [1] 3

CHALLENGE 2.2

TASK: Try to create a list named `myOrder` that contains the following
data structures as list elements:

-- Element 1 is a character vector of length 4 that lists the menu items
you ordered from the restaurant: chicken, soup, salad, tea.

-- Element 2 is a factor that describes menu items as "liquid" or "solid".

-- Element 3 is a vector that records the cost of each menu item:
4.99, 2.99, 3.29, 1.89.

*Hint: Define your elements first, then create a list with them.

CHALLENGE 2.2: Answer

menuItems<-c("chicken", "soup", "salad", "tea")
menuType<-factor(c("solid", "liquid", "solid", "liquid"))
menuCost<-c(4.99, 2.99, 3.29, 1.89)
       
myOrder<-list(menuItems, menuType, menuCost)

Now apply the following functions to the list you created. Try to predict the output before you run the command.

length(myOrder)
str(myOrder)
print(myOrder)

Data frames
- Let’s go back to gapminder dataset. Could you make an informative guess about how this data structure can be represented in R?
- Yes! It is a list of vectors of equal length. Let’s look at our myOrder list to see if we can make data frame out of it. Is the list we just made suitable for a data frame? Yes, the elements of the list are vectors of equal size (but they do not have to to be list elements).
Previously we used list() to combine our elements:
- myOrder<-list(menuItems, menuType, menuCost)
Now let’s combine with data.frame() function. How? Give it a different name, myOrder_df.
```
myOrder_df<-data.frame(menuItems, menuType, menuCost)
#now view it!
myOrder_df
```
```
##   menuItems menuType menuCost
## 1   chicken    solid     4.99
## 2      soup   liquid     2.99
## 3     salad    solid     3.29
## 4       tea   liquid     1.89
```
```
#and check with `str()` - anything different compared to `str(myOrder)`
#output? What is happening with data types? 
str(myOrder_df)
```
```
## 'data.frame':    4 obs. of  3 variables:
##  $ menuItems: Factor w/ 4 levels "chicken","salad",..: 1 3 2 4
##  $ menuType : Factor w/ 2 levels "liquid","solid": 2 1 2 1
##  $ menuCost : num  4.99 2.99 3.29 1.89
```

General subsetting rules

Let’s talk about how to take your dataset apart. In general, you can access every element of your data set. You must be able to do that to manipulate and analyze your data. There are three general ways to subset the data:

### 1. By position index
## 1a. Use `[]` operator
v<-c(1:10)
v

##  [1]  1  2  3  4  5  6  7  8  9 10

## see what happens here
v[2]

## [1] 2

v[c(3:6)]

## [1] 3 4 5 6

v[-c(3:5)]

## [1]  1  2  6  7  8  9 10

## 1b. Use `which` function - extracts the position indices of the
## elements with a specified values:
v<-c(1,3,5,5,7,5)
v1<-v[which(v==5)]  #get vector elements equal to 5
v1  #Can you explain the output? Try `which(v==5)`

## [1] 5 5 5

## the above works for lists too, notice that [] returns list, use [[]]to get vector
## try subsetting myOrder list we created above
##for 2D structures like matrices and dataframes provide 2 indices [row, column]
myOrder_df[1:3, ] #gets first 3 rows

##   menuItems menuType menuCost
## 1   chicken    solid     4.99
## 2      soup   liquid     2.99
## 3     salad    solid     3.29

### 2. By name:
## Use `$` operator to extract columns as vectors 
myOrder_df$menuType

## [1] solid  liquid solid  liquid
## Levels: liquid solid

### 3. By logical vector index: selects elements corresponding to TRUE values
### of logical vector:
v

## [1] 1 3 5 5 7 5

v1<-v[v==5]
v1

## [1] 5 5 5

# how does the above work? Try only `v==4`
v==5   # returns logical vector

## [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE

##Use `myOrder_df` dataframe:select rows that satisfy various conditions
##Diplay logical vector to understand the ouput
df1<-myOrder_df[myOrder_df$menuType=="solid", ]
df1

##   menuItems menuType menuCost
## 1   chicken    solid     4.99
## 3     salad    solid     3.29

df2<-myOrder_df[myOrder_df$menuCost>3, ]
df2

##   menuItems menuType menuCost
## 1   chicken    solid     4.99
## 3     salad    solid     3.29

##Can you explain the output generated here?
df3<-myOrder_df[myOrder_df$menuType=="solid"]
df3

##   menuItems menuCost
## 1   chicken     4.99
## 2      soup     2.99
## 3     salad     3.29
## 4       tea     1.89

3. Overview using our gapminder dataset

Let’s return to our gapminder dataset that you have downloaded to Sunday_AM/data/gapminderData.csv

Before we examine our data, let’s read(load) this dataset into R. There are multiple ways to read data into R. For table-like formats, here are 2 popular methods:

Method 1: Use read.csv() function
- myData<-read.csv("data/gapminderData.csv")
- Take a look at the new object myData
- A bit hard to see? It is too large to be displayed at the console! Different ways to see what your object looks like:
  - Use head() function:
    - head(myData)
  - Look at the top right window, environment tab - gives you a brief summary of the object and opens an Excel-like view of the dataset
Method 2. Use read.table() function

Challenge 3.1 Learn how to read data into R

TASK: Load our gapminder dataset into R using read.table() function

*Hints: 1. Use help functions to read about read.table():
        `?read.table()` or `args(read.table)`
        2. It might be helpful to compare `args(read.csv)` and `args(read.table)`.
        3. Look at `dim(myData)` output after each try. Is it different? Why?

Challenge 3.1: Answer

myData<-read.table("data/gapminderData.csv", header=TRUE, sep = ",")

Now we know how to read dataframes into R. Let’s use this dataset to go over what we talked about this morning. Some of the details were not covered in class, but it is good to know what else you can do with your dataset. Explore!

Challenge 3.2 Play with gapminder dataset

TASK: Answer the following questions about `myData` object
1. Overall object structure? What function will you use?
2. Can you tell what is the data type of elements in each cloumn?
3. Can you extract 3rd and 5th column of the dataset?
4. Can you extract the list of countries in this dataset?
5. Can you get a part of this dataset that includes information about Sweden?
6. Can you extract all countries for which life expectancy is below 70?
7. Can you make a new column that contains population in units of millions of people?

Challenge 3.2: Answer

1. str()
2. typeof() # typeof() will give you "list" - lets you know that dataframe 
#is really a list of vectors; examine the output of str() for details about 
#column data types
3. myData[ , c(3,5)] # you can use head(myData[ , c(3,5)]) to view top 6 rows 
#of the output
4. names(table(myData$country))  # this is a nested function -> break it up 
#to see what each function does; use help(function) to get help
5. myData[myData$country=="Sweden", ] #rows are selected based on logical 
#vector TRUE values - can you view this vector?
6. myData[myData$lifeExp<70, ] #similar to Q5.
7. myData$PopM<-myData$pop/10^6  #simple way to add a column to a dataframe. 
#You can verify that you added a column with: `head(myData)`

Summary

R is made up of objects and functions [that manipulate objects in various ways]
Objects and functions can be as simple or as complex as you want them to be (built-in + custom-made). Name some data structures/functions you know?
Help functions are super valuble to write your own code or navigate through script written by others. Do you remember what they are?