Morning checklist

1. Intro to RStudio

2. Data types and Data structures

Let’s take a look at the example of a dataset, gapminderData.csv

This is a familiar, table-like dataset suitable for different statistical analyses. Much of the data you are working with can probably be represented in this format. Here we have a record of population size, life expectancy and other kind of information for different countries.

Any ideas about how this kind of dataset can be represented in R? In R this whole dataset is a single object (or data structure) built out of smaller pieces. Think of a castle built of legos. We want to understand how to built a castle and how to take it apart. Our example of a castle is this data set ( in R it is known as a dataframe) and it is relatively simple, but this is as far as we will go today. Let’s start from smallest lego pieces and build our dataset from them.

Smallest units in R: single-element data structures

Let’s assign value of 45 to a variable age. We just created the smallest lego piece (smallest object) in R:

age <- 45
length(age)
## [1] 1
str(age)
##  num 45

Variables can hold values of various types. Most common data types:

  • numeric(double+integer)
  • character
  • logical
  • complex

For example: What data type is stored in score variable?

score<-79
is.integer(score)
## [1] FALSE
typeof(score)
## [1] "double"
typeof(is.integer(score))
## [1] "logical"

The last expression is an example of nested function. Nested functions are very common in R, but are very difficult to understand it first. You can always split nested function into a series of single function calls. Remember that the variable inside the most inner paranthesis is an argument(input)for the function that will be evaluated first.

Challenge 2.1:Learn how to read the output of nested help functions

TASK: Break the following expression into multiple single function calls. You will need to assign the output of each function to a variable that will serve as an input(argument) for the next function. What is the value of each variable? What does each function do? Assign: `score<-79`

is.logical(is.numeric(typeof(is.integer(score))))

Challenge 2.1: Answer

score <- 79
step1 <- is.integer(score)
print(step1)
## [1] FALSE
step2 <- typeof(step1)
print(step2)
## [1] "logical"
step3 <- is.numeric(step2)
print(step3)
## [1] FALSE
step4 <- is.logical(step3)
print(step4)
## [1] TRUE
## Or as a single step:
print(is.logical(is.numeric(typeof(is.integer(score)))))
## [1] TRUE

Sometimes you will need to convert between data types. There are functions that do that: as.integer(), as.character(), and so on. The conversion between data types is not always possible - why? Let’s see what happens here:

score <- 79
typeof(score)
## [1] "double"
score <- as.integer(score)
typeof(score)
## [1] "integer"
#but can we convert character to integer?
name <- "Sasha"
typeof(name)
## [1] "character"
name <- as.integer(name)
## Warning: NAs introduced by coercion
# the data type will be changed, but no value will be assigned
typeof(name)
## [1] "integer"
print(name)   # NA = missing value
## [1] NA

Data structures with multiple elements

The small objects can be combined to build larger objects. Look at the gapminder dataset. Our smallest objects can be used to represent a single element in the dataset, like individual year, or individual country, but what would be the simplest object that you can make with multiple elements?

  • Vectors: collection of elements of the same data type
    • what part of our dataset can be represented by a vector?
    • how to create: use concatinate function, c()
    ###let's make a vector
    v<-c(1:3, 45)
    v
    ## [1]  1  2  3 45
    ##examine object
    typeof(v) # tells you the data type of vector elements
    ## [1] "double"
    length(v) # what does this do?
    ## [1] 4
    str(v)    # tells you the structure of the object VERY USEFUL
    ##  num [1:4] 1 2 3 45
    ##view
    head(v, n=2)  #look at the first 2 elements
    ## [1] 1 2
    #what would `tail()` do?
    tail(v, n=3)  #look at the first 3 elements
    ## [1]  2  3 45
    ##manipulate
    v <- c(v,56)  #add element to vector
    #vectorizarion: no loop is required to perform operation on each vector element
    v1 <- 2*v   # multiply each vector element by 2
    v1
    ## [1]   2   4   6  90 112
    # let's try to add vectors
    v2<-c(1:5)
    v3 <- v1+v2
    v3
    ## [1]   3   6   9  94 117
    # you can name vectors; find out what `names()` function does
    # change data type
    v3 <- as.character(v3)  #also known as coersion
    str(v3)
    ##  chr [1:5] "3" "6" "9" "94" "117"
  • Matrices: 2-dimensional vectors that contain elements of the same data type
    • how to create: use matrix() function
    m <- matrix(c(1:18), 3,6)
    m
    ##      [,1] [,2] [,3] [,4] [,5] [,6]
    ## [1,]    1    4    7   10   13   16
    ## [2,]    2    5    8   11   14   17
    ## [3,]    3    6    9   12   15   18
    # try functions that we used for vectors - do they work on matrices?
    # new to 2D structures
    dim(m)  # tells you number of rows and columns in your matrix
    ## [1] 3 6
  • Factors: special vectors used to represent categorical data
    • what part of our dataset can be represented by a vector?
    • to create: use factor()
    f <- factor(c("M","F","F","F")) #4 observations, the first one for male, other 3 for female
    str(f) # what are these numbers in the output?
    ##  Factor w/ 2 levels "F","M": 2 1 1 1
    typeof(f)   # factors are of integer data type! Levels are numbered in alphabetical order
    ## [1] "integer"
    #sometimes importent to reorder levels
    f <- factor(f, levels=c("M","F"))
    str(f)
    ##  Factor w/ 2 levels "M","F": 1 2 2 2
  • Lists : generic vectors - collection of elements with different data types
    • what part of our dataset can be represented by a list?
    • to create: use list() function
    l<-list("Afghanistan", 1952, 8769855)
    print(l)
    ## [[1]]
    ## [1] "Afghanistan"
    ## 
    ## [[2]]
    ## [1] 1952
    ## 
    ## [[3]]
    ## [1] 8769855
    typeof(l)
    ## [1] "list"
    str(l)
    ## List of 3
    ##  $ : chr "Afghanistan"
    ##  $ : num 1952
    ##  $ : num 8769855
    length(l)
    ## [1] 3
    • CHALLENGE 2.2

      TASK: Try to create a list named `myOrder` that contains the following
      data structures as list elements:
      
      -- Element 1 is a character vector of length 4 that lists the menu items
      you ordered from the restaurant: chicken, soup, salad, tea.
      
      -- Element 2 is a factor that describes menu items as "liquid" or "solid".
      
      -- Element 3 is a vector that records the cost of each menu item:
      4.99, 2.99, 3.29, 1.89.
      
      *Hint: Define your elements first, then create a list with them.

      CHALLENGE 2.2: Answer

      menuItems<-c("chicken", "soup", "salad", "tea")
      menuType<-factor(c("solid", "liquid", "solid", "liquid"))
      menuCost<-c(4.99, 2.99, 3.29, 1.89)
             
      myOrder<-list(menuItems, menuType, menuCost)
      Now apply the following functions to the list you created. Try to predict the output before you run the command.
      • length(myOrder)
      • str(myOrder)
      • print(myOrder)
  • Data frames
    • Let’s go back to gapminder dataset. Could you make an informative guess about how this data structure can be represented in R?

    • Yes! It is a list of vectors of equal length. Let’s look at our myOrder list to see if we can make data frame out of it. Is the list we just made suitable for a data frame? Yes, the elements of the list are vectors of equal size (but they do not have to to be list elements).

    Previously we used list() to combine our elements:

    • myOrder<-list(menuItems, menuType, menuCost)

    Now let’s combine with data.frame() function. How? Give it a different name, myOrder_df.

    myOrder_df<-data.frame(menuItems, menuType, menuCost)
    #now view it!
    myOrder_df
    ##   menuItems menuType menuCost
    ## 1   chicken    solid     4.99
    ## 2      soup   liquid     2.99
    ## 3     salad    solid     3.29
    ## 4       tea   liquid     1.89
    #and check with `str()` - anything different compared to `str(myOrder)`
    #output? What is happening with data types? 
    str(myOrder_df)
    ## 'data.frame':    4 obs. of  3 variables:
    ##  $ menuItems: Factor w/ 4 levels "chicken","salad",..: 1 3 2 4
    ##  $ menuType : Factor w/ 2 levels "liquid","solid": 2 1 2 1
    ##  $ menuCost : num  4.99 2.99 3.29 1.89

General subsetting rules

Let’s talk about how to take your dataset apart. In general, you can access every element of your data set. You must be able to do that to manipulate and analyze your data. There are three general ways to subset the data:

### 1. By position index
## 1a. Use `[]` operator
v<-c(1:10)
v
##  [1]  1  2  3  4  5  6  7  8  9 10
## see what happens here
v[2]
## [1] 2
v[c(3:6)]
## [1] 3 4 5 6
v[-c(3:5)]
## [1]  1  2  6  7  8  9 10
## 1b. Use `which` function - extracts the position indices of the
## elements with a specified values:
v<-c(1,3,5,5,7,5)
v1<-v[which(v==5)]  #get vector elements equal to 5
v1  #Can you explain the output? Try `which(v==5)`
## [1] 5 5 5
## the above works for lists too, notice that [] returns list, use [[]]to get vector
## try subsetting myOrder list we created above
##for 2D structures like matrices and dataframes provide 2 indices [row, column]
myOrder_df[1:3, ] #gets first 3 rows
##   menuItems menuType menuCost
## 1   chicken    solid     4.99
## 2      soup   liquid     2.99
## 3     salad    solid     3.29
### 2. By name:
## Use `$` operator to extract columns as vectors 
myOrder_df$menuType
## [1] solid  liquid solid  liquid
## Levels: liquid solid
### 3. By logical vector index: selects elements corresponding to TRUE values
### of logical vector:
v
## [1] 1 3 5 5 7 5
v1<-v[v==5]
v1
## [1] 5 5 5
# how does the above work? Try only `v==4`
v==5   # returns logical vector
## [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE
##Use `myOrder_df` dataframe:select rows that satisfy various conditions
##Diplay logical vector to understand the ouput
df1<-myOrder_df[myOrder_df$menuType=="solid", ]
df1
##   menuItems menuType menuCost
## 1   chicken    solid     4.99
## 3     salad    solid     3.29
df2<-myOrder_df[myOrder_df$menuCost>3, ]
df2
##   menuItems menuType menuCost
## 1   chicken    solid     4.99
## 3     salad    solid     3.29
##Can you explain the output generated here?
df3<-myOrder_df[myOrder_df$menuType=="solid"]
df3
##   menuItems menuCost
## 1   chicken     4.99
## 2      soup     2.99
## 3     salad     3.29
## 4       tea     1.89

3. Overview using our gapminder dataset

Let’s return to our gapminder dataset that you have downloaded to Sunday_AM/data/gapminderData.csv

Before we examine our data, let’s read(load) this dataset into R. There are multiple ways to read data into R. For table-like formats, here are 2 popular methods:

Challenge 3.1 Learn how to read data into R

TASK: Load our gapminder dataset into R using read.table() function

*Hints: 1. Use help functions to read about read.table():
        `?read.table()` or `args(read.table)`
        2. It might be helpful to compare `args(read.csv)` and `args(read.table)`.
        3. Look at `dim(myData)` output after each try. Is it different? Why?

Challenge 3.1: Answer

myData<-read.table("data/gapminderData.csv", header=TRUE, sep = ",")

Now we know how to read dataframes into R. Let’s use this dataset to go over what we talked about this morning. Some of the details were not covered in class, but it is good to know what else you can do with your dataset. Explore!

Challenge 3.2 Play with gapminder dataset

TASK: Answer the following questions about `myData` object
1. Overall object structure? What function will you use?
2. Can you tell what is the data type of elements in each cloumn?
3. Can you extract 3rd and 5th column of the dataset?
4. Can you extract the list of countries in this dataset?
5. Can you get a part of this dataset that includes information about Sweden?
6. Can you extract all countries for which life expectancy is below 70?
7. Can you make a new column that contains population in units of millions of people?

Challenge 3.2: Answer

1. str()
2. typeof() # typeof() will give you "list" - lets you know that dataframe 
#is really a list of vectors; examine the output of str() for details about 
#column data types
3. myData[ , c(3,5)] # you can use head(myData[ , c(3,5)]) to view top 6 rows 
#of the output
4. names(table(myData$country))  # this is a nested function -> break it up 
#to see what each function does; use help(function) to get help
5. myData[myData$country=="Sweden", ] #rows are selected based on logical 
#vector TRUE values - can you view this vector?
6. myData[myData$lifeExp<70, ] #similar to Q5.
7. myData$PopM<-myData$pop/10^6  #simple way to add a column to a dataframe. 
#You can verify that you added a column with: `head(myData)`

Summary

You are now ready to use R objects and functions to write your own R scripts