2.5 Data types

Related Reading: IDS 2.4

2.5.1 Numeric Vectors

The most basic data type in R is the vector. In fact, above when we created variables that were just a single number, they are actually stored as a numeric vector.

To more explicitly create a vector, you can use the c function in R. For example, let’s create a vector called five that contains the numbers 1 through 5.

  five <- c(1,2,3,4,5)

We can print the contents of the vector five just by typing its name

five
#> [1] 1 2 3 4 5

Another common operation on vectors is to get a particular element of a vector. Let me give an example

five[3]
#> [1] 3

This code takes the vector five and returns the third element in the vector. Notice that the above line contains braces, [ and ] rather than parentheses.

If you want several different elements from a vector, you can do the following

five[c(1,4)]
#> [1] 1 4

This code takes the vector five and returns the first and fourth element in the vector.

One more useful function for vectors is the function length. This tells you the number of elements in vector. For example,

length(five)
#> [1] 5

which means that there are five total elements in the vector five.

2.5.2 Vector arithmetic

Related Reading: IDS 2.11

The main operations on numeric vectors are +, -, *, / which correspond to addition, subtraction, multiplication, and division. Often, we would like to carry out these operations on vectors.

There are two main cases. The first case is when you try to add a single number (i.e., a scalar) to all the elements in a vector. In this setup, the operation will happen element-wise which means the same number will be added to all numbers in the vector. This will be clear with some examples.

five <- c(1,2,3,4,5)

# adds one to each element in vector
five + 1
#> [1] 2 3 4 5 6

# also adds one to each element in vector
1 + five
#> [1] 2 3 4 5 6

Similar things will happen with the other mathematical operations above. Here are some more examples:

five * 3
#> [1]  3  6  9 12 15

five - 3
#> [1] -2 -1  0  1  2

five / 3
#> [1] 0.3333333 0.6666667 1.0000000 1.3333333 1.6666667

The other interesting case is what happens when you try to apply any of the same mathematical operators to two different vectors.

# just some random numbers
vec2 <- c(8,-3,4,1,7)

five + vec2
#> [1]  9 -1  7  5 12

five - vec2
#> [1] -7  5 -1  3 -2

five * vec2
#> [1]  8 -6 12  4 35

five / vec2
#> [1]  0.1250000 -0.6666667  0.7500000  4.0000000  0.7142857

You can immediately see what happens here. For example, for five + vec2, the first element of five is added to the first element of vec2, the second element of five is added to the second element of vec2 and so on. Similar things happen for each of the other mathematical operations too.

There’s one other case that might be interesting to consider too. What happens if you try to apply these mathematical operations to two vectors of different lengths? Let’s find out

vec3 <- c(2,6)
five + vec3
#> Warning in five + vec3: longer object length is not a multiple of shorter object length
#> [1]  3  8  5 10  7

You’ll notice that this computes something but it also issues a warning. What happens here is that the result is equal to the first element of five plus the first element of vec3, the second of five plus the second element of vec3, the third element of five plus the first element of vec3, the fourth element of five plus the second element of vec3, and the fifth element of five plus the first element of vec3. What’s happening here is that, since vec3 contains fewere elements that five, the elements of vec3 are getting recycled. In my experience, this warning often indicates a coding mistake. There are many cases where I want to add the same number to all elements in a vector, and many other cases where I want to add two vectors that have the same length, but I cannot think of any cases where I would want to add two vectors the way that is being carried out here.

The same sort of things will happen with subtraction, multiplication, and division (feel free to try it out).

2.5.3 More helpful functions in R

This is definitely an incomplete list, but I’ll point you here to some more functions in R that are often helpful along with quick examples of them.

  • seq function — creates a “sequence” of numbers

    seq(2,7)
    #> [1] 2 3 4 5 6 7
  • sum function — computes the sum of a vector of numbers

    sum(c(1,5,8))
    #> [1] 14
  • sort, order, and rev functions — functions for understanding the order or changing the order of a vector

    sort(c(3,1,5))
    #> [1] 1 3 5
    order(c(3,1,5))
    #> [1] 2 1 3
    rev(c(3,1,5))
    #> [1] 5 1 3
  • %% — modulo function (i.e., returns the remainder from dividing one number by another)

    8 %% 3
    #> [1] 2
    1 %% 3
    #> [1] 1

Practice: The function seq contains an optional argument length.out. Try running the following code and seeing if you can figure out what length.out does.

seq(1,10,length.out=5)
seq(1,10,length.out=10)
seq(1.10,length.out=20)

2.5.4 Other types of vectors

There are other types of vectors in R too. Probably the main two other types of vectors are character vectors and logical vectors. We’ll talk about character vectors here and defer logical vectors until later. Character vectors are often referred to as strings.

We can create a character vector as follows

string1 <- "econometrics"
string2 <- "class"
string1
#> [1] "econometrics"

The above code creates two character vectors and then prints the first one.

Side Comment c stands for “concatenate”. Concatenate is a computer science word that means to combine two vectors. Probably the most well known version of this is “string concatenation” that combines two vectors of characters. Here is an example of string concatenation.

c(string1, string2)
#> [1] "econometrics" "class"

Sometimes string concatenation means to put two (or more strings) into the same string. This can be done using the paste command in R.

paste(string1, string2)
#> [1] "econometrics class"

Notice that paste puts in a space between string1 and string2. For practice, see if you can find an argument to the paste function that allows you to remove the space between the two strings.

2.5.5 Data Frames

Another very important type of object in R is the data frame. I think it is helpful to think of a data frame as being very similar to an Excel spreadsheet — sort of like a matrix or a two-dimensional array. Each row typically corresponds to a particular observation, and each column typically provides the value of a particular variable for that observation.

Just to give a simple example, suppose that we had firm-level data about the name of the firm, what industry a firm was in, what county they were located in, and their number of employees. I created a data frame like this (it is totally made up, BTW) and show it to you next

firm_data
name industry county employees
ABC Manufacturing Manufacturing Clarke 531
Martin’s Muffins Food Services Oconee 6
Down Home Appliances Manufacturing Clarke 15
Classic City Widgets Manufacturing Clarke 211
Watkinsville Diner Food Services Oconee 25

Side Comment: If you are following along on R, I created this data frame using the following code

firm_data <- data.frame(name=c("ABC Manufacturing", "Martin\'s Muffins", "Down Home Appliances", "Classic City Widgets", "Watkinsville Diner"),
                        industry=c("Manufacturing", "Food Services", "Manufacturing", "Manufacturing", "Food Services"),
                        county=c("Clarke", "Oconee", "Clarke", "Clarke", "Oconee"),
                        employees=c(531, 6, 15, 211, 25))

This is also the same data that we loaded earlier in Section 2.3.

Often, we’ll like to access a particular column in a data frame. For example, you might want to calculate the average number of employees across all the firms in our data.

Typically, the easiest way to do this, is to use the accessor symbol, which is $ in R. This will make more sense with an example:

firm_data$employees
#> [1] 531   6  15 211  25

firm_data$employees just provides the column called “employees” in the data frame called “firm_data”. You can also notice that firm_data$employees is just a numeric vector. This means that you can apply any of the functions that we have been covering on it

mean(firm_data$employees)
#> [1] 157.6

log(firm_data$employees)
#> [1] 6.274762 1.791759 2.708050 5.351858 3.218876

Side Comment: Notice that the function mean and log behave differently. mean calculates the average over all the elements in the vector firm_data$employees and therefore returns a single number. log calculates the logarithm of each element in the vector firm_data$employees and therefore returns a numeric vector with five elements.

Side Comment:

The $ is not the only way to access the elements in a data frame. You can also access them by their position. For example, if you want whatever is in the third row and second column of the data frame, you can get it by

firm_data[3,2]
#> [1] "Manufacturing"

Sometimes it is also convenient to recover a particular row or column by its position in the data frame. Here is an example of recovering the entire fourth row

firm_data[4,]
#>                   name      industry county employees
#> 4 Classic City Widgets Manufacturing Clarke       211

Notice that you just leave the “column index” (which is the second one) blank

Side Comment: One other thing that sometimes takes some getting used to is that, for programming in general, you have to be very precise. Suppose you were to make a very small typo. R is not going to understand what you mean. See if you can spot the typo in the next line of code.

firm_data$employes
#> NULL

A few more useful functions for working with data frames are:

  • nrow and ncol — returns the number of rows or columns in the data frame

  • colnames and rownames — returns the names of the columns or rows

2.5.6 Lists

Vectors and data frames are the main two types of objects that we’ll use this semester, but let me give you a quick overview of a few other types of objects. Let’s start with lists. Lists are very generic in the sense that they can carry around complicated data. If you are familiar with any object oriented programming language like Java or C++, they have the flavor of an “object”, in the object-oriented sense.

I’m not sure if we will see any examples this semester where you have to use a list. But here is an example. Suppose that we wanted to put the vector that we created earlier five and the data frame that we created earlier firm_data into the same object. We could do it as follows

unusual_list <- list(numbers=five, df=firm_data)

You can access the elements of a list in a few different ways. Sometimes it is convenient to access them via the $

unusual_list$numbers
#> [1] 1 2 3 4 5

Other times, it is convenient to access them via their position in the list

unusual_list[[2]] # notice the double brackets
#>                   name      industry county employees
#> 1    ABC Manufacturing Manufacturing Clarke       531
#> 2     Martin's Muffins Food Services Oconee         6
#> 3 Down Home Appliances Manufacturing Clarke        15
#> 4 Classic City Widgets Manufacturing Clarke       211
#> 5   Watkinsville Diner Food Services Oconee        25

2.5.7 Matrices

Matrices are very similar to data frames, but the data should all be of the same type. Matrices are very useful in some numerical calculations that are beyond the scope of this class. Here is an example of a matrix.

mat <- matrix(c(1,2,3,4), nrow=2, byrow=TRUE)
mat
#>      [,1] [,2]
#> [1,]    1    2
#> [2,]    3    4

You can access elements of a matrix by their position in the matrix, just like for the data frame above.

# first row, second column
mat[1,2]
#> [1] 2
# all rows in second column
mat[,2] 
#> [1] 2 4

2.5.8 Factors

Sometimes variables in economics are categorical. This sort of variable is somewhat between a numeric variable and a string. In R, categorical variables are called factors.

A good example of a categorical variable is firm_data$industry. It tells you the “category” of the industry that a firm is in.

Oftentimes, we may have to tell R that a variable is a “factor” rather than just a string. Let’s create a variable called industry that contains the industry from firm_data but as a factor.

industry <- as.factor(firm_data$industry)
industry
#> [1] Manufacturing Food Services Manufacturing Manufacturing Food Services
#> Levels: Food Services Manufacturing

A useful package for working with factor variables is the forcats package.

2.5.9 Understanding an object in R

Sometimes you may be in the case where there is a variable where you don’t know what exactly it contains. Some functions that are helpful in this case are

  • class — tells you, err, the class of an object (i.e., its “type”)

  • head — shows you the “beginning” of an object; this is especially helpful for large objects (like some data frames)

  • str — stands for “structure” of an object

Let’s try these out

class(firm_data)
#> [1] "data.frame"
# typically would show the first five rows of a data frame,
# but that is the whole data frame here
head(firm_data) 
#>                   name      industry county employees
#> 1    ABC Manufacturing Manufacturing Clarke       531
#> 2     Martin's Muffins Food Services Oconee         6
#> 3 Down Home Appliances Manufacturing Clarke        15
#> 4 Classic City Widgets Manufacturing Clarke       211
#> 5   Watkinsville Diner Food Services Oconee        25
str(firm_data)
#> 'data.frame':    5 obs. of  4 variables:
#>  $ name     : chr  "ABC Manufacturing" "Martin's Muffins" "Down Home Appliances" "Classic City Widgets" ...
#>  $ industry : chr  "Manufacturing" "Food Services" "Manufacturing" "Manufacturing" ...
#>  $ county   : chr  "Clarke" "Oconee" "Clarke" "Clarke" ...
#>  $ employees: num  531 6 15 211 25

Practice: Try running class, head, and str on the vector five that we created earlier.