2.5 Data types
Related Reading: IDS 2.4
2.5.1 Numeric Vectors
The most basic data type in R
is the vector. In fact, above when we created variables that were just a single number, they are actually stored as a numeric vector.
To more explicitly create a vector, you can use the c
function in R
. For example, let’s create a vector called five
that contains the numbers 1 through 5.
<- c(1,2,3,4,5) five
We can print the contents of the vector five
just by typing its name
five#> [1] 1 2 3 4 5
Another common operation on vectors is to get a particular element of a vector. Let me give an example
3]
five[#> [1] 3
This code takes the vector five
and returns the third element in the vector. Notice that the above line contains braces, [
and ]
rather than parentheses.
If you want several different elements from a vector, you can do the following
c(1,4)]
five[#> [1] 1 4
This code takes the vector five
and returns the first and fourth element in the vector.
One more useful function for vectors is the function length
. This tells you the number of elements in vector. For example,
length(five)
#> [1] 5
which means that there are five total elements in the vector five
.
2.5.2 Vector arithmetic
Related Reading: IDS 2.11
The main operations on numeric vectors are +
, -
, *
, /
which correspond to addition, subtraction, multiplication, and division. Often, we would like to carry out these operations on vectors.
There are two main cases. The first case is when you try to add a single number (i.e., a scalar) to all the elements in a vector. In this setup, the operation will happen element-wise which means the same number will be added to all numbers in the vector. This will be clear with some examples.
<- c(1,2,3,4,5)
five
# adds one to each element in vector
+ 1
five #> [1] 2 3 4 5 6
# also adds one to each element in vector
1 + five
#> [1] 2 3 4 5 6
Similar things will happen with the other mathematical operations above. Here are some more examples:
* 3
five #> [1] 3 6 9 12 15
- 3
five #> [1] -2 -1 0 1 2
/ 3
five #> [1] 0.3333333 0.6666667 1.0000000 1.3333333 1.6666667
The other interesting case is what happens when you try to apply any of the same mathematical operators to two different vectors.
# just some random numbers
<- c(8,-3,4,1,7)
vec2
+ vec2
five #> [1] 9 -1 7 5 12
- vec2
five #> [1] -7 5 -1 3 -2
* vec2
five #> [1] 8 -6 12 4 35
/ vec2
five #> [1] 0.1250000 -0.6666667 0.7500000 4.0000000 0.7142857
You can immediately see what happens here. For example, for five + vec2
, the first element of five
is added to the first element of vec2
, the second element of five
is added to the second element of vec2
and so on. Similar things happen for each of the other mathematical operations too.
There’s one other case that might be interesting to consider too. What happens if you try to apply these mathematical operations to two vectors of different lengths? Let’s find out
<- c(2,6)
vec3 + vec3
five #> Warning in five + vec3: longer object length is not a
#> multiple of shorter object length
#> [1] 3 8 5 10 7
You’ll notice that this computes something but it also issues a warning. What happens here is that the result is equal to the first element of five
plus the first element of vec3
, the second of five
plus the second element of vec3
, the third element of five
plus the first element of vec3
, the fourth element of five
plus the second element of vec3
, and the fifth element of five
plus the first element of vec3
. What’s happening here is that, since vec3
contains fewere elements that five
, the elements of vec3
are getting recycled. In my experience, this warning often indicates a coding mistake. There are many cases where I want to add the same number to all elements in a vector, and many other cases where I want to add two vectors that have the same length, but I cannot think of any cases where I would want to add two vectors the way that is being carried out here.
The same sort of things will happen with subtraction, multiplication, and division (feel free to try it out).
2.5.3 More helpful functions in R
This is definitely an incomplete list, but I’ll point you here to some more functions in R that are often helpful along with quick examples of them.
seq
function — creates a “sequence” of numbersseq(2,7) #> [1] 2 3 4 5 6 7
sum
function — computes the sum of a vector of numberssum(c(1,5,8)) #> [1] 14
sort
,order
, andrev
functions — functions for understanding the order or changing the order of a vectorsort(c(3,1,5)) #> [1] 1 3 5 order(c(3,1,5)) #> [1] 2 1 3 rev(c(3,1,5)) #> [1] 5 1 3
%%
— modulo function (i.e., returns the remainder from dividing one number by another)8 %% 3 #> [1] 2 1 %% 3 #> [1] 1
Practice: The function seq
contains an optional argument length.out
. Try running the following code and seeing if you can figure out what length.out
does.
seq(1,10,length.out=5)
seq(1,10,length.out=10)
seq(1.10,length.out=20)
2.5.4 Other types of vectors
There are other types of vectors in R too. Probably the main two other types of vectors are character vectors and logical vectors. We’ll talk about character vectors here and defer logical vectors until later. Character vectors are often referred to as strings.
We can create a character vector as follows
<- "econometrics"
string1 <- "class"
string2
string1#> [1] "econometrics"
The above code creates two character vectors and then prints the first one.
2.5.5 Data Frames
Another very important type of object in R is the data frame. I think it is helpful to think of a data frame as being very similar to an Excel spreadsheet — sort of like a matrix or a two-dimensional array. Each row typically corresponds to a particular observation, and each column typically provides the value of a particular variable for that observation.
Just to give a simple example, suppose that we had firm-level data about the name of the firm, what industry a firm was in, what county they were located in, and their number of employees. I created a data frame like this (it is totally made up, BTW) and show it to you next
firm_data
name | industry | county | employees |
---|---|---|---|
ABC Manufacturing | Manufacturing | Clarke | 531 |
Martin’s Muffins | Food Services | Oconee | 6 |
Down Home Appliances | Manufacturing | Clarke | 15 |
Classic City Widgets | Manufacturing | Clarke | 211 |
Watkinsville Diner | Food Services | Oconee | 25 |
Side Comment: If you are following along on R, I created this data frame using the following code
<- data.frame(name=c("ABC Manufacturing", "Martin\'s Muffins", "Down Home Appliances", "Classic City Widgets", "Watkinsville Diner"),
firm_data industry=c("Manufacturing", "Food Services", "Manufacturing", "Manufacturing", "Food Services"),
county=c("Clarke", "Oconee", "Clarke", "Clarke", "Oconee"),
employees=c(531, 6, 15, 211, 25))
This is also the same data that we loaded earlier in Section 2.3.
Often, we’ll like to access a particular column in a data frame. For example, you might want to calculate the average number of employees across all the firms in our data.
Typically, the easiest way to do this, is to use the accessor symbol, which is $
in R. This will make more sense with an example:
$employees
firm_data#> [1] 531 6 15 211 25
firm_data$employees
just provides the column called “employees” in the data frame called “firm_data”. You can also notice that firm_data$employees
is just a numeric vector. This means that you can apply any of the functions that we have been covering on it
mean(firm_data$employees)
#> [1] 157.6
log(firm_data$employees)
#> [1] 6.274762 1.791759 2.708050 5.351858 3.218876
Side Comment: Notice that the function mean
and log
behave differently. mean
calculates the average over all the elements in the vector firm_data$employees
and therefore returns a single number. log
calculates the logarithm of each element in the vector firm_data$employees
and therefore returns a numeric vector with five elements.
Side Comment:
The $
is not the only way to access the elements in a data frame. You can also access them by their position. For example, if you want whatever is in the third row and second column of the data frame, you can get it by
3,2]
firm_data[#> [1] "Manufacturing"
Sometimes it is also convenient to recover a particular row or column by its position in the data frame. Here is an example of recovering the entire fourth row
4,]
firm_data[#> name industry county employees
#> 4 Classic City Widgets Manufacturing Clarke 211
Notice that you just leave the “column index” (which is the second one) blank
Side Comment: One other thing that sometimes takes some getting used to is that, for programming in general, you have to be very precise. Suppose you were to make a very small typo. R is not going to understand what you mean. See if you can spot the typo in the next line of code.
$employes
firm_data#> NULL
A few more useful functions for working with data frames are:
nrow
andncol
— returns the number of rows or columns in the data framecolnames
andrownames
— returns the names of the columns or rows
2.5.6 Lists
Vectors and data frames are the main two types of objects that we’ll use this semester, but let me give you a quick overview of a few other types of objects. Let’s start with lists. Lists are very generic in the sense that they can carry around complicated data. If you are familiar with any object oriented programming language like Java or C++, they have the flavor of an “object”, in the object-oriented sense.
I’m not sure if we will see any examples this semester where you have to use a list. But here is an example. Suppose that we wanted to put the vector that we created earlier five
and the data frame that we created earlier firm_data
into the same object. We could do it as follows
<- list(numbers=five, df=firm_data) unusual_list
You can access the elements of a list in a few different ways. Sometimes it is convenient to access them via the $
$numbers
unusual_list#> [1] 1 2 3 4 5
Other times, it is convenient to access them via their position in the list
2]] # notice the double brackets
unusual_list[[#> name industry county employees
#> 1 ABC Manufacturing Manufacturing Clarke 531
#> 2 Martin's Muffins Food Services Oconee 6
#> 3 Down Home Appliances Manufacturing Clarke 15
#> 4 Classic City Widgets Manufacturing Clarke 211
#> 5 Watkinsville Diner Food Services Oconee 25
2.5.7 Matrices
Matrices are very similar to data frames, but the data should all be of the same type. Matrices are very useful in some numerical calculations that are beyond the scope of this class. Here is an example of a matrix.
<- matrix(c(1,2,3,4), nrow=2, byrow=TRUE)
mat
mat#> [,1] [,2]
#> [1,] 1 2
#> [2,] 3 4
You can access elements of a matrix by their position in the matrix, just like for the data frame above.
# first row, second column
1,2]
mat[#> [1] 2
# all rows in second column
2]
mat[,#> [1] 2 4
2.5.8 Factors
Sometimes variables in economics are categorical. This sort of variable is somewhat between a numeric variable and a string. In R
, categorical variables are called factors.
A good example of a categorical variable is firm_data$industry
. It tells you the “category” of the industry that a firm is in.
Oftentimes, we may have to tell R that a variable is a “factor” rather than just a string. Let’s create a variable called industry
that contains the industry from firm_data
but as a factor.
<- as.factor(firm_data$industry)
industry
industry#> [1] Manufacturing Food Services Manufacturing Manufacturing
#> [5] Food Services
#> Levels: Food Services Manufacturing
A useful package for working with factor variables is the forcats
package.
2.5.9 Understanding an object in R
Sometimes you may be in the case where there is a variable where you don’t know what exactly it contains. Some functions that are helpful in this case are
class
— tells you, err, the class of an object (i.e., its “type”)head
— shows you the “beginning” of an object; this is especially helpful for large objects (like some data frames)str
— stands for “structure” of an object
Let’s try these out
class(firm_data)
#> [1] "data.frame"
# typically would show the first five rows of a data frame,
# but that is the whole data frame here
head(firm_data)
#> name industry county employees
#> 1 ABC Manufacturing Manufacturing Clarke 531
#> 2 Martin's Muffins Food Services Oconee 6
#> 3 Down Home Appliances Manufacturing Clarke 15
#> 4 Classic City Widgets Manufacturing Clarke 211
#> 5 Watkinsville Diner Food Services Oconee 25
str(firm_data)
#> 'data.frame': 5 obs. of 4 variables:
#> $ name : chr "ABC Manufacturing" "Martin's Muffins" "Down Home Appliances" "Classic City Widgets" ...
#> $ industry : chr "Manufacturing" "Food Services" "Manufacturing" "Manufacturing" ...
#> $ county : chr "Clarke" "Oconee" "Clarke" "Clarke" ...
#> $ employees: num 531 6 15 211 25
Practice: Try running class
, head
, and str
on the vector five
that we created earlier.
Side Comment
c
stands for “concatenate”. Concatenate is a computer science word that means to combine two vectors. Probably the most well known version of this is “string concatenation” that combines two vectors of characters. Here is an example of string concatenation.Sometimes string concatenation means to put two (or more strings) into the same string. This can be done using the
paste
command in R.Notice that
paste
puts in a space betweenstring1
andstring2
. For practice, see if you can find an argument to thepaste
function that allows you to remove the space between the two strings.