2.6 R Basic Data types
In the previous examples we worked with numbers, but variables could be assigned other types of information. There are four basic types:
Integers: integer numbers. If you type an integer in R, as before 3 or 4, it will usually be stored as a double unless explicitly defined;
Doubles: real numbers;
Logicals or Booleans: corresponding to
TRUE
andFALSE
, also abbreviated asT
andF
respectively;Characters: strings of text surrounded by
"
(for example"hi"
) or by'
(for example ‘by’);Dates: date-time expressions in R.
2.6.1 Integers
Integer vectors exist so that data can be passed to C or Fortran code which expects them, and so that small integer data can be represented exactly and compactly. The range of integers is from −2 000 000 000 to + 2 000 000 000 (-2*10ˆ9 to +2*10ˆ9
, which R could portray as -2e+09
to 2e+09
).
Be careful. Do not try to change the class of a vector by using the integer
function. Here is a numeric vector of whole numbers that you want to convert into a vector of integers:
= c(5,3,7,8)
x is.integer(x)
## [1] FALSE
is.numeric(x)
## [1] TRUE
To coerce a numeric vector to be an integers vector we have to use the as.integer
function like this:
= c(5,3,7,8)
x = as.integer(x)
x is.integer(x)
## [1] TRUE
The integer function works as trunc when applied to real numbers, and removes the imaginary part when applied to complex numbers:
as.integer(5.7)
## [1] 5
as.integer(5.7 -3i)
## Warning: imaginary parts discarded in coercion
## [1] 5
2.6.2 Integer vs. Double
The two most common numeric classes used in R are integer and double (for double precision floating point numbers). R automatically converts between these two classes when needed for mathematical purposes. As a result, it’s feasible to use R and perform analyses for years without specifying these differences.
By default, when you create a numeric vector using the c()
function it will produce a vector of double precision numeric values. To create a vector of integers using c()
you must specify explicity by placing an L directly after each number.
= c(1, 2.5, 4.5)
dbl_var dbl_var
## [1] 1.0 2.5 4.5
= c(1L, 6L, 10L)
int_var int_var
## [1] 1 6 10
To check whether a vector is made up of integer or double values:
typeof(dbl_var)
## [1] "double"
typeof(int_var)
## [1] "integer"
By default, if you read in data that has no decimal points or you create numeric values using the x = 1:10
method the numeric values will be coded as integer. If you want to change a double to an integer or vice versa you can specify one of the following:
# integers to doubles
as.double(int_var)
## [1] 1 6 10
# doubles to integers
as.integer(dbl_var)
## [1] 1 2 4
2.6.3 Logical operators
A crucial part of computing involves asking questions about things. Is one thing bigger than other? Are two things the same size? Questions can be joined together using words like ‘and’ ‘or,’ ‘not.’ Questions in R typically evaluate to TRUE
or FALSE
but there is the option of a ‘maybe’ (when the answer is not available, NA
). In R, < means ‘less than,’ > means ‘greater than,’ and ! means ‘not’ (see Table 2.2).
You can use T
for TRUE and F
for FALSE, but you should be aware that T
and F
might have been allocated
as variables. So this is obvious:
TRUE == FALSE
## [1] FALSE
== F T
## [1] FALSE
This, however, is not so obviours:
= 0
T == FALSE T
## [1] TRUE
= 1
F TRUE == F
## [1] TRUE
But now, of course, T
is not equal to F
:
!= F T
## [1] TRUE
To be sure, always write TRUE and FALSE in full, and never use T or F as variable names.
Maybe you noticed in the last code chunk that, in R, TRUE is coded as 1
and FALSE as 0
:
TRUE == 1
## [1] TRUE
FALSE == 0
## [1] TRUE
Let’s remove the T and F variables to avoid future errors:
rm(list = c(T,F))
## Warning in rm(list = c(T, F)): object '0' not found
## Warning in rm(list = c(T, F)): object '1' not found
2.6.4 Real numbers equality
You need to be careful in programming when you want to test whether or not two computed numbers are equal. R will assume that you mean ‘exactly equal,’ and what that means depends upon machine precision. Most numbers are rounded to an accuracy of 53 binary digits.
Typically therefore, two floating point numbers will not reliably be equal unless they were computed by the same algorithm, and not always even then. You can see this by squaring the square root of 2: surely these values are the same?
= sqrt(2)
x * x == 2 x
## [1] FALSE
In fact, they are not the same. We can see by how much the two values differ by subtraction:
* x - 2 x
## [1] 4.440892e-16
So how do we test for equality of real numbers? The best advice is not to do it. Try instead to use the alternatives ‘less than’ with ‘greater than or equal to,’ or conversely ‘greater than’ with ‘less than or equal to.’ Then you will not go wrong. Sometimes, however, you really do want to test for equality. In those circumstances, do not use double equals to test for equality, but employ the all.equal
function instead.
= 0.3 - 0.2
x = 0.1
y == y x
## [1] FALSE
all.equal(x,y)
## [1] TRUE
WARNING: Do not use all.equal
directly in if
expressions. Either use isTRUE(all.equal( ....))
2.6.5 Logical arithmetic
Arithmetic involving logical expressions is very useful in programming and in selection of variables. The key thing to understand is that logical expressions evaluate to either true or false (represented in R by TRUE
or FALSE
), and that R can coerce TRUE
or FALSE
into numerical values: 1 for TRUE
and 0 for FALSE
. Suppose that x is a sequence from 0 to 6 like this:
= 0:6 x
Now we can ask questions about the contents of the vector called x. Is x less than 4?
< 4 x
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE
The answer is yes for the first four values (0, 1, 2 and 3) and no for the last three (4, 5 and 6).
Two important logical functions are all
and any
. They check an entire vector but return a single logical value: TRUE
or FALSE
. Are all the x values bigger than 0?
all(x>0)
## [1] FALSE
No. The first x value is a zero. Are any of the x values negative?
any(x<0)
## [1] FALSE
No. The smallest x value is a zero.
We can use the answers of logical functions in arithmetic. We can count the true values of (x<4
), using sum
:
sum(x<4)
## [1] 4
We can multiply (x<4) by other vectors:
<4) * runif(7) (x
## [1] 0.3638331 0.9908843 0.4828548 0.2473120 0.0000000 0.0000000 0.0000000
2.6.6 Characters
In R, character strings are defined by double quotation marks:
= "abc"
a = "123" b
Numbers can be coerced to characters (as in b above), but non-numeric characters cannot be coerced to numbers:
as.numeric(a)
## Warning: NAs introduced by coercion
## [1] NA
as.numeric(b)
## [1] 123
One of the initially confusing things about character strings is the distinction between the length of a character object (a vector), and the numbers of characters (nchar) in the strings that comprise that object. An example should make the distinction clear:
= c("cat","dog","gerbil","terrapin") pets
Here, pets
is a vector comprising four character strings:
length(pets)
## [1] 4
and the individual character strings have 3, 3, 6 and 8 characters, respectively:
nchar(pets)
## [1] 3 3 6 8
When first defined, character strings are not factors:
class(pets)
## [1] "character"
is.factor(pets)
## [1] FALSE
We have to coerce this variable to convert it into a factor
= as.factor(pets)
pets is.factor(pets)
## [1] TRUE
2.6.7 Characters: letters vectors
There are built-in vectors in R that contain the 26 letters of the alphabet in lower case (letters) and in upper case (LETTERS):
letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
LETTERS
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
To discover which number in the alphabet the letter n
is, you can use the which
function like this:
which(letters == "n")
## [1] 14
For the purposes of printing you might want to suppress the quotes that appear around character strings by default. The function to do this is called noquote:
noquote(letters)
## [1] a b c d e f g h i j k l m n o p q r s t u v w x y z
2.6.8 Characters: Pasting strings together
You can amalgamate individual strings into vectors of character information:
c(a,b)
## [1] "abc" "123"
This shows that the concatenation produces a vector of two strings. It does not convert two 3-character strings into one 6-character string. The R function to do that is paste
:
paste(a,b,sep = "")
## [1] "abc123"
The third argument, sep=""
, means that the two character strings are to be pasted together without any separator between them: the default for paste
is to insert a single blank space, like this:
paste(a,b)
## [1] "abc 123"
Notice that you do not lose blanks that are within character strings when you use the sep=""
option in
paste
.
paste(a,b," a longer phrase containing blanks",sep="")
## [1] "abc123 a longer phrase containing blanks"
If one of the arguments to paste
is a vector, each of the elements of the vector is pasted to the specified character string to produce an object of the same length as the vector:
= c(a,b,"new")
d = paste(d, "a longer phrase containing blanks")
e e
## [1] "abc a longer phrase containing blanks"
## [2] "123 a longer phrase containing blanks"
## [3] "new a longer phrase containing blanks"
2.6.9 Characters: Extracting parts of strings
We being by defining a phrase:
= "the quick brown fox jumps over the lazy dog" phrase
The function called substr
is used to extract substrings of a specified number of characters from within a character string. In the next example we are extracting the first 20 characters from phrase
object:
substr(phrase, 1, 20)
## [1] "the quick brown fox "
The second argument in substr
is the number of the character at which extraction is to begin (in this case the first), and the third argument is the number of the character at which extraction is to end (in this case, the 20th).
2.6.10 Characters: Counting things within strings
Counting the total number of characters in a string could not be simpler; just use the nchar
function directly, like this:
nchar(phrase)
## [1] 43
So there are 43 characters including the blanks between the words. To count the numbers of separate individual characters (including blanks) you need to split up the character string into individual characters (43 of them), using strsplit
like this:
strsplit(phrase, split = "")
## [[1]]
## [1] "t" "h" "e" " " "q" "u" "i" "c" "k" " " "b" "r" "o" "w" "n" " " "f" "o" "x"
## [20] " " "j" "u" "m" "p" "s" " " "o" "v" "e" "r" " " "t" "h" "e" " " "l" "a" "z"
## [39] "y" " " "d" "o" "g"
The split = ""
argument is for determine the character we are going to use to split the entire object. If we use a blank space instead we can separate the string in all the different words:
strsplit(phrase, split = " ")
## [[1]]
## [1] "the" "quick" "brown" "fox" "jumps" "over" "the" "lazy" "dog"
The table function can then be used for counting the number of occurrences of each of the characters:
table(strsplit(phrase, split = ""))
##
## a b c d e f g h i j k l m n o p q r s t u v w x y z
## 8 1 1 1 1 3 1 1 2 1 1 1 1 1 1 4 1 1 2 1 2 2 1 1 1 1 1
This demonstrates that all of the letters of the alphabet were used at least once within our phrase, and that there were eight blanks within the string called phrase. This suggests a way of counting the number of words in a phrase, given that this will always be one more than the number of blanks (so long as there are no leading or trailing blanks in the string):
= 1+table(strsplit(phrase, split = ""))[1]
nwords nwords
##
## 9
Another solution is to sum
all the elements in the table splitting the phrase with a blank space:
sum(table(strsplit(phrase, split = " ")))
## [1] 9
2.6.11 Characters: Upper- and lower-case text
It is easy to switch between upper and lower cases using the toupper
and tolower
functions:
toupper(phrase)
## [1] "THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG"
tolower(toupper(phrase))
## [1] "the quick brown fox jumps over the lazy dog"
2.6.12 Dates and times
The measurement of time is highly idiosyncratic. Successive years start on different days of the week. There are months with different numbers of days. Leap years have an extra day in February. Americans and Britons put the day and the month in different places: 3/4/2006 is March 4 for the former and April 3 for the latter.
All these things mean that working with dates and times is extremely complicated. Fortunately, R has a robust system for dealing with this complexity.
To see how R handles dates and times, have a look at Sys.time():
Sys.time()
## [1] "2022-03-11 12:59:18 CET"
This description of date and time is strictly hierarchical from left to right: the longest time scale (years) comes first, then month, then day, separated by hyphens, then there is a blank space, followed by the time, with hours first (using the 24-hour clock), then minutes, then seconds, separated by colons. Finally, there is a character string explaining the time zone (CET stands for Central Europe Time)
This representation of the date and time as a character string is user-friendly and familiar, but it is no good for calculations. For that, we need a single numeric representation of the combined date and time. The convention in R is to base this on seconds (the smallest time scale that is accommodated in Sys.time)
The baseline for expressing today’s date and time in seconds is 1 January 1970:
as.numeric(Sys.time())
## [1] 1.647e+09
This is fine for plotting time series graphs, but it is not much good for computing monthly means (e.g. is the mean for June significantly different from the July mean?) or daily means (e.g. is the Monday mean significantly different from the Friday mean?).
To answer questions like these we have to be able to access a broad set of categorical variables associated with the date: the year, the month, the day of the week, and so forth. To accommodate this, R uses the POSIX system for representing times and dates:
class(Sys.time())
## [1] "POSIXct" "POSIXt"
You can think of the class POSIXct
, with suffix ‘ct,’ as continuous time (i.e. a number of seconds), and
POSIXlt
, with suffix ‘lt,’ as list time (i.e. a list of all the various categorical descriptions of the time,
including day of the week and so forth). It is hard to remember these acronyms, but it is well worth making
the effort. Naturally, you can easily convert to one representation to the other:
= as.POSIXlt(Sys.time())
time.list unlist(time.list)
## sec min hour mday
## "18.3728873729706" "59" "12" "11"
## mon year wday yday
## "2" "122" "5" "69"
## isdst zone gmtoff
## "0" "CET" "3600"
Here you see the nine components of the list. The time is represented by the number of seconds (sec
), minutes (min
) and hours (on the 24-hour clock). Next comes the day of the month (mday
, starting from 1), then the month of the year (mon, starting at January = 0), then the year (starting at 0 = 1900). The day of the week (wday
) is coded from Sunday = 0 to Saturday = 6. The day within the year (yday
) is coded from 0 = January 1. Finally, there is a logical variable isdst
which asks whether daylight saving time is in operation (0 = FALSE in this case). The ones you are most likely to use include year
(to get yearly mean values), mon
(to get monthly means) and wday
(to get means for the different days of the week).
2.6.13 Dates and times: Reading time data from files
It is most likely that your data files contain dates in Excel format, for example 03/09/2014 (a character string showing month/day/year separated by slashes).
= read.csv(file = "Data/boulder-precip.csv")
df attach(df)
head(df)
## ID DATE PRECIP TEMP
## 1 756 8/21/13 0.1 55
## 2 757 8/26/13 0.1 25
## 3 758 8/27/13 0.1 NA
## 4 759 9/1/13 0.0 -999
## 5 760 9/9/13 0.1 15
## 6 761 9/10/13 1.0 25
To convert a factor or a character string into a POSIXlt object, we employ an important function called ‘strip time,’ written strptime
.
2.6.14 Dates and times: The strptime
function
To convert a factor or a character string into dates using the strptime function, we provide a format statement enclosed in double quotes to tell R exactly what to expect, in what order, and separated by what kind of symbol. For our present example we have day (as two digits), then slash, then month (as two digits), then slash, then year (without the century, making two digits).
= strptime(DATE, "%m/%d/%y")
Rdate class(Rdate)
## [1] "POSIXlt" "POSIXt"
It is always a good idea at this stage to add the R-formatted date to your dataframe:
= data.frame(df, Rdate)
df head(df)
## ID DATE PRECIP TEMP Rdate
## 1 756 8/21/13 0.1 55 2013-08-21
## 2 757 8/26/13 0.1 25 2013-08-26
## 3 758 8/27/13 0.1 NA 2013-08-27
## 4 759 9/1/13 0.0 -999 2013-09-01
## 5 760 9/9/13 0.1 15 2013-09-09
## 6 761 9/10/13 1.0 25 2013-09-10
Now, at last, we can do things with the date information. We might want the mean value of PRECIP
for each day of the week. The name of this object is Rdate$wday:
tapply(PRECIP,Rdate$wday,mean)
## 0 1 2 3 4 5 6
## 0.5000000 0.2250000 0.3666667 1.2000000 9.8000000 0.8000000 0.1000000
It is hard to remember all the format codes for strip time, but they are roughly mnemonic and they are always preceded by a percent symbol. Here is the full list of format components:
%a
: Abbreviated weekday name%A
: Full weekday name%b
: Abbreviated month name%B
: Full month name%c
: Date and time, locale-specific%d
: Day of the month as decimal number (01–31)%H
: Hours as decimal number (00–23) on the 24-hour clock%I
: Hours as decimal number (01–12) on the 12-hour clock%j
: Day of year as decimal number (0–366)%m
: Month as decimal number (0–11)%M
: Minute as decimal number (00–59)%p
: AM/PM indicator in the locale%S
: Second as decimal number (00–61, allowing for two ‘leap seconds’)%U
: Week of the year (00–53) using the first Sunday as day 1 of week 1%w
: Weekday as decimal number (0–6, Sunday is 0)%W
: Week of the year (00–53) using the first Monday as day 1 of week 1%x
: Date, locale-specific%X
: Time, locale-specific%Y
: Year with century%y
: Year without century%Z
: Time zone as a character string (output only)
There is a useful function called weekdays (note the plural) for turning the day number into the appropriate name:
= strptime("01/02/2020", format="%d/%m/%Y")
y weekdays(y)
## [1] "Saturday"
which is converted from:
$wday y
## [1] 6
because the days of the week are numbered from Sunday = 0.
Here is another kind of date, with years in two-digit form (%y
), and the months as abbreviated names (%b
) using no separators:
= c("1jan99", "2jan05", "31mar04", "30jul05")
other.dates strptime(other.dates, "%d%b%y")
## [1] "1999-01-01 CET" "2005-01-02 CET" "2004-03-31 CEST" "2005-07-30 CEST"
Here is yet another possibility with year, then month in full, then week of the year, then day of the week abbreviated, all separated by a single blank space:
= c("2016 January 2 Mon","2017 February 6 Fri","2018 March 10 Tue")
yet.another.date strptime(yet.another.date,"%Y %B %W %a")
## [1] "2016-01-11 CET" "2017-02-10 CET" "2018-03-06 CET"
2.6.15 Dates and times: Summary
The key thing to understand is the difference between the two representations of dates and times in R. They have unfortunately non-memorable names.
- POSIXlt gives a list containing separate vectors for the year, month, day of the week, day within the year, and suchlike. It is very useful as a categorical explanatory variable (e.g. to get monthly means from data gathered over many years using date$mon).
- POSIXct gives a vector containing the date and time expressed as a continuous variable that you can use in regression models (it is the number of seconds since the beginning of 1970).
2.6.16 Testing and coercing
The concepts of membership and coercion may be unfamiliar. Membership relates to the class of an object in R. Coercion changes the class of an object. For instance, a logical variable has class logical
and mode logical. This is how we create the variable:
= c(TRUE,FALSE,TRUE) lv
We can assess its membership by asking if it is a logical variable using the is.logical
function:
is.logical(lv)
## [1] TRUE
It is not a factor, and so it does not have levels:
levels(lv)
## NULL
But we can coerce it be a two-level factor like this:
= as.factor(lv)
fv fv
## [1] TRUE FALSE TRUE
## Levels: FALSE TRUE
We can coerce a logical variable to be numeric: TRUE
evaluates to 1 and FALSE
evaluates to zero, like this:
= as.numeric(lv)
nv nv
## [1] 1 0 1
In general, the expression as(object, value)
is the way to coerce an object to a particular class. Membership functions ask is.something
and coercion functions say as.something
.
Objects have a type, and you can test the type of an object using an is.type function (Table 2.3). For instance, mathematical functions expect numeric input and text-processing functions expect character input. Some types of objects can be coerced into other types. A familiar type of coercion occurs when we interpret the TRUE and FALSE of logical variables as numeric 1 and 0, respectively. Factor levels can be coerced to numbers. Numbers can be coerced into characters, but non-numeric characters cannot be coerced into numbers.
2.6.17 Missing values, infinity and things that are not numbers
Calculations can lead to answers that are plus infinity, represented in R by Inf
, or minus infinity, which is represented as -Inf
:
3/0
## [1] Inf
-12/0
## [1] -Inf
Calculations involving infinity can be evaluated: for instance,
exp(-Inf)
## [1] 0
0/Inf
## [1] 0
Other calculations, however, lead to quantities that are not numbers. These are represented in R by NaN (‘not a number’). Here are some of the classic cases:
0/0
## [1] NaN
Inf-Inf
## [1] NaN
Inf/Inf
## [1] NaN
You need to understand clearly the distinction between NaN
and NA
(this stands for ‘not available’ and is the missing-value symbol in R).
The function is.nan
is provided to check specifically for NaN
, and is.na
also returns TRUE
for NaN
. Coercing NaN
to logical or integer type gives an NA
of the appropriate type. There are built-in tests to check whether a number is finite or infinite:
Missing values in dataframes are a real source of irritation, because they affect the way that model-fitting functions operate and they can greatly reduce the power of the modelling that we would like to do.
You may want to discover which values in a vector are missing. Here is a simple case:
= c(4,NA,7) y
The missing value question should evaluate to FALSE TRUE FALSE
. There are two ways of looking for missing values that you might think should work, but do not. These involve treating NA
as if it was a piece of text and using double equals (==) to test for it. So this does not work:
== NA y
## [1] NA NA NA
because it turns all the values into NA
(definitively not what you intended). This does not work either:
== "NA" y
## [1] FALSE NA FALSE
It correctly reports that the numbers are not character strings, but it returns NA
for the missing value itself, rather than TRUE
as required. This is how you do it properly:
is.na(y)
## [1] FALSE TRUE FALSE
To produce a vector with the NA
stripped out, use subscripts with the not !
operator like this:
! is.na(y)] y[
## [1] 4 7
Some functions do not work with their default settings when there are missing values in the data, and mean
is a classic example of this:
= c(1:8, NA)
x mean(x)
## [1] NA
In order to calculate the mean of the non-missing values, you need to specify that the NA are to be removed, using the na.rm=TRUE
argument:
mean(x, na.rm = T)
## [1] NA