8.5 Model notation

There is a concise notation for specifying the choices made in a model design, that is, which is the response variable, what are the explanatory variables, and what model terms to use. This notation is the one you will use it in working with computers.

To illustrate, here is the notation for some of the models looked at earlier in this chapter:

  • ccf ~ 1 + temperature
  • wage ~ 1 + sex
  • time ~ 1 + year + sex + year:sex

The ~ symbol (pronounced “tilde”) divides each statement into two parts. On the left of the tilde is the name of the response variable. On the right is a list of model terms. When there is more than one model term, as is typically the case, the terms are separated by a + sign.

The examples show three types of model terms:

  • The symbol 1 stands for the intercept term.
  • A variable name (e.g., sex or temperature) stands for using that variable in a main term.
  • An interaction term is written as two names separated by a colon, for instance year:sex.

Although this notation looks like arithmetic or algebra, IT IS NOT. The plus sign does not mean arithmetic addition, it simply is the divider mark between terms. In English, one uses a comma to mark the divider as in “rock, paper, and scissors.” The modeling notation uses + instead: “rock + paper + scissors.” So, in the modeling notation 1 + age does not mean “arithmetically add 1 to the age.” Instead, it means “two model terms: the intercept and age as a main term.”

Similarly, don’t confuse the tilde with an algebraic equal sign. The model statement is not an equation. So the statement wage ~ 1 + age does not mean “wage equals 1 plus age.” Instead it means, wage is the response variable and there are two model terms: the intercept and age as a main term.”

In order to avoid accidentally leaving out important terms, the modeling notation includes some shorthand. Two main points will cover most of what you will do:

  • You don’t have to type the 1 term; it will be included by default. So, wage ~ age is the same thing as wage ~ 1 + age. On those very rare occasions when you might want to insist that there be no intercept term, you can indicate this with a minus sign: wage ~ age - 1.

  • Almost always, when you include an interaction term between two variables, you will also include the main terms for those variables. The * sign can be used as shorthand. The model wage ~ 1 + sex + age + sex:age can be written simply as wage ~ sex * age.