How do I include categorical variables in my regression model?


One common problem researchers face when running a regression analysis is how to include categorical predictors. Unlike using continuous variables, which you can simply add with no previous manipulation, including categorical variables requires extra work when performing the analysis and interpreting the results.

Let’s start with the simplest case of a binary variable, that is, a two-level categorical variable. The first step is to transform it into a 0/1 variable also known as dummy variable. Imagine a simple regression model where the dependent variable is salary and the only predictor is gender, which has been coded as 1 if “Male” and 2 if ”Female.” We will first need to recode it into 0 if “Male” and 1 if ”Female” (or vice versa). The category coded with a 0 is known as reference group or category. The interpretation of the coefficient corresponding to the dummy variable is the average difference in the dependent variable between the two levels of the binary predictor. In our example the coefficient corresponding to this dummy is the difference in the average salary across genders. A positive and significant regression coefficient will indicate that on average men have better salaries than women.

Now, let’s look at the case of having more than two categories. A categorical variable with k categories needs to be transformed into k-1 dummy variables before being entered into the model. This process of creating dichotomous variables from a categorical predictor is known as dummy coding. For the sake of simplicity we will consider the case of a categorical variable with three levels.

We will need to include two dummy variables in the model. For example, let’s consider the categorical variable education (highest level of studies completed) coded as 1 ”High School or less” 2 “College” 3 “Advanced graduate degree”. The way to dummy code this variable will be creating a variable called HSorless that is 1 when education is 1 and 0 otherwise. Likewise, we will create College, which is 1 if education is 2 and 0 otherwise; and Advanced, which is 1 if education is 3 and 0 otherwise.

We will only use two of the three created dummies in the regression analysis; for instance, we could chose to include HSorless and College and leave Advanced as a reference category. Or we could decide to include College and Advanced and leave HSorless as a reference category. The decision about the reference category will depend upon your research interest. If your interest is to see how having a college or advanced degree contributes to the average salary compared to having a high school diploma, then leaving HSorless as the reference group seems like the appropriate choice.

The interpretation of the dummy coefficients is similar to the case of the binary variable. The coefficient corresponding to College is the average difference in the dependent between this level of education and the reference group. If College and Advance are the dummies included in the model the coefficient for College will show the average difference in salary for a person who has completed a college degree compared to a person with a high school diploma.

Ayla Myrick