Multiple Regression with Dummy Variables


One of the limitations of multiple-regression analysis is that it accommodates only quantitative explanatory variables. Dummy variable can be used to transform and incorporate qualitative explanatory variables such as gender, mother tongue language etc into a linear model, substantially expanding the range of application of regression analysis.


We have a data set including Wage expressed in Canadian Dollar, Education expressed in years, Age, Gender and Language for 3979 individuals. The objective is measuring the impact of explanatory variables, both quantitative and qualitative on the explained variable which is Wage.

Introducing a Dummy Regressor

One way of formulating the common-slope model is

where D, called a dummy-variable regressor or an indicator variable, is coded 1 for men and 0 for women:

Hence, for women the model becomes

and for men

The regression summary for males using dummy variable by splitting gender into male and female is as follow:

Regression Statistics
Multiple R 0.54613717
R Square 0.298265808
Adjusted R Square 0.297736197
Standard Error 6.596309175
Observations 3979
Coefficients Standard Error t Stat P-value
Intercept -8.162885422 0.617311042 -13.22329405 4.26076E-39
education 0.934117644 0.03528227 26.4755543 2.1331E-142
age 0.255995604 0.008665943 29.54042093 1.4645E-173
gender (male) 3.48023027 0.20925238 16.63173562 4.22394E-60

Based on the value of r squared results, education, age and gender as explanatory variables, explain around 30% of
Wages variation. The low P-value of explanatory variables indicates that we can reject the null hypothesis that there is no correlation between the explanatory variables and explained variable (wage) hence non statistically significant.
All three explanatory variables coefficients have positive sign means they are positively correlated with the explanatory variables and an increase in education, age and being male, increases the wage.

The regression model using dummy variable for men is:

Wage= (-8.162885422+3.48023027) + 0.934117644(education) + 0.255995604(age)+ e

Whereas per women the model is:

Wage= -8.162885422+ 0.934117644(education) + 0.255995604(age)+ e

The next step is to create dummy variable for “Language” as the second qualitative explanatory variable and deploy the created dummy variables in the model.

The language variable includes three categories; English, French and Other. Therefore, we create two dummy variables, English and French.

Models Explained variable Explanatory variables
Model 1 Wage Education, Age, Male, English
Model 2 Wage Education, Age, Male, French

The regression summary for males using dummy variable by splitting gender and language is as follow:

Regression Statistics
Multiple R 0.546156901
R Square 0.298287361
Adjusted R Square 0.297404259
Standard Error 6.597867925
Observations 3979
Coefficients Standard Error t Stat P-value
Intercept -8.062713603 0.680841355 -11.84227948 8.04394E-32
education 0.934962776 0.035400845 26.41074734 9.2945E-142
age 0.25567294 0.008719145 29.32316635 2.8591E-171
sex_Male 3.480752092 0.209350591 16.62642589 4.59903E-60
Lan-English -0.113383063 0.326277032 -0.3475055 0.728229995
Lan-French -0.113348309 0.509894688 -0.222297489 0.824093733
Now the models are even bigger, but the coefficients of language are negative which imply that we should digger more, using a different type of models. In the above examples, we considered that the explanatory variables have different intercepts but an identical parallel slope. The graph below shows the trajectory of intercept and slope for the above models:

Running this model, the summary of the results is:

Regression Statistics
Multiple R 0.340135991
R Square 0.115692492
Adjusted R Square 0.11524767
Standard Error 7.403920023
Observations 3979
Coefficients Standard Error t Stat P-value
Intercept 13.4540538 0.149079651 90.24741943 0
Edu-Gender-Lan: English 0.009856859 0.000440936 22.35440928 2.2983E-104
Edu-Gender-Lan: French 0.008885083 0.001206793 7.362558866 2.1813E-13
The results show that the r adjusted is lower but having the same amount of education, being male but knowing English gives a higher propensity of earning more than speaking French though adjusted r is still low. Maybe we should have more variables such as the type of job, location of population in the sample etc however we still can explain some stuff, education seems to matter, age and gender too.

We can still make further assumption and modifying the model, it is important when you build your model be careful what you are looking for and always think about what the expected value of Y is given the dummies you’re using, switch on and off the dummies change the intercept and the slope depending on how this is structured. You can use squared dummies if you think education has a nonlinear effect. There’s a lot you can do with dummy variables.

Leave a Reply

Your email address will not be published. Required fields are marked *