# Multiple Regression with Dummy Variables

## Introduction

One of the limitations of multiple-regression analysis is that it accommodates only quantitative explanatory variables. Dummy variable can be used to transform and incorporate qualitative explanatory variables such as gender, mother tongue language etc into a linear model, substantially expanding the range of application of regression analysis.

## Analysis

We have a data set including Wage expressed in Canadian Dollar, Education expressed in years, Age, Gender and Language for 3979 individuals. The objective is measuring the impact of explanatory variables, both quantitative and qualitative on the explained variable which is Wage.

Introducing a Dummy Regressor

One way of formulating the common-slope model is

where D, called a dummy-variable regressor or an indicator variable, is coded 1 for men and 0 for women:

Hence, for women the model becomes

and for men

The regression summary for males using dummy variable by splitting gender into male and female is as follow:

Regression Statistics
Multiple R 0.54613717
R Square 0.298265808
Standard Error 6.596309175
Observations 3979
Coefficients Standard Error t Stat P-value
Intercept -8.162885422 0.617311042 -13.22329405 4.26076E-39
education 0.934117644 0.03528227 26.4755543 2.1331E-142
age 0.255995604 0.008665943 29.54042093 1.4645E-173
gender (male) 3.48023027 0.20925238 16.63173562 4.22394E-60

Based on the value of r squared results, education, age and gender as explanatory variables, explain around 30% of
Wages variation. The low P-value of explanatory variables indicates that we can reject the null hypothesis that there is no correlation between the explanatory variables and explained variable (wage) hence non statistically significant.
All three explanatory variables coefficients have positive sign means they are positively correlated with the explanatory variables and an increase in education, age and being male, increases the wage.

The regression model using dummy variable for men is:

Wage= (-8.162885422+3.48023027) + 0.934117644(education) + 0.255995604(age)+ e

Whereas per women the model is:

Wage= -8.162885422+ 0.934117644(education) + 0.255995604(age)+ e

The next step is to create dummy variable for “Language” as the second qualitative explanatory variable and deploy the created dummy variables in the model.

The language variable includes three categories; English, French and Other. Therefore, we create two dummy variables, English and French.

Models Explained variable Explanatory variables
Model 1 Wage Education, Age, Male, English
Model 2 Wage Education, Age, Male, French

The regression summary for males using dummy variable by splitting gender and language is as follow:

Regression Statistics
Multiple R 0.546156901
R Square 0.298287361
Standard Error 6.597867925
Observations 3979
Coefficients Standard Error t Stat P-value
Intercept -8.062713603 0.680841355 -11.84227948 8.04394E-32
education 0.934962776 0.035400845 26.41074734 9.2945E-142
age 0.25567294 0.008719145 29.32316635 2.8591E-171
sex_Male 3.480752092 0.209350591 16.62642589 4.59903E-60
Lan-English -0.113383063 0.326277032 -0.3475055 0.728229995
Lan-French -0.113348309 0.509894688 -0.222297489 0.824093733
Now the models are even bigger, but the coefficients of language are negative which imply that we should digger more, using a different type of models. In the above examples, we considered that the explanatory variables have different intercepts but an identical parallel slope. The graph below shows the trajectory of intercept and slope for the above models:

Running this model, the summary of the results is:

Regression Statistics
Multiple R 0.340135991
R Square 0.115692492