## Introduction

One of the limitations of multiple-regression analysis is that it accommodates only quantitative explanatory variables. Dummy variable can be used to transform and incorporate qualitative explanatory variables such as gender, mother tongue language etc into a linear model, substantially expanding the range of application of regression analysis.

## Analysis

We have a data set including Wage expressed in Canadian Dollar, Education expressed in years, Age, Gender and Language for 3979 individuals. The objective is measuring the impact of explanatory variables, both quantitative and qualitative on the explained variable which is Wage.

Introducing a Dummy Regressor

One way of formulating the common-slope model is

where D, called a dummy-variable regressor or an indicator variable, is coded 1 for men and 0 for women:

Hence, for women the model becomes

and for men

The regression summary for males using dummy variable by splitting gender into male and female is as follow:

Regression Statistics

Multiple R 0.54613717

R Square 0.298265808

Adjusted R Square 0.297736197

Standard Error 6.596309175

Observations 3979

Coefficients Standard Error t Stat P-value

Intercept -8.162885422 0.617311042 -13.22329405 4.26076E-39

education 0.934117644 0.03528227 26.4755543 2.1331E-142

age 0.255995604 0.008665943 29.54042093 1.4645E-173

gender (male) 3.48023027 0.20925238 16.63173562 4.22394E-60

Based on the value of r squared results, education, age and gender as explanatory variables, explain around 30% of

Wages variation. The low P-value of explanatory variables indicates that we can reject the null hypothesis that there is no correlation between the explanatory variables and explained variable (wage) hence non statistically significant.

All three explanatory variables coefficients have positive sign means they are positively correlated with the explanatory variables and an increase in education, age and being male, increases the wage.

The regression model using dummy variable for men is:

Wage= (-8.162885422+3.48023027) + 0.934117644*(education) + 0.255995604*(age)+ e

Whereas per women the model is:

Wage= -8.162885422+ 0.934117644*(education) + 0.255995604*(age)+ e

The next step is to create dummy variable for “Language” as the second qualitative explanatory variable and deploy the created dummy variables in the model.

The language variable includes three categories; English, French and Other. Therefore, we create two dummy variables, English and French.

Models Explained variable Explanatory variables

Model 1 Wage Education, Age, Male, English

Model 2 Wage Education, Age, Male, French

The regression summary for males using dummy variable by splitting gender and language is as follow:

Regression Statistics

Multiple R 0.546156901

R Square 0.298287361

Adjusted R Square 0.297404259

Standard Error 6.597867925

Observations 3979

Coefficients Standard Error t Stat P-value

Intercept -8.062713603 0.680841355 -11.84227948 8.04394E-32

education 0.934962776 0.035400845 26.41074734 9.2945E-142

age 0.25567294 0.008719145 29.32316635 2.8591E-171

sex_Male 3.480752092 0.209350591 16.62642589 4.59903E-60

Lan-English -0.113383063 0.326277032 -0.3475055 0.728229995

Lan-French -0.113348309 0.509894688 -0.222297489 0.824093733

Now the models are even bigger, but the coefficients of language are negative which imply that we should digger more, using a different type of models. In the above examples, we considered that the explanatory variables have different intercepts but an identical parallel slope. The graph below shows the trajectory of intercept and slope for the above models:

Running this model, the summary of the results is:

Regression Statistics

Multiple R 0.340135991

R Square 0.115692492

Adjusted R Square 0.11524767

Standard Error 7.403920023

Observations 3979

Coefficients Standard Error t Stat P-value

Intercept 13.4540538 0.149079651 90.24741943 0

Edu-Gender-Lan: English 0.009856859 0.000440936 22.35440928 2.2983E-104

Edu-Gender-Lan: French 0.008885083 0.001206793 7.362558866 2.1813E-13

The results show that the r adjusted is lower but having the same amount of education, being male but knowing English gives a higher propensity of earning more than speaking French though adjusted r is still low. Maybe we should have more variables such as the type of job, location of population in the sample etc however we still can explain some stuff, education seems to matter, age and gender too.

We can still make further assumption and modifying the model, it is important when you build your model be careful what you are looking for and always think about what the expected value of Y is given the dummies you’re using, switch on and off the dummies change the intercept and the slope depending on how this is structured. You can use squared dummies if you think education has a nonlinear effect. There’s a lot you can do with dummy variables.