Find a Better Linear Regression Model by Using R- Amusement Park Survey (2)

Kelly Szutu
Analytics Vidhya
Published in
5 min readApr 2, 2020

--

Photo by Christina Winter on Unsplash

From the previous article “Create a Linear Regression Model by Using R- Amusement Park Survey Case (1)”, we’ve built our first model (m1) by changing one independent variable and got the regression equation with 68% fit to the data.

m1 <- lm(overall ~ rides + games + wait + clean + num.child + logdist, data=surveyData)#create the report
summary(m1)

However, there are still some improvements to this model since we always want to find the model with a better performance. In this article, I will cover the idea of data preprocessing I used and also the criteria of the model selection.

Data Preprocessing

Drop column

We created a new column called logdist last time to avoid the skewed information, so the original distance column can be dropped (if there are other useless or unrelated columns, we can also drop them).

head(surveyData) #show the original data
surveyData <- surveyData[,-3]
head(surveyData) #show the data after dropping the third column

Standardize/ Normalize data

To see the relationship between DV and IVs, it’s better to have the same scale. Given that logdist somewhat shows a different scale from others, we have to do the normalization: (x — mean(x)) / std(x). R already provides this function for us.

surveyData[,3:8] <- scale(surveyData[,3:8])
head(surveyData)

After normalizing, we build a new model (m2) and see if there is any change.

m2 <- lm(overall ~ rides + games + wait + clean + weekend + logdist + num.child, data=surveyData)#create the report
summary(m2)

By changing the model a little bit, we got 0.6786 R-squared score and 0.647 adjusted R-squared score, almost the same values as we had in model 1. Still have some spaces for improvement.

Change datatype

Different types of data can also affect the model. Over here we focus on the child column. It can be treated as both continuous (numeric) and discrete (factor, binary) data.

is.numeric(surveyData$num.child)

For the previous data, we consider the number of children as numeric. Executing the above code will return “TRUE”, which means column “num.child” in surveyData is numeric. Now, we try to change it to factor to see how this influences our outcomes.

surveyData$num.child.factor <- actor(surveyData$num.child)
surveyData$num.child.factor

When we print out this new column, it seems that there’s no difference from the old one. But look at the last line of the output, it shows “Levels: 0 1 2 3 4 5”, meaning that the datatype of this is a factor and in 5 levels. After this, we again run the regression model (m3).

m3 <- lm(overall ~ rides + games + wait + clean + weekend + logdist + num.child.factor, data=surveyData)#create the report
summary(m3)

In the report, we can see a large enhancement of the model. It gave us a 0.7751 R-squared score and a 0.77 adjusted R-squared score, around 10% more than previous attempts.

Lastly, we create a new column “has.child” by changing the column “num.child” to binary variables and test how it works.

surveyData$has.child <- factor(surveyData$num.child > 0)
head(surveyData$has.child)

The output gives us “TRUE” or “FALSE” depends on the condition. Then, we run the model (m4) with our new variable.

m4 <- lm(overall ~ rides + games + wait + clean + weekend + logdist + has.child, data=surveyData)#create the report
summary(m4)

The result is significant. R-squared equals to 0.7742 and adjusted R-square equals to 0.771, all similar to model 3.

Model Selection Information Criteria

R-squared or adjusted R-squared is just a reference, along with it, we have to check some other model information criteria such as AIC, BIC, to get a closer look at how the model really fits.

AIC (Akaike’s information criterion) and BIC (Bayesian information criterion) are measures of the goodness of fit of an estimated statistical model and can also be used for model selection.

AIC(m1); AIC(m2); AIC(m3); AIC(m4) 
BIC(m1); BIC(m2); BIC(m3); BIC(m4)

Conclusion

For model comparison, the model with the lowest AIC and BIC score is preferred. Therefore, we conclude that model 4 is the best model in this case. This binary datatype might successfully handle the overfitting issue. Moreover, We can also interpret from the report that the satisfaction of the rides, wait, clean, and have/ don’t have children are the four key components that influence the overall satisfaction score, and they all have a positive relationship with it.

About me

Hey, I’m Kelly, a business analytics graduate student with journalism and communication background who likes to share the life of exploring data and interesting findings. If you have any questions, feel free to contact me at kelly.szutu@gmail.com

--

--

Kelly Szutu
Analytics Vidhya

Journalist x Data Visualization | Data Analyst x Machine Learning | Python, SQL, Tableau | LinkedIn: www.linkedin.com/in/szutuct/