Every data scientist should have the skills , the knowledge of “Data Science” techniques and also the ability to use the required tools to explore the data , finding the insight and explain the result to the people.
As i junior data scientist , just starting my professional career with the Saudi Digital Academy and Coding Dojo bootcamp “Data Science Immersive Bootcamp” i select the “House Prices Prediction” as my final project at this bootcamp.
The idea of the project is to predict the sale price of the house and the method to do that is :
1- Load the Data
2- EDA ( Explore Data Analysis )
3- Apply machine learning techniques
Load The Data
I use jupyter notebook to do my project . So i load the data using Pandas library “ library has some tools to help the data science field people to working with Python” .
EDA ( Explore Data Analysis )
I always stat with .head() and .shape to have a taste of the data . The head() show us the first rows of the data frame . And shape used to see the shape of the data , in this case it is 1460 rows and 81 columns .
After colecting some information about the data i use matplotlib and seaborn (two library in data python) to visualise the data and come up with some insight that may be helpful .
As you can see Fig 1 show the distribution of the house sale price , and it’s clear that the peak between 100,000 $ and 200,000 $ .
I most say that this figure is only shows the distribution of the sale price not the max and min . I wrote a code to find that and also more information and the result is :
Max of Sale Price is : 755000 $
The year sold is : 2007
The year built is : 1994
Min of Sale Price is : 34900 $
The year sold is : 2009
The year built is : 1920
To see the relationship between the data features “columns” , scatter plot is the way to do that .
Fig 2 shows the relation between the “Ground living area” in the x-axis and “Sale Price” y-axis . The scatter plot change in the color according to the number of the rooms in the house . We can see as the ground living area increase the Sale price increase and the color also change . This tell us that as big as the place is as expinsive it is .
Also in fig 3 , we see a similar relation between the basement area and the sale price .
Fig 4 shows the relation between the garage area and the sale price , but we can see there is no clear relation between them . For example if we look at the garage area there is alot between 200 Square feet to 1000 Square feet and all in place where the sale price of the house is in 100,000 $ to 200,000 $ .
Fig 5 shows a box plot that focus on the foundation material of the house . PConc , CBlock , BrkTil , Wood , Slab and Stone . We can see that if the house foundation is PConc , the Sale price of the house will be more expensive . Also the box plot give us more information about each foundation . For example , the highest sale price of a house that use PConc as foundation is more than 700,000 $ . If you don’t understand the box plots in general see this link for more information .
Before you look at the box plot you must know that overall quality rates the overall material and finish of the house from 1 to 10
10 — Very Excellent
9 — Excellent
8 — Very Good
7 — Good
6 — Above Average
5 — Average
4 — Below Average
3 — Fair
2 — Poor
1 — Very Poor
According to fig 6 , if the overall quality rating is 9 or 10 the house must have central air conditioning . Also if there is no central air conditioning the house sale price will be more lower then the house price that has central air conditioning in the same overall quality rating . And there is no house rated as 1 or 2 that has central air conditioning .
Fig 7 : the x-axis is the month that the house has sold in it , y-axis is counting of the sold houses in that month . We can see that June has more sold than the other months . Looks like a lot of sales happens in the summer .
Now let’s see the more correlated columns with the sale price .
This give us an over view . But let’s make it more specific and see the to 9 correlated columns to the sale price .
This give us more clear over view . I select the top 9 columns that correlated to the sale price .
Now , we see there is some similar columns . For example , TotalBsmtSF and 1stFlrSF has the same number “0.61” . Also Garage cars and Garage area , which mean the same thing but one look at the garage car capacity and the other looks at the area . In the next step we will focus on the most correlated and if there is two mean the same thing we will chose only one of them .
It is always a good idea to look at Pair plot in the EDA . It’s awesome to see alot of information in just small area and much less line of code . Also some time it may give you an insight that you may don’t think about .
Machine learning techniques
First i need to make sure that there is no missed values in the data . And prepare the data to apply the model .
If you finish the cleaning of the data and make it ready to use in the model , it’s good idea to start modeling .
To start i split the data to training and testing data using train_test_split from sklearn library . Then i Import Linear Regression from sklearn and after that i start fitting the data to the model .
Now let’s predict the values and see how much is the accuracy of the prediction .
Fig 11 is the Linear regression plot where x-axis as a true value and y-axis as the prediction . The line is the perfect fit . Now let’s see what R² value w achieve .
From sklearn we use r2_score() function .
Files you may interested in it :