Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
April 29, 2022 11:33 am GMT

Regression Modeling

Make sure to read this blog before reading this one.

What is Regression model and which kind of problem it is used for?

A regression model predicts continuous values. For example, regression models make predictions that answer questions like the following:

  • What is the value of a house in California?
  • What is the probability that a user will click on this ad?

They can't be use to predict discrete values. For example:

  • Is a given email message spam or not spam?
  • Is this an image of a dog, a cat, or a hamster?

Such problems are known as Classification problem. We will see which Machine Learning algorithms we can use for classification problem in future blogs.

The dataset used in this blog can be found here.

Linear Regression Model

In Linear Regression Model, we have only one Dependent Variable (Target Variable) and one Independent Variable (Feature).

Step-1

Loading Libraries

import pandas as pdimport numpy as npimport matplotlib.pyplot as plt%matplotlib inlinefrom sklearn import linear_model

We are Scikit-Learn library to model our model. By default, this model uses the Ordinary Least Square method.

Step-2

Loading Dataset.

df = pd.read_csv('homeprices.csv')

Step-3

We now that in order for linear regression to work. We need to know whether the data. We are fitting our model is linear or not. Let's draw a scatter plot to find it's Linear or not.

plt.xlabel('Area')plt.ylabel('Price')plt.scatter(df['area'],df['price'],marker='+', color='red')

Image description

We can see that all points plot indicate that it's almost linear data and can be fit by a stright line.

Step-4

Modeling our Linear Regression model.

linear_reg_model = linear_model.LinearRegression()linear_reg_model.fit(df[['area']],df['price'])

You may wonder why are we using df[['area']] instead of df['area']. Thing is df['area'] returns Pandas.Series while this function takes DataFrame as an argument. So, that's why we used df[['area']] because it return the DataFrame instead of Series.

Step-5

Testing our model on a single value

linear_reg_model.predict([[3300]])

Output

array([628715.75342466])

Our model is doing the prediction as intended.

Step-6

Checking the regression coefficient and Y-intercept calculated.

print(linear_reg_model.coef_)print(linear_reg_model.intercept_)

Output

[135.78767123]
180616.43835616432

For the equation y=mx+b. The linear_reg_model.coef_ indicates the m and linear_reg_model.intercept_ indicates the b in the equation respectively.

Step-7

Doing prediction on testing data

test_data = pd.read_csv('areas.csv')preds = linear_reg_model.predict(test_data)test_data['price'] = predsplt.scatter(test_data['area'],test_data['price'],color='red')plt.plot(test_data['area'],preds,color='green')

Here, plot methods draw a straight line while the scatter method draw the point on the plot. We are trying to get an idea. How well our regression lines fit the actual point of the data.

Image description

Here, we don't have the actual price values. For the sake of visualization. We are treating the predicted values both as actual and predicted value. That's why we have all the point exactly on the line which is not the case otherwise.

Multivariate Regression Model

When we have more than one independent variables. We just can't fit a Linear Model as it can't be fit by a Straight line. When we have more than one independent variable (features) the dimensinality of our problem increase. So, we use plane or hyperplane to fit our dataset.

Step-1

Import libraries and Reading data.

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inlineUSAhousing = pd.read_csv('USA_Housing.csv')

Step-2

We want to see whether our data is Linear. Does there exist any collineraity or noise. What it's distribution?

sns.pairplot(USAhousing)

Remember! It may take quite a time for larger dataset. Just be patient.

Image description

We can clearly see that other than Price none of the features show collinearity. There isn't noise here as well.

Let's now see the distribution of Price dependent variable.

sns.distplot(USAhousing['Price'])

Image description

We can clearly see that it's normally distributed just like we wanted. So, we can use Regression model here.

sns.heatmap(USAhousing.corr())

Image description

Another way of seeing the correlation between all the variable. We can see that no variable other than the target variable have any correlation with any other variable which confirms the absence of the Collinearity. The diagonal show the strong correlation because every variable have 100% correlation with itself.

Step-3

Seperating Features and Target Variable.

X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms','Avg. Area Number of Bedrooms', 'Area Population']]y = USAhousing['Price']

Step-4

Train Test split

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)

0.4 indicates that 40% will be test data while 100-40=60% will be the Training data.

Step-5

Creating and training the model.

from sklearn.linear_model import LinearRegressionlm = LinearRegression()lm.fit(X_train,y_train)

Step-6

Model Evaluation

# print the interceptprint(lm.intercept_)coeff_df = pd.DataFrame(lm.coef_,X.columns,columns=['Coefficient'])print(coeff_df)

Image description

This is how we will interpret the data.

  • Holding all other features fixed, a 1 unit increase in Avg. Area Income is associated with an *increase of $21.52 *.
  • Holding all other features fixed, a 1 unit increase in Avg. Area House Age is associated with an *increase of $164883.28 *.
  • Holding all other features fixed, a 1 unit increase in Avg. Area Number of Rooms is associated with an *increase of $122368.67 *.
  • Holding all other features fixed, a 1 unit increase in Avg. Area Number of Bedrooms is associated with an *increase of $2233.80 *.
  • Holding all other features fixed, a 1 unit increase in Area Population is associated with an *increase of $15.15 *.

Step-7

Predication from our model.

predictions = lm.predict(X_test)plt.scatter(y_test,predictions)

Image description

sns.distplot((y_test-predictions),bins=50);

Image description

If your residual have a NORMAL distribution it's a good indication that your choice for choosing the Regression model was correct. Otherwise, you may want to go back and see perhaps there is another model which is better for this problem.

Step-8

Regression Evation Metrics

Here are three common evaluation metrics for regression problems:

Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

Image description

Mean Squared Error (MSE) is the mean of the squared errors:

Image description

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

Image description

Comparing these metrics:

  • MAE is the easiest to understand, because it's the average error.
  • MSE is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world. Applying square, as we know larger values have larger squares also square will make the value positive.
  • RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are loss functions, because we want to minimize them.

from sklearn import metricsprint('MAE:', metrics.mean_absolute_error(y_test, predictions))print('MSE:', metrics.mean_squared_error(y_test, predictions))print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

Output

MAE: 82288.2225191
MSE: 10460958907.2
RMSE: 102278.829223

The smaller the errors (LOSS) better is the model performance.


Original Link: https://dev.to/daud99/6-regression-modeling-53i9

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To