Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
January 29, 2023 12:13 pm GMT

Insurance Cost Prediction using Machine Learning with Python.

Machine learning (ML) is a sub set of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so.
Machine learning algorithms uses historical data as input to predict new output values.

In this project, I worked on developing an end to end machine learning project using linear regression.
Data cleaning, Extensive data visulaization, Exploratory data analysis was also done.

Data Description:

The dataset used for this project is an Insurance focused dataset that contains columns such as age, sex, bmi, region, and other data, which were used to determine the cost of each persons insurance.

Steps

  • Importing the necessary libraries:Numpy, pandas, matplotlib, seaborn and sckitlearn were imported.
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import train_test_split%matplotlib inline 

Image description

  • Loading in the dataset:The csv was loaded using the code below:
Insurance = pd.read_csv("https://raw.githubusercontent

Image description

  • Information about the data.To get some information about the data such as the type of data in each column, we use the code below
Insurance.info()

Image description

  • Checking the statistical description of the data:
Insurance.describe()

Image description

  • Checking for the number of rows and columns present in the dataset:
Insurance.shape

Image description

Data Cleaning and preparation:

Working with unclean data leads to inaccuracy in results, so its necessary to carry out data cleaning before any analysis or prediction is done.

  • Checking for null values:

To check for null values in our dataset, we use the code below:

Insurance.isnull().any()

Image description

  • Checking for duplicates:
Insurance.duplicated().any()

Image description

Exploratory Data Analysis:

Exploratory data analysis helps in understanding the patterns, trends and metrics in a dataset. Also helps in detecting outliers and anomalous events.

  • Using a correlation matrix to check for correlations among the columns in the dataset:
sns.heatmap(Insurance.corr())

Image description

Image description

The correlation matrix shows theres little or no correlation between age and charges.

  • Checking for the distribution pattern of the charges column
sns.distplot(Insurance['charges'])

Image description

  • Plotting a pairplot to check out the relationship that exists between one column to another.
sns.pairplot(Insurance);

Image description

Image description

Extracting dependent and independent variables:

The dependent variable in this case is the charges while the independent variables are the other columns.

X = Insurance.drop(columns = ["charges"])X.head(5)
y = Insurance["charges"]y

Image description

Image description

Splitting the dataset into test and train.

To build a machine learning algorithm, you have to train the model with a set of data and use the other set to test the model youve built.
So we split our data into test data and train data, using 80 percent to train the model and using the other 20 percent to test the model.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 0)X_train.head()

Image description

One hot encoding to transform categorical text data

The data contains some columns which have texts in them, such as gender, region.
Since we cant build the model with these text data, we need to convert it into numbers.
Using the gender column as an example; assigning 0 to female and 1 to male.
We can do this using one hot encoding, using the code below

X_train_ = pd.get_dummies(X_train, columns=["sex", "smoker", "region"], drop_first=True)

Image description

Building and fitting the model.

Here is the most interesting part of this project , now that we are done with data cleaning and converting text data to numbers, we can now build our model using the line of code below:

from sklearn.linear_model import LinearRegression
lm = LinearRegression()lm.fit(X_train_,y_train)

Image description

Predicting the test set results.

Remember we trained our model on 80 percent of our data, now that weve built the model, we can use the model to predict the outcome of the 20 percent we set aside.
Heres the code and the prediction using our test data.

predictions = lm.predict(X_test_)

Image description

Now lets check the accuracy of our model, if our model is 100 percent accurate in predicting the test set results.

Model evaluation:

To evaluate the accuracy of our model, well use the R2 score.
The R2 score measures the amount of variance of the prediction which is explained by the dataset.

If the value of the R2 score is 1, it means the model is perfect, and if its 0, it means the model will perform badly in an unseen data.
The closer the value of the R2 is to 1, the more perfectly the model is trained.

To check our R2 score, we use the code below:

from sklearn.metrics import r2_scorer2_score(y_test, predictions)

Image description

Oops
Not a bad model I must say!

View the entire code here:

https://github.com/heyfunmi/Insurance_Cost_Prediction_using_Machine_Learning_with_Python

See you in another project!
Cheers!!


Original Link: https://dev.to/heyfunmi/insurance-cost-prediction-using-machine-learning-with-python-2gma

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To