An Interest In:
Web News this Week
- April 25, 2024
- April 24, 2024
- April 23, 2024
- April 22, 2024
- April 21, 2024
- April 20, 2024
- April 19, 2024
Insurance Cost Prediction using Machine Learning with Python.
Machine learning (ML) is a sub set of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so.
Machine learning algorithms uses historical data as input to predict new output values.
In this project, I worked on developing an end to end machine learning project using linear regression.
Data cleaning, Extensive data visulaization, Exploratory data analysis was also done.
Data Description:
The dataset used for this project is an Insurance focused dataset that contains columns such as age, sex, bmi, region, and other data, which were used to determine the cost of each persons insurance.
Steps
- Importing the necessary libraries:Numpy, pandas, matplotlib, seaborn and sckitlearn were imported.
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import train_test_split%matplotlib inline
- Loading in the dataset:The csv was loaded using the code below:
Insurance = pd.read_csv("https://raw.githubusercontent
- Information about the data.To get some information about the data such as the type of data in each column, we use the code below
Insurance.info()
- Checking the statistical description of the data:
Insurance.describe()
- Checking for the number of rows and columns present in the dataset:
Insurance.shape
Data Cleaning and preparation:
Working with unclean data leads to inaccuracy in results, so its necessary to carry out data cleaning before any analysis or prediction is done.
- Checking for null values:
To check for null values in our dataset, we use the code below:
Insurance.isnull().any()
- Checking for duplicates:
Insurance.duplicated().any()
Exploratory Data Analysis:
Exploratory data analysis helps in understanding the patterns, trends and metrics in a dataset. Also helps in detecting outliers and anomalous events.
- Using a correlation matrix to check for correlations among the columns in the dataset:
sns.heatmap(Insurance.corr())
The correlation matrix shows theres little or no correlation between age and charges.
- Checking for the distribution pattern of the charges column
sns.distplot(Insurance['charges'])
- Plotting a pairplot to check out the relationship that exists between one column to another.
sns.pairplot(Insurance);
Extracting dependent and independent variables:
The dependent variable in this case is the charges while the independent variables are the other columns.
X = Insurance.drop(columns = ["charges"])X.head(5)
y = Insurance["charges"]y
Splitting the dataset into test and train.
To build a machine learning algorithm, you have to train the model with a set of data and use the other set to test the model youve built.
So we split our data into test data and train data, using 80 percent to train the model and using the other 20 percent to test the model.
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 0)X_train.head()
One hot encoding to transform categorical text data
The data contains some columns which have texts in them, such as gender, region.
Since we cant build the model with these text data, we need to convert it into numbers.
Using the gender column as an example; assigning 0 to female and 1 to male.
We can do this using one hot encoding, using the code below
X_train_ = pd.get_dummies(X_train, columns=["sex", "smoker", "region"], drop_first=True)
Building and fitting the model.
Here is the most interesting part of this project , now that we are done with data cleaning and converting text data to numbers, we can now build our model using the line of code below:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()lm.fit(X_train_,y_train)
Predicting the test set results.
Remember we trained our model on 80 percent of our data, now that weve built the model, we can use the model to predict the outcome of the 20 percent we set aside.
Heres the code and the prediction using our test data.
predictions = lm.predict(X_test_)
Now lets check the accuracy of our model, if our model is 100 percent accurate in predicting the test set results.
Model evaluation:
To evaluate the accuracy of our model, well use the R2 score.
The R2 score measures the amount of variance of the prediction which is explained by the dataset.
If the value of the R2 score is 1, it means the model is perfect, and if its 0, it means the model will perform badly in an unseen data.
The closer the value of the R2 is to 1, the more perfectly the model is trained.
To check our R2 score, we use the code below:
from sklearn.metrics import r2_scorer2_score(y_test, predictions)
Oops
Not a bad model I must say!
View the entire code here:
https://github.com/heyfunmi/Insurance_Cost_Prediction_using_Machine_Learning_with_Python
See you in another project!
Cheers!!
Original Link: https://dev.to/heyfunmi/insurance-cost-prediction-using-machine-learning-with-python-2gma
Dev To
An online community for sharing and discovering great ideas, having debates, and making friendsMore About this Source Visit Dev To