Sources Contact Advanced Search Tutorials

An Interest In:

Web News this Week

Search Archive

Some of Our Sources

View All Sources

Help Webnuz

Referal links:

January 26, 2022 11:28 am GMT

Personality Prediction Using K-Means Clustering Algorithm and Django Rest Framework (Article Written Partly By GPT3)

Introduction

Artificial Intelligence is revolutionizing the world today as we know it through technical transformations. Machine learning applications are applied in our day-to-day lives, and one of the incredible applications of machine learning is to classify individuals based on their personality traits. Each person on this planet is unique and carries a unique personality. The availability of a high-dimensional and large amount of data has paved the way for increasing marketing campaigns' effectiveness by targeting specific people. Such personality-based communications are highly effective in increasing the popularity and attractiveness of products and services.

In this article, we build a model to predict personality based on 50 questions and deploy the model using Django Rest Framework.

I will also be sharing a snapshot of my code on how I was able to use GPT3 to help me generate content in this article. Every text below, except 'the steps to implement' section, was generated by GPT3, a natural-language processing system that can generate tweets, pens poetry, summarizes emails, answers trivia questions, translates languages and even write its own computer programs.

K-means Clustering Algorithm.

The K-Means clustering algorithm is an algorithm that is used in unsupervised machine learning. To put it simply, it is a clustering algorithm where it groups the data into clusters and then assigns a label to each of the clusters. The K-Means algorithm is an iterative algorithm, it means that it repeats the following steps in a loop until the algorithm is satisfied.

Assign the input data to K clusters (there are K clusters, the algorithm sorts the data into K clusters)
Assign each input data to the closest cluster (the algorithm calculates the distance between the data and each of the clusters and then assigns each data to the closest cluster)
Update the cluster centroids (this is the average of all the data in the cluster)

Applications of K-means Algorithm.

The k-means clustering algorithm is used in a variety of applications. It is used in marketing, advertisement, market segmentation, and customer segmentation. It is also used in various scientific fields, like in determining someones personality.nIn determining someones personality.

Five Personality Traits (OCEAN)

The big five personality traits are:

Openness to experience
Conscientiousness
Extraversion
Agreeableness
Neuroticism.

These are the five traits that human behavior is divided into. People can score between 1 and 5 in each of the five traits. For example, let's say that someone scores 3 in openness to experience. This means that he/she is not very open to new experiences, he/she is a person that is very comfortable with their own ways and he/she wants to stick to it. He/she is not very open to new things and he/she tries to avoid it. He/she is a person that doesn't like to try new things. For example, in a party, he/she is not very social and he/she doesn't like to talk a lot. He/she is a person that speaks little and he/she is a person that is not very sociable. In a party, he/she likes to stay at the corner and he/she won't go talk to other people. He/she will just stay and look at other people in the party. This is his/her personality.

Now that we have a basic understanding of the k-means algorithm and the big five personality traits, let us dive deep into building the Big Five personality traits.

Steps to implement.

Let us start by building and saving our model that will be later used to make predictions for our API.

Dataset.

Our dataset consists of 1,015,342 questionnaire answers collected online by Open Psychometrics. Let's look at how the dataset actually appears. The dataset and entire code for this blog can be found on my Github Repo

We first import the necessary dependencies. If you do not have the libraries installed, kindly do so before proceeding.

import numpy as np import pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport osfrom sklearn.cluster import KMeansfrom yellowbrick.cluster import KElbowVisualizerfrom sklearn.preprocessing import MinMaxScalerfrom sklearn.decomposition import PCAimport joblib

Next, we load the dataset using pandas and then we display the number of participants:

data_raw = pd.read_csv('data-final.csv', sep='')data = data_raw.copy()pd.options.display.max_columns = 150data.drop(data.columns[50:107], axis=1, inplace=True)data.drop(data.columns[51:], axis=1, inplace=True)print('Number of participants: ', len(data))data.head()

Number of participants:  1015341

Exploration of Dataset.

Let us begin by checking for missing values and removing the missing values, like so:

print('Missing value? ', data.isnull().values.any())print('How many? ', data.isnull().values.sum())data.dropna(inplace=True)print('Number of participants after eliminating missing values: ', len(data))

Is there any missing value?  TrueHow many missing values?  89227Number of participants after eliminating missing values:  1013481

Let us now look at the participant distribution per nationality, by entering the code:

# Participants' nationality distriutioncountries = pd.DataFrame(data['country'].value_counts())countries_5000 = countries[countries['country'] >= 5000]plt.figure(figsize=(15,5))sns.barplot(data=countries_5000, x=countries_5000.index, y='country')plt.title('Countries With More Than 5000 Participants')plt.ylabel('Participants');

Let us now visualize the question and answer distribution of the questionnaires, like so:

# Defining a function to visualize the questions and answers distributiondef vis_questions(groupname, questions, color):    plt.figure(figsize=(40,60))    for i in range(1, 11):        plt.subplot(10,5,i)        plt.hist(data[groupname[i-1]], bins=14, color= color, alpha=.5)        plt.title(questions[groupname[i-1]], fontsize=18)

Q&As Related to Extroversion Personality

print('Q&As Related to Extroversion Personality')vis_questions(EXT, ext_questions, 'orange')

Q&As Related to Neuroticism Personality

print('Q&As Related to Neuroticism Personality')vis_questions(EST, est_questions, 'pink')

Q&As Related to Agreeable Personality

print('Q&As Related to Agreeable Personality')vis_questions(AGR, agr_questions, 'red')

Q&As Related to Conscientious Personality

print('Q&As Related to Conscientious Personality')vis_questions(CSN, csn_questions, 'purple')

Q&As Related to Open Personality

print('Q&As Related to Open Personality')vis_questions(OPN, opn_questions, 'blue')

Building the Model

We already know the number of clusters that will be present in our model, that is 5, the big five of personality traits. Let us see how we can get this value using code. For ease of calculation, we shall scale all the values between 0-1 and use only a sample of 5000, like so:

df = data.drop('country', axis=1)columns = list(df.columns)scaler = MinMaxScaler(feature_range=(0,1))df = scaler.fit_transform(df)df = pd.DataFrame(df, columns=columns)df_sample = df[:5000]

Let us now visualize our elbow curve. In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use.

kmeans = KMeans()visualizer = KElbowVisualizer(kmeans, k=(2,15))visualizer.fit(df_sample)visualizer.poof()

As you can see 5 clusters looks optimum for the data set and we already know this researh is to identify 5 different personalities.

Clustering Participants into 5 Personality Groups

To do this, we use the unscaled data but without the country column.

df_model = data.drop('country', axis=1)# I define 5 clusters and fit my modelkmeans = KMeans(n_clusters=5)k_fit = kmeans.fit(df_model)## SAVE KFIT MODELjoblib_file = "BigFivePersonalityTestModel.joblib"joblib.dump(k_fit, joblib_file)# Predicting the Clusterspd.options.display.max_columns = 10predictions = k_fit.labels_df_model['Clusters'] = predictionsdf_model.head()

EXT1    EXT2    EXT3    EXT4    EXT5    ... OPN7    OPN8    OPN9    OPN10   Clusters0   4.0 1.0 5.0 2.0 5.0 ... 5.0 3.0 4.0 5.0 31   3.0 5.0 3.0 4.0 3.0 ... 4.0 2.0 5.0 3.0 22   2.0 3.0 4.0 4.0 3.0 ... 5.0 3.0 4.0 4.0 23   2.0 2.0 2.0 3.0 4.0 ... 4.0 4.0 3.0 3.0 14   3.0 3.0 3.0 3.0 5.0 ... 5.0 3.0 5.0 5.0 3

Analysing the Model and Predictions

Let us see how many individuals we have for each cluster, like so:

df_model.Clusters.value_counts()

4    2270632    2128163    2100750    2002261    163301Name: Clusters, dtype: int64

Let's group the results acording to clusters. That way we can investigate the average answer to the each question for each cluster.

That way we can have an intuition about how our model classifies people.

pd.options.display.max_columns = 150df_model.groupby('Clusters').mean()

Let's now sum up the each question groups (EXT, EST ..) and see if we can see a pattern.

# Summing up the different questions groupscol_list = list(df_model)ext = col_list[0:10]est = col_list[10:20]agr = col_list[20:30]csn = col_list[30:40]opn = col_list[40:50]data_sums = pd.DataFrame()data_sums['extroversion'] = df_model[ext].sum(axis=1)/10data_sums['neurotic'] = df_model[est].sum(axis=1)/10data_sums['agreeable'] = df_model[agr].sum(axis=1)/10data_sums['conscientious'] = df_model[csn].sum(axis=1)/10data_sums['open'] = df_model[opn].sum(axis=1)/10data_sums['clusters'] = predictionsdata_sums.groupby('clusters').mean()

extroversion    neurotic    agreeable   conscientious   openclusters                    0   2.965969    3.645931    3.148628    3.173210    3.2455291   2.909467    2.525743    2.851802    2.914458    3.1203732   3.051889    2.984940    3.187544    3.159140    3.2436413   3.085431    2.423577    3.209064    3.106899    3.3271734   3.072319    3.426610    3.300147    3.211454    3.352370

Let us now visualize the mean for each of our 5 personality clusters:

# Visualizing the means for each clusterdataclusters = data_sums.groupby('clusters').mean()plt.figure(figsize=(22,3))for i in range(0, 5):    plt.subplot(1,5,i+1)    plt.bar(dataclusters.columns, dataclusters.iloc[:, i], color='green', alpha=0.2)    plt.plot(dataclusters.columns, dataclusters.iloc[:, i], color='red')    plt.title('Cluster ' + str(i))    plt.xticks(rotation=45)    plt.ylim(0,4);

Visualizing the Cluster Predictions

pca = PCA(n_components=2)pca_fit = pca.fit_transform(df_model)df_pca = pd.DataFrame(data=pca_fit, columns=['PCA1', 'PCA2'])df_pca['Clusters'] = predictionsdf_pca.head()plt.figure(figsize=(10,10))sns.scatterplot(data=df_pca, x='PCA1', y='PCA2', hue='Clusters', palette='Set2', alpha=0.8)plt.title('Personality Clusters after PCA');

Predict Personality

I answered the questions in an Microsoft Excel spread sheet. Then I added that data into this notebook and put my answers to the model to see in which category I will be.

my_data = pd.read_excel('my_personality_test.xlsx')my_personality = k_fit.predict(my_data)print('My Personality Cluster: ', my_personality)

My Personality Cluster:  [2]

Visualizing my personality cluster

col_list = list(my_data)ext = col_list[0:10]est = col_list[10:20]agr = col_list[20:30]csn = col_list[30:40]opn = col_list[40:50]my_sums = pd.DataFrame()my_sums['extroversion'] = my_data[ext].sum(axis=1)/10my_sums['neurotic'] = my_data[est].sum(axis=1)/10my_sums['agreeable'] = my_data[agr].sum(axis=1)/10my_sums['conscientious'] = my_data[csn].sum(axis=1)/10my_sums['open'] = my_data[opn].sum(axis=1)/10my_sums['cluster'] = my_personalityprint('Sum of my question groups')my_sumsmy_sum = my_sums.drop('cluster', axis=1)plt.bar(my_sum.columns, my_sum.iloc[0,:], color='green', alpha=0.2)plt.plot(my_sum.columns, my_sum.iloc[0,:], color='red')plt.title('Cluster 2')plt.xticks(rotation=45)plt.ylim(0,4);

Now that our model works, let us proceed by turning our model into a Restful API

Turning the Model into an RESTFUL API

Following Python best practices, we will create a virtual environment for our project, and install the required packages.

First, create the project directory.

$ mkdir djangoapp$ cd djangoapp

Now, create a virtual environment and install the required packages.

For macOS and Unix systems:

$ python3 -m venv myenv$ source myenv/bin/activate(myenv) $ pip install django requests numpy joblib scikit-learn xlsxwriter openpyxl

For Windows:

$ python3 -m venv myenv$ myenv\Scripts\activate(myenv) $ pip install django requests numpy joblib scikit-learn xlsxwriter openpyxl

Setting Up Your Django Application

First, navigate to the directory djangoapp we created and establish a Django project.

(myenv) $ django-admin startproject mainapp

This will auto-generate some files for your project skeleton:

mainapp/    manage.py    mainapp/        __init__.py        settings.py        urls.py        asgi.py        wsgi.py

Now, navigate to the directory you just created (make sure you are in the same directory as manage.py) and create your app directory.

(myenv) $ python manage.py startapp monitor

This will create the following:

monitor/    __init__.py    admin.py    apps.py    migrations/        __init__.py    models.py    tests.py    views.py

On the mainapp/settings.py file, look for the following line and add the app we just created above.

INSTALLED_APPS = [    'django.contrib.admin',    'django.contrib.auth',    'django.contrib.contenttypes',    'django.contrib.sessions',    'django.contrib.messages',    'django.contrib.staticfiles',    'monitor', #new line]

Ensure you are in the monitor directory then create a new directory called templates then a new file called urls.py. Your directory structure of monitor application should look like this

monitor/    __init__.py    admin.py    apps.py    migrations/    templates/        __init__.py    models.py    tests.py    urls.py    views.py

Ensure your mainapp/urls.py file, add our monitor app URL to include the URLs we shall create next on the monitor app:

from django.contrib import adminfrom django.urls import path, includeurlpatterns = [    #path('admin/', admin.site.urls),    path('', include('monitor.urls')),#monitor app url]

Now, on the monitor/urls.py file, add our website there:

from django.urls import pathfrom .views import *urlpatterns = [    path('persona', PersonalityPrediction.as_view(), name='personality'),]

Lets create another directory to store our machine learning model. Ill also add the dataset to the project for those who want to achieve the whole dataset. (It is not compulsory to create a data folder.) Be sure to move the vectorizer file and the joblib file we created earlier to ml/model folder

(venv)$ mkdir ml(venv)$ mkdir ml/models(venv)$ mkdir ml/data

We also need to tell Django where our machine learning model is located. Add these lines to settings.py file:

import osMODELS = os.path.join(BASE_DIR, 'ml/models')

Load Model and Vectorizer through apps.py

Load the model we created and saved in apps.py so that when the application starts, the trained model is loaded only once. Otherwise, the trained model is loaded each time an endpoint is called, and then the response time will be slower.

Lets update apps.py

import osimport joblibfrom django.apps import AppConfigfrom django.conf import settingsclass ApiConfig(AppConfig):    name = 'api'    MODEL_FILE = os.path.join(settings.MODELS, "BigFivePersonalityTestModel.joblib")    model = joblib.load(MODEL_FILE)

Edit views.py

The views will be mainly responsible for:

Process incoming POST requests.

from django.shortcuts import renderfrom rest_framework.views import APIViewfrom rest_framework.response import Responseimport requestsfrom .apps import *import pandas as pdimport xlsxwriter#kmeans personality prediction viewclass PersonalityPrediction(APIView):    def post(self, request):        #get user input        EXT1 = request.data.get('EXT1')        EXT2 = request.data.get('EXT2')        EXT3 = request.data.get('EXT3')        EXT4 = request.data.get('EXT4')        EXT5 = request.data.get('EXT5')        EXT6 = request.data.get('EXT6')        EXT7 = request.data.get('EXT7')        EXT8 = request.data.get('EXT8')        EXT9 = request.data.get('EXT9')        EXT10 = request.data.get('EXT10')        EST1 = request.data.get('EST1')        EST2 = request.data.get('EST2')        EST3 = request.data.get('EST3')        EST4 = request.data.get('EST4')        EST5 = request.data.get('EST5')        EST6 = request.data.get('EST6')        EST7 = request.data.get('EST7')        EST8 = request.data.get('EST8')        EST9 = request.data.get('EST9')        EST10 = request.data.get('EST10')        AGR1 = request.data.get('AGR1')        AGR2 = request.data.get('AGR2')        AGR3 = request.data.get('AGR3')        AGR4 = request.data.get('AGR4')        AGR5 = request.data.get('AGR5')        AGR6 = request.data.get('AGR6')        AGR7 = request.data.get('AGR7')        AGR8 = request.data.get('AGR8')        AGR9 = request.data.get('AGR9')        AGR10 = request.data.get('AGR10')        CSN1 = request.data.get('CSN1')        CSN2 = request.data.get('CSN2')        CSN3 = request.data.get('CSN3')        CSN4 = request.data.get('CSN4')        CSN5 = request.data.get('CSN5')        CSN6 = request.data.get('CSN6')        CSN7 = request.data.get('CSN7')        CSN8 = request.data.get('CSN8')        CSN9 = request.data.get('CSN9')        CSN10 = request.data.get('CSN10')        OPN1 = request.data.get('OPN1')        OPN2 = request.data.get('OPN2')        OPN3 = request.data.get('OPN3')        OPN4 = request.data.get('OPN4')        OPN5 = request.data.get('OPN5')        OPN6 = request.data.get('OPN6')        OPN7 = request.data.get('OPN7')        OPN8 = request.data.get('OPN8')        OPN9 = request.data.get('OPN9')        OPN10 = request.data.get('OPN10')        #load model        kmeansModel = PersonalityTestConfig.model        # Predicting the Clusters        pd.options.display.max_columns = 10        predictions = kmeansModel.labels_        workbook = xlsxwriter.Workbook('hello.xlsx')        worksheet = workbook.add_worksheet()        worksheet.write('A1', 'EXT1')        worksheet.write('A2', (EXT1))        worksheet.write('B1', 'EXT2')        worksheet.write('B2', (EXT2))        worksheet.write('C1', 'EXT3')        worksheet.write('C2', (EXT3))        worksheet.write('D1', 'EXT4')        worksheet.write('D2', (EXT4))        worksheet.write('E1', 'EXT5')        worksheet.write('E2', (EXT5))        worksheet.write('F1', 'EXT6')        worksheet.write('F2', (EXT6))        worksheet.write('G1', 'EXT7')        worksheet.write('G2', (EXT7))        worksheet.write('H1', 'EXT8')        worksheet.write('H2', (EXT8))        worksheet.write('I1', 'EXT9')        worksheet.write('I2', (EXT9))        worksheet.write('J1', 'EXT10')        worksheet.write('J2', (EXT10))        worksheet.write('K1', 'EST1')        worksheet.write('K2', (EST1))        worksheet.write('L1', 'EST2')        worksheet.write('L2', (EST2))        worksheet.write('M1', 'EST3')        worksheet.write('M2', (EST3))        worksheet.write('N1', 'EST4')        worksheet.write('N2', (EST4))        worksheet.write('O1', 'EST5')        worksheet.write('O2', (EST5))        worksheet.write('P1', 'EST6')        worksheet.write('P2', (EST6))        worksheet.write('Q1', 'EST7')        worksheet.write('Q2', (EST7))        worksheet.write('R1', 'EST8')        worksheet.write('R2', (EST8))        worksheet.write('S1', 'EST9')        worksheet.write('S2', (EST9))        worksheet.write('T1', 'EST10')        worksheet.write('T2', (EST10))        worksheet.write('U1', 'AGR1')        worksheet.write('U2', (AGR1))        worksheet.write('V1', 'AGR2')        worksheet.write('V2', (AGR2))        worksheet.write('W1', 'AGR3')        worksheet.write('W2', (AGR3))        worksheet.write('X1', 'AGR4')        worksheet.write('X2', (AGR4))        worksheet.write('Y1', 'AGR5')        worksheet.write('Y2', (AGR5))        worksheet.write('Z1', 'AGR6')        worksheet.write('Z2', (AGR6))        worksheet.write('AA1', 'AGR7')        worksheet.write('AA2', (AGR7))        worksheet.write('AB1', 'AGR8')        worksheet.write('AB2', (AGR8))        worksheet.write('AC1', 'AGR9')        worksheet.write('AC2', (AGR9))        worksheet.write('AD1', 'AGR10')        worksheet.write('AD2', (AGR10))        worksheet.write('AE1', 'CSN1')        worksheet.write('AE2', (CSN1))        worksheet.write('AF1', 'CSN2')        worksheet.write('AF2', (CSN2))        worksheet.write('AG1', 'CSN3')        worksheet.write('AG2', (CSN3))        worksheet.write('AH1', 'CSN4')        worksheet.write('AH2', (CSN4))        worksheet.write('AI1', 'CSN5')        worksheet.write('AI2', (CSN5))        worksheet.write('AJ1', 'CSN6')        worksheet.write('AJ2', (CSN6))        worksheet.write('AK1', 'CSN7')        worksheet.write('AK2', (CSN7))        worksheet.write('AL1', 'CSN8')        worksheet.write('AL2', (CSN8))        worksheet.write('AM1', 'CSN9')        worksheet.write('AM2', (CSN9))        worksheet.write('AN1', 'CSN10')        worksheet.write('AN2', (CSN10))        worksheet.write('AO1', 'OPN1')        worksheet.write('AO2', (OPN1))        worksheet.write('AP1', 'OPN2')        worksheet.write('AP2', (OPN2))        worksheet.write('AQ1', 'OPN3')        worksheet.write('AQ2', (OPN3))        worksheet.write('AR1', 'OPN4')        worksheet.write('AR2', (OPN4))        worksheet.write('AS1', 'OPN5')        worksheet.write('AS2', (OPN5))        worksheet.write('AT1', 'OPN6')        worksheet.write('AT2', (OPN6))        worksheet.write('AU1', 'OPN7')        worksheet.write('AU2', (OPN7))        worksheet.write('AV1', 'OPN8')        worksheet.write('AV2', (OPN8))        worksheet.write('AW1', 'OPN9')        worksheet.write('AW2', (OPN9))        worksheet.write('AX1', 'OPN10')        worksheet.write('AX2', (OPN10))        # Finally, close the Excel file        # via the close() method.        workbook.close()        my_data = pd.read_excel('hello.xlsx', engine='openpyxl')        my_personality = kmeansModel.predict(my_data)        print('My Personality Cluster: ', my_personality)        # Summing up the my question groups        col_list = list(my_data)        ext = col_list[0:10]        est = col_list[10:20]        agr = col_list[20:30]        csn = col_list[30:40]        opn = col_list[40:50]        my_sums = pd.DataFrame()        my_sums['extroversion'] = my_data[ext].sum(axis=1)/10        my_sums['neurotic'] = my_data[est].sum(axis=1)/10        my_sums['agreeable'] = my_data[agr].sum(axis=1)/10        my_sums['conscientious'] = my_data[csn].sum(axis=1)/10        my_sums['open'] = my_data[opn].sum(axis=1)/10        my_sums['cluster'] = my_personality        print('Sum of my question groups')        print(my_sums)        response_dict = {"Prediction": my_sums}        print(response_dict)        return Response(response_dict, status=200)

The code above starts by getting the data appended to the request body. The code then creates a new excel file programmatically which is used in the prediction and returns the cluster as a response.

Testing our model.

To test the system, make the necessary migrations and run the django server. Open POSTMAN and make a POST request to our server like so, with the answers to the questionnaire appended to the body of the request. We should get a sample response showing us our personality cluster and our scores.
To test my live system, make a POST request to the URL below:

https://alienx.tech/api/v1/persona

The code below is what i used to prompt GPT3 to help me generate the content in this blog:

import openaiopenai.api_key = "VISIT OPENAI TO GET YOUR KEY"response = openai.Completion.create(  engine="davinci",  prompt="The Pixel District Janury 16, 2022n Title: Personality Segmentation Using K-Means Clustering Algorithm and Django Rest Framework!n tags: machine-learning, kmeans, gpt3, kmeans code sample Summary:  I am sharing my exprience in implementing kmeans clustering algorithmn in determining someones personality. I am explaining why kmeans clustering algorithms is, how it works configuration. I am explaining what the big five personality traits are. I am explaining why I think kmeans algorithm is the best to use in finding someones personality based on the  big five personality traits. I am also adding various example codes of the kmeans clustering to find the big five personality traits.n Full text: ",  temperature=0.7,  max_tokens=1655,  top_p=1,  frequency_penalty=0,  presence_penalty=0)print(response)

That's it for today. Thanks for staying tuned in!

Original Link: https://dev.to/paulwababu/personality-prediction-using-k-means-clustering-algorithm-and-django-rest-framework-article-written-partly-by-gpt3-22oe

Share this article:

View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To