Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
June 15, 2022 02:13 pm GMT

Fake News Detection with Machine Learning and Flask

Introduction

The world has become more digital, and there is an abundance of data available. Before being sent into space, all data cannot be checked. As the amount of data grows, some of it will be true while the rest will be false. All sources cannot be independently verified, and doing so manually is impossible.

Machine Learning occupies a unique position in that when utilised correctly, it may construct a model based on a trusted dataset that can subsequently be used to sort through news. This project tries to develop a model that analyzes text to determine whether it is true news or not.

Diving into the Project

The Data

The data used for this project was gotten from the Fake and real news dataset on Kaggle. For a simple guide on loading the data from Kaggle to Google Colab, check out this blog post

Data Cleaning

After the data has been loaded, there should be a bit of cleaning done.

true['label'] = 1fake['label'] = 0

Data Cleaning at this stage is done to ensure text is converted to numbers for the model built to be able to interpret information. True news is hence labelled as 1, while Fake news is labelled as 0.

To increase the speed of the experiment, only the first 5000 data points in the data are used and then put into a data frame.

frames = [true.loc[:5000][:], fake.loc[:5000][:]]df = pd.concat(frames)

X and y datasets are then created for the process of dividing the earlier data frame into features and labels.

X = df. drop('label', axis=1)y = df['label']

Dropping missing values and creating a copy data frame for later usage is then done.

df = df.dropna()df2 = df.copy()df2.reset_index(inplace=True)

Text Preprocessing

Preprocessing is the process of converting data into a format that a computer can understand and then use. For working with text data, a form of preprocessing usually done is removing useless data. Useless data for text data are referred to as stop words. Stop words are commonly used words that programs and search engines have been instructed to ignore. Examples can include ('a', 'i', 'me', 'my', 'the', you')

Continuing with the Fake News project, to preprocess we use the process
nltk is a python package that is used for text preprocessing.

from nltk.corpus import stopwordsfrom nltk.stem.porter import PorterStemmerimport reimport nltk

After importing the required libraries, stemming is the next step. The next bit involves removing all punctuation, all capitalized characters, all stopwords and then stemming. Stemming is the process where words in the dataset are reduced to their base forms. For example, words like "likes", "liked", "likely", and "liking" are reduced to like. To eliminate data redundancy in a model, this is required.

Regex is used in this section, if you're not familiar, you can get an introduction to it here

nltk.download('stopwords')ps = PorterStemmer()corpus = []for i in range(0, len(df2)):    review = re.sub('[^a-zA-Z]', ' ', df2['text'][i])    review = review.lower()    review = review.split()    review = [ps.stem(word) for word in review if not word in stopwords.words('english')]    review = ' '.join(review)    corpus.append(review)

The next step involves Word Embedding. Word Embedding is a method used in extracting features from text data for machine learning models to be able to work with the data. There are different word embedding techniques such as Word2Vec, GloVe, BERT but Tfidf is sufficient for this project.

Tfidf is a statistical method for capturing the significance of a text's terms in relation to the corpus/body of the text. It's ideal for retrieving information and extracting keywords from a document.

from sklearn.feature_extraction.text import TfidfVectorizertfidf_v = TfidfVectorizer(max_features=5000, ngram_range=(1,3))X = tfidf_v.fit_transform(corpus).toarray()y = df2['label']

Once that is done, the next step involves splitting the dataset into train and test sets.

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Training and Validating the Model

The data has been split and is prime for modelling. For this project, the PassiveAggressiveClassifier is used. The PassiveAggressiveClassifier is an online learning algorithm that works well to detect fake news. Other algorithms can be used in this step such as Regression, XGBoost, or Neural Networks. This classifier works very well on fake news. For a more detailed explanation, check here

from sklearn.linear_model import PassiveAggressiveClassifierfrom sklearn import metricsimport numpy as npimport itertoolsclassifier = PassiveAggressiveClassifier(max_iter=1000)classifier.fit(X_train, y_train)pred = classifier.predict(X_test)score = metrics.accuracy_score(y_test, pred)print("accuracy:   %0.3f" % score)

A confusion matrix is then used to visualize the results. If you want to learn more about the confusion matrix, you can check out my previous article

For the validation process.

# Validationimport randomr1 = random.randint(5001, len(fake))review = re.sub('[^a-zA-Z]', ' ', fake['text'][r1])review = review.lower()review = review.split() review = [ps.stem(word) for word in review if not word in stopwords.words('english')]review = ' '.join(review)# Vectorizationval = tfidf_v.transform([review]).toarray()# Predict classifier.predict(val)

To save the model, we make use of the Pickle package

import picklepickle.dump(classifier, open('model2.pkl', 'wb'))pickle.dump(tfidf_v, open('tfidfvect2.pkl', 'wb'))

Loading the model to confirm the results

# Load model and vectorizerjoblib_model = pickle.load(open('model2.pkl', 'rb'))joblib_vect = pickle.load(open('tfidfvect2.pkl', 'rb'))val_pkl = joblib_vect.transform([review]).toarray()joblib_model.predict(val_pkl)

Deploying the model

This section requires a user to have experience using Flask. There are many options to deploy a model but this model will be deployed using flask. The app.py used can be found on the GitHub here and the index.html here

The code for this project is available at this repo

Bringing it All Together

This blog post has gone through the steps from downloading data, to cleaning it, building the model, validating the model, and concluded with deploying on Flask. Thank you for reading through. Any feedback is appreciated.

References


Original Link: https://dev.to/iyissa/fake-news-detection-with-machine-learning-and-flask-e5l

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To