An Interest In:
Web News this Week
- April 20, 2024
- April 19, 2024
- April 18, 2024
- April 17, 2024
- April 16, 2024
- April 15, 2024
- April 14, 2024
How to build a sentiment analysis engine in Python
Intro
A little tutorial to show how to build and train a classifier to distinguish positive from negative reviews:
as an example dataset we download Movie Reviews from Kaggle.
This dataset contains 1000 positive and 1000 negative processed reviews.
link: https://www.kaggle.com/nltkdata/movie-review
Classifier
use BernoulliNB Naive Bayes classifier for multivariate Bernoulli models.
Like MultinomialNB, this classifier is suitable for discrete data. The difference is that while MultinomialNB works with occurrence counts, BernoulliNB is designed for binary/boolean features.
CountVectorizer
Convert a collection of text documents to a matrix of token counts
This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.
If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data.
Packages
import pandas as pdfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.model_selection import train_test_splitfrom sklearn.naive_bayes import BernoulliNBfrom sklearn.metrics import accuracy_score
Read CSV as DataFrame
df = pd.read_csv('moview_review.csv')
DataFrame preview
<bound method NDFrame.head of fold_id cv_tag html_id sent_id text tag0 0 cv000 29590 0 films adapted from comic books have had plenty... pos1 0 cv000 29590 1 for starters , it was created by alan moore ( ... pos2 0 cv000 29590 2 to say moore and campbell thoroughly researche... pos3 0 cv000 29590 3 the book ( or " graphic novel , " if you will ... pos4 0 cv000 29590 4 in other words , don't dismiss this film becau... pos... ... ... ... ... ... ...64715 9 cv999 14636 20 that lack of inspiration can be traced back to... neg64716 9 cv999 14636 21 like too many of the skits on the current inca... neg64717 9 cv999 14636 22 after watching one of the " roxbury " skits on... neg64718 9 cv999 14636 23 bump unsuspecting women , and . . . that's all . neg64719 9 cv999 14636 24 after watching _a_night_at_the_roxbury_ , you'... neg[64720 rows x 6 columns]>
Preparing Data
X = df['text']y = df['tag']
Vectorize Data
vect = CountVectorizer(ngram_range=(1, 2))X = vect.fit_transform(X)
Split data into random train and test subsets
X_train, X_test, y_train, y_test = train_test_split(X, y)
Train Bayesan Model
model = BernoulliNB()model.fit(X_train, y_train)# Predictp_train = model.predict(X_train)p_test = model.predict(X_test)
Calculating the Accuracy
Accuracy classification score.
In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.
acc_train = accuracy_score(y_train, p_train)acc_test = accuracy_score(y_test, p_test)
Result
print(f'Train ACC: {acc_train}, Test ACC: {acc_test}')Train ACC: 0.9564276885043264, Test ACC: 0.6988875154511743
Original Link: https://dev.to/daviducolo/how-to-build-a-sentiment-analysis-engine-in-python-6jn
Dev To
An online community for sharing and discovering great ideas, having debates, and making friendsMore About this Source Visit Dev To