Sources Contact Advanced Search Tutorials

An Interest In:

Web News this Week

Search Archive

Some of Our Sources

View All Sources

Help Webnuz

Referal links:

May 18, 2022 01:34 pm GMT

SENTIMENT ANALYSIS WITH PYTHON's NLTK LIBRARY

It is commonly known, the success of a company/business/product directly depends on the customer, so if your customer likes your product it's your success. If not then you need to improvise it by making some changes to it.

Question:How will you know whether your product is successful or not?

Well for that you need to analyze your customers and one of the attributes of analyzing your customer is to analyze the sentiment of them on the specific product and this is where the sentiment analysis comes into the picture.

So, let's start with what is sentiment analysis?

Sentiment Analysis we can say it is a process of computationally identifying and categorizing opinions from piece of text, and determine whether the writer's attitude towards the particular topic or product is positive,negative or neutral.

It might be possible not as an individual every time you don't perform a sentiment analysis but you do look for the feedback right like before purchasing a product or downloading an app in your device(phone) either from the App Store or Play Store you can look for feedback of what other customers or users are saying about that product whether is good or bad and you analyze it manually. Consider at the company level, how do they analyze what customers are saying about particular products, they do have more than millions of customers. That is where companies need to perform sentiment analysis to know whether their products are doing good on the market or not.

What is NLTK?

NLTK stands for Natural Language Toolkit, It provides us various text processing libraries with a lot of test datasets. Today you will learn how to process text for sentiment analysis usingNLTK. There are other libraries as well like CoreNLP, spaCy, PyNLPI, and Polyglot.NLTKand spaCy are most widely used. Spacy works well with large information and for advanced NLP.

We are going to use NLTK to perform sentiment analysis to assess if a Twitter post is about Covid-19 or not. You can read more about the datasets we are going to usehere

The first step is to install NLTK in your working environment

pip install nltk

Data Loading

Our datasets have three features ID, text, and target where ID just indexes each tweet, the text is the tweet needed to be classified and the target is the label of the tweet where 1 means the tweet with Covid contents and 0 tweets with no Covid contents.

import pandas as pdtweets =  pd.read_csv("../data/tweets.csv")tweets.head()

Overview of tweets

Let's try to filter some tweets one with Covid contents and another with non-Covid contents.

# %tweet with covid contentstweets[tweets['target'] == 1 ].loc[8,'text']

# %tweet with non covid contentstweets[tweets['target'] == 1].loc[1,'text']

Data Cleaning

The process is usually step-wise, regardless of which language of those sentiments are coming, only some changes can be done to fit the requirements of sentiment analysis but the idea of cleaning text data is the same. We have loaded the datasets it is time to clean those texts.

Removing Punctuation

Punctuation is nonmeaningful when we come in sentiment analysis we should remove from strings to remain with clean sentiments. We can do so by usingremove_punctuationfunction on the snippet below.

# %function to remove punctuation using string librarydef remove_punctuation(text):    '''a function for removing punctuation'''    translator = str.maketrans('', '', string.punctuation)    return text.translate(translator)tweets['text'] = tweets['text'].apply(remove_punctuation)

Converting Text to Lowercase

In sentiment analysis, we should consider converting text to lowercase because they can be a problem for a couple of reasons. Imagine the world "us".It could be a pronoun representing "we" on the sentence or the country "USA". We can do so by using the piece of code below.

tweets['text'] = tweets['text'].apply(lambda x: " ".join(x.lower()  for x in x.split()))

Removing Stop words

A Text may contain words like is, was, when etc. We can remove stopwords from the text. There is no universal list of stop words in NLP, however, the NLTK library provides a list of stop words, so here we will use the snippet below to remove all stopwords from the text.

# %import nltkimport nltk# %download stopwords using nltknltk.download('stopwords')# %remove english stopwordsallstopwords = nltk.corpus.stopwords.words("english")tweets['text'] = tweets['text'].apply(lambda x: " ".join(i for i in x.split() if i not in allstopwords))

Removing Numbers

Numbers or words combined with numbers are very hard to process, you can find a sentence with words like covid-19,software2020orplan7. For this can, we should remove these kinds of words be problems for machines to understand. For our case here is a piece of code to perform such a task.

def remove_numbers(text):    '''a function to remove numbers'''    txt = re.sub(r'\b\d+(?:\.\d+)?\s+', '', text)    return txttweets['text'] = tweets['text'].apply(remove_numbers)

Tokenization

The process of breaking down the text into smaller units is called tokens. If we have a sentence, the idea is to separate each word and build a vocabulary such that we can represent all words uniquely in a list. Numbers, words, etc.. all fall under tokens. NLTK has different methods to perform such tasks. For our case, we will usesent_tokenizeto perform tokenization

nltk.download('punkt')from nltk.tokenize import sent_tokenizetweets['text'] = sent_tokenize(tweets['text'])

Story Generation and Visualization from Tweets

Take a look at visualizations to answer some questions or to tell stories like

What are the most common words in the entire dataset?
What are the most common words in the dataset for each class, respectively?
Which trends are associated with either of the sentiments? Are they compatible with the sentiments?

We will be using a word cloud to visualize our tweets

# create text from all tweetsall_words = ' '.join([text for text in tweets['text']])wordcloud = WordCloud(width=900, height=500, random_state=21, max_font_size=110).generate(all_words)plt.figure(figsize=(15, 7))plt.imshow(wordcloud, interpolation="bilinear")plt.axis('off')plt.show()

Word cloud Visualization for tweets

From left word cloud visualization of all tweets, at the middle the visualization of tweets about Covid and the last(right side) the visualization of tweets with non-Covid contents.

Feature Extraction and Classification

Here we are going to extract tweets and split them into two groups, train group and test group where train group will be used to train our sentiment model and the test group to validate the performance of our model, the splitting task will be done by usingsklearnpython library

we are going to use NLTK methods to perform feature extractions

# %Extracting word featuresdef get_words_in_tweets(tweets):    all = []    for (words, target) in tweets:        all.extend(words)    return alldef get_word_features(wordlist):    wordlist = nltk.FreqDist(wordlist)    features = wordlist.keys()    return featuresw_features = get_word_features(get_words_in_tweets(tweets))def extract_features(document):    document_words = set(document)    features = {}    for word in w_features:        features['containts(%s)' % word] = (word in document_words)

Then, Let's use NLTK Naive Bayes Classifier to classify the extracted tweet word features.

from nltk.classify import SklearnClassifiertraining_set = nltk.classify.apply_features(extract_features,tweets)classifier = nltk.NaiveBayesClassifier.train(training_set)

The final step is to evaluate how far our trained model can perform sentiment classification with unseen tweets.

noncovid_cnt = 0covid_cnt = 0for obj in test_nocovid:    res =  classifier.classify(extract_features(obj.split()))    if(res == 0):        noncovid_cnt = noncovid_cnt + 1for obj in test_covid:    res =  classifier.classify(extract_features(obj.split()))    if(res == 1):        covid_cnt = covid_cnt + 1print('[Non Covid]: %s/%s '  % (len(test_nocovid),noncovid_cnt))print('[Covid]: %s/%s '  % (len(test_covid),covid_cnt))

The final model with NLTK was able to predict collect 530 tweets with Non-Covid contents out of 581, equivalent to 91%, and 410 tweets with Covid contents out of 477, equivalent to 85%.

Well done, now you are familiar with NLTK that allows you to process text, analyze text to gain particular information, you can investigate more on how to make your model improve also you can opt to select other classifiers to compare with Naive Bayes.

Thank you, you can access full codeshere

Relationship Between Neurotech and Natural Language Processing(NLP)

Natural Language Processing is a powerful tool when your solve business challenges, associating with the digital transformation of companies and startups.SarufiandNeurotechoffer high-standard solutions concerning conversational AI(chatbots). Improve your business experience today with NLP solutionsfrom experienced technical expertise.

Hope you find this article useful, sharing is caring.

Original Link: https://dev.to/neurotech_africa/sentiment-analysis-with-pythons-nltk-library-1ihe

Share this article:

View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To