Sources Contact Advanced Search Tutorials

An Interest In:

Web News this Week

Search Archive

Some of Our Sources

View All Sources

Help Webnuz

Referal links:

October 4, 2020 03:19 am GMT

3 Natural Language Processing Tools From AWS to Python

Photo byEric KrullonUnsplash.

Parsing and processing documents can provide a lot of value for almost every department in a company. This is one of the many use cases where natural language processing (or NLP) can come in handy.

NLP is not just for chatbots and Gmail predicting what you are going to write in your email. NLP can also be used to help break down, categorize, and analyze documents automatically. For example, perhaps your company is looking to find relationships through all of your contracts or you're trying to categorize what blog posts or movie scripts are about.

This is where using some form of NLP processing could come in very handy. It can help break down subjects, repetitive themes, pronouns, and more from a document.

Now the question is, how do you do it?

Should you develop a custom neural network from scratch that will break down sentences, words, meaning, sentiment, etc?

This is probably not the best solution --- at least not for your initial MVP. Instead, there are lots of libraries and cloud services that can be used to help breakdown documents.

In this article, we are going to look at three options and how you can implement these tools to analyze documents with Python. We are going to look into AWS Comprehend, GCP Natural Language, and TextBlob.

AWS Comprehend

AWS Comprehendis one of many cloud services that AWS provides that allows your team to take advantage of neural networks and other models without the complexity of building your own.

In this case, AWS Comprehend is an NLP API that can make it very easy to process text.

What is great about AWS Comprehend is that it will automatically break down what concepts like what entities, phrases, and syntax are involved in a document. Entities are particularly helpful if you are trying to break down what events, organizations, persons, or products are referenced in a document.

There are plenty of Python libraries that make it easy to break down nouns, verbs, and other parts of speech. However, those libraries aren't built to label exactly where those nouns fall as far as categories.

Let's look at an example.

For all the code examples in this article, we will be using the text below:

If you have ever worked at a FAANG or even technology-driven start-up like Instacart, then you have probably realized that data drives everything.To the point that analysts, PMs, and product managers are starting to understand SQL out of necessity. SQL is the language of data and if you want to interact with data, you need to know it.Do you want to easily figure out the average amount of time a user spends on your product, but dont want to wait for an analyst? You better figure out how to run a query.This ability to run queries easily is also driven by the fact that SQL editors no longer need to be installed. With cloud-based data, warehouses come SaaS SQL editors. We will talk about a SaaS SQL editor more in the next section.However, the importance here is you dont have to wait 30 minutes to install and editor and deal with all the hassle of managing it.Now you can just go to a URL and access your teams data warehouse. This has allowed anyone in the company easy access to their data.We know both from anecdotal experience as well as the fact that indeed.coms tracking in 2019 has shown a steady requirement for SQL skill sets for the past 5 years.

We will take that text example and run it through the code below where the variable plain-text is:

from io import StringIOimport requestsimport boto3import sysS3_API_PUBLIC = os.environ.get("PUBLIC_KEY")S3_API_SECRET = os.environ.get("SECRET_KEY")client_comprehend = boto3.client(   'comprehend',   region_name = 'eu-west-1',   aws_access_key_id = S3_API_PUBLIC,   aws_secret_access_key = S3_API_SECRET)plain_text = # PUT TEST TEXT HEREdominant_language_response = client_comprehend.detect_dominant_language(       Text=plain_text)dominant_language = sorted(dominant_language_response['Languages'], key=lambda k: k['LanguageCode'])[0]['LanguageCode']if dominant_language not in ['en','es']:           dominant_language = 'en'response = client_comprehend.detect_entities(           Text=plain_text,           LanguageCode=dominant_language)print(response)

AWS Comprehend output

Once you run the code above, you will get an output like the one below. This is a shortened version. However, you can still see the output. For example, you can seeQUANTITYwas labeled with 30 minutes and five years --- both of which are quantities of time:

{   "Entities":[      {         "Score":0.9316830039024353,         "Type":"ORGANIZATION",         "Text":"FAANG",         "BeginOffset":30,         "EndOffset":35      },      {         "Score":0.7218282222747803,         "Type":"TITLE",         "Text":"Instacart",         "BeginOffset":76,         "EndOffset":85      },      {         "Score":0.9762992262840271,         "Type":"TITLE",         "Text":"SQL",         "BeginOffset":581,         "EndOffset":584      },      {         "Score":0.997804582118988,         "Type":"QUANTITY",         "Text":"30 minutes",         "BeginOffset":801,         "EndOffset":811      },      {         "Score":0.5189864635467529,         "Type":"ORGANIZATION",         "Text":"indeed.com",         "BeginOffset":1079,         "EndOffset":1089      },      {         "Score":0.9985176920890808,         "Type":"DATE",         "Text":"2019",         "BeginOffset":1104,         "EndOffset":1108      },      {         "Score":0.6815792322158813,         "Type":"QUANTITY",         "Text":"5 years",         "BeginOffset":1172,         "EndOffset":1179      }]     }

As you can see, AWS Comprehend does a great job of breaking down organizations and other entities. Again, it is not limited to only breaking down entities. However, this feature is one of the more useful ones when attempting to look for relationships between documents.

GCP Natural Language

Google has created a very similar NLP cloud service calledCloud Natural Language.

It offers a lot of similar features, including entity detection, custom entity detection, content classification, and more.

Let's use GCP's version of natural language processing on a string. The code below shows an example of using GCP to detect entities:

from googleapiclient import discoveryimport httplib2from oauth2client.client import GoogleCredentialsDISCOVERY_URL = ('https://{api}.googleapis.com/'                '$discovery/rest?version={apiVersion}')def gcp_nlp_example():    http = httplib2.Http()    credentials = GoogleCredentials.get_application_default().create_scoped(     ['https://www.googleapis.com/auth/cloud-platform'])    http=httplib2.Http()    credentials.authorize(http)    service = discovery.build('language', 'v1beta1',                           http=http, discoveryServiceUrl=DISCOVERY_URL)    service_request = service.documents().analyzeEntities(    body={     'document': {        'type': 'PLAIN_TEXT',        'content': # PUT TEXT HERE    }   })    response = service_request.execute()    print(response)    return 0gcp_nlp_example()

GCP Natural Language output

The GCP output is similar to that of AWS Comprehend. However, you will notice that GCP also breaks down similar words and tries to find metadata that is related to the original word:

//sample{   "entities":[      {         "name":"SaaS SQL",         "type":"OTHER",         "metadata":{            "mid":"/m/075st",            "wikipedia_url":"https://en.wikipedia.org/wiki/SQL"         },         "salience":0.36921546,         "mentions":[            {               "text":{                  "content":"SQL",                  "beginOffset":-1               },               "type":"COMMON"            },            {               "text":{                  "content":"SQL",                  "beginOffset":-1               },               "type":"PROPER"            },            {               "text":{                  "content":"language",                  "beginOffset":-1               },               "type":"COMMON"            }           ]      }

TextBlob And Python

Besides using cloud service providers, there are libraries that can also extract information from documents. In particular, theTextBlob libraryin Python is very useful. Personally, it was the first library I learned to develop NLP pipelines with.

It is far from perfect. However, it does a great job of parsing through documents.

It offers parts of speech parsing like AWS Comprehend and GCP Natural language as well as sentiment analysis. However, on its own, it won't categorize what entities exist.

It is still a great tool to break down the basic word types.

Using this library a developer can break down verbs, nouns, or other parts of speech and then look for patterns. What words are commonly used? Which specific phrases or words are attracting readers? Which words are common with other nouns?

There are still a lot of questions you can answer and products you can develop depending on your end goal.

Implementing the TextBlob library is very simple.

No need to connect to an API in this case. All you will need to do is import the library and call a few classes.

This is shown in the code below:

from textblob import TextBlobt=#PUT YOUR TEXT HEREblob = TextBlob(t)for i in blob.noun_phrases:    print(i)

TextBlob output

Here is the output of TextBlob. You will see a lot of similar words that are pulled out using both AWS and GCP. However, there isn't all the extra labeling and metadata that come with the APIs. That's what you are paying for (amongst a few other helpful features) with both AWS and GCP:

faangtechnology-driven start-upinstacartpmsproduct managerssqlsqlaverage amountdon  tquery.this abilitysqlsaas sqlsaas sqldon  turlteam  s data warehouseeasy accessanecdotal experienceindeed.com  ssteady requirementSql

And with that, we have covered three different ways you can use NLP on your documents or raw text.

NLP Doesn't Have to Be Hard --- Sort Of

NLP is a great tool to help process documents, categorize them, and look for relationships. Thanks to AWS and GCP, many less technical developers can take advantage of some NLP features.

That being said, there are a lot of hard aspects to NLP. For example, having to develop chatbots that are good at tracking conversations and context isn't an easy task. In fact, there is a great series here on Medium whereAdam Geitgeycovers just that. You can read more in the articleNatural Language Processing Is Fun.

Good luck with whatever your next NLP project is.

If you would like to read more about data science and data engineering. Check out the articles and videos below.

4 SQL Tips For Data Scientists

How To Analyze Data Without Complicated ETLs For Data Scientists

What Is A Data Warehouse And Why Use It

Kafka Vs RabbitMQ

SQL Best Practices---Designing An ETL Video

5 Great Libraries To Manage Big Data With Python

Original Link: https://dev.to/seattledataguy/3-natural-language-processing-tools-from-aws-to-python-4g0i

Share this article:

View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To