An Interest In:
Web News this Week
- April 2, 2024
- April 1, 2024
- March 31, 2024
- March 30, 2024
- March 29, 2024
- March 28, 2024
- March 27, 2024
CUSTOM SWAHILI NAMED ENTITY RECOGNITION USING SPACY
Named Entity Recognitionas a potential "game-changer" in most businesses, has helped many business operations around the world by addressing complex challenges, since defining boilerplate textual data and extracting even standard information from a big corpus of words can be a difficult and error-prone task.
Financial professionals, business leaders, and innovatorsare increasingly turning toartificial intelligence (AI) technologiesto help spend less time discovering data and more time acting on insights from the data to improve the future of their businesses. This is why "Named Entity Recognition" (NER) as the best tool of all the time in the Artificial intelligence era came into play.
This article speaks about NER as highly leveraged byNeurotech Africa, a leading startup in Africa focused on creatingpowerfully Artificial Intelligence and NLP algorithmsto automate African business by providingsarufi solutions. I will explain common uses cases and demonstrate how to create a customSwahili named entity recognitionmodel using thespaCylibrary.
Meaning: Named Entity Recognition
Named Entity Recognition(NER) is the technique that automatically identifies the important and usefullynamed entitiesthat have been shown, discussed, or mentioned in a certain unstructured text document and classifies them into pre-defined categories such as person names, organization, location, monetary values andso on. Consider the below image for more understanding.
Named Entity Recognition is the first help towards information retrieval tasks, it is also known as entity chunking, entity identification, or entity extraction and has been used in many fields such asNatural Language Processing(NLP)andMachine Learning.
Myth: Named Entity Recognition isn't the future or important in digital businesses.
The power of Named Entity Recognition, in my option comes in the ease with which different basic models can be customized or even built from scratch to extract specificbusiness's informationfrom a variety of data sources in certain companies, resulting inhigh commercial and business values andhence is of more important and mostly future of digital business
Examine how NER can be used to marry with business use cases.
Saurbh image
Use Cases: Relevance of Named Entity Recognition in Businesses' operations
The most successful businesses operations rely on the customers, withArtificial intelligence-powered Named Entity Recognition toolscan give up Africa and the whole world possibilities for driving economic interest in most business operations through user satisfaction. Here I will showcase some of the usages of Named Entity Recognition in business operations.
Automating and Simplifying Customer Support
NER can be used to recognize useful entities in customer complaints and feedback so that they can be categorized to the proper department in charge of the recognized product. This saves time, cost, and faster customer caring and feedback handling in business, hence resulting in more business values. A typical example,Neurotechprovidesentity recognition APIsthat can be integrated into business to automate customer handling process.
Powering Recommendation Engine Algorithms
Recommendation systems govern how wefind fresh stuff and ideas in an interconnected world. Named Entity Recognition may be used to create algorithms that automatically filter relevant information we might be interested in and assist us to uncover similar and previously undiscovered relevant stuff based on our prior behavior. This increases customer engagement on products and brings more business values.
Effective and Efficient optimization of search engine algorithms
A search engine's algorithm is a collection of rules that determines how listings are ranked in response to a search query. Instead of examining the millions of articles and websites online for an entered query, a more efficient approach to design a search engine algorithm would be to run aNERmodel on the articles once and store the entities associated with them permanently. This speeds up a process and increases the business value.
Implementation: Creating a Custom Swahili Named Entity Recognition Model using Spacy.
Hope you now understandNamed Entity Recognition,itsimportance, and its usage in business operations, let's dive into our topic and see how to create a simple named entity recognition model based on the Swahili language using spaCy. But wait !, I see you wondering, what Spacy is? right!
Meaning: spaCy
Simply put,spaCyis a Python-based open-source framework that doessophisticated natural language processing. It is intended for production usage and aids with the development of applications that process and "understand" massive amounts of text. check it outhere,In spaCy, Named Entity Recognition is done by thepipelinecomponentner,it is easy to implement, shortly I can say, spaCy is like your NumPy in data science.
Now, Let's get started,
Using the pre-built-in NER spaCy model
Here we first explore the trained model calledxx_ent_wiki_sm,this is a multilingual model trained to understand different languages. This is due to some languages including Swahili does not have a specific spaCy NER language model. This solution is made on spaCyversion 3.2.1,as the latest version at the time of writing this article
Let's start byinstalling the librariesto be used, code below shows how to install spaCy and download the Multi-language model
! pip install -U spacy #install spacy and upgrade to latest version! python -m spacy download xx_ent_wiki_sm #download the multi language model! python -m spacy info #checking the info about the spacy installed
Copy
Importing the necessary librariesin the project
import spacyimport xx_ent_wiki_sm #multi language modelfrom tqdm import tqdm #making loop show nice progress barfrom spacy.tokens import DocBin # effeciently used to hold serialized annotationsfrom spacy import displacy #highlighting the discovered named entities from text documentimport warningswarnings.filterwarnings("ignore") #filter warningsmodel=xx_ent_wiki_sm.load() #loading the multi language model
Copy
Testing the trained NER Modelloaded as shown above by giving it text data. Consider below code
text_swahili="Mimi ni Innocent Charles , mjuzi wa akili bandia na sayansi ya data kutoka kampuni ya IPFsoftwares" #text data in swahili languagepreds=model(text_swahili) #made predictions of the named entities that might be in text givenfor preds_show in preds.ents: print(preds_show.text,preds_show.label_) #print named entitie and respective labelsdisplacy.render(preds,style="ent",jupyter=True) #displaying it for proper visualization
Copy
Magic !, just simplelike that the model trained in spaCy has done well in recognizing the named entities as shown below image.
Let's explore the pre-defined named entitiesas recognized above by the trained spaCy NER model. Consider below code
print("PER Meaning:",spacy.explain("PER")) #meaning of PERprint("ORG Meaning:",spacy.explain("ORG")) #meaning of ORGprint("MISC Meaning:",spacy.explain("MISC")) #meaning of MISCprint("LOC Meaning:",spacy.explain("LOC")) #meaning of LOC
Copy
Nice, From the below image, contains the meaning of entitiesnow you got to know what NER is capable of. It was able to recognizenamesandorganizationswhere innocent charles might work there.
From the above images and codes, it is shown that we were using the already trained NER model from spaCywithout fine-tuning.
Now, let's create our own or custom NER model using spaCy based onthe Swahili language
Training Custom NER Swahili Model using Spacy By Updating the existing pre-trained Multilingual Model
Preparation of custom data, here I have prepared some training data and validation data withpre-defined entitiesas labels, consider the code below
#training dataSwahili_training_data=[ ("Maafisa wa WHO wamesema kwa wiki kadhaa ufuatiliaji wa mlipuko huo umeangazia mabara ya Marekani, na idadi ya Jumapili imeonyesha ongezeko la siku moja la zaidi ya maambukizi 116,000 katika eneo Latin Amerika na Amerika ya Kaskazini.",{"entities":[[0,7,"MTU"],[11,14,"SHIRIKA"],[88,96,"MAHALI"],[110,118,"SIKU"],[175,182,"IDADI"],[195,208,"MAHALI"],[212,232,"MAHALI"]]}), ("Watu wawili waliojitolea walipatiwa chanjo hiyo Alhamisi mjini Oxford ambapo timu ya Chuo kikuu hicho ilitengeneza chanjo hiyo katika kipindi chini ya miezi mitatu.",{"entities":[[0,4,"MTU"],[5,11,"IDADI"],[48,56,"SIKU"],[63,69,"MAHALI"],[85,95,"SHIRIKA"]]})]#validation dataSwahili_validation_data=[ ("Canada, Russia na nchi nyingine pia wanashughulika kutengeneza chanjo, lakini wataalam wanasema hata kama itapatikana inayofaa hivi karibuni, utengenezaji wa chanjo hiyo na usambazaji wake unaweza kuchukua mwaka mmoja au zaidi.",{"entities":[[0,6,"MAHALI"],[8,14,"MAHALI"],[78,86,"MTU"],[206,217,"MUDA"]]}), ("Tafiti mbalimbali pia zinaonyesha dawa ya malaria hydroxychloroquine haiponyi virusi hivyo na pengine, ukweli ulivyo, inahatarisha maisha ya wagongwa wa COVID-19.",{"entities":[[42,49,"UGONJWA"],[50,68,"DAWA"],[141,149,"MTU"],[153,162,"UGONJWA"]]})]#loading the pre trained model for doing fine tuningcustom_NER_model=xx_ent_wiki_sm.load()
Copy
Double-check if the model is loaded, consider the code below
if(custom_NER_model): print("Existing Model is Loaded",custom_NER_model)else: print("Existing Model is not Loaded")
Copy
Check the pipelines and labeled entities, consider the code below
print(custom_NER_model.pipe_names)print(custom_NER_model.pipe_labels)
Copy
Now the magic task happens here, the code below to covert the prepared data into spaCy data format with .spacy extension and add the custom entities to the model, and save the well-formatted data in the disk.
db = DocBin() #efficiently serialize the information#training datafor text, annot in tqdm(Swahili_training_data): #data in previous format doc = custom_NER_model.make_doc(text) ents = [] for start, end, label in annot["entities"]: #create doc object span = doc.char_span(start, end, label=label,alignment_mode="contract") if span is None: print("Skipping entity") else: ents.append(span) doc.ents = ents #label the text with the ents db.add(doc)db.to_disk("Swahili_training_data.spacy") #save the docbin object#validation datafor text, annot in tqdm(Swahili_validation_data): doc = custom_NER_model.make_doc(text) ents = [] for start, end, label in annot["entities"]: span = doc.char_span(start, end, label=label,alignment_mode="contract") if span is None: print("Skipping entity") else: ents.append(span) doc.ents = ents db.add(doc)db.to_disk("Swahili_validation_data.spacy")
Copy
Creating the config file for the training model,this file automatically come up with necessary hyperparameters based on the pipeline and language model used, this saves time instead of defining them manually in codes. There are multiple ways of creating a config file, but this seems to be simple with CLI.
! python -m spacy init config config.cfg --lang xx --pipeline ner --optimize efficiency
Copy
Finally, use the spacy train and config fileto train the model on the prepared data in spacy format as shown below
! python -m spacy train config.cfg --output ./ --paths.train ./Swahili_training_data.spacy --paths.dev ./Swahili_validation_data.spacy
Copy
Load the custom NER Swahili modeland test it in an unseen Swahili text document
model_test=spacy.load("../Notebook/model-best")test_preds=model_test("Walinzi wa pwani ya Libya wamekamata wahamiaji 400 waliokuwa wakonjiani katika pwani ya Mediterranean ya nchi hiyo wakielekea Ulaya na kuwarejesha katika mji mkuu wa Tripoli masaa 24 yaliyopita, Shirika la uhamiaji la Umoja wa Mataifa UN limesema Jumapili.")for x in test_preds.ents: print(x.text,x.label_)displacy.render(test_preds,style="ent",jupyter=True) #display the recognized named entity in the text given
Copy
Nice job! we have managed to create a simple custom Swahili NER model using spaCy, in this article you have learned about NER, business use cases, and see the implementation of NER and creating a custom model using spaCy.
Bottom line
Following my recent exposure toNER,I am quite confident in stating that this is a highly helpful feature used in a wide range of business scenarios. However, many difficulties must be considered to make the most optimal use of NER.
On the other hand, the rapid advancement of deep learning algorithms as offered byNeurotech Africaand other organizations has resulted in far more powerful NLP models in recent years. You may considercontacting usnow to upscale your business and make the most of it.
Author: Innocent Charles,machine Learning data scientistandNLP developer advocatebased in Africa, focuses on harnessing the power of data and technology to create smart solutions that address complex challenges around Africa. I'm quite eager in hearing about your experience with data space!, let's keep in touch onLinkedin
Original Link: https://dev.to/neurotech_africa/custom-swahili-named-entity-recognition-using-spacy-5p5
Dev To
An online community for sharing and discovering great ideas, having debates, and making friendsMore About this Source Visit Dev To