Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
May 28, 2022 03:14 pm GMT

How I built a Deep Learning Powered Search Engine

We use search engines everyday. Without them, we would be using the Internet in a much different way, a way which I can't imagine what it would be like.

I decided to have a go at making my own small search engine, using Deep Learning to power the search results.

Breaking the problem down

Because making a fully fledged search engine is a very long and expensive task, I decided that my search engine would be for Wikipedia pages only (a.k.a given a search query, it will return any relevant Wikipedia pages for it)

There were 3 main steps to building this search engine:

  • Find/Create a database of Wikipedia pages
  • Index the database of pages
  • Efficiently match search queries to pages in the index

Creating the Database

Instead of trying to find a databse of Wikipedia pages online, I decided to create my own, simply because I wanted to build a web crawler!

What is a web crawler?

A web crawler is initially given a few starting URLs to visit. They then visit these URLs and parse through the HTML to find more URLs they can visit. Any URLs they find are added to a queue, so that they can visit them later.

Search engines, like Google and Bing, use web crawlers to keep adding to their index of website links, so that they can come up in their search results.

Since most of the URLs on Wikipedia pages are URLs to other Wikipedia pages, all I had to do was write a simple, generic web crawler and feed it a few intial Wikipedia pages.

import requestsimport osfrom tinydb import TinyDBfrom html.parser import HTMLParserdb = TinyDB("results.json")def get_host(url):    while os.path.dirname(url) not in ["http:", "https:"]:        url = os.path.dirname(url)    return urlclass Parser(HTMLParser):    def __init__(self):        super(Parser, self).__init__(convert_charrefs=True)        self.url = ""        self.urls = []        self.meta_description = ""        self.title = ""        self.paragraph_content = ""        self.paragraph = False        self.set_description = False        self.set_title = False    def set_url(self, url):        if url[-1] == "/":            if os.path.dirname(url[:-1]) not in ["http:", "https:"]:                url = url[:-1]        self.url = url    def handle_starttag(self, tag, attrs):        if tag == "meta":            for attr in attrs:                if attr[0] == "name" and attr[1] == "description":                    self.set_description = True                if self.set_description:                    if attr[0] == "content":                        self.meta_description = attr[1]                        self.set_description = False        elif tag == "a":            for attr in attrs:                if attr[0] == "href":                    link = attr[1]                    if link:                        if link[0] == "/":                            link = get_host(self.url) + link                            self.urls.append(link)                        elif "http://" in link and link.index("http://") == 0:                            self.urls.append(link)                        elif "https://" in link and link.index("https://") == 0:                            self.urls.append(link)        elif tag == "p" and len(self.paragraph_content) < 100:            self.paragraph = True        elif tag == "title":            self.set_title = True    def handle_endtag(self, tag):        if tag == "p":            self.paragraph = False    def handle_data(self, data):        if self.set_title:            self.title = data            self.set_title = False        elif self.paragraph:            self.paragraph_content += data    def clear(self):        self.urls = []        self.meta_description = ""        self.title = ""        self.paragraph_content = ""        self.paragraph = False        self.set_description = False        self.set_title = Falsedef crawl(start_queue):    parser = Parser()    queue = start_queue    seen_urls = []    while len(queue) > 0:        if queue[0] not in seen_urls:            try:                print (queue[0])                page = requests.get(queue[0])                parser.set_url(queue[0])                parser.feed(page.text)                db.insert({                    "title": parser.title,                    "description": parser.meta_description,                    "content": parser.paragraph_content,                    "url": queue[0]                })                seen_urls.append(queue[0])                queue = queue + parser.urls                parser.clear()            except:                pass        queue = queue[1:]crawl(["https://en.wikipedia.org/wiki/Music", "https://en.wikipedia.org/wiki/Cricket", "https://en.wikipedia.org/wiki/Football"])

Everytime the crawler visits a page, it scans through the HTML for any a tags and adds its href attribute to the queue.

It also records the page's title, found in between the title tags, the first 100 characters of the page's article, by looking at the content within the p tags and the page's meta description (however after crawling, I found that none of the Wikipedia pages have meta descriptions!)

After the page is scanned through, its URL and recorded details are saved to a local JSON file, using TinyDB.

I ran the crawler for a few mintues and managed to scrape around 1000 pages.

Indexing the Database

To return relevant pages to a user's search query, I was planning to use a KNN algorithm to compare the vector encodings of the search query and the vector encodings of the page contents store in the database.

Vector encodings of natural language are crucial when it comes to understanding what a user is saying. As you will see later, I decided to use a Transformer model to encode sentences into a vector representation.

Image description

With what I had so far, it was possible to build a working search engine with the above method in mind, but it would be extremely inefficient, since it would involve going through EVERY record in the database, vectorising their article's content and comparing it to the query vector.

Increasing Search Efficiency

In order to increase the efficiency of the search, I had an idea to index/preprocess the database, in order to reduce the search space for the KNN algorithm.

The first step was to vectorise the content of all the pages I had scraped and stored in my database.

from tinydb import TinyDBfrom sentence_transformers import SentenceTransformerfrom sklearn.cluster import KMeansmodel = SentenceTransformer('sentence-transformers/stsb-roberta-base-v2')db = TinyDB("results.json")contents = []for record in db:    content = record["content"]    contents.append(content)embeddings = model.encode(contents)

As you can see, for vectorising sentences, I used the SentenceTransformer library and the "stsb-roberta-base-v2" transformer model, which was fine-tuned for tasks like Neural Search, where a query needs to be matched with relevant documents (the exact task I had at hand).

Image description

Now that I had all the vector representations of the pages, I decided to cluster semantically similar pages together, using the K-Means clustering algorithm.

The idea behind this was that, when a search query is entered, the query would be vectorised and be classified into a cluster. Then we could perform KNN with the pages in that cluster only, instead of the whole database, which should improve efficiency.

kmeans = KMeans(n_clusters=3)kmeans.fit(embeddings)buckets = {}for record, label in zip(db, kmeans.labels_):    if label not in buckets:        buckets[label] = []    buckets[label].append(dict(record)) import picklepickle.dump(kmeans, open("kmeans-model.save", "wb"))pickle.dump(buckets, open("buckets.save","wb"))

Clustering pages

from sentence_transformers import SentenceTransformerimport pickleimport numpy as npfrom sklearn.neighbors import NearestNeighborsmodel = SentenceTransformer('sentence-transformers/stsb-roberta-base-v2')kmeans = pickle.load(open("cluster.save", 'rb'))buckets = pickle.load(open("buckets.save", 'rb'))while True:    query = input("Search: ")    encoded_query = model.encode(query)    encoded_query = np.expand_dims(encoded_query, axis=0)    bucket = kmeans.predict(encoded_query)    bucket = int(bucket.squeeze())    embeddings = []    for result in buckets[bucket]:        embeddings.append(result["content"])    embeddings = model.encode(embeddings)    neigh = NearestNeighbors(n_neighbors=10)    neigh.fit(embeddings)    indexes = neigh.kneighbors(encoded_query, return_distance=False)    for index in indexes.T:      doc = buckets[bucket][int(index.squeeze())]      print (doc["title"], doc["url"])      print ("")

Matching user query to 10 most relevant pages

In the code above, it shows that I broke the database down into 3 clusters.

However, I did a lot of tinkering with the number of clusters.

If we group the database into too many clusters, searches would be extremely efficient, but there would be a huge drop in result accuracy. The same goes the other way.

I found that, with 3 clusters, the accuracy of the results were really high, however it was still extremely slow to return results (~7 seconds for each search, but the worst was 32 seconds).

Search: how do i compose musicMusical instrument - Wikipedia https://en.wikipedia.org/wiki/Musical_instrumentElements of music - Wikipedia https://en.wikipedia.org/wiki/Elements_of_musicMusic criticism - Wikipedia https://en.wikipedia.org/wiki/Music_criticismContemporary classical music - Wikipedia https://en.wikipedia.org/wiki/Contemporary_musicAccompaniment - Wikipedia https://en.wikipedia.org/wiki/AccompanimentMusical improvisation - Wikipedia https://en.wikipedia.org/wiki/Musical_improvisationMusique concrte - Wikipedia https://en.wikipedia.org/wiki/Musique_concr%C3%A8teProgramming (music) - Wikipedia https://en.wikipedia.org/wiki/Programming_(music)Film score - Wikipedia https://en.wikipedia.org/wiki/Film_scoreSong - Wikipedia https://en.wikipedia.org/wiki/SongHarpsichord - Wikipedia https://en.wikipedia.org/wiki/HarpsichordMusic theory - Wikipedia https://en.wikipedia.org/wiki/Music_theoryMusic industry - Wikipedia https://en.wikipedia.org/wiki/Music_industryDefinition of music - Wikipedia https://en.wikipedia.org/wiki/Definitions_of_musicWolfgang Amadeus Mozart - Wikipedia https://en.wikipedia.org/wiki/Wolfgang_Amadeus_MozartInvention (musical composition) - Wikipedia https://en.wikipedia.org/wiki/Invention_(musical_composition)Music - Wikipedia https://en.wikipedia.org/wiki/MusicMusic - Wikipedia https://en.wikipedia.org/wiki/MusicAesthetics of music - Wikipedia https://en.wikipedia.org/wiki/Aesthetics_of_musicMusicology - Wikipedia https://en.wikipedia.org/wiki/Musicology

Searches relating to music took the most time, possibly because the database contained a lot of music pages, and so they were all grouped into one big cluster.

With anything above 6 clusters, I found that results were being returned at a quick speed, however the accuracy was poor. For example I'd search something simple, such as "Liverpool Football", but the engine would fail to return the Liverpool F.C page, despite it being present in the database.

A better solution

With the trade-off between speed and accuracy in the above solution being way too sensitive, I had to find a better solution.

After a bit of research, I came across ANNOY.

ANNOY stands for "Approximate Nearest Neighbours Oh Yeah" and is a small library, provided by Spotify, to search for points in space that are close to a given query point.

Spotify themselves use ANNOY for their user music recommendations!

Approximate Nearest Neighbours?

You may be thinking why would we want an approximate nearest neighbours alogrithm? Why not an exact one?

For KNN to be exact, it has to iterate through each and every datapoint given to it, which is obviously extremely inefficient.

Things can be drastically sped up if a little bit of accuracy is sacrificed, but, in practice, this sacrifice in accuracy does not matter at all. A user would not mind if the second closest datapoint and first closest datapoint are swapped around, since they are both probably good matches to their query.

How ANNOY works

Here is a good article explaining how ANNOY works (written by the man who built ANNOY himself!)

ANNOY works by building loads of binary trees (a forest) from the dataset its given.

To build a tree, it selects two random points in the vector space and divides the space into two subspaces, by the hyperplane equidistant to the two random points.

Image description

The process repeats again, in the new subspaces that were just made.

Image description

This keeps going until there is a certain n number of points in each subspace.

Points that are near to each other should be in the same subspace, since it is unlikely that there would be a hyperplane to separate them into separate subspaces.

Now that we have all the subspaces, a binary tree can be constructed.

Image description

The nodes of the tree represent a hyperplane. So, when given a query vector, we can traverse down the tree, telling us which hyperplanes we should go down to, in order to find some x most relevant points in the vector space.

ANNOY builds many of these trees to build a forest. The number of trees in the forest is specified by the programmer.

When given a query vector, ANNOY uses a priority queue to search this query through the binary trees in its forest. The priority queue allows for the search to focus on trees that are best for the query (aka trees whose hyperplanes are far from the query vector).

After it has finished searching, ANNOY looks at all the common points its trees have found, which would form the query vector's "neighbours".

Image description

Now the k nearest neighbours can be ranked and returned

Image description

Refactoring Indexing and Search code

It didn't take long to change the code to use ANNOY, thanks to its straightforward API.

from tinydb import TinyDBfrom sentence_transformers import SentenceTransformerfrom annoy import AnnoyIndexmodel = SentenceTransformer('sentence-transformers/stsb-roberta-base-v2')db = TinyDB("results.json")descriptions = []for record in db:    description = record["content"]    descriptions.append(description)embeddings = model.encode(descriptions)index = AnnoyIndex(embeddings.shape[-1], "euclidean")vec_idx = 0for vec in embeddings:    index.add_item(vec_idx, vec)    vec_idx += 1index.build(10)index.save("index.ann") #stores the results of the indexing

Indexing code

from tinydb import TinyDBfrom sentence_transformers import SentenceTransformerfrom annoy import AnnoyIndexmodel = SentenceTransformer('sentence-transformers/stsb-roberta-base-v2')db = TinyDB("results.json")index = AnnoyIndex(768, "euclidean")index.load("index.ann")while True:    query = input("Search: ")    vec = model.encode(query)    indexes = (index.get_nns_by_vector(vec, 20))    all_db = db.all()    for i in indexes:        print (all_db[i]["title"], all_db[i]["url"])        print ("")

user search code

Now let's see the results...

Search: great composersIgor Stravinsky - Wikipedia https://en.wikipedia.org/wiki/Igor_StravinskyContemporary classical music - Wikipedia https://en.wikipedia.org/wiki/Contemporary_musicGustav Mahler - Wikipedia https://en.wikipedia.org/wiki/Gustav_MahlerSymphony - Wikipedia https://en.wikipedia.org/wiki/SymphonyArt music - Wikipedia https://en.wikipedia.org/wiki/Art_musicSymphony No. 5 (Beethoven) - Wikipedia https://en.wikipedia.org/wiki/Symphony_No._5_(Beethoven)Wolfgang Amadeus Mozart - Wikipedia https://en.wikipedia.org/wiki/Wolfgang_Amadeus_MozartLudwig van Beethoven - Wikipedia https://en.wikipedia.org/wiki/Ludwig_van_BeethovenMusic of Central Asia - Wikipedia https://en.wikipedia.org/wiki/Central_Asian_musicBig band - Wikipedia https://en.wikipedia.org/wiki/Big_bandProgram music - Wikipedia https://en.wikipedia.org/wiki/Program_musicCello Suites (Bach) - Wikipedia https://en.wikipedia.org/wiki/Bach_cello_suitesFilm score - Wikipedia https://en.wikipedia.org/wiki/Film_scoreJohann Sebastian Bach - Wikipedia https://en.wikipedia.org/wiki/Johann_Sebastian_BachElements of music - Wikipedia https://en.wikipedia.org/wiki/Elements_of_musicMusic of China - Wikipedia https://en.wikipedia.org/wiki/Chinese_classical_musicOrgan (music) - Wikipedia https://en.wikipedia.org/wiki/Organ_(music)Toccata and Fugue in D minor, BWV 565 - Wikipedia https://en.wikipedia.org/wiki/Toccata_and_Fugue_in_D_minor,_BWV_565Georg Philipp Telemann - Wikipedia https://en.wikipedia.org/wiki/Georg_Philipp_TelemannSonata form - Wikipedia https://en.wikipedia.org/wiki/Sonata_formSearch: what are the rules of cricketNo-ball - Wikipedia https://en.wikipedia.org/wiki/No-ballInternational cricket - Wikipedia https://en.wikipedia.org/wiki/International_cricketToss (cricket) - Wikipedia https://en.wikipedia.org/wiki/Toss_(cricket)Match referee - Wikipedia https://en.wikipedia.org/wiki/Match_refereeCricket ball - Wikipedia https://en.wikipedia.org/wiki/Cricket_ballBoard of Control for Cricket in India - Wikipedia https://en.wikipedia.org/wiki/Board_of_Control_for_Cricket_in_IndiaIndia national cricket team - Wikipedia https://en.wikipedia.org/wiki/India_national_cricket_teamCaught - Wikipedia https://en.wikipedia.org/wiki/CaughtSubstitute (cricket) - Wikipedia https://en.wikipedia.org/wiki/Substitute_(cricket)International Cricket Council - Wikipedia https://en.wikipedia.org/wiki/International_Cricket_CouncilPortal:Cricket - Wikipedia https://en.wikipedia.org/wiki/Portal:CricketDelivery (cricket) - Wikipedia https://en.wikipedia.org/wiki/Delivery_(cricket)Cricketer (disambiguation) - Wikipedia https://en.wikipedia.org/wiki/Cricketer_(disambiguation)Cricket (disambiguation) - Wikipedia https://en.wikipedia.org/wiki/Cricket_(disambiguation)Cricket West Indies - Wikipedia https://en.wikipedia.org/wiki/Cricket_West_IndiesZimbabwe Cricket - Wikipedia https://en.wikipedia.org/wiki/Zimbabwe_CricketBowled - Wikipedia https://en.wikipedia.org/wiki/BowledPakistan Cricket Board - Wikipedia https://en.wikipedia.org/wiki/Pakistan_Cricket_BoardWorld Cricket League - Wikipedia https://en.wikipedia.org/wiki/World_Cricket_LeagueWest Indies cricket team - Wikipedia https://en.wikipedia.org/wiki/West_Indies_cricket_team

As you can see, the results returned are pretty accurate! It obviously doesn't help that the database I had was pretty small, but this did yield some good results despite it!

On top of that, these results were produced almost instantly, much much quicker than the previous solution.

Conclusion

The only thing I could see that stopped from the search results being better was the size of the database. If I did set out to build a much bigger search engine in the future, I'd look to use databases such as Firebase or MongoDB, and look into how ANNOY could interface with them.

Having said that, I built this project to investigate how deep learning models could be used in document searching tasks and what can be done to efficiently perform the searches and I think I've taken a lot away from this project.

Thank you for reading and I hope you've learnt something from this too!


Original Link: https://dev.to/ashwinscode/how-i-built-a-deep-learning-powered-search-engine-49mh

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To