Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
July 18, 2022 08:22 pm GMT

How to build your own google search

Got interested ? Yeah we will build an algorithm to make our own search engine.

There are a couple of steps in doing so. First, we do some web scraping and index the data we obtain to store it in the database in a meaningful format. We wont be doing these here, though. What we will do is to search what you want from these indexed data. So lets get started.

Problem at hand

You suddenly get an urge to build your own google search. Your friend is helping you out in the initial stuff, but he is really dumb when it comes to algorithms. So you need to build an algorithm that can search for a word efficiently from a given list of strings.

We shall make some assumptions here. Firstly, we will assume that the strings will be unique and will be comprising of single word each. Also, the characters of the strings will be from a-z.

NOTE: These assumptions are just to deduce the algorithm. The algorithm that will be building can handle all characters and also multi-word strings. These assumptions will just let us deduce the algorithm easily and then we can generalise our algorithm by some minute modifications.

Intuition

Lets first try to think what we can do in this case.

One simple option is to hash each string to a value and use the string as and when it is called. However, there are two problems in this method. We need to store the same part of the string twice. For example, consider these two cases:

[ "app", "apple" ]

Here, it is evident that storing app and apple separately will cause us to use more memory as compared to storing one of them. So the question is can we do better ?

Well, the answer is yes we can, however, there are some trade-offs for this method.

We can use something called a trie.

Tries are efficient information retrieval data structures. We can use these to retrieve data more efficiently than arrays (they dont have that O(1) time complexity, though).

And guess what ?

Tries take lesser memory than hashmaps.

Solution

Now, lets implement our trie. Before we start, though, Id like to give you a basic structure of a TrieNode (basically a node or vertex in the trie).

struct Node{    char val;    Node* next;};class TrieNode{    unordered_map<char, Node*> children;    bool endOfWord;public:    TrieNode() {        for(char c='a'; c<='z'; c++)            children[c] = NULL;        endOfWord = false;    }};

This is what a basic trie node looks like. Here, I took the information in each node as a character. It can be anything ranging from a character to complex objects. The children map will have all the characters in range (here, a-z) as keys and the node for that characters data as the value.

So once we have that in place, now is the time we construct our trie.

First step is to understand how it will work. Basically, we will have a starting node (lets call this a wildcard character). This node will be an empty TrieNode with null pointers as all its children and endOfWord as false.

Now, we keep on adding the strings in our trie. For each string, we add one character to the tries children with the corresponding nodes value. We repeat unless the string has ended. Once it has, we mark the endOfWord of that TrieNode as true.

void insert(string word) {    Node* curr = root;    for(char c : word) {        if(curr->children.find(c) == curr->children.end()) {            curr->children[c] = new Node();            curr = curr->children[c];        }    }    curr->endOfWord = true;}

We do this for each string we get and at the end, we will have our trie completed.

Good, now we have our data in place. But how do we query it ?

If you understood it till now, you shouldve already guessed it. However, if you are dumb (like me), then hang on. Lets discuss this through.

So querying it is quite simple. For the query word, we perform a similar logic and search character by character. If at any moment we see that a character that you need to complete the current word is not present, we can conclude that the word is not present in the trie and need to go further. (Case I)

In case you find the whole word, however, the last character you visited to doesnt have endOfWord as true, then again, the word you are searching for doesnt exist. In all other cases, you can find the word you are searching for. (Case II)

Consider the following example for case I:

  * / \a   b|   |p   e|   |p   a|   |l   r| e

So we have apple and bear in our trie. Now, say, you search for boat.

You see we have a b but, hold on we do not have an o after b. Hence, boat doesnt exist in our trie. Now, say you add boat to it. Itd look something like this:

  * / \a   b|  / \p o   e| |   |  p a   a| |   |  l t   r| e

Now, if you search for boat, you can see you will get all the characters, and t will also have endOfWord as true.

Consider the following example for case I:

  * / \a   b|  / \p o   e| |   |  p a   a| |   |  l t   r| e

Now, say you search for app. You notice, a is present, then p is also present and the next p is also present. However, this p doesnt have endOfWord set to true. Hence, app isnt present in our trie. Now, if you add app to our trie, itd look something like this.

  * / \a   b|  / \p o   e| |   |  p a   a| |   |  l t   r| e

As you can see, we marked p with endOfWord true as well in the above case.

Here is how you implement this in code:

bool search(string word) {    Node* curr = root;    for(char c : word) {    if(curr->children.find(c) == curr->children.end()) {      return false;    }    curr = curr->children[c];    }    return curr->endOfWord;}

Also, lets discuss a bonus thing you can do with tries. You can even predict the words (or characters in this case) that the user is going to type next. It works on the same logic as above with a slight modification.

Here, we just say if the word starts with the prefix string. All you need to do to make this into an autosuggestion feature is to put the words into an array and returning it in place of true (also, change the function return type lol).

bool startsWith(string prefix) {    Node* curr = root;    for(char c : prefix) {        if(curr->children.find(c) == curr->children.end()) {            return false;        }        curr = curr->children[c];    }    return true;}

With that, my lecture is over. But youd ask now, Wait! You mentioned this is better than hashmaps and started giving a lecture without even asking if I was interested. So is it really more efficient than a hasmap ?

So lets try to answer that.

Here, we are traversing through the entire trie and we visit each node in a branch exactly once. So, the time complexity will be O(h) in the worst case, where h is the height of the trie. If you know that the trie has N nodes, then, you can even say it takes O(logN) time (Yay less than linear!!!).

The space needed is almost constant as each node has a fixed number(26 in this case) of children. If the height of tree is h, then, space taken is O(26*h) ~ O(h) which can also be interpreted as O(logN) if the number of nodes are N.

As you can imagine, its pretty efficient as compared the the naive way of using a hashmap to store stuff. Sure, wed get O(1) search time, but wed have taken a lot of memory and we also wouldve to give up on the autosuggest featue xP.

Well, with that, youve successfully planned the algorithm to build the next google search. What are you waiting for ? Go and build it! Dont keep your friend waiting after he did all that hardwork of doing the initial indexing for you.

Thanks for reading it till the end and stay tuned for more amazing content like this one!


Original Link: https://dev.to/samyc2002/how-to-build-your-own-google-search-4i85

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To