An Interest In:
Web News this Week
- November 3, 2024
- November 2, 2024
- November 1, 2024
- October 31, 2024
- October 30, 2024
- October 29, 2024
- October 28, 2024
Make Notion search great again: Vector Database
In this series were looking into the implementation of a vector index built from the contents of our company Notion pages that allow us not only to search for relevant information but also to enable a language model to directly answer our questions with Notion as its knowledge base. In this article, we will see how weve used a vector database to finally achieve this.
Numbers, vectors, and charts are real data unless stated otherwise
Last time we downloaded and processed data from Notion API. Lets do something with it.
Vector Database
To find semantically similar texts we need to calculate the distance between vectors. While we have just a few short texts we can brute-force it: calculate the distance between our query and each text embedding one by one and see which one is the closest. When we deal with thousands or even millions of entries in our database, however, we need a more efficient way of comparing vectors. Just like for any other way of searching through a lot of entries, an index can help here. To make our life easier well use - a vector database that implements the HNSW vector index to improve the performance of vector search.
There are a lot of different vector database you can use. Weve used Weaviate DB because it has reasonable defaults, including vector and BM25 indexes working out of the box and a lot of features that can be enabled with modules (like rerank mentioned before). You can also consider postgres extension pgvector to take advantage of SQL goodness: relations, joins, subqueries and so on while weaviate may be more limited in that regard. Choose wisely!
I may revisit the topic of vector indexes in the future but in this article Ill just use the database that implements it. To learn more about HNSW itself look here, and to learn more about configuring vector index in Weaviate DB look .
Weaviate DB
Weaviate DB is an open-source, scalable, vector database that you can easily use in your own projects. The vector goodness is just one docker container away and you can run it like this:
docker run -p 8080:8080 -d semitechnologies/weaviate:latest
Weaviate is modular, and there are a number of modules allowing you to add functionality to your database. You can provide the embedding vectors to the database entries yourself, but there are modules to calculate those for you, like text2vec-openai module that uses the openAI API. There are modules allowing you to easily backup your DB data to S3, add rerank functionality to your searches, and . Enabling a module is as simple as adding an environment variable:
docker run -p 8080:8080 -d \ -e ENABLE_MODULES=text2vec-openai,backup-s3,reranker-cohere \ semitechnologies/weaviate:latest
Now, to connect to the database from our typescript project:
import weaviate from 'weaviate-ts-client';const client = weaviate.client({ scheme: 'http', host: 'localhost:8080',});
All the data in Weaviate DB is stored in classes (equivalent to tables in SQL or collections in MongoDB), containing data objects. Objects have one or more properties of various types, and each object can be represented by exactly one vector. Just like SQL databases, Weaviate is schema-based. We define a class with its name, properties, and additional configuration, like which modules should be used for vectorization. Here is the simplest class with one property.
{ class: 'MagicIndex', properties: [ { name: 'content', dataType: ['text'], }, ],}
We can add as many properties as we like. There are a number of types available: integer, float, text, boolean, geoCoordinates (with special ways to query based on the location), blobs, or lists of most of these like int[] or text[]:
{ class: 'MagicIndex', properties: [ { name: 'content', dataType: ['text'] }, { name: 'tags', dataType: ['text[]'] }, { name: 'lastUpdated', dataType: ['date'] }, { name: 'file', dataType: ['blob'] }, { name: 'location', dataType: ['geoCoordinates'] }, ],}
You can also control how, and for what properties the embeddings are going to be calculated if you dont want to provide them yourself:
{ class: 'MagicIndex', properties: [ { name: 'content', dataType: ['text'] }, { name: 'metadata', dataType: ['text'], moduleConfig: { 'text2vec-openai': { skip: true, }, }, }, ], vectorizer: 'text2vec-openai',}
In this case, were going to use the text2vec-openai module to calculate vectors but only from the content property.
Weaviate stores exactly one vector per object so if you have more fields that are vectorized (or you have vectorizing class name enabled) embedding is going to be calculated from concatenated texts. If you want to have separate vectors for different properties of the document (like different chunks, title, metadata etc.) you need separate entries in the database.
Applying a schema is as simple as:
await client.schema .classCreator() .withClass(classDefinition) .do();
Lets see what the data objects look like in our Notion index:
{ pageTitle: 'Locomotive Kinematics of Quick Brown Foxes: An In-Depth Analysis of Canine Velocity Over Lazy Canid Obstacles', chunk: '1', originalContent: '# Abstract
The paradigm of quick brown foxes leaping over lazy dogs has long fascinated both the scientific community and the general public...', content: 'abstract
the paradigm of quick brown foxes leaping over lazy dogs has long fascinated both the scientific community and the general public...', pageId: 'dfda9d5d-b059-4186-95f4-7cb8cdf42545', pageType: 'page', pageUrl: 'https://www.notion.so/LeapFoxSolutions/dfda9d5d-b059-4186-95f4-7cb8cdf42545', lastUpdated: '2023-04-12T23:20:50.52Z'}
Lets get what is obvious out of the way: we store the page title, its ID, URL, and the last update date. We also vectorize only content property: the vectorizer ignores the title, originalContent, and so on.
You probably noticed a chunk property though. What is it? For vectors to work best it is preferable that texts are not too long. They are generally used for texts not longer than a short paragraph so we split the contents of Notion pages into smaller chunks. Weve used the lanchain's recursive text splitter. It tries to split the text first by double newline, if some chunks are still too long by a single new line, then by spaces, and so on. This way we keep paragraphs together if possible. Weve set the target chunk length to 1000 characters with a 200-long overlap.
The length of the chunks and the way you split them can have a huge impact on vector search performance. It is generally assumed that chunk size should be similar to the length of the query (so during the search you compare vectors of similarly sized texts). In our case chunks 1000 characters long, although pretty big, seem to work best but your mileage may vary. Additionally, we also make sure that table rows are not sliced in half to avoid orphaned columns. This is a huge topic and I may revisit it in one of the future posts.
We save each chunk separately in the database and the chunk property is an index of the chunk. Why is it string and not number though? Because we dont vectorize the title property, we save a separate entry for it that looks like this:
{ pageTitle: 'Locomotive Kinematics of Quick Brown Foxes: An In-Depth Analysis of Canine Velocity Over Lazy Canid Obstacles', chunk: 'title', originalContent: 'Locomotive Kinematics of Quick Brown Foxes An In-Depth Analysis of Canine Velocity Over Lazy Canid Obstacles', ...}
In the future, we may decide that we want to vectorize more properties of the page than just content and title. We can do that easily just by adding a new possible value to the chunk property.
Whats the deal with content and originalContent properties? To spare the vectorizer some noise in the data, we prepare a cleaned-up version of each chunk. We remove all special characters, replace multiple whitespaces with a single one, and change the text to lowercase. In our testing, vector search is slightly more accurate with this simple cleanup. We still keep originalContent though because this is what we pass to rerank and use for traditional, reverse index search.
Lastly, we have pageType property which is just a result of a Notion quirk: a page in Notion can be either a page or a database. As mentioned in the previous article, we treat both the same way in our index: databases are converted to simple tables.
Ok, we have an idea of what data we are going to store in the database, but how to add, fetch, and query that data?
Weaviate interface
Weaviate offers two interfaces to interact with it, RESTful and graphQL APIs and it is reflected in the available typescript client methods. We will focus on the graphQL interface. To get entries from the database, we need to simply provide a class name and the fields we want to get
client.graphql .get() .withClassName('MagicIndex') .withFields('pageTitle originalContent pageUrl');
It is recommended that each query is limited and uses cursor-based pagination if necessary:
client.graphql .get() .withClassName('MagicIndex') .withFields('pageTitle originalContent pageUrl') .withLimit(50) .withAfter(cursor);
Lets add some entries to the database:
await client.data .creator() .withClassName('MagicIndex') .withProperties({ pageTitle: 'Vulpine Agility vs. Canine Apathy: A Comparative Study', chunk: '2', originalContent: '## Background
Though colloquially immortalized in typographical tests, the scenario of a quick brown fox vaulting over a lazy dog presents...', content: 'background
though colloquially immortalized in typographical tests the scenario of a quick brown fox vaulting over a lazy dog presents...', pageId: '1ba0b851-d443-4290-8415-3cd295850d14', pageType: 'page', pageUrl: 'https://www.notion.so/LeapFoxSolutions/1ba0b851-d443-4290-8415-3cd295850d14', lastUpdated: '2023-03-01T12:21:30.12Z' }) .do();
With vectorizer enabled for MagicIndex class, thats all we need to do. The entry is added to the database together with its vector representation calculated by OpenAIs ADA embedding model. Now we can search for texts about foxes and dogs all day long.
Traditional search
Weaviate allows us to search with traditional reverse index methods too! We have a bag-of-words ranking function called BM25F at our disposal. Its configured with reasonable defaults out of the box. Lets see it in action:
await client.graphql .get() .withClassName('MagicIndex') .withBm25({ query: 'Can the fox really jump over the dog?', properties: ['originalContent'], }) .withLimit(5) .withFields('pageTitle originalContent pageUrl _additional { score }') .do();
You can see the _additional property that we can request in the query. It can contain various additional data related to the object itself (like its ID) or the search (like BM25 score or the cosine distance in case of vector search).
Vector search
Of course, a reverse index search will not find many texts that, while talking about brown foxes, dont use those words. Thankfully, semantic search is as easy to perform:
await client.graphql .get() .withClassName('MagicIndex') .withNearText({ concepts: ['Can the fox really jump over the dog?'] }) .withLimit(5) .withFields('pageTitle originalContent pageUrl _additional { distance }') .do();
There is some additional magic that we can do to make the search even better like setting the maximum cosine distance that we accept in the search results, or using the autocut feature:
await client.graphql .get() .withClassName('MagicIndex') .withNearText({ concepts: ['Can the fox really jump over the dog?'], distance: 0.25, }) .withAutocut(2) .withLimit(10) .withFields('pageTitle originalContent pageUrl _additional { distance }') .do();
Now, not only do we get only results with cosine distance less than 0.25 (thats what distance setting in withNearText method does), but additionally, weaviates autocut feature will group the results by similar distance and return the first two groups (more on how autocut works ).
But thats not all. We can also make the search like some concepts and avoid some others:
await client.graphql .get() .withClassName('MagicIndex') .withNearText({ concepts: ['Can the fox really jump over the dog?'], moveAwayFrom: { concepts: ['typography'], force: 0.45, }, moveTo: { concepts: ['scientific'], force: 0.85, }, }) .withFields('pageTitle originalContent pageUrl') .do();
While the example with foxes is a little silly, you can imagine many scenarios where that feature can be really useful. Maybe youre looking for ways to fly but you want to move away from planes and move toward animals. Or you may search for a query, but keep the results similar to some other object in the database:
await client.graphql .get() .withClassName('MagicIndex') .withNearText({ concepts: ['Can the fox really jump over the dog?'], moveTo: { objects: [{ id: '84ab0371-a73b-4774-8b03-eccb97b640ae' }], force: 0.85, }, }) .withFields('pageTitle originalContent pageUrl') .do()
There are many other features that you may want to experiment with. Read more on those in .
Hybrid search
Finally, we can combine the power of vector search with the BM25 index! Here comes the hybrid search which uses both methods and combines them with a given weights:
await client.graphql .get() .withClassName('MagicIndex') .withHybrid({ query: 'Can the fox really jump over the dog?', }) .withLimit(5) .withFields('pageTitle originalContent pageUrl _additional { distance score explainScore }') .do();
In _additional.explainScore property, you will find the details about score contributions from vector and reverse index searches. By default, the vector search result has a weight of 0.75 and a reverse index: 0.25, and those are the values we use in our Notion search. More about how hybrid search works and how to customize the query (including how to change the way vector and reverse index results are combined) can be found .
Rerank
If we enable the rerank module, we can use it to improve the quality of search results. It works for any search method: vector, BM25, or hybrid:
await client.graphql .get() .withClassName('MagicIndex') .withHybrid({ query: 'Can the fox really jump over the dog?', }) .withLimit(100) .withFields('pageTitle originalContent pageUrl _additional { rerank(property: "originalContent" query: "Can the fox really jump over the dog?") { score } }') .do();
Adding a rerank score field to the query will make Weaviate call a rerank module and reorder the results based on the score received. To increase the chance of finding relevant results, weve also increased the limit: now rerank has more texts to work on and can find relevant results even if we had a lot of false positives from a hybrid search.
Summary
To summarize. In our Notion index weve used Weaviate DB with the following modules:
- text2vec-openai enabling Weaviate to calculate embeddings using OpenAI API and ADA model
- reranker-cohere allowing us to use CohereAIs reranking model to improve search results
- backup-s3 just to make it easier to backup data and migrate between environments
To get the data to index, we fetch all Notion pages using a search endpoint with an empty query. In each page, we recursively fetch all blocks that are then parsed by a set of parsers: specific for each type of block. We then have a markdown-formatted string for each page.
We then split the contents of each page into chunks: 1000 characters long with 200 characters of overlap. We also clean up the texts by removing special characters and multiple whitespaces to improve the performance of vector search.
The data for each page chunk is then inserted into the database with a fairly straightforward schema. We have an index of the chunk and some properties of the Notion page: URL, ID, title, and type. Additionally, we keep both original, unaltered content and cleaned-up versions but we calculate embeddings only from the latter.
To find information in the index, we use the hybrid search with a default limit of 100 chunks, with rerank enabled by default.
What worked and what didnt
So, the $100mln question. Does it work?
Absolutely! We have a working semantic search that allows us to reliably search for information even without using the exact wording used on the pages were looking for. You can search for parking around the office or where to leave my car around the office or even just parking?. How to use a coffee machine? What benefits are available in Brainhub? Which member of the team is skilled in martial arts? Who should I talk to if I want a new laptop? What are Brainhubs values?
Not everything works perfectly though. Finding information in large tables (e.g. we have a table with team members - long, with a lot of columns and long texts inside) may be challenging if youre not smart in chunking them e.g. by ensuring that one row is in one chunk even if very long to avoid orphaned columns. Even then the search is not perfect e.g. when asking who is a UX in our team, it may find a chunk with one person out of three UX designers in a table. While this is fine for search (in search results, you still get the link to the correct page that contains the whole table) it may not be enough for a Q&A bot that may miss some information because of it.
Another issue is noise. One of the reasons we wanted a better search was thousands of pages of meeting notes, outdated guidelines, and other mostly irrelevant stuff that lurks in the depths of our Notion workspace. We did implement some mitigations to improve search results and get rid of noise, like lowering the search score of old pages but it was not enough. The best method was still manually excluding areas that were most problematic. Thats not ideal of course, we would like our search engine to figure out whats relevant automatically so thats something to do more research on.
In general though, the results are more than satisfactory and, while there were a lot of small tweaks here and there needed, weve managed to create a Notion search that actually works.
Original Link: https://dev.to/brainhubeu/make-notion-search-great-again-vector-database-2gnm
Dev To
An online community for sharing and discovering great ideas, having debates, and making friendsMore About this Source Visit Dev To