Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
August 24, 2022 05:28 pm GMT

Embeddings index components

This article is part of a tutorial series on txtai, an AI-powered semantic search platform.

The main components of txtai are embeddings, pipeline, workflow and an api. This is confirmed with a look at the txtai src tree.

Abbreviated listing of src/txtai ann api database embeddings pipeline scoring vectors workflow

One might ask, why are ann, database, scoring and vectors top level packages and not under the embeddings package? The embeddings package provides the glue between these components, making everything easy to use. The reason is that each of these packages are modular and can be used on their own!

This article will go through a series of examples demonstrating how each component can be used standalone.

Note: This is intended as a deep dive into txtai embeddings components. There are much simpler high-level APIs for standard use cases.

Install dependencies

Install txtai and all dependencies.

# Install txtaipip install txtai datasets

Load dataset

This example will use the ag_news dataset, which is a collection of news article headlines.

from datasets import load_datasetdataset = load_dataset("ag_news", split="train")

Approximate nearest neighbor (ANN) and Vectors

In this section, we'll use the ann and vectors package to build a similarity index over the ag_news dataset.

The first step is vectorizing the text. We'll use a sentence-transformers model.

import numpy as npfrom txtai.vectors import VectorsFactorymodel = VectorsFactory.create({"path": "sentence-transformers/all-MiniLM-L6-v2"}, None)embeddings = []# List of all text elementstexts = dataset["text"]# Create embeddings buffer, vector model has 384 featuresembeddings = np.zeros(dtype=np.float32, shape=(len(texts), 384))# Vectorize text in batchesbatch, index, batchsize = [], 0, 128for text in texts:  batch.append(text)  if len(batch) == batchsize:    vectors = model.encode(batch)    embeddings[index : index + vectors.shape[0]] = vectors    index += vectors.shape[0]    batch = []# Last batchif batch:    vectors = model.encode(batch)    embeddings[index : index + vectors.shape[0]] = vectors# Normalize embeddingsembeddings /= np.linalg.norm(embeddings, axis=1)[:, np.newaxis]# Print shapeembeddings.shape
(120000, 384)

Next we'll build a vector index using these embeddings!

from txtai.ann import ANNFactory# Create Faiss index using normalized embeddingsann = ANNFactory.create({"backend": "faiss"})ann.index(embeddings)# Show totalann.count()
120000

Now let's run a search.

query = model.encode(["best planets to explore for life"])query /= np.linalg.norm(query)for uid, score in ann.search(query, 3)[0]:  print(uid, texts[uid], score)
17752 Rocky Road: Planet hunting gets closer to Earth Astronomers have discovered the three lightest planets known outside the solar system, moving researchers closer to the goal of finding extrasolar planets that resemble Earth. 0.59904360771179216158 Earth #39;s  #39;big brothers #39; floating around stars Washington - A new class of planets has been found orbiting stars besides our sun, in a possible giant leap forward in the search for Earth-like planets that might harbour life. 0.568852901458740245029 Coming Soon: "Good" Jupiters Most of the extrasolar planets discovered to date are gas giants like Jupiter, but their orbits are either much closer to their parent stars or are highly eccentric. Planet hunters are on the verge of confirming the discovery of Jupiter-size planets with Jupiter-like orbits. Solar systems that contain these "good" Jupiters may harbor habitable Earth-like planets as well. 0.5606889724731445

And there it is, a full vector search system without using the embeddings package.

Just as a reminder, the following much simpler code does the same thing with an Embeddings instance.

from txtai.embeddings import Embeddingsembeddings = Embeddings({"path": "sentence-transformers/all-MiniLM-L6-v2"})embeddings.index((x, text, None) for x, text in enumerate(texts))for uid, score in embeddings.search("best planets to explore for life"):  print(uid, texts[uid], score)
17752 Rocky Road: Planet hunting gets closer to Earth Astronomers have discovered the three lightest planets known outside the solar system, moving researchers closer to the goal of finding extrasolar planets that resemble Earth. 0.59904360771179216158 Earth #39;s  #39;big brothers #39; floating around stars Washington - A new class of planets has been found orbiting stars besides our sun, in a possible giant leap forward in the search for Earth-like planets that might harbour life. 0.56885296106338545029 Coming Soon: "Good" Jupiters Most of the extrasolar planets discovered to date are gas giants like Jupiter, but their orbits are either much closer to their parent stars or are highly eccentric. Planet hunters are on the verge of confirming the discovery of Jupiter-size planets with Jupiter-like orbits. Solar systems that contain these "good" Jupiters may harbor habitable Earth-like planets as well. 0.560688853263855

Database

When the content parameter is enabled, an Embeddings instance stores both vector content and raw content in a database. But the database package can be used standalone too.

from txtai.database import DatabaseFactory# Load content into databasedatabase = DatabaseFactory.create({"content": True})database.insert((x, row, None) for x, row in enumerate(dataset))# Show totaldatabase.search("select count(*) from txtai")
[{'count(*)': 120000}]

The full txtai SQL query syntax is available, including working with dynamically created fields.

database.search("select count(*), label from txtai group by label")
[{'count(*)': 30000, 'label': 0}, {'count(*)': 30000, 'label': 1}, {'count(*)': 30000, 'label': 2}, {'count(*)': 30000, 'label': 3}]

Let's run a query to find text containing the word planets.

for row in database.search("select id, text from txtai where text like '%planets%' limit 3"):  print(row["id"], row["text"])
100 Comets, Asteroids and Planets around a Nearby Star (SPACE.com) SPACE.com - A nearby star thought to harbor comets and asteroids now appears to be home to planets, too. The presumed worlds are smaller than Jupiter and could be as tiny as Pluto, new observations suggest.102 Redesigning Rockets: NASA Space Propulsion Finds a New Home (SPACE.com) SPACE.com - While the exploration of the Moon and other planets in our solar system is nbsp;exciting, the first task for astronauts and robots alike is to actually nbsp;get to those destinations.272 Sharpest Image Ever Obtained of a Circumstellar Disk Reveals Signs of Young Planets MAUNA KEA, Hawaii -- The sharpest image ever taken of a dust disk around another star has revealed structures in the disk which are signs of unseen planets.     Dr...

Since this is just a SQL database, text search is quite limited. The query above just retrieved results with the word planets in it.

Scoring

Since the original txtai release, there has been a scoring package. The main use case for this package is building a weighted sentence embeddings vector when using word vector models. But this package can also be used standalone to build BM25, TF-IDF and/or SIF text indexes.

from txtai.scoring import ScoringFactory# Build indexscoring = ScoringFactory.create({"method": "bm25", "terms": True, "content": True})scoring.index((x, text, None) for x, text in enumerate(texts))# Show totalscoring.count()
120000
for row in scoring.search("planets explore life earth", 3):  print(row["id"], row["text"], row["score"])
16327 3 Planets Are Found Close in Size to Earth, Making Scientists Think 'Life' A trio of newly discovered worlds are much smaller than any other planets previously discovered outside of the solar system. 17.76833244813070716158 Earth #39;s  #39;big brothers #39; floating around stars Washington - A new class of planets has been found orbiting stars besides our sun, in a possible giant leap forward in the search for Earth-like planets that might harbour life. 17.6594196817079316620 New Planets could advance search for Life Astronomers in Europe and the United States have found two new planets about 20 times the size of Earth beyond the solar system. The discovery might be a giant leap forward in  17.65941968170793

The search above ran a BM25 search across the dataset. The search will return more keyword/literal results. With proper query construction, the results can be decent.

Comparing the vector search results earlier and these results are a good lesson in the differences between keyword and vector search.

Database and Scoring

Earlier we showed how the ann and vectors components can be combined to build a vector search engine. Can we combine the database and scoring components to add keyword search to a database? Yes!

def search(query, limit=3):  # Get similar clauses, if any  similar = database.parse(query).get("similar")  return database.search(query, [scoring.search(args[0], limit * 10) for args in similar] if similar else None, limit)# Rebuild scoring - only need terms indexscoring = ScoringFactory.create({"method": "bm25", "terms": True})scoring.index((x, text, None) for x, text in enumerate(texts))for row in search("select id, text, score from txtai where similar('planets explore life earth') and label = 0"):  print(row["id"], row["text"], row["score"])
15363 NASA to Announce New Class of Planets Astronomers have discovered four new planets in a week's time, an exciting end-of-summer flurry that signals a sharper era in the hunt for new worlds.    While none of these new bodies would be mistaken as Earth's twin, some appear to be noticeably smaller and more solid - more like Earth and Mars - than the gargantuan, gaseous giants identified before... 12.58292325969713215900 Astronomers Spot Smallest Planets Yet American astronomers say they have discovered the two smallest planets yet orbiting nearby stars, trumping a small planet discovery by European scientists five days ago and capping the latest round in a frenzied hunt for other worlds like Earth.    All three of these smaller planets belong to a new class of "exoplanets" - those that orbit stars other than our sun, the scientists said in a briefing Tuesday... 12.56392823106715515879 Astronomers see two new planets US astronomers find the smallest worlds detected circling other stars and say it is a breakthrough in the search for life in space. 12.078383982352994

And there it is, scoring-based similarity search with the same syntax as standard txtai vector queries, including additional filters!

txtai is built on vector search, machine learning and finding results based on semantic meaning. It's been well-discussed from a functionality standpoint how vector search has many advantages over keyword search. The one advantage keyword search has is speed.

Wrapping up

This notebook walked through each of the packages used by an Embeddings index. The Embeddings index makes this all transparent and easy to use. But each of the components do stand on their own and can be individually integrated into a project!


Original Link: https://dev.to/neuml/embeddings-index-components-3plg

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To