An Interest In:
Web News this Week
- April 26, 2024
- April 25, 2024
- April 24, 2024
- April 23, 2024
- April 22, 2024
- April 21, 2024
- April 20, 2024
Distributed embeddings cluster
This article is part of a tutorial series on txtai, an AI-powered search engine.
The txtai API is a web-based service backed by FastAPI. All txtai functionality is available via the API. The API can also cluster multiple embeddings indices into a single logical index to horizontally scale over multiple nodes.
This notebook installs the txtai API and shows an example of building an embeddings cluster.
Install dependencies
Install txtai
and all dependencies.
pip install txtai
Start distributed embeddings cluster
First we'll start multiple API instances that will serve as embeddings index shards. Each shard stores a subset of the indexed data and these shards work in tandem to form a single logical index.
Then we'll start the main API instance that clusters the shards together into a logical instance.
The API instances are all started in the background.
import osos.chdir("/content")
writable: true# Embeddings settingsembeddings: method: transformers path: sentence-transformers/bert-base-nli-mean-tokens
# Embeddings clustercluster: shards: - http://127.0.0.1:8001 - http://127.0.0.1:8002
# Start embeddings shardsCONFIG=index.yml nohup uvicorn --port 8001 "txtai.api:app" &> shard-1.log &CONFIG=index.yml nohup uvicorn --port 8002 "txtai.api:app" &> shard-2.log &# Start main instanceCONFIG=cluster.yml nohup uvicorn --port 8000 "txtai.api:app" &> main.log &# Wait for startupsleep 90
Python
Let's first try the cluster out directly in Python. The code below aggregates the two shards into a single cluster and executes actions against the cluster.
from txtai.api import Clustercluster = Cluster({"shards": ["http://127.0.0.1:8001", "http://127.0.0.1:8002"]})data = [ "US tops 5 million confirmed virus cases", "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg", "Beijing mobilises invasion craft along coast as Taiwan tensions escalate", "The National Park Service warns against sacrificing slower friends in a bear attack", "Maine man wins $1M from $25 lottery ticket", "Make huge profits without work, earn up to $100,000 a day",]# Index datacluster.add([{"id": x, "text": row} for x, row in enumerate(data)])cluster.index()# Test searchuid = cluster.search("feel good story", 1)[0]["id"]print("Query: feel good story
Result:", data[uid])
Query: feel good storyResult: Maine man wins $1M from $25 lottery ticket
JavaScript
Next let's try to run the same code above via the API using JavaScript.
npm install txtai
For this example, we'll clone the txtai.js project to import the example build configuration.
git clone https://github.com/neuml/txtai.js
Run cluster.js
The following script is a JavaScript version of the logic above
import {Embeddings} from "txtai";import {sprintf} from "sprintf-js";const run = async () => { try { let embeddings = new Embeddings(process.argv[2]); let data = ["US tops 5 million confirmed virus cases", "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg", "Beijing mobilises invasion craft along coast as Taiwan tensions escalate", "The National Park Service warns against sacrificing slower friends in a bear attack", "Maine man wins $1M from $25 lottery ticket", "Make huge profits without work, earn up to $100,000 a day"]; console.log(); console.log("Querying an Embeddings cluster"); console.log(sprintf("%-20s %s", "Query", "Best Match")); console.log("-".repeat(50)); for (let query of ["feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"]) { let results = await embeddings.search(query, 1); let uid = results[0].id; console.log(sprintf("%-20s %s", query, data[uid])) } } catch (e) { console.trace(e); }};run();
Build and run cluster.js
cd txtai.js/examples/nodenpm installnpm run build
Next lets run the code against the main cluster URL
node dist/cluster.js http://127.0.0.1:8000
Querying an Embeddings clusterQuery Best Match--------------------------------------------------feel good story Maine man wins $1M from $25 lottery ticketclimate change Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberghealth US tops 5 million confirmed virus caseswar Beijing mobilises invasion craft along coast as Taiwan tensions escalatewildlife The National Park Service warns against sacrificing slower friends in a bear attackasia Beijing mobilises invasion craft along coast as Taiwan tensions escalatenorth america US tops 5 million confirmed virus casesdishonest junk Make huge profits without work, earn up to $100,000 a day
The JavaScript program is showing the same results as the Python code above. This is running a clustered query against both nodes in the cluster and aggregating the results together.
Queries can be run against each individual shard to see what the queries independently return.
node dist/cluster.js http://127.0.0.1:8001
Querying an Embeddings clusterQuery Best Match--------------------------------------------------feel good story Maine man wins $1M from $25 lottery ticketclimate change Beijing mobilises invasion craft along coast as Taiwan tensions escalatehealth US tops 5 million confirmed virus caseswar Beijing mobilises invasion craft along coast as Taiwan tensions escalatewildlife Beijing mobilises invasion craft along coast as Taiwan tensions escalateasia Beijing mobilises invasion craft along coast as Taiwan tensions escalatenorth america US tops 5 million confirmed virus casesdishonest junk Beijing mobilises invasion craft along coast as Taiwan tensions escalate
node dist/cluster.js http://127.0.0.1:8002
Querying an Embeddings clusterQuery Best Match-------------------------------------------------------feel good story Make huge profits without work, earn up to $100,000 a dayclimate change Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberghealth Make huge profits without work, earn up to $100,000 a daywar Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized icebergwildlife The National Park Service warns against sacrificing slower friends in a bear attackasia Make huge profits without work, earn up to $100,000 a daynorth america Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized icebergdishonest junk Make huge profits without work, earn up to $100,000 a day
Note the differences. The section below runs a count against the full cluster and each shard to show the count of records in each.
curl http://127.0.0.1:8000/countprintf "
"curl http://127.0.0.1:8001/countprintf "
"curl http://127.0.0.1:8002/count
633
This notebook showed how a distributed embeddings cluster can be created with txtai. This example can be further scaled out on Kubernetes with StatefulSets, which will be covered in a future tutorial.
Original Link: https://dev.to/neuml/distributed-embeddings-cluster-24gg
Dev To
An online community for sharing and discovering great ideas, having debates, and making friendsMore About this Source Visit Dev To