Sources Contact Advanced Search Tutorials

An Interest In:

Web News this Week

Search Archive

Some of Our Sources

View All Sources

Help Webnuz

Referal links:

October 14, 2021 02:04 pm GMT

Export and run other machine learning models

This article is part of a tutorial series on txtai, an AI-powered semantic search platform.

txtai primarily has support for Hugging Face Transformers and ONNX models. This enables txtai to hook into the rich model framework available in Python, export this functionality via the API to other languages (JavaScript, Java, Go, Rust) and even export and natively load models with ONNX.

What about other machine learning frameworks? Say we have an existing TF-IDF + Logistic Regression model that has been well tuned. Can this model be exported to ONNX and used in txtai for labeling and similarity queries? Or what about a simple PyTorch text classifier? Yes, both of these can be done!

With the onnxmltools library, traditional models from scikit-learn, XGBoost and others can be exported to ONNX and loaded with txtai. Additionally, Hugging Face's trainer module can train generic PyTorch modules. This notebook will walk through all these examples.

Install dependencies

Install txtai and all dependencies. Since this article uses ONNX exports, we need to install the pipeline extras package.

pip install txtai[pipeline,similarity] datasets

Train a TF-IDF + Logistic Regression model

For this example, we'll load the emotion dataset from Hugging Face datasets and build a TF-IDF + Logistic Regression model with scikit-learn.

The emotion dataset has the following labels:

sadness (0)
joy (1)
love (2)
anger (3)
fear (4)
surprise (5)

from datasets import load_datasetfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.pipeline import Pipelineds = load_dataset("emotion")# Train the modelpipeline = Pipeline([    ('tfidf', TfidfVectorizer()),    ('lr', LogisticRegression(max_iter=250))])pipeline.fit(ds["train"]["text"], ds["train"]["label"])# Determine accuracy on validation setresults = pipeline.predict(ds["validation"]["text"])labels = ds["validation"]["label"]results = [results[x] == label for x, label in enumerate(labels)]print("Accuracy =", sum(results) / len(ds["validation"]))

Accuracy = 0.8595

86% accuracy - not too bad! While we all get caught up in deep learning and advanced methods, good ole TF-IDF + Logistic Regression is still a solid performer and runs much faster. If that level of accuracy works, no reason to overcomplicate things.

Export and load with txtai

The next section exports this model to ONNX and shows how the model can be used for similarity queries.

from txtai.pipeline import Labels, MLOnnx, Similaritydef tokenize(inputs, **kwargs):    if isinstance(inputs, str):        inputs = [inputs]    return {"input_ids": [[x] for x in inputs]}def query(model, tokenizer, multilabel=False):    # Load models into similarity pipeline    similarity = Similarity((model, tokenizer), dynamic=False)    # Add labels to model    similarity.pipeline.model.config.id2label = {0: "sadness", 1: "joy", 2: "love", 3: "anger", 4: "fear", 5: "surprise"}    similarity.pipeline.model.config.label2id = dict((v, k) for k, v in similarity.pipeline.model.config.id2label.items())    inputs = ["that caught me off guard", "I didn t see that coming", "i feel bad", "What a wonderful goal!"]    scores = similarity("joy", inputs, multilabel)    for uid, score in scores[:5]:        print(inputs[uid], score)# Export to ONNXonnx = MLOnnx()model = onnx(pipeline)# Create labels pipeline using scikit-learn ONNX modelsklabels = Labels((model, tokenize), dynamic=False)# Add labels to modelsklabels.pipeline.model.config.id2label = {0: "sadness", 1: "joy", 2: "love", 3: "anger", 4: "fear", 5: "surprise"}sklabels.pipeline.model.config.label2id = dict((v, k) for k, v in sklabels.pipeline.model.config.id2label.items())# Run test query using modelquery(model, tokenize, None)

What a wonderful goal! 0.909473717212677I didn t see that coming 0.47113093733787537that caught me off guard 0.42067453265190125i feel bad 0.019547615200281143

txtai can use a standard text classification model for similarity queries, where the label(s) are a list of fixed queries. The output above shows the best results for the query "joy".

Train a PyTorch model

The next section defines a simple PyTorch text classifier. The transformers library has a trainer package that supports training PyTorch models, assuming some standard conventions/naming is used.

# Set predictable seedsimport osimport randomimport torchimport numpy as npfrom torch import nnfrom torch.nn import CrossEntropyLossfrom transformers import AutoConfig, AutoTokenizerfrom txtai.models import Registryfrom txtai.pipeline import HFTrainerfrom transformers.modeling_outputs import SequenceClassifierOutputdef seed(seed=42):    random.seed(seed)    os.environ['PYTHONHASHSEED'] = str(seed)    np.random.seed(seed)    torch.manual_seed(seed)    torch.cuda.manual_seed(seed)    torch.backends.cudnn.deterministic = Trueclass Simple(nn.Module):    def __init__(self, vocab, dimensions, labels):        super().__init__()        self.config = AutoConfig.from_pretrained("bert-base-uncased")        self.labels = labels        self.embedding = nn.EmbeddingBag(vocab, dimensions)        self.classifier = nn.Linear(dimensions, labels)        self.init_weights()    def init_weights(self):        initrange = 0.5        self.embedding.weight.data.uniform_(-initrange, initrange)        self.classifier.weight.data.uniform_(-initrange, initrange)        self.classifier.bias.data.zero_()    def forward(self, input_ids=None, labels=None, **kwargs):        embeddings = self.embedding(input_ids)        logits = self.classifier(embeddings)        loss = None        if labels is not None:            loss_fct = CrossEntropyLoss()            loss = loss_fct(logits.view(-1, self.labels), labels.view(-1))        return SequenceClassifierOutput(            loss=loss,            logits=logits,        )# Set seed for reproducibilityseed()# Define modeltokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")model = Simple(tokenizer.vocab_size, 128, len(ds["train"].unique("label")))# Train modeltrain = HFTrainer()model, tokenizer = train((model, tokenizer), ds["train"], per_device_train_batch_size=8, learning_rate=1e-3, num_train_epochs=15, logging_steps=10000)# Register custom model to fully support pipelinesRegistry.register(model)# Create labels pipeline using PyTorch modelthlabels = Labels((model, tokenizer), dynamic=False)# Determine accuracy on validation setresults = [row["label"] == thlabels(row["text"])[0][0] for row in ds["validation"]]print("Accuracy = ", sum(results) / len(ds["validation"]))

Accuracy =  0.883

88% accuracy this time. Pretty good for such a simple network and something that could definitely be improved upon.

Once again let's run similarity queries using this model.

query(model, tokenizer)

What a wonderful goal! 1.0that caught me off guard 0.9998751878738403I didn t see that coming 0.7328283190727234i feel bad 5.2972134609891875e-19

Same result order as with the scikit-learn model with scoring variations which is expected given this is a completely different model.

Pooled embeddings

The PyTorch model above consists of an embeddings layer with a linear classifier on top of it. What if we take that embeddings layer and use it for similarity queries? Let's give it a try.

from txtai.embeddings import Embeddingsclass SimpleEmbeddings(nn.Module):    def __init__(self, embeddings):        super().__init__()        self.embeddings = embeddings    def forward(self, input_ids=None, **kwargs):        return (self.embeddings(input_ids),)embeddings = Embeddings({"method": "pooling", "path": SimpleEmbeddings(model.embedding), "tokenizer": "bert-base-uncased"})print(embeddings.similarity("mad", ["Glad you found it", "Happy to see you", "I'm angry"]))

[(2, 0.8323876857757568), (1, -0.11010512709617615), (0, -0.16152513027191162)]

Definitely looks like the embeddings have stored knowledge. Could these embeddings be good enough to build a semantic search index, especially for sentiment based data, given the training dataset? Possibly. It certainly would run faster than a standard transformer model (see below).

Train a transformer model and compare accuracy/speed

Let's train a standard transformer sequence classifier and compare the accuracy/speed between the two.

train = HFTrainer()model, tokenizer = train("microsoft/xtremedistil-l6-h384-uncased", ds["train"], logging_steps=2000)tflabels = Labels((model, tokenizer), dynamic=False)# Determine accuracy on validation setresults = [row["label"] == tflabels(row["text"])[0][0] for row in ds["validation"]]print("Accuracy = ", sum(results) / len(ds["validation"]))

Accuracy =  0.93

As expected, the accuracy is better. The model above is a distilled model and even better accuracy can be obtained with a model like "roberta-base" with the tradeoff being increased training/inference time.

Speaking of speed, let's compare the speed of these models.

import time# Test inputsinputs = ds["test"]["text"]print("Testing speed of %d items" % len(inputs))start = time.time()r1 = sklabels(inputs, multilabel=None)print("TF-IDF + Logistic Regression time =", time.time() - start)start = time.time()r2 = thlabels(inputs)print("PyTorch time =", time.time() - start)start = time.time()r3 = tflabels(inputs)print("Transformers time =", time.time() - start, "
")# Compare model resultsfor x in range(5):  print("index: %d" % x)  print(r1[x][0])  print(r2[x][0])  print(r3[x][0], "
")

Testing speed of 2000 itemsTF-IDF + Logistic Regression time = 1.116208791732788PyTorch time = 2.2385385036468506Transformers time = 15.705108880996704 index: 0(0, 0.7258279323577881)(0, 1.0)(0, 0.998250424861908) index: 1(0, 0.854256272315979)(0, 1.0)(0, 0.9981004595756531) index: 2(0, 0.6306578516960144)(0, 0.9999700784683228)(0, 0.9981676340103149) index: 3(1, 0.554378092288971)(1, 0.9998960494995117)(1, 0.9985388517379761) index: 4(0, 0.8961835503578186)(0, 1.0)(0, 0.9981957077980042)

Wrapping up

This notebook showed how frameworks outside of Transformers and ONNX can be used as models in txtai.

As seen in the section above, TF-IDF + Logistic Regression is 16 times faster than a distilled Transformers model. A simple PyTorch network is 8 times faster. Depending on your accuracy requirements, it may make sense to use a simpler model to get better runtime performance.

Original Link: https://dev.to/neuml/export-and-run-other-machine-learning-models-3b86

Share this article:

View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To