As part of our natural language processing (NLP) blog series, we will walk through an example of using a text embedding model to generate vector representations of textual contents and demonstrating vector similarity search on generated vectors. We will deploy a publicly available model on Elasticsearch and use it in an ingest pipeline to generate embeddings from textual documents. We will then show how to use those embeddings in the vector similarity search to find semantically similar documents for a given query.
Vector similarity search or, as is commonly called semantic search, goes beyond the traditional keyword based search and allows users to find semantically similar documents that may not have any common keywords thus providing a wider range of results. Vector similarity search operates on dense vectors and uses k-nearest neighbour search to find similar vectors. For this, contents in the textual form first need to be converted to their numeric vector representations using a text embedding model.
We will use a public dataset from the MS MARCO Passage Ranking Task for demonstration. It consists of real questions from the Microsoft Bing search engine and human generated answers for them. This dataset is a perfect resource for testing vector similarity search, firstly, because question-answering is a one of the most common use cases for vector search, and secondly, the top papers in the MS MARCO leaderboard use vector search in some form.
In our example we will work with a sample of this dataset, use a model to produce text embeddings, and then run vector search on it. We hope to also do a quick verification of the quality of produced results from the vector search.
Deploying NLP Models, generating text embeddings & running vector search
1. Deploy a text embedding model
The first step is to install a text embedding model. For our model we use
msmarco-MiniLM-L12-cos-v5 from Hugging Face. This is a sentence-transformer model that takes a sentence or a paragraph and maps it to a 384-dimensional dense vector. This model is optimized for semantic search and was specifically trained on the MS MARCO Passage dataset, making it suitable for our task. Besides this model, Elasticsearch supports a number of other models for text embedding. The full list can be found here.
We install the model with the Eland docker agent that we built in the NER example. Running a script below imports our model into our cluster and deploys it:
eland_import_hub_model
--cloud-id <cloud-id> \
-u <username> -p <password> \
--hub-model-id sentence-transformers/msmarco-MiniLM-L12-cos-v5 \
--task-type text_embedding \
--start
This time, --task-type is set to text_embedding and the --start option is passed to the Eland script so the model will be deployed automatically without having to start it in the Model Management UI. To speed up inferences, you can increase the number of inference threads with inference_threads parameter.
We can test the successful deployment of the model by using this example in Kibana Console:
POST /_ml/trained_models/sentence-transformers__msmarco-minilm-l12-cos-v5/deployment/_infer
{
"docs": {
"text_field": "how is the weather in jamaica"
}
}
We should see the predicted dense vector as the result:
{
"predicted_value" : [
0.051237598061561584,
-0.04680659621953964,
0.03971194103360176
…
]
}
2. Loading initial data
As mentioned in the introduction, we use the MS MARCO Passage Ranking dataset. The dataset is quite big, consisting of over 8 million passages. For our example, we use a subset of it that was used in the testing stage of the 2019 TREC Deep Learning Track. The dataset msmarco-passagetest2019-top1000.tsv used for the re-ranking task contains 200 queries and for each query a list of relevant text passages extracted by a simple IR system. From that dataset, we’ve extracted all unique passages with their ids, and put them into a separate tsv file, totaling 182469 passages. We use this file as our dataset.
We use Kibana's file upload feature to upload this dataset. Kibana file upload allows us to provide custom names for fields, let’s call them id with type long for passages’ ids, and text with type text for passages’ contents. The index name is collection. After the upload, we can see an index named collection with 182469 documents.