
How to Generate Sentence Embeddings
How to Generate Sentence Embeddings êŽë š
As previously mentioned, embeddings are vector representations of words or sentences. Embeddings can be generated from both words and sentences. How you choose to generate embeddings depends on your intended application of the LLM.
Word embeddings are numerical representations of individual words in a continuous vector space. They capture semantic relationships between words, allowing similar words to have vectors close to each other.
Word embeddings can be used in search engines as they support word-level queries by matching embeddings to retrieve relevant documents. They can also be used in text classification to classify documents, emails, or tweets based on word-level features (for example, detecting spam emails or sentiment analysis).
Sentence embeddings are numerical representations of entire sentences in a vector space, designed to capture the overall meaning and context of the sentence. They are used in settings where sentences provide better context like question answering systems where user queries are matched to relevant sentences or documents for more precise retrieval.
For our recipe chatbot, sentence embedding is the best choice.
First, create an empty dataframe that has three columns.
#empty dataframe
recipe_sentence_embeddings <- data.frame(
recipe = character(),
recipe_vec_embeddings = I(list()),
recipe_id = character()
)
The first column will hold the actual recipe in text form, the recipe_vec_embeddings
column will hold the generated sentence embeddings, and the recipe_id
holds a unique id for each recipe. This will help in indexing and retrieval from the vector database.
Next, itâs helpful to define a progress bar, which you can do like this:
# create a progress bar
pb <- txtProgressBar(min = 1, max = length(chunks), style = 3)
Embedding can take a while, so itâs important to keep track of the progress of the process.
Now itâs time to generate embeddings and populate the dataframe.
Write a for loop that executes the code block as long as the length of the chunks.
for (i in 1:length(chunks)) {}
The recipe field is the text at the chunk that is currently being executed and the unique chunk id is generated by pasting the index of the chunk and the text âchunkâ.
for (i in 1:length(chunks)) {
recipe <- as.character(chunks[i])
recipe_id <- paste0("recipe",i)
}
The text embed function from the text library generates either sentence or word embeddings. It takes in a character variable or a dataframe and produces a tibble of embeddings. You can read loading instructions here for smooth running of the text library.
The batch_size
defines how many rows are embedded at a time from the input. Setting the keep_token_embeddings
discards the embeddings for individual tokens after processing, and aggregation_from_layers_to_tokens
âconcatenatesâ or combines embeddings from specified layers to create detailed embeddings for each token. A token is the smallest unit of text that a model can process.
for (i in 1:length(chunks)) {
recipe <- as.character(chunks[i])
recipe_id <- paste0("recipe",i)
recipe_embeddings <- textEmbed(as.character(recipe),
layers = 10:11,
aggregation_from_layers_to_tokens = "concatenate",
aggregation_from_tokens_to_texts = "mean",
keep_token_embeddings = FALSE,
batch_size = 1
)
}
In order to specify sentence embeddings, you need to set the argument to the aggregation_from_tokens_to_texts
parameter as "mean"
.
aggregation_from_tokens_to_texts = "mean"
The "mean" operation averages the embeddings of all tokens in a sentence to generate a single vector that represents the entire sentence. This sentence-level embedding captures the overall meaning and semantics of the text, regardless of its token length.
# convert tibble to vector
recipe_vec_embeddings <- unlist(recipe_embeddings, use.names = FALSE)
recipe_vec_embeddings <- list(recipe_vec_embeddings)
The embedding function returns a tibble object. In order to obtain a vector embedding, you need to first unlist the tibble and drop the row names and then list the result to form a simple vector.
# Append the current chunk's data to the dataframe
recipe_sentence_embeddings <- recipe_sentence_embeddings %>%
add_row(
recipe = recipe,
recipe_vec_embeddings = recipe_vec_embeddings,
recipe_id = recipe_id
)
Finally, update the empty dataframe after each iteration with the newly generated data.
# track embedding progress
setTxtProgressBar(pb, i)
In order to keep track of the embedding progress, you can use the earlier defined progress bar inside the loop. It will update at the end of every iteration.
Complete Code Block:
# load required library
library(text)
# # ensure to read loading instructions here for smooth running of the 'text' library
# # https://www.r-text.org/
# embedding data
for (i in 1:length(chunks)) {
recipe <- as.character(chunks[i])
recipe_id <- paste0("recipe",i)
recipe_embeddings <- textEmbed(as.character(recipe),
layers = 10:11,
aggregation_from_layers_to_tokens = "concatenate",
aggregation_from_tokens_to_texts = "mean",
keep_token_embeddings = FALSE,
batch_size = 1
)
# convert tibble to vector
recipe_vec_embeddings <- unlist(recipe_embeddings, use.names = FALSE)
recipe_vec_embeddings <- list(recipe_vec_embeddings)
# Append the current chunk's data to the dataframe
recipe_sentence_embeddings <- recipe_sentence_embeddings %>%
add_row(
recipe = recipe,
recipe_vec_embeddings = recipe_vec_embeddings,
recipe_id = recipe_id
)
# track embedding progress
setTxtProgressBar(pb, i)
}