How to Generate Sentence Embeddings

Elabonga AtuoApril 15, 2025About 4 min

How to Generate Sentence Embeddings 관련

How to Build a Local RAG App with Ollama and ChromaDB in the R Programming Language

A Large Language Model (LLM) is a type of machine learning model that is trained to understand and generate human-like text. These models are trained on vast datasets to capture the nuances of human language, enabling them to generate coherent and co...

How to Build a Local RAG App with Ollama and ChromaDB in the R Programming Language

As previously mentioned, embeddings are vector representations of words or sentences. Embeddings can be generated from both words and sentences. How you choose to generate embeddings depends on your intended application of the LLM.

Word embeddings are numerical representations of individual words in a continuous vector space. They capture semantic relationships between words, allowing similar words to have vectors close to each other.

Word embeddings can be used in search engines as they support word-level queries by matching embeddings to retrieve relevant documents. They can also be used in text classification to classify documents, emails, or tweets based on word-level features (for example, detecting spam emails or sentiment analysis).

Sentence embeddings are numerical representations of entire sentences in a vector space, designed to capture the overall meaning and context of the sentence. They are used in settings where sentences provide better context like question answering systems where user queries are matched to relevant sentences or documents for more precise retrieval.

For our recipe chatbot, sentence embedding is the best choice.

First, create an empty dataframe that has three columns.

#empty dataframe
recipe_sentence_embeddings <-  data.frame(
  recipe = character(),
  recipe_vec_embeddings = I(list()),
  recipe_id = character()
)

The first column will hold the actual recipe in text form, the recipe_vec_embeddings column will hold the generated sentence embeddings, and the recipe_id holds a unique id for each recipe. This will help in indexing and retrieval from the vector database.

Next, it’s helpful to define a progress bar, which you can do like this:

# create a progress bar
pb <- txtProgressBar(min = 1, max = length(chunks), style = 3)

Embedding can take a while, so it’s important to keep track of the progress of the process.

Now it’s time to generate embeddings and populate the dataframe.

Write a for loop that executes the code block as long as the length of the chunks.

for (i in 1:length(chunks)) {}

The recipe field is the text at the chunk that is currently being executed and the unique chunk id is generated by pasting the index of the chunk and the text “chunk”.

for (i in 1:length(chunks)) {
    recipe <- as.character(chunks[i])
    recipe_id <- paste0("recipe",i)
}

The text embed function from the text library generates either sentence or word embeddings. It takes in a character variable or a dataframe and produces a tibble of embeddings. You can read loading instructions here for smooth running of the text library.

The batch_size defines how many rows are embedded at a time from the input. Setting the keep_token_embeddings discards the embeddings for individual tokens after processing, and aggregation_from_layers_to_tokens “concatenates” or combines embeddings from specified layers to create detailed embeddings for each token. A token is the smallest unit of text that a model can process.

for (i in 1:length(chunks)) {
    recipe <- as.character(chunks[i])
    recipe_id <- paste0("recipe",i)
    recipe_embeddings <- textEmbed(as.character(recipe),
                                layers = 10:11,
                                aggregation_from_layers_to_tokens = "concatenate",
                                aggregation_from_tokens_to_texts = "mean",
                                keep_token_embeddings = FALSE,
                                batch_size = 1
  )
}

In order to specify sentence embeddings, you need to set the argument to the aggregation_from_tokens_to_texts parameter as "mean".

aggregation_from_tokens_to_texts = "mean"

The "mean" operation averages the embeddings of all tokens in a sentence to generate a single vector that represents the entire sentence. This sentence-level embedding captures the overall meaning and semantics of the text, regardless of its token length.

# convert tibble to vector
recipe_vec_embeddings <- unlist(recipe_embeddings, use.names = FALSE)
recipe_vec_embeddings <- list(recipe_vec_embeddings)

The embedding function returns a tibble object. In order to obtain a vector embedding, you need to first unlist the tibble and drop the row names and then list the result to form a simple vector.

# Append the current chunk's data to the dataframe
recipe_sentence_embeddings <- recipe_sentence_embeddings %>%
  add_row(
    recipe = recipe,
    recipe_vec_embeddings = recipe_vec_embeddings,
    recipe_id = recipe_id
  )

Finally, update the empty dataframe after each iteration with the newly generated data.

  # track embedding progress
  setTxtProgressBar(pb, i)

In order to keep track of the embedding progress, you can use the earlier defined progress bar inside the loop. It will update at the end of every iteration.

Complete Code Block:

# load required library
library(text)
# # ensure to read loading instructions here for smooth running of the 'text' library
# # https://www.r-text.org/
# embedding data
for (i in 1:length(chunks)) {
  recipe <- as.character(chunks[i])
  recipe_id <- paste0("recipe",i)
  recipe_embeddings <- textEmbed(as.character(recipe),
                                layers = 10:11,
                                aggregation_from_layers_to_tokens = "concatenate",
                                aggregation_from_tokens_to_texts = "mean",
                                keep_token_embeddings = FALSE,
                                batch_size = 1
  )

  # convert tibble to vector
  recipe_vec_embeddings <- unlist(recipe_embeddings, use.names = FALSE)
  recipe_vec_embeddings <- list(recipe_vec_embeddings)

  # Append the current chunk's data to the dataframe
  recipe_sentence_embeddings <- recipe_sentence_embeddings %>%
    add_row(
      recipe = recipe,
      recipe_vec_embeddings = recipe_vec_embeddings,
      recipe_id = recipe_id
    )

  # track embedding progress
  setTxtProgressBar(pb, i)

}