RAG with Spring AI

In this blog post, I'll show you how to implement a traditional RAG (Retrieval Augmented Generation) system using Spring AI. To set up a traditional RAG system, we need to implement two workflows: one for ingestion and one for retrieval and generation. Spring AI brings all the building blocks, and our task is to wire them together in the typical Spring fashion.

Prerequisites ¶

For a traditional RAG system, we need a vector database to store the embeddings of the text chunks, a model to generate the embeddings, and a LLM to generate the responses.

For the demo application, I use this docker compose file to start a Postgres database with pgvector extension.

In this blog post, I also wanted to demonstrate that Spring AI integrates with Ollama. This tool enables you to run models locally. I downloaded bge-m3 an embedding model, and Google's Gemma 3n, an LLM.

ollama pull bge-m3:567m
ollama pull gemma3n:e4b

One advantage of running a local model is privacy. If you work with sensitive data, you don't want to send it to a third-party service. A local model allows you to keep the data in your environment. But don't expect any wonders from a local model. They are usually not as powerful as the big models in the cloud, but they might work for your use case. It's worth a try.

Note that Spring AI also supports models running in the Docker model runner. An alternative to Ollama for running models locally. You find more information in the documentation.

I ran this demo application on a Windows 11 machine with an Nvidia RTX 3060 GPU. Performance is not great, but it works. If you have a more powerful GPU, you might get faster responses.

Spring Setup ¶

After starting the Postgres database and pulling the models with Ollama, we take a look at the Spring Boot application.

First, add the necessary dependencies to the pom.xml file:

  <dependencyManagement>
      <dependencies>
          <dependency>
              <groupId>org.springframework.ai</groupId>
              <artifactId>spring-ai-bom</artifactId>
              <version>1.0.0</version>
              <type>pom</type>
              <scope>import</scope>
          </dependency>
      </dependencies>
  </dependencyManagement>    
  
  <dependencies>
    <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>    
    <dependency>
      <groupId>org.springframework.ai</groupId>
      <artifactId>spring-ai-starter-model-ollama</artifactId>
    </dependency>
    <dependency>
      <groupId>org.springframework.ai</groupId>
      <artifactId>spring-ai-starter-vector-store-pgvector</artifactId>
    </dependency>
    <dependency>
      <groupId>org.springframework.ai</groupId>
      <artifactId>spring-ai-advisors-vector-store</artifactId>
    </dependency>    
    <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-rag</artifactId>
        </dependency>
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-pdf-document-reader</artifactId>
    </dependency>        
  </dependencies>

pom.xml

This demo application depends on quite a few libraries. The models are running in Ollama, and the vector store is PostgreSQL with pgvector. The RAG libraries are used because this is a RAG application, and a PDF document reader library to extract text from PDF documents, because the application demonstrates the RAG workflow with PDF documents.

The application property file contains the configuration for the Ollama models and the vector store.

spring.ai.ollama.embedding.options.model=bge-m3:567m
spring.ai.ollama.chat.options.model=gemma3n:e4b


spring.datasource.url=jdbc:postgresql://localhost:5432/postgres
spring.datasource.username=postgres
spring.datasource.password=postgres
spring.ai.vectorstore.pgvector.index-type=HNSW
spring.ai.vectorstore.pgvector.distance-type=COSINE_DISTANCE
spring.ai.vectorstore.pgvector.dimensions=1024
spring.ai.vectorstore.pgvector.batching-strategy=TOKEN_COUNT
spring.ai.vectorstore.pgvector.max-document-batch-size=10000
spring.ai.vectorstore.pgvector.initialize-schema=true

application.properties

Important here is that the dimensions configuration matches the length of the embedding vector produced by the embedding model. The BGE-M3 model produces 1024-dimensional vectors, so the value is set to 1024. This value is different for each embedding model.

By default, Spring AI creates a table called vector_store when using pgvector. You can override this by setting the spring.ai.vectorstore.pgvector.table-name property to a different name. Because initialize-schema is enabled, Spring AI will create the table automatically if it does not exist yet.

Index type and distance type are set to HNSW and cosine distance. pgvector supports different index types and distance types. You find more information on the pgvector project page.

Next, we take a look at how the two workflows are implemented.

Ingestion ¶

The ingestion workflow is responsible for reading the documents, extracting the text, splitting the text into chunks, generating the embeddings, and storing them in the vector database.

This workflow can be a one-time process if you work with static documents that don't change, or it's a continuous process if documents are added to or changed frequently.

Spring AI provides an Extract, Transform, and Load (ETL) framework to implement this workflow in a simple way.

First, the application instantiates a text splitter that splits the documents into chunks. The TokenTextSplitter implementation is a token-based text splitter. For this application, I use the default settings, which means this splitter creates chunks of a maximum size of 800 tokens.

  private final TokenTextSplitter splitter = new TokenTextSplitter();

DemoApplication.java

In other libraries, you find splitters that use an overlap approach where chunks overlap by a certain number of characters or tokens. This implementation does not do that. Spring AI 1.0.0 only provides the TokenTextSplitter implementation, but you can implement your own splitter by implementing a subclass of the abstract TextSplitter class.

Splitting text is one of the challenges in a RAG system. There is no one-size-fits-all solution. It depends on the type of documents you work with. The TokenTextSplitter is a good starting point, but you might need to implement your own splitter for your specific data set.

The next component the application needs is a vector store. Spring AI automatically configures a VectorStore bean based on the properties defined in the application.properties file.

  @Autowired
  VectorStore vectorStore;

DemoApplication.java

The implementation of the ingestion workflow is now straightforward. The application downloads a set of PDF documents from the IPCC website with the Spring RestTemplate.

  private void ingest() {
    for (String url : urls) {
      byte[] pdfBytes = this.restTemplate.getForObject(url, byte[].class);
      List<Document> documents = loadPdfs(new ByteArrayResource(pdfBytes) {
        @Override
        public String getFilename() {
          return url;
        }
      });
      this.vectorStore.add(documents);
    }
  }

DemoApplication.java

The code then extracts the text from each PDF with the PagePdfDocumentReader, which is part of the spring-ai-pdf-document-reader library. You see here that you can ignore certain parts of the page. Often, documents have headers and footers that you don't want to include in the text. After extracting the text, the code calls the splitter to split the text into chunks.

  List<Document> loadPdfs(Resource resourcePdf) {

    PagePdfDocumentReader pdfReader = new PagePdfDocumentReader(resourcePdf,
        PdfDocumentReaderConfig.builder().withPageTopMargin(0).withPageBottomMargin(0)
            .withPageExtractedTextFormatter(
                ExtractedTextFormatter.builder().withNumberOfTopTextLinesToDelete(0)
                    .withNumberOfBottomTextLinesToDelete(0).build())
            .withPagesPerDocument(1).build());

    return this.splitter.apply(pdfReader.read());
  }

DemoApplication.java

All these text chunks are then passed to the VectorStore bean to generate the embeddings and store them in the vector database.

      this.vectorStore.add(documents);

DemoApplication.java

Retrieval and Generation ¶

This workflow responds to user queries. First, it generates the embeddings for the user query, then it retrieves the relevant text chunks from the vector store with a similarity search, then it augments the user prompt with the retrieved text chunks, and finally sends the augmented prompt to the language model to generate a response.

With Spring AI, the whole process can be implemented in a simple way. First, the application needs to create a ChatClient instance, which is done with the help of the ChatClient.Builder bean that is automatically configured by Spring AI based on the properties defined in the application.properties file.

  private final ChatClient chatClient;
  private final ChatClient.Builder chatClientBuilder;

  public DemoApplication(ChatClient.Builder chatClientBuilder) {
    this.chatClientBuilder = chatClientBuilder;
    this.chatClient = chatClientBuilder.build();
  }

DemoApplication.java

The application then only has to configure an advisor with the vector store to use. The Spring AI Advisors API provides a way to intercept, modify, and enhance data sent to and from Large Language Models (LLMs).

    String response = this.chatClient.prompt()
        .advisors(QuestionAnswerAdvisor.builder(this.vectorStore).build())
        .user("What are the main causes of the climate change?").call().content();

DemoApplication.java

The QuestionAnswerAdvisor is a predefined advisor that generates the embeddings for the user query, retrieves the relevant text chunks from the vector store, and augments the user prompt with the retrieved text chunks. The advisor uses a predefined prompt template, but you can override that with your own by calling promptTemplate(...) on the QuestionAnswerAdvisor.Builder instance.

The advisor can also be configured to use a certain similarity threshold (0.0 = any similarity is accepted, 1.0 = exact match required). topK sets the maximum number of text chunks to retrieve from the vector store. The SearchRequest also has support for filtering documents. This is important for applications where each user has access to different documents.

    var qaAdvisor = QuestionAnswerAdvisor.builder(this.vectorStore)
        .searchRequest(SearchRequest.builder().similarityThreshold(0.6).topK(20).build())
        .build();
    String response = this.chatClient.prompt().advisors(qaAdvisor)
        .user("What are the main causes of the climate change?").call().content();

DemoApplication.java

Advanced concepts ¶

The QuestionAnswerAdvisor is a good starting point for a RAG system, but you can also implement more advanced retrieval and augmentation strategies. Spring AI provides the RetrievalAugmentationAdvisor class that allows you to configure different query transformers, query expanders, and document retrievers.

A common pattern you might see in other RAG systems is a component that rewrites the user query. The idea is to rewrite the user query to make it more specific and remove any irrelevant information. This step runs before the query embeddings are generated. Note that this adds a round trip to the LLM, so it increases the latency of the system.

    Advisor retrievalAugmentationAdvisor = RetrievalAugmentationAdvisor.builder()
        .queryTransformers(RewriteQueryTransformer.builder()
            .chatClientBuilder(this.chatClientBuilder.build().mutate()).build())
        .documentRetriever(VectorStoreDocumentRetriever.builder().similarityThreshold(0.5)
            .topK(20).vectorStore(this.vectorStore).build())
        .build();

    String response = this.chatClient.prompt().advisors(retrievalAugmentationAdvisor)
        .user("What are the main causes of the climate change?").call().content();

DemoApplication.java

The RewriteQueryTransformer uses a predefined prompt that can be easily overwritten with promptTemplate(...) if you have a better prompt for your use case.

Another approach is to expand the user query with additional queries. The idea is to generate multiple queries based on the user query and then retrieve the relevant text chunks for each of these queries. This can help to improve the recall of the system.

The MultiQueryExpander can be configured with the number of queries to generate, and if the original query should be included in the expanded queries. In this example, the expander generates three additional queries and includes the original user query. So the RetrievalAugmentationAdvisor creates embeddings for four queries and retrieves the text chunk for each of these queries.

    Advisor retrievalAugmentationAdvisor = RetrievalAugmentationAdvisor.builder()
        .queryExpander(MultiQueryExpander.builder()
            .chatClientBuilder(this.chatClientBuilder.build().mutate())
            .includeOriginal(true).numberOfQueries(3).build())
        .documentRetriever(VectorStoreDocumentRetriever.builder().similarityThreshold(0.5)
            .topK(20).vectorStore(this.vectorStore).build())
        .build();

    String response = this.chatClient.prompt().advisors(retrievalAugmentationAdvisor)
        .user("What are the main causes of the climate change?").call().content();

DemoApplication.java

The prompt that the MultiQueryExpander uses can also be overwritten with promptTemplate(...). Like the RewriteQueryTransformer, this adds a round trip to the LLM.

The RetrievalAugmentationAdvisor also supports document post processors (documentPostProcessors(...)). This is code called after the advisor retrieved the text chunks from the vector store. A component you might see in other RAG systems is a re-ranker that ranks the retrieved text chunks based on their relevance to the user query. The idea is to remove irrelevant or redundant text chunks before they are sent to the language model. In version 1.0.0 of Spring AI, there is no implementation of a document post-processor, so you have to implement your own. This is done by implementing the DocumentPostProcessor interface.

Conclusion ¶

In this blog post, I showed you how to implement a basic traditional RAG system using Spring AI. The library provides all the building blocks to implement the ingestion and retrieval workflows in a simple way. Spring AI also supports more advanced concepts like query rewriting, query expansion, and document post-processing.