In this blog post, I'll show you how to implement a traditional RAG (Retrieval Augmented Generation) system using Spring AI. To set up a traditional RAG system, we need to implement two workflows: one for ingestion and one for retrieval and generation. Spring AI brings all the building blocks, and our task is to wire them together in the typical Spring fashion.
Prerequisites ¶
For a traditional RAG system, we need a vector database to store the embeddings of the text chunks, a model to generate the embeddings, and a LLM to generate the responses.
For the demo application, I use this docker compose file to start a Postgres database with pgvector extension.
In this blog post, I also wanted to demonstrate that Spring AI integrates with Ollama. This tool enables you to run models locally. I downloaded bge-m3 an embedding model, and Google's Gemma 3n, an LLM.
ollama pull bge-m3:567m
ollama pull gemma3n:e4b
One advantage of running a local model is privacy. If you work with sensitive data, you don't want to send it to a third-party service. A local model allows you to keep the data in your environment. But don't expect any wonders from a local model. They are usually not as powerful as the big models in the cloud, but they might work for your use case. It's worth a try.
Note that Spring AI also supports models running in the Docker model runner. An alternative to Ollama for running models locally. You find more information in the documentation.
I ran this demo application on a Windows 11 machine with an Nvidia RTX 3060 GPU. Performance is not great, but it works. If you have a more powerful GPU, you might get faster responses.
Spring Setup ¶
After starting the Postgres database and pulling the models with Ollama, we take a look at the Spring Boot application.
First, add the necessary dependencies to the pom.xml
file:
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-bom</artifactId>
<version>1.0.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-model-ollama</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-vector-store-pgvector</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-advisors-vector-store</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-rag</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-pdf-document-reader</artifactId>
</dependency>
</dependencies>
This demo application depends on quite a few libraries. The models are running in Ollama, and the vector store is PostgreSQL with pgvector. The RAG libraries are used because this is a RAG application, and a PDF document reader library to extract text from PDF documents, because the application demonstrates the RAG workflow with PDF documents.
The application property file contains the configuration for the Ollama models and the vector store.
spring.ai.ollama.embedding.options.model=bge-m3:567m
spring.ai.ollama.chat.options.model=gemma3n:e4b
spring.datasource.url=jdbc:postgresql://localhost:5432/postgres
spring.datasource.username=postgres
spring.datasource.password=postgres
spring.ai.vectorstore.pgvector.index-type=HNSW
spring.ai.vectorstore.pgvector.distance-type=COSINE_DISTANCE
spring.ai.vectorstore.pgvector.dimensions=1024
spring.ai.vectorstore.pgvector.batching-strategy=TOKEN_COUNT
spring.ai.vectorstore.pgvector.max-document-batch-size=10000
spring.ai.vectorstore.pgvector.initialize-schema=true
Important here is that the dimensions
configuration matches the length of the embedding vector produced by the embedding model.
The BGE-M3 model produces 1024-dimensional vectors, so the value is set to 1024. This value is different for each embedding model.
By default, Spring AI creates a table called vector_store
when using pgvector. You can override this by setting the
spring.ai.vectorstore.pgvector.table-name
property to a different name.
Because initialize-schema
is enabled, Spring AI will create the table automatically if it does not exist yet.
Index type and distance type are set to HNSW and cosine distance. pgvector supports different index types and distance types. You find more information on the pgvector project page.
Next, we take a look at how the two workflows are implemented.
Ingestion ¶
The ingestion workflow is responsible for reading the documents, extracting the text, splitting the text into chunks, generating the embeddings, and storing them in the vector database.
This workflow can be a one-time process if you work with static documents that don't change, or it's a continuous process if documents are added to or changed frequently.
Spring AI provides an Extract, Transform, and Load (ETL) framework to implement this workflow in a simple way.
First, the application instantiates a text splitter that splits the documents into chunks.
The TokenTextSplitter
implementation is a token-based text splitter. For this application, I use the default settings,
which means this splitter creates chunks of a maximum size of 800 tokens.
private final TokenTextSplitter splitter = new TokenTextSplitter();
In other libraries, you find splitters that use an overlap approach where chunks overlap by a certain number of characters or tokens.
This implementation does not do that. Spring AI 1.0.0 only provides the TokenTextSplitter
implementation, but you can implement your own splitter by implementing a subclass of the abstract TextSplitter
class.
Splitting text is one of the challenges in a RAG system. There is no one-size-fits-all solution. It depends on the type of documents you
work with. The TokenTextSplitter
is a good starting point, but you might need to implement your own splitter for your specific data set.
Next component the application needs is a vector store.
Spring AI automatically configures a VectorStore
bean based on the properties defined in the application.properties
file.
@Autowired
VectorStore vectorStore;
The implementation of the ingestion workflow is now straightforward. The application downloads a set of PDF documents from the IPCC website with the Spring RestTemplate.
private void ingest() {
for (String url : urls) {
byte[] pdfBytes = this.restTemplate.getForObject(url, byte[].class);
List<Document> documents = loadPdfs(new ByteArrayResource(pdfBytes) {
@Override
public String getFilename() {
return url;
}
});
this.vectorStore.add(documents);
}
}
The code then extracts the text from each PDF with the PagePdfDocumentReader
,
which is part of the spring-ai-pdf-document-reader
library. You see here that you can ignore certain parts of the page.
Often, documents have headers and footers that you don't want to include in the text. After extracting the text, the code
calls the splitter to split the text into chunks.
List<Document> loadPdfs(Resource resourcePdf) {
PagePdfDocumentReader pdfReader = new PagePdfDocumentReader(resourcePdf,
PdfDocumentReaderConfig.builder().withPageTopMargin(0).withPageBottomMargin(0)
.withPageExtractedTextFormatter(
ExtractedTextFormatter.builder().withNumberOfTopTextLinesToDelete(0)
.withNumberOfBottomTextLinesToDelete(0).build())
.withPagesPerDocument(1).build());
return this.splitter.apply(pdfReader.read());
}
All these text chunks are then passed to the VectorStore
bean to generate the embeddings and store them in the vector database.
this.vectorStore.add(documents);
Retrieval and Generation ¶
This workflow responds to user queries. First, it generates the embeddings for the user query, then it retrieves the relevant text chunks from the vector store with a similarity search, then it augments the user prompt with the retrieved text chunks, and finally sends the augmented prompt to the language model to generate a response.
With Spring AI, the whole process can be implemented in a simple way. First, the application needs to create a ChatClient
instance, which
is done with the help of the ChatClient.Builder
bean that is automatically configured by Spring AI based on the properties
defined in the application.properties
file.
private final ChatClient chatClient;
private final ChatClient.Builder chatClientBuilder;
public DemoApplication(ChatClient.Builder chatClientBuilder) {
this.chatClientBuilder = chatClientBuilder;
this.chatClient = chatClientBuilder.build();
}
The application then only has to configure an advisor with the vector store to use. The Spring AI Advisors API provides a way to intercept, modify, and enhance data sent to and from Large Language Models (LLMs).
String response = this.chatClient.prompt()
.advisors(QuestionAnswerAdvisor.builder(this.vectorStore).build())
.user("What are the main causes of the climate change?").call().content();
The QuestionAnswerAdvisor
is a predefined advisor that generates the embeddings for the user query, retrieves the relevant text chunks from the vector store,
and augments the user prompt with the retrieved text chunks. The advisor uses a predefined prompt template, but you can easily
override that with your own by calling promptTemplate(...)
on the QuestionAnswerAdvisor.Builder
instance.
The advisor can also be configured to use a certain similarity threshold (0.0 = any similarity is accepted, 1.0 = exact match required).
topK
sets the maximum number of text chunks to retrieve from the vector store.
The SearchRequest
also has support for filtering documents. This is important for applications where each user has access to different documents.
var qaAdvisor = QuestionAnswerAdvisor.builder(this.vectorStore)
.searchRequest(SearchRequest.builder().similarityThreshold(0.6).topK(20).build())
.build();
String response = this.chatClient.prompt().advisors(qaAdvisor)
.user("What are the main causes of the climate change?").call().content();
Advanced concepts ¶
The QuestionAnswerAdvisor
is a good starting point for a RAG system, but you can also implement more advanced
retrieval and augmentation strategies. Spring AI provides the RetrievalAugmentationAdvisor
class that allows you to
configure different query transformers, query expanders, and document retrievers.
A common pattern you might see in other RAG systems is a component that rewrites the user query. The idea is to rewrite the user query to make it more specific and remove any irrelevant information. This step runs before the query embeddings are generated. Note that this adds a round trip to the LLM, so it increases the latency of the system.
Advisor retrievalAugmentationAdvisor = RetrievalAugmentationAdvisor.builder()
.queryTransformers(RewriteQueryTransformer.builder()
.chatClientBuilder(this.chatClientBuilder.build().mutate()).build())
.documentRetriever(VectorStoreDocumentRetriever.builder().similarityThreshold(0.5)
.topK(20).vectorStore(this.vectorStore).build())
.build();
String response = this.chatClient.prompt().advisors(retrievalAugmentationAdvisor)
.user("What are the main causes of the climate change?").call().content();
The RewriteQueryTransformer
uses a predefined prompt that can be easily overwritten with promptTemplate(...)
if you have a better prompt for your use case.
Another approach is to expand the user query with additional queries. The idea is to generate multiple queries based on the user query and then retrieve the relevant text chunks for each of these queries. This can help to improve the recall of the system.
The MultiQueryExpander
can be configured with the number of queries to generate and if the original query should be included in the expanded queries.
In this example, the expander generates three additional queries and includes the original user query.
So the RetrievalAugmentationAdvisor
creates embeddings for four queries and retrieves the text chunk for each of these queries.
Advisor retrievalAugmentationAdvisor = RetrievalAugmentationAdvisor.builder()
.queryExpander(MultiQueryExpander.builder()
.chatClientBuilder(this.chatClientBuilder.build().mutate())
.includeOriginal(true).numberOfQueries(3).build())
.documentRetriever(VectorStoreDocumentRetriever.builder().similarityThreshold(0.5)
.topK(20).vectorStore(this.vectorStore).build())
.build();
String response = this.chatClient.prompt().advisors(retrievalAugmentationAdvisor)
.user("What are the main causes of the climate change?").call().content();
The prompt that the MultiQueryExpander
uses can also be overwritten with promptTemplate(...)
. Like the RewriteQueryTransformer
, this adds a round trip to the LLM.
The RetrievalAugmentationAdvisor
also supports document post processors (documentPostProcessors(...)
).
This is code that is called after the advisor retrieved the text chunks from the vector store. A component you might see in other RAG systems
is a re-ranker that ranks the retrieved text chunks based on their relevance to the user query. The idea is to remove irrelevant or redundant
text chunks before they are sent to the language model.
In version 1.0.0 of Spring AI, there is no implementation of a document post processor, so you have to implement your own.
This is done by implementing the DocumentPostProcessor
interface.
Conclusion ¶
In this blog post, I showed you how to implement a basic traditional RAG system using Spring AI. The library provides all the building blocks to implement the ingestion and retrieval workflows in a simple way. Spring AI also supports more advanced concepts like query rewriting, query expansion, and document post-processing.