This blog post is inspired by this YouTube video from Matt Williams.
Matt shows a retrieval augmented generation (RAG) system that uses a web search to give context to the generation. It uses a local large language model (LLM) running in Ollama. In this blog post, I show you my attempt at building this program in Go.
Run LLM with Ollama ¶
We start with installing Ollama. Follow the instructions on the Ollama website. Ollama runs on macOS, Linux, and Windows. It is a platform and tool for running and experimenting with large language models (LLMs) on your local machine. It offers a command-line tool for downloading, running, and interacting with various language models without an internet connection. The models page shows all the available models that you can download and run.
I'm using Alibaba's Qwen 2.5 7B model for this blog post.
After installing Ollama, you can download the model with the following command:
ollama pull qwen2.5:7b
Before we start the model, we need to increase the context window. By default, Ollama uses a context window size of only 2048 tokens.
This needs to be bigger for an RAG workflow because we embed a lot of text in the context. To increase the context window
create a new file Modelfile
somewhere on your system and add the following content:
FROM qwen2.5:7b
PARAMETER num_ctx 32768
32768 is a good starting point for the context window size. According to my information, this model's maximum context window is 128K tokens.
Now, we need to apply this change to the model. Run the following command:
ollama create -f Modelfile qwen2.5:7b
With the following command, we can start the model and interact with it:
ollama run qwen2.5:7b
What we want to do with the program is to ask questions about recent events. When we ask the model such questions without giving any context, the model cannot provide us with an answer. This makes sense because all LLMs have a cut-off date, in this case, October 2023. So, the model is only trained with data that was available up to that date. And it's also possible that the model might not be trained on this kind of information.
>>> Who won the Puzzle World Championship 2024?
As of my last update in October 2023, I don't have specific information about who won the Puzzle World Championship 2024
because detailed results and winners for that event are not available yet. The exact details would typically be announced
closer to or after the event's conclusion.
...
In the following, we will see how to leverage a web search to give context to the model so that it can answer our question.
Web Search ¶
For web searches, it would be best to have a system that can return results in an easily processable format with a program like JSON. Such an application is SearXNG. SearXNG is a meta-search engine aggregating results from more than 70 search services and databases. SearXNG can easily be deployed locally with the help of Docker.
An easy way to get started with SearXNG is to clone the searxng-docker repository with the following command.
git clone https://github.com/searxng/searxng-docker.git
Before starting the service, open the searxng/settings.yml
file and change it so it looks like this:
# see https://docs.searxng.org/admin/settings/settings.html#settings-use-default-settings
use_default_settings: true
server:
# base_url is defined in the SEARXNG_BASE_URL environment variable, see .env and docker-compose.yml
secret_key: "supersecret" # change this!
limiter: false # can be disabled for a private instance
image_proxy: true
ui:
static_use_hash: true
redis:
url: redis://redis:6379/0
search:
formats:
- html
- json
It is important to change the secret_key.
SearXNG will not start if you don't change this value.
Then I disabled the rate limiter (limiter: false
). Don't do this if you plan to expose the service to the internet.
I only access it locally here, so it's not a problem. As the last change, I added the search formats html
and json
.
Because we want to get the search results in JSON format, for easy processing in the Go program.
Note: On the first run, remove cap_drop: - ALL
from the docker-compose.yaml
file. SearXNG needs to create
an ini file; without proper permission, this will fail. After the first run, add the configuration back.
We can now start the service with docker compose:
docker compose up -d redis searxng
The docker compose file also contains a configuration for Caddy, but I don't need it because I only want to connect to the service locally. Caddy is only used to terminate SSL connections. If you plan to expose the service to the internet, you can use Caddy as the reverse proxy.
Test if the service is running by opening the following URL in your browser: http://localhost:8080/
.
If port 8080 on your system is already in use, change the port in the docker-compose.yaml
file.
You find more information about searxng-docker in the documentation.
With the LLM and web search engine in place, we can start building the program that leverages both systems.
Go Program ¶
The workflow of the Go program is as follows:
- Ask the LLM to reformulate the user question into a search query.
- Search the web with the search query.
- Extract the text of the 3 top search results.
- Ask the LLM the user question with the extracted text as context.
- Print the answer.
First, I added the following dependencies to the project.
go get github.com/go-shiori/go-readability
go get github.com/ollama/ollama
go-readability is a library that extracts the main readable content of an HTML page and removes all the clutter like buttons, ads, background images, script, etc. go-readability is a port of the JavaScript project Readability.js from Mozilla.
I also added ollama to the project containing a client and simplifying the interaction with the LLM. If you like to access models on Ollama without adding the whole Ollama package to your Go program, check out this blog post which only uses the Go standard library to interact with Ollama.
main ¶
The main method of the programs shows the workflow described above:
func main() {
ollamaURL, err := url.Parse(ollamaBaseURL)
if err != nil {
log.Fatal(err)
}
client := api.NewClient(ollamaURL, httpClient)
query := "Who won the Puzzle World Championship 2024?"
searchQuery := getSearchQuery(client, query)
fmt.Println("Query:", searchQuery)
searchResponse, err := webSearch(searchQuery)
if err != nil {
log.Fatal(err)
}
searchContext := buildSearchContext(searchResponse.Results)
answer := getAnswer(client, query, searchContext)
fmt.Println(answer)
}
Reformulate the user question with getSearchQuery
into a search query, send the query to the web search engine with webSearch
,
extract the text of the top 3 search results with buildSearchContext
,
and finally, ask the LLM the user question with the extracted text as context with getAnswer.
The call to api.NewClient
creates a new client to access Ollama. It takes the Ollama URL, which by default is http://localhost:11434
,
and an instance of http.Client
as arguments. Because running LLMs on your local computer can sometimes take a while until they answer, set the timeout of the http client
to an appropriate value. Here, I set it to 10 minutes.
var httpClient = &http.Client{
Timeout: 10 * time.Minute,
}
getSearchQuery ¶
The getSearchQuery
function reformulates the user question into a search query. The function uses the LLM to generate the search query.
func getSearchQuery(client *api.Client, query string) string {
messages := []api.Message{
{
Role: "system",
Content: "You are a professional web searcher.",
},
{
Role: "user",
Content: "Reformulate the following user prompt into a search query and return it.Nothing else.\n\n" + query,
},
}
response := executeChat(client, messages)
return strings.Trim(response, "\"")
}
executeChat
sends the messages to the LLM running on Ollama and returns the response.
You find the implementation of executeChat
here.
The program removes any leading and trailing quotes from the response. I noticed that sometimes the answer is quoted, which would result in search results that are not what we expect.
webSearch ¶
The webSearch
method sends the query from the previous step to SearXNG and returns the response in
the SearchResponse
struct.
func webSearch(query string) (*SearchResponse, error) {
encodedQuery := url.QueryEscape(query)
requestURL := fmt.Sprintf("%s?q=%s&format=json", searchBaseURL, encodedQuery)
response, err := httpClient.Get(requestURL)
if err != nil {
return nil, err
}
defer response.Body.Close()
var searchResponse SearchResponse
if err := json.NewDecoder(response.Body).Decode(&searchResponse); err != nil {
return nil, err
}
return &searchResponse, nil
}
Accessing SearchXNG is straightforward. Just send a GET request with the search query after the ?q=
parameter and add &format=json
to get the results in JSON format.
buildSearchContext ¶
This is the method where the application extracts the text from the top 3 search results. It utilizes go-readability to extract the main content of the page. To speed up the process, it fetches the content concurrently.
func buildSearchContext(results []Result) string {
var wg sync.WaitGroup
resultChan := make(chan string, maxResults)
for i, result := range results {
if i >= maxResults {
break
}
wg.Add(1)
go func(result Result) {
defer wg.Done()
content, err := fetchTextContent(result.URL)
if err != nil {
log.Printf("Error fetching content for URL %s: %v", result.URL, err)
return
}
resultChan <- fmt.Sprintf("%s\n%s\n\n", result.URL, content)
}(result)
}
wg.Wait()
close(resultChan)
var contextSB strings.Builder
for res := range resultChan {
contextSB.WriteString(res)
}
return contextSB.String()
}
func fetchTextContent(url string) (string, error) {
fmt.Println("Fetching content for URL:", url)
article, err := readability.FromURL(url, 30*time.Second)
if err != nil {
return "", err
}
return article.TextContent, nil
}
getAnswer ¶
The final puzzle piece is the getAnswer
method. This method sends the user question and the context to the LLM
and returns the answer.
func getAnswer(client *api.Client, query, context string) string {
messages := []api.Message{
{
Role: "user",
Content: fmt.Sprintf("%s\n\nOnly return answer based on the context. "+
"If you don't know return I don't know:\n###%s\n###", query, context),
},
}
return executeChat(client, messages)
}
Demo ¶
Now, we can run the program and see if it works. The program should return the answer to the user's question.
go run .
The output looks like this:
Query: Puzzle World Championship 2024 winner
Fetching content for URL: https://www.worldjigsawpuzzle.org/wjpc/2024/individual/final
Fetching content for URL: https://www.worldjigsawpuzzle.org/
Fetching content for URL: https://en.wikipedia.org/wiki/2024_World_Jigsaw_Puzzle_Championship
Based on the provided information, Kristin Thuv from Norway won the Individual event of the 2024 World Jigsaw Puzzle Championship with a time of 00:37:58.
After "Query:" we see the reformulated search query. Then, it lists the 3 URLs from which the program extracted the text. The last line shows the answer from the LLM to the user question.
Conclusion ¶
We have seen how we can leverage a web search engine to give context to a large language model, how we can run a LLM locally with Ollama, and how we can interact with the LLM from a Go program.
It showed how easy it is to build a simple example to leverage a retrieval augmented generation (RAG) workflow to get answers to questions not in the LLM training data.