May 20, 2023 12 min read

Writing a GPT question-answerer: An implementer’s notes

(note: I wrote this a long time ago. Nowadays we would call this RAG. The website fell over because Pinecone took down my database, and I no longer maintain it. Caveat emptor!)

I recently wrote MRBot, a website that answers questions using information from the blog Marginal Revolution. Writing GPT bots is in vogue nowadays, and I won’t pretend that this is breaking any new ground. But it’s a fun project, and I recommend it.

In my case, I like to check Marginal Revolution for travel advice, as I find that it often has good angles other sites don’t capture. However, using traditional search can be a bit annoying — for instance, if I want advice on travelling to Kenya, I might need to do multiple searches for Kenyan food, and advice on Nairobi, and general advice on travelling to eastern Africa. Then I’ll need to use my brain to filter and synthesize my results.

The benefit of embeddings-based search, which is what MRBot uses, is that it can effectively do all these searches at once. Because it then uses GPT to summarize the results it gets, it can also provide useful context to tie results together, saving my brain a bunch of cognitive effort. It provides a different — and, hopefully, better — search experience.

Here’s an example query and answer:

Q: Where should I eat in Mexico?

A: To eat well in Mexico, follow these tips:

Look for time-specific food, as it’s usually fresh and for locals. For example, in San Miguel, there is barbacoa from 8-10 a.m., carnitas from 11-4, and chorizo after 8 p.m.

Often the best meals are served in places with no names, like roadside stands with plastic tables and chairs.

Roadside restaurants on the edges of towns or between towns serve some of the best food in Mexico.

Focus on breakfast and lunch for your best meals, as dinner is often an afterthought and many good places close by 8 p.m.

Specific recommendations include:

Tres Marias, a food village off the highway on the way to Cuernavaca, known for green chilaquiles and chorizo tacos.

Coyoacan, a suburb of Mexico City, where you can find excellent and sanitary food stalls (“comedores”) near the Frida Kahlo museum.

Pujol in Mexico City, which is a fancy place that does wonderful things with vegetables and offers the Menu de la Tierra.

Remember that the best food experiences in Mexico often come from exploring local markets, street food vendors, and small, family-run establishments.

It’s pretty neat!

The authoritative guide on how to do this is Question answering using embeddings, an ipython notebook provided by OpenAI. I won’t repeat what’s there; instead, you should think of this blog post as a companion guide to that cookbook.

(An embedding, in this context, is a way to transform a passage of words into a numerical representation that a computer can work with. It usually looks like a vector (or list) of 300-1500 floating-point numbers. Embeddings are useful in machine learning because they represent semantic differences as numerical differences, which means you can use them to cluster text or to ‘measure’ how similar/dissimilar different passages of text are.)

I’m going to assume some familiarity with systems design and AI, and I won’t go through all the details of how it works. (To be honest, I don’t grok transformers yet and I don’t know how embedding models are trained — but it turns out the libraries provide excellent enough abstraction libraries that I don’t need to!)

At a high level, here’s what the project involved:

Step one & two: do data pre-processing (done once)
- First you need to actually scrape data and put it in some retrievable place.
- Then, you need to decide how to generate embeddings for the data, and store those embeddings somewhere.
Step three: retrieval & summarization (done once per query)
- Whenever you get a query, generate an embedding for that.
- Then, do similarity search between your query embedding and all the data embeddings, and find the data that’s closest to your query.
- You could return this on its own, in a Google-like way. However, most people who do this usually add some summary/synthesis by sending this data to GPT.
(Optional) Step four: build a user-friendly UI
- This actually involves no ML knowledge at all, but I dumped some notes on this here anyway.

Note that if you're not doing this as a learning exercise, developing question-answerers manually is slightly overkill nowadays. There are libraries and services out there that can do this for you. LlamaIndex is an example.

Step one: Gather data

The first thing I needed to do was get my data. This required no ML knowledge at all; it’s all traditional scraping and parsing. Fortunately, Marginal Revolution is pretty easy to scrape. I’m not the first one to try.

I wrote a script in python using BeautifulSoup and dumped each article into a sqlite database. I saved the raw HTML of each post. I wouldn’t recommend it. It’s worthwhile to save URLs and bold/italic information, as GPT can use this in its response, but I could have done that by saving Markdown instead.

Step two: Generate embeddings

Now that I had my data, I needed to generate my embeddings.

If I were to go back and re-do this project, this is the step I would spend the most time revisiting. The quality of the embeddings is the most important part of this project.

Unfortunately, it’s not something that can really be systematized. One thing I’ve been surprised about is how much ML research is an art, not a science. The end-quality of your product or model seems to depend a lot on your hyperparameters and/or design choices. In this case, I found the two key variables were:

Which embeddings model I chose to use
How I chunked my text

Which embeddings model?

You should collect some basic statistics on your model before deciding which embedding model to use. I say this from bitter experience, because I didn’t. For your benefit, here are the relevant ones anyway:

I scraped 34,305 articles.
The articles were an average of 199 words each. The shortest article had 2 words, and the longest had 4958. (‘Words’ here is counted by splitting html by spaces.)
In total there were around 6.8 million words, which we can ballpark as about 9 million tokens (in English, a word usually requires 4/3 tokens).

The OpenAI cookbook suggests using their text-embedding-ada-002 model. At $0.0004 / 1k tokens, this is fairly cheap, but not free. I needed to embed 9 million tokens, so embedding this using OpenAI would have cost me 4 dollars. You might be happy to pay $4, but sad if your data sample is orders of magnitude larger than mine.

(You also need to pay to use this embeddings model on every query you get, but the cost of calling GPT on every query dominates this, so it’s not really worth worrying about.)

I didn’t do this math before I started working, and I was feeling cheap, so I used an open-source model instead. HuggingFace’s sentence_transformers library makes it very easy to download and use these models. There isn’t a quality dropoff – this article claims that all-MiniLM-L6-v2 is better than ada, and has a more detailed discussion on how embeddings models compare. Based on it, I used all-MiniLM-L6-v2.

This saved me money, but caused me pain. One downside of using an open-source model is that you’ll need to make sure you can generate embeddings at scale in production. This can also make it a pain to deploy (see my notes later). You don’t need to worry about any of this if you use OpenAI’s api.

You might also get great results from customizing embeddings — this OpenAI cookbook demonstrates how you would do that. I didn’t try it.

Chunking

Like GPT, most of these libraries have a maximum context window. text-embedding-ada-002’s sequence length is 8192 tokens, which is long enough for almost all MR posts. all-MiniLM-L6-v2's is 512 words, which is long enough for the average post.

Even if you’re not limited by your embedding model, you’re ultimately limited by GPT. GPT-3.5 has a context window of 4k tokens, and regular GPT-4 has a context window of 8k tokens.

This means you need to break your data up into pieces. How? It depends! It’s up to you whether it makes sense to feed it 10 results of 300 tokens each, or 2 results of 1500 tokens each.

I wanted to get through this and did the simplest thing possible: I split each article into paragraphs, and considered each paragraph a chunk. This works well for some Marginal Revolution articles, like this one, where each paragraph basically covers a different topic.

However, it works less well on others, like this one:

Title: Russia and China

Authored by Philip Snow, the subtitle is Four Centuries of Conflict and Concord. This book is excellent and definitive and serves up plenty of economic history, here is one bit from the opening section:

The trade nonetheless went ahead with surprising placidity. Now and again there were small incidents in the form of cattle-rustling or border raids. In 1742 some Russians were reported to have crossed the frontier in search of fuel, and in 1744 two drunken Russians killed two Chinese traders in a squabble over vodka.

I am looking forward to reading the rest, you can buy it here.

The fact that Cowen recommends the book is in a different paragraph to that which actually describes the book. So if our embeddings-based search only returns one of these paragraphs and not another, it'll be confusing to GPT and the reader.

For instance, the paragraph "Authored by Philip Snow, the subtitle is Four Centuries of Conflict and Concord. This book is excellent and definitive and serves up plenty of economic history, here is one bit from the opening section:" will only come up if you search for Philip Snow and not if you search for Russia, or China, or great power conflict.

I decided not to worry about this. Once I'd embedded enough paragraphs, I gave my search a try. However, my search results were awful, as paragraphs alone weren't enough context. I still didn't want to embed entire articles, so I added the title to the embedding as well. That fixed the problem, as it turns out that titles can do a lot to contextualize individual paragraphs of posts.

For instance, in our example above, now we're capturing the paragraph "Russia and China: Authored by Philip Snow, the subtitle is Four Centuries of Conflict and Concord. This book is excellent and definitive and serves up plenty of economic history, here is one bit from the opening section:". This is now much more likely to come up if we search for "great power conflict" or "Chinese foreign policy"!

In my opinion, this is the most fun part of the project to tweak. If I re-did this, I would try splitting up MR posts into those which are high-context and those which are just lists of things, and treating them differently! And if I were working on this at a larger scale, I would write some tooling to make it easy to try different embedding models on different datasets.

Step 2.5: Storage

This is a half-step because you need to figure it out while you’re generating embeddings. To start off with, I actually stored my embeddings in SQLite while I was generating them (as blobs). I'd retrieve them and store them in memory and iterate all of them while calculating similarity.

This got to be too slow in production. I've been told there are various hacks that will let you optimize this. I decided to throw money at the problem instead, and switched over to Pinecone.

Step 3: Retrieval

This is actually the easy part! The OpenAI cookbook gives you all the code you need. My modified version of it (swapping out the OpenAI call for a sentence_transformer call) basically looked like this:

  import pandas as pd
  from sentence_transformers import SentenceTransformer, LoggingHandler
  from scipy import spatial  # for calculating vector similarities for search 

  model = SentenceTransformer('all-MiniLM-L6-v2')

  def strings_ranked_by_relatedness(
      query: str,
      embeddings: list[tuple[Article, np.ndarray]],
      relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
      top_n: int = 100
  ) -> tuple[list[str], list[float]]:
      “””Returns a list of strings and relatednesses, sorted from most related to least.”””
      query_embedding = model.encode(query)
      articles_and_relatednesses = [
          (article, relatedness_fn(query_embedding, e))
          for (article, e) in embeddings
      ]
      articles_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
      articles, relatednesses = zip(*articles_and_relatednesses)
      return articles[:top_n], relatednesses[:top_n]

If your process is anything like mine, your most related articles will probably be a bit of a mess. Here’s a shortened version of what mine looks like, for the query about Mexican food:

Use the below articles from Marginal Revolution to answer the subsequent question.
Do your best to synthesize the results, as the articles may not directly answer the question or might be irrelevant.
Do not add information that is not in the articles.
Give specific examples. If the user’s query is not a question, say ‘I don’t know how to answer that.’
How to eat well anywhere in Mexico: 5. Roadside restaurants, on the edges of towns or between towns, serve some of the best food in Mexico or anyhere else for that matter. Some of these restaurants even have names, though you can overlook that in the interests of eating well. 
How to eat well anywhere in Mexico: 1. Look for time-specific food. In San Miguel for instance, there is barbacoa [barbecue] from 8-10 a.m., carnitas from about 11-4, and wonderful chorizo after 8 p.m. In Mexico, if the food is available only part of the day, it’s almost always good. It’s for locals and there is no storage in these places so it’s also extremely fresh.
How to eat well anywhere in Mexico: You’ll sometimes hear fallacious claims that San Miguel Allende or Guanajuato or other parts of Mexico don’t have superb food. What is true, in many Mexican cities, is that almost every place near the main square is only so-so. Here’s what to do:
[...]
Where to eat right next to Phoenix airport?: Chili or Mexican food would be ideal. Thanks in advance for the help…
Some food notes from Mexico City: <a href=“http://pujol.com.mx/“>Pujol</a> does wonderful things with vegetables and is perhaps the best f la Tierra.
Some food notes from Mexico City: They have done away with the food stalls at the Zócalo. In Mexico City calorie-counting menus are common and gelato is being replaced by frozen yogurt (!).
Question: Where should I eat in Mexico?

This is, I think, why most versions of embeddings search then call GPT to summarize the results. It's a pain for a human to go through and extract the useful information. However, if the goal is just to get enough usable information for GPT to synthesize it, you can then feed this in and get something human-readable out the other side.

You'll want to keep the temperature low so that GPT doesn't hallucinate information. I found that GPT-4 was notably better at synthesizing information than GPT-3 was.

This does get expensive! I’m just paying for MRBot queries right now, because this is fairly niche. (Happily, OpenAI allows me to set a monthly cap on what I’ll pay them, which I’m using to limit my costs.)

If you were doing this commercially, you’d probably want to charge customers, or have them charge it to their own OpenAI account. If you were doing this in-house, the cost of each query is probably negligible compared to how much time the people working with you will save, but if you care you can implement token tracking and quotas.

Step 4: UI

Hilariously, this step took me longer than the last two (or three) put together. However, I wanted to make a nice website that my friends and I could access from our phones, and that required some traditional software development.

MRBot is a simple React frontend (deployed using Cloudflare Pages) that calls out to a backend Flask service. I spent an absurd amount of time trying to figure out whether the best practices around deploying a Flask app have changed at all in the last 10 years. They haven’t.[1]

[1] With one caveat — I spent some time poking around to see if I could make a nice Docker setup for the application. I could, but only by making Docker images that were 6GB+ in size (!). It turns out that the CUDA libraries used by sentence_transformers are really big. If you instead use OpenAI’s API to use your embeddings, I suspect you don’t need to include these huge libraries in your requirements.txt, which makes deploying with Docker reasonable.

Peculiarities of OpenAI API calls

OpenAI responses are really slow. If you use gunicorn, you’ll want to make sure the workers are using gevent, or you’ll block your workers while waiting on the api call to return. asyncio will do a reasonable thing here too, but I’m not familiar with it.

The only really interesting thing I’ll note here is that it was a little bit of effort to stream the OpenAI response to the client. However, it’s worth doing this because it’s a notable improvement to the user’s experience. I implemented it using a hacky version of server-side events.

A plug for Figma

One thing I’ve learned recently while developing projects is that Figma is awesome for prototyping websites. I used to design websites by tweaking raw code and looking at it in Chrome.

In practice, though, you really don’t want to do this more than necessary — it’s annoying to remember the syntax for the changes you want to do, and time-consuming to wait for it all to reload. Nowadays, I prototype websites by using auto layout in Figma and basic component libraries. Once I have a layout I like, I then turn that into HTML, and usually end up with something decent. It’s notably improved my frontend development experience.

A final point about ethics. I added a large disclaimer to the MRBot website clarifying that nothing here should be taken as the view of the authors. As discussed above, I made a lot of creative choices when writing this bot, all of which change its output in ways that don't reflect the authors' intent. All faults are mine, etc. And I don't think it would be right, or legal, to charge for this.

The legal issues here remind me of those around fan remixes or fanfiction though, where technology changed how people could appreciate work faster than copyright law could keep up. If you write a bot that summarizes Wikipedia entries, can you sell that? What if it draws on information from multiple sources, some of which are freely-available and some of which are not? I'd like us to err on the side of experimentation, but others might not.

But other than that, that's it! I hope you learned something from this. Do let me know if you end up doing something similar!