The situation was dire. Napoleon’s army was far from its supply lines, with the harsh Russian winter closing in. Their only hope of shelter lay in the capital city of Moscow, but, arriving there in September 1812, they found the city in flames. Of the half-million men who had set out with Napoleon, only ten thousand would survive the retreat.
If you ask ChatGPT about this, it will give the conventional answer, plus some convincing – and incorrect – ideas about the book.
The conventional version of this history says that the Russians purposely torched their own city. Writing in War and Peace, however, Count Leo Tolstoy won’t admit such a desperate tactic. He presents a strong case that Moscow, “a town built of wood, where scarcely a day passes without conflagration” would naturally burn once its people – and the fire department – evacuated.
If you ask ChatGPT about all this, it will give the conventional answer. If you ask, “according to Tolstoy,” it will still give the conventional answer, plus some convincing – and incorrect – ideas about the book. That’s because ChatGPT has never read the book!
It’s adorable, because it bluffs about the reading, exactly like a slacking college student. What was the professor thinking, assigning a thousand-page novel?
Now, thanks to Retrieval Augmented Generation (RAG), you can help ChatGPT answer such questions by priming it with selected passages from the novel. So, I did. I wanted to demonstrate that RAG would support more-advanced text analyses:
- Answer questions using passages from a single novel
- Answer questions using passages from two novels and compare them
- Compare passages from two novels based on an unseen question
- Compare passages from two novels based on a similarity search
This week, we’ll cover the basics using War and Peace, and then I’ll share the two-novel results next week.
War and Peace and RAG
The basic idea behind RAG is simply to query a text database for the priming material, before handing the problem over to a Large Language Model (LLM) like ChatGPT. To be precise, I am using the OpenAI API to work with the GPT 3.5 model.
The only AI involved in the retrieval step is that we use an “embedding model” to convert the text and the query string into vectors. Apart from that, it’s a text search. You could, conceivably, use old-school text search techniques to do the job. I haven’t tried that, yet. What I tried were these three embedding models:
- text-embedding-3-large
- text-embedding-ada-002
- text-embedding-3-small
The embedding models convert chunks of text into vectors of varying length, hence the “large” and “small” size designations. If you’re an AI person reading this, you already know about mapping words to vectors. Mapping paragraphs to vectors follows the same principle. Both are due to Mikolov, et al. See Distributed Representations of Sentences and Documents.
If you’re a language person, well, you won’t be surprised to learn that the English lexicon can be arranged into a spatial array so that “cat” and “dog” end up together. And, if a three-D word space is good, a 300-D word space is better!
I downloaded some novels from the Gutenberg project, did some basic text parsing on them, and then converted each into its own vector database. I used the Chroma database natively and with the LangChain library. Other popular vector databases include Milvus and Pinecone.
I parsed the novels into chapters first, so that Chroma would pick up the chapter headings as metadata. If you use Project Gutenberg, be sure to stop parsing at THE END because there are about 600 lines of legal stuff after that.
Based on the context provided, it is implied that the Russians did not intentionally set fire to Moscow. The fire in Moscow was attributed to a combination of factors such as the presence of foreign troops, abandoned wooden buildings, soldiers’ carelessness with pipes and campfires, and the general chaos and looting that ensued. The burning of Moscow was seen as a result of the circumstances rather than a deliberate act of arson by the Russians.
The main thing to tune, here, is the size of text chunks in the database. You’ll be passing text chunks to the LLM via an API, and there’s a size limit. GPT 3.5 has a window of roughly 3,000 words.
The typical RAG use case is simply to find one good chunk with the answer. For the Moscow fire question, that is chunk #1440, which happens to be only 0.12 “cosine distance” away from the query text:
{‘source’: ‘data\\tolstoy\\Chapter_255.txt’, ‘start_index’: 10994}
But what if you’re asking “who is the best Russian general?” or “how many siblings does Natasha have?” Clues will be scattered throughout the text, and you will need more, smaller chunks.
Prince Andrew doesn’t marry Natasha immediately because he decides to conform to his father’s wish to propose and postpone the wedding for a year. This decision is made in the hopes that either his own or Natasha’s feelings may change within that time period. Despite Natasha’s distress and desire to marry sooner, Prince Andrew feels bound by his father’s wish and the decision to delay the wedding.
I would say that smaller is better because the retriever can always fetch multiple chunks from the same neighborhood – as long as they’re at least big enough to be picked up by the vector search. After some experimentation, I settled on a chunk size of 1,200 characters, with the ada-002 embedding, which has 1,536 dimensions.
All operations on the API are sized (and priced) in “tokens,” so it’s a good idea to employ the tiktoken counter, and keep an eye on your token limits. My 1,200-character chunks run around 220 tokens.
LangChain’s recursive text splitter does its best to honor sentences and paragraphs, but not semantics. It still feels like taking a favorite novel and chopping it up in a Cuisinart. Greg Kamradt has invented a semantic text splitter, which can detect and split based on topic changes, but its implementation in LangChain isn’t great.
Chroma’s query method takes a string, vectorizes it, and then does a similarity search against the embeddings in the database. Normally, you look to maximize “cosine similarity” but, with Chroma, you must minimize “cosine distance.” You can also call the OpenAI embedding function on your own, and then search by vector directly. Just be sure to use the same embedding model in all cases.
That covers the basics and the single-novel case. Next week, we’ll use RAG to compare two novels.
Sidebar: While writing this, I felt the need to review some points from War and Peace, so I pulled the book off the shelf … and then remembered I had just built a searchable database. Old habits die hard.