Unguided RAG for Text Comparison

Last week, we covered the basic RAG setup and had some fun answering questions from War and Peace. Today, we move on to the two-novel cases:

  • Answer questions using passages from two novels and compare them
  • Compare passages from two novels based on an unseen question
  • Compare passages from two novels based on a similarity search

When I first challenged ChatGPT to “compare common tropes and plot devices between War and Peace and Middlemarch,” I had a specific one in mind. Both novels are set in the early nineteenth century, a time when virtually all wealth was inherited – and the non-wealthy were basically slaves. So, everyone is trying either to marry into a rich family or suck up to a wealthy relative.

This trope is common enough that you could have seen it anywhere: our hero stands to inherit a fortune, but there is some intrigue around the dead uncle’s will. Maybe there’s a different version known only to the servants, etc. I put this question to RAG by gathering matching passages from both novels. The typical prompt template would be something like:

“Please answer this question {question} based on these passages from the novel War and Peace {context1} and these passages from the novel Middlemarch {context2}”

What we are looking for, though, is something a little more autonomous, so I simply removed the query text from the prompt template:

“Please compare these two novels with reference to the context provided … passages from the novel War and Peace {context1} and passages from the novel Middlemarch {context2}”

So, the retriever knows what question is to be answered, but ChatGPT is only asked to “compare” and draw its own conclusions. The results are quite good. I won’t share the whole response, but here is the concluding paragraph:

Overall, while both novels touch on similar themes of inheritance and wills, they do so in distinct ways that reflect the respective societies and characters depicted in each work. Middlemarch delves into the personal and familial implications of inheritance, with a focus on individual motivations and moral dilemmas, while War and Peace explores the broader societal and political consequences of inheritance, with a tone that is more ironic and comedic.

Here we see ChatGPT contrasting the tone of the two samples, and drawing a thematic inference. You can even fish for likely tropes, like “Are there instances of women being unfaithful?”

RAG with Unguided Retrieval

Finally, to make RAG fully autonomous, we must dispense with the guiding hand of the query text, and set it loose using only cosine similarity. This script simply trundles through both databases, using Chroma’s query by vector to find similar passages. There is no need to create any new embeddings.

Once a cross-novel match is found, the script retrieves n_results similar passages on each side, and then passes the unguided “compare” prompt to Open AI.

Between War and Peace and Middlemarch, it settles on some grim material about “empathy and compassion in a time of hardship.” I didn’t feel like quoting it so, instead, I tried another novel.

It took me about ten minutes to download, parse, and add Vanity Fair to the mix. In common with War and Peace, it has its own Napoleonic war (different campaign) and a great opportunity for guided search: “Is one or more of the protagonists killed in action?”

The unguided search script, predictably, finds the war. Tolstoy treats the war from a historical perspective while, for Thackeray, it’s just the backdrop for his drama.

On the other hand, War and Peace takes a broader perspective, examining the larger geopolitical forces at play during the Napoleonic Wars. The novel delves into the complexities of international relations, diplomacy, and military strategy, showing how the actions of monarchs, diplomats, and military leaders shape the course of history. While War and Peace also portrays the impact of war on individuals, it does so within the context of larger historical forces and political developments.

The histogram above shows the distribution of cosine similarity results across all 4.8 million pairs of chunks between Middlemarch and War and Peace for ada-002 and 3-small. These are both 1,536-dimensional embeddings. I also experimented with 3-large.

Initially, I preferred ada-002 because it was easier for the script to find similar passages. After working for a while with both, and seeing the histogram, I think maybe a wider variance is better. It means that nearby passages really are similar, while those that aren’t are farther apart.

For instance, 3-small gives a better answer on Natasha’s engagement because it’s more discriminating. Because I’ve read the novel (twice) I can infer where the search is going wrong. Also, I wrote a little utility function that displays which chunks it has found, with their distance metrics and metadata.

Literary Analysis with RAG

The situation was dire. Napoleon’s army was far from its supply lines, with the harsh Russian winter closing in. Their only hope of shelter lay in the capital city of Moscow, but, arriving there in September 1812, they found the city in flames. Of the half-million men who had set out with Napoleon, only ten thousand would survive the retreat.

If you ask ChatGPT about this, it will give the conventional answer, plus some convincing – and incorrect – ideas about the book. 

The conventional version of this history says that the Russians purposely torched their own city. Writing in War and Peace, however, Count Leo Tolstoy won’t admit such a desperate tactic. He presents a strong case that Moscow, “a town built of wood, where scarcely a day passes without conflagration” would naturally burn once its people – and the fire department – evacuated.

If you ask ChatGPT about all this, it will give the conventional answer. If you ask, “according to Tolstoy,” it will still give the conventional answer, plus some convincing – and incorrect – ideas about the book. That’s because ChatGPT has never read the book!

It’s adorable, because it bluffs about the reading, exactly like a slacking college student. What was the professor thinking, assigning a thousand-page novel?

Now, thanks to Retrieval Augmented Generation (RAG), you can help ChatGPT answer such questions by priming it with selected passages from the novel. So, I did. I wanted to demonstrate that RAG would support more-advanced text analyses: 

  1. Answer questions using passages from a single novel
  2. Answer questions using passages from two novels and compare them
  3. Compare passages from two novels based on an unseen question
  4. Compare passages from two novels based on a similarity search

This week, we’ll cover the basics using War and Peace, and then I’ll share the two-novel results next week.

War and Peace and RAG

The basic idea behind RAG is simply to query a text database for the priming material, before handing the problem over to a Large Language Model (LLM) like ChatGPT. To be precise, I am using the OpenAI API to work with the GPT 3.5 model.

The only AI involved in the retrieval step is that we use an “embedding model” to convert the text and the query string into vectors. Apart from that, it’s a text search. You could, conceivably, use old-school text search techniques to do the job. I haven’t tried that, yet. What I tried were these three embedding models:

  • text-embedding-3-large
  • text-embedding-ada-002
  • text-embedding-3-small

The embedding models convert chunks of text into vectors of varying length, hence the “large” and “small” size designations. If you’re an AI person reading this, you already know about mapping words to vectors. Mapping paragraphs to vectors follows the same principle. Both are due to Mikolov, et al. See Distributed Representations of Sentences and Documents.

If you’re a language person, well, you won’t be surprised to learn that the English lexicon can be arranged into a spatial array so that “cat” and “dog” end up together. And, if a three-D word space is good, a 300-D word space is better!

I downloaded some novels from the Gutenberg project, did some basic text parsing on them, and then converted each into its own vector database. I used the Chroma database natively and with the LangChain library. Other popular vector databases include Milvus and Pinecone.

I parsed the novels into chapters first, so that Chroma would pick up the chapter headings as metadata. If you use Project Gutenberg, be sure to stop parsing at THE END because there are about 600 lines of legal stuff after that. 

Based on the context provided, it is implied that the Russians did not intentionally set fire to Moscow. The fire in Moscow was attributed to a combination of factors such as the presence of foreign troops, abandoned wooden buildings, soldiers’ carelessness with pipes and campfires, and the general chaos and looting that ensued. The burning of Moscow was seen as a result of the circumstances rather than a deliberate act of arson by the Russians.

The main thing to tune, here, is the size of text chunks in the database. You’ll be passing text chunks to the LLM via an API, and there’s a size limit. GPT 3.5 has a window of roughly 3,000 words.

The typical RAG use case is simply to find one good chunk with the answer. For the Moscow fire question, that is chunk #1440, which happens to be only 0.12 “cosine distance” away from the query text:

{‘source’: ‘data\\tolstoy\\Chapter_255.txt’, ‘start_index’: 10994}

But what if you’re asking “who is the best Russian general?” or “how many siblings does Natasha have?” Clues will be scattered throughout the text, and you will need more, smaller chunks. 

Prince Andrew doesn’t marry Natasha immediately because he decides to conform to his father’s wish to propose and postpone the wedding for a year. This decision is made in the hopes that either his own or Natasha’s feelings may change within that time period. Despite Natasha’s distress and desire to marry sooner, Prince Andrew feels bound by his father’s wish and the decision to delay the wedding.

I would say that smaller is better because the retriever can always fetch multiple chunks from the same neighborhood – as long as they’re at least big enough to be picked up by the vector search. After some experimentation, I settled on a chunk size of 1,200 characters, with the ada-002 embedding, which has 1,536 dimensions.

All operations on the API are sized (and priced) in “tokens,” so it’s a good idea to employ the tiktoken counter, and keep an eye on your token limits. My 1,200-character chunks run around 220 tokens.

LangChain’s recursive text splitter does its best to honor sentences and paragraphs, but not semantics. It still feels like taking a favorite novel and chopping it up in a Cuisinart. Greg Kamradt has invented a semantic text splitter, which can detect and split based on topic changes, but its implementation in LangChain isn’t great.

Chroma’s query method takes a string, vectorizes it, and then does a similarity search against the embeddings in the database. Normally, you look to maximize “cosine similarity” but, with Chroma, you must minimize “cosine distance.” You can also call the OpenAI embedding function on your own, and then search by vector directly. Just be sure to use the same embedding model in all cases.

That covers the basics and the single-novel case. Next week, we’ll use RAG to compare two novels.

Sidebar: While writing this, I felt the need to review some points from War and Peace, so I pulled the book off the shelf … and then remembered I had just built a searchable database. Old habits die hard.

Weighted Factors for Product Selection

Every so often, I will write up a standard quantitative procedure, usually because someone has asked me about it.  For instance, see Pay Plan Math, What Is Accuracy, and Know Your Time Series.  Today, it’s weighted-factor analysis for product selection.  At a high level, this procedure is:

  1. Gather your requirements and selection criteria
  2. Quantify how important each criterion is
  3. Grade the vendor responses
  4. Compute numerical scores

Gather Requirements and Criteria

First, through interviews and maybe some direct observation, discover why we need the product.  In my business, this is generally a software product, but it could be anything.  Next, determine what are the requirements and selection criteria.

Selection criteria are the features we will evaluate to decide which product is the best fit, whereas requirements are features the product must have to even be considered.  Don’t make the mistake of thinking requirements are just extra-special criteria.

If you’re looking to buy a car, and gas mileage is on the list, then a hybrid will score well on that criterion.  If you’re only looking to buy a hybrid, then that’s the category, and you’re not looking at gas cars at all.

The purpose of requirements is to define the category of product we’re looking for. If you’re writing an RFP, the criteria are what the vendors respond to, and the requirements determine which vendors get the RFP.  When in doubt, send them the RFP anyway and let the vendor figure it out.

For example, if I am selecting cybersecurity software, I might want endpoint protection (EPP), endpoint detection and response (EDR), managed detection and response (MDR), or even a security operations center (SOC).  These all address the same problem, but they’re not the same product.

Quantify Importance of the Criteria

In the chart, I show criteria scored on a scale of 1 to 5, which is typical. Then, for the sake of example, I norm these to a total score of 100.  This is probably overkill, but it’s fun to have 100 as a baseline.  Later, we’ll do the same with the final score.  Clients love simple numbers.

One way to explore the criteria is do a forced ranking from most important to least important.  This is not amenable to quantitative methods, but it’s a good way to get started.  Spend an hour in front of the whiteboard while the client staff fight it out over the ranking, then let them each do the 1 to 5, and average their responses.

Yet another way is to give each participant 100 points that he can allocate as desired across the criteria.  This is the most accurate, in terms of understanding tradeoffs, and it makes the math easy.

I like to keep the cost analysis separate from the features.  It is possible to turn the price proposal into another row among the criteria, but no one really thinks this way.  What you’re shooting for is, “this one scored 84 out of a hundred, and it’s $100,000 more than the one that scored 74,” with traceability back to the features that account for the difference.

Grade the Vendor Responses

Maybe you’ve sent an RFP and are now grading the proposals, or maybe you’re doing your own research. Using an RFP is handy because you can include the criteria and let the vendors tell you how they propose to meet them. In either case, you (and the committee) are responsible for assigning a number to indicate how well the product meets each criterion.

Here again, the 1 to 5 scale is popular and easy to use.  Obviously, grades supported by numbers are best.  For gas mileage, you can assign 1, 2, 3, 4, and 5 to specific ranges of MPG.  Something like “vendor support” can be tied to a service-level agreement in hours or minutes.

Compute the Final Score

This is called weighted-factor analysis because each product is scored according to its criteria grades, and the criteria have different weights.  It’s just like computing a weighted average.  Since we’ve normed the weights to 100 and we’re using a 5-point grading scale, we divide the totals by five to produce a score out of 100.  You can present this as a percentage if you want.

In our carefully contrived example, vendor #3 comes out on top even though they had the lowest raw score, because they scored well on the criteria that mattered most.

When data scientists say that “our precision exceeds our accuracy,” this is what they mean. Do not take this fundamentally subjective numerical score out to two decimal places.  The point of this procedure is not so much to generate a number, but to make the variables explicit.

The idea is that sum of many small decisions will be more accurate than one big one, particularly if there is consensus among the participants. Everyone on the committee should be able to say why the chosen product scored ten points better than the runner-up.

Also, to be a little bit pragmatic: now everyone has their fingerprints on the decision.  No one can complain that they weren’t consulted, or question how the decision was made.

Funny aside:  One of my first consulting projects was the selection of a networking vendor for Ford Credit. We did the full procedure: interviews, requirements, criteria, an RFP, a selection committee, bidder conferences, sealed bids, etc. Digital Equipment (DEC) won. Remember them? And then some big shot from the Glass House swooped in and gave the contract to IBM. What about our fancy RFP project? Well, it was “defective” because it failed to produce IBM as the winner. There was a saying in those days, “no one gets fired for buying IBM.” It was seen as the safe choice – and the only choice for risk-averse executives.