Literary Analysis with RAG

The situation was dire. Napoleon’s army was far from its supply lines, with the harsh Russian winter closing in. Their only hope of shelter lay in the capital city of Moscow, but, arriving there in September 1812, they found the city in flames. Of the half-million men who had set out with Napoleon, only ten thousand would survive the retreat.

If you ask ChatGPT about this, it will give the conventional answer, plus some convincing – and incorrect – ideas about the book. 

The conventional version of this history says that the Russians purposely torched their own city. Writing in War and Peace, however, Count Leo Tolstoy won’t admit such a desperate tactic. He presents a strong case that Moscow, “a town built of wood, where scarcely a day passes without conflagration” would naturally burn once its people – and the fire department – evacuated.

If you ask ChatGPT about all this, it will give the conventional answer. If you ask, “according to Tolstoy,” it will still give the conventional answer, plus some convincing – and incorrect – ideas about the book. That’s because ChatGPT has never read the book!

It’s adorable, because it bluffs about the reading, exactly like a slacking college student. What was the professor thinking, assigning a thousand-page novel?

Now, thanks to Retrieval Augmented Generation (RAG), you can help ChatGPT answer such questions by priming it with selected passages from the novel. So, I did. I wanted to demonstrate that RAG would support more-advanced text analyses: 

  1. Answer questions using passages from a single novel
  2. Answer questions using passages from two novels and compare them
  3. Compare passages from two novels based on an unseen question
  4. Compare passages from two novels based on a similarity search

This week, we’ll cover the basics using War and Peace, and then I’ll share the two-novel results next week.

War and Peace and RAG

The basic idea behind RAG is simply to query a text database for the priming material, before handing the problem over to a Large Language Model (LLM) like ChatGPT. To be precise, I am using the OpenAI API to work with the GPT 3.5 model.

The only AI involved in the retrieval step is that we use an “embedding model” to convert the text and the query string into vectors. Apart from that, it’s a text search. You could, conceivably, use old-school text search techniques to do the job. I haven’t tried that, yet. What I tried were these three embedding models:

  • text-embedding-3-large
  • text-embedding-ada-002
  • text-embedding-3-small

The embedding models convert chunks of text into vectors of varying length, hence the “large” and “small” size designations. If you’re an AI person reading this, you already know about mapping words to vectors. Mapping paragraphs to vectors follows the same principle. Both are due to Mikolov, et al. See Distributed Representations of Sentences and Documents.

If you’re a language person, well, you won’t be surprised to learn that the English lexicon can be arranged into a spatial array so that “cat” and “dog” end up together. And, if a three-D word space is good, a 300-D word space is better!

I downloaded some novels from the Gutenberg project, did some basic text parsing on them, and then converted each into its own vector database. I used the Chroma database natively and with the LangChain library. Other popular vector databases include Milvus and Pinecone.

I parsed the novels into chapters first, so that Chroma would pick up the chapter headings as metadata. If you use Project Gutenberg, be sure to stop parsing at THE END because there are about 600 lines of legal stuff after that. 

Based on the context provided, it is implied that the Russians did not intentionally set fire to Moscow. The fire in Moscow was attributed to a combination of factors such as the presence of foreign troops, abandoned wooden buildings, soldiers’ carelessness with pipes and campfires, and the general chaos and looting that ensued. The burning of Moscow was seen as a result of the circumstances rather than a deliberate act of arson by the Russians.

The main thing to tune, here, is the size of text chunks in the database. You’ll be passing text chunks to the LLM via an API, and there’s a size limit. GPT 3.5 has a window of roughly 3,000 words.

The typical RAG use case is simply to find one good chunk with the answer. For the Moscow fire question, that is chunk #1440, which happens to be only 0.12 “cosine distance” away from the query text:

{‘source’: ‘data\\tolstoy\\Chapter_255.txt’, ‘start_index’: 10994}

But what if you’re asking “who is the best Russian general?” or “how many siblings does Natasha have?” Clues will be scattered throughout the text, and you will need more, smaller chunks. 

Prince Andrew doesn’t marry Natasha immediately because he decides to conform to his father’s wish to propose and postpone the wedding for a year. This decision is made in the hopes that either his own or Natasha’s feelings may change within that time period. Despite Natasha’s distress and desire to marry sooner, Prince Andrew feels bound by his father’s wish and the decision to delay the wedding.

I would say that smaller is better because the retriever can always fetch multiple chunks from the same neighborhood – as long as they’re at least big enough to be picked up by the vector search. After some experimentation, I settled on a chunk size of 1,200 characters, with the ada-002 embedding, which has 1,536 dimensions.

All operations on the API are sized (and priced) in “tokens,” so it’s a good idea to employ the tiktoken counter, and keep an eye on your token limits. My 1,200-character chunks run around 220 tokens.

LangChain’s recursive text splitter does its best to honor sentences and paragraphs, but not semantics. It still feels like taking a favorite novel and chopping it up in a Cuisinart. Greg Kamradt has invented a semantic text splitter, which can detect and split based on topic changes, but its implementation in LangChain isn’t great.

Chroma’s query method takes a string, vectorizes it, and then does a similarity search against the embeddings in the database. Normally, you look to maximize “cosine similarity” but, with Chroma, you must minimize “cosine distance.” You can also call the OpenAI embedding function on your own, and then search by vector directly. Just be sure to use the same embedding model in all cases.

That covers the basics and the single-novel case. Next week, we’ll use RAG to compare two novels.

Sidebar: While writing this, I felt the need to review some points from War and Peace, so I pulled the book off the shelf … and then remembered I had just built a searchable database. Old habits die hard.

Choosing the Cutoff Value

If you work with binary classifiers, then you are familiar with the problem of choosing a cutoff value.  While the classifier will predict positives and negatives, under the covers it’s a probability score with an implicit 0.50 threshold.  Since most real-life data is imbalanced, 0.50 will not be the right value.

This activity of finding the right cutoff value, and choosing the desired accuracy metric, can be a hassle, so I developed a tool to help me deal with it. In this article, I’ll show how I use the tool for a typical problem involving credit approval

When training a binary classifier, we generally look at the “receiver operating characteristic” or ROC curve.  This is a plot of true positives versus false positives for all choices of the cutoff value. A nice, plump ROC curve means that the model is fit for purpose, but you still have to choose the cutoff value.

In this example, we have an ROC with “area under the curve” of 0.76.  This is a good score, but the default 0.50 threshold happens to lie where the curve runs into the lower left corner.  Using the slider on my ROC tool, I can run this point up and down the curve, maximizing whichever accuracy metric I choose. 

To do this, I have the classifier write a list of its predicted probability scores into a file, along with the actuals (y_pred, y_val) and then I read that file into the tool.  If you’re using Scikit, you’ll want predict_proba for this.

In this case, the best balanced accuracy is achieved when the cutoff value is 0.11.  We need balanced accuracy because our exploratory data analysis showed that the data is nine to one imbalanced in favor of negative cases. 

For problems like this, balanced accuracy is usually sufficient, but we can take it a step further and ask what is the gain or loss from each decision.  

In the context of our credit problem, negatives are people who don’t default on their loan.  Our classifier could present 90% naïve “accuracy” simply by calling every case a negative.  We would confidently approve loans for everyone, and then encounter a 10% default rate. 

The tool displays other popular accuracy metrics like precision, recall, and F1 score.  By the way, notice that the true positive rate (TPR) and the false negative rate (FNR) add to unity because these are all the positive cases. The same goes for negatives.  The TPR is also known as “sensitivity.”

For problems like this, balanced accuracy is usually sufficient, but we can take it a step further.  We can ask what is the gain or loss from each decision.  The tool accepts these figures, with the false ones marked as red.

For example, let’s say that a false negative costs us $10,000 in collections and recovery charges, while a true negative means we earn $7,500 in interest.  True positives and false positives will both count as zero, because we declined them.

We can see that our maximum expected value of $2,170 is achieved when the cutoff value is reduced to 0.08.  This is below the optimum for balanced accuracy.  It is accuracy weighted more heavily to avoid false negatives.

I hope you enjoy using the tool.  Remember, it’s best practice to do all this with your training or validation dataset, and then commit to a cutoff value for your final test.

Predicting Loan Defaults with AI

I have some time on my hands, so I decided to experiment with some of the new AI assisted code generators.  I wanted something relevant to F&I, and I found this exercise on Coursera.  The “F” side of F&I gets all the attention, but there is plenty of opportunity for AI to rate insurance risk and mechanical breakdown risk.

Note that we are using AI to generate an AI model.  For the Coursera exercise, linear regression is sufficient, but I chose to use neural networks here because they are undeniably machine learning.  See my earlier post on this, “What Is Real AI?

Today, we’ll look at three popular AI assistants: ChatGPT, GPT-Engineer, and GitHub Copilot.  These are all based on the famous OpenAI large language model, just packaged a little differently.

To start, I worked the problem myself, running several different models.  The goal is to predict the probability of a given loan going bad, based on seventeen variables including credit score, term, and debt to income.  Once I was satisfied with my solution, I turned the problem over to my robot friend, ChatGPT.

ChatGPT

Using the chat window requires you to cut and paste code over to your IDE, so it doesn’t really feel like a code generator.  On the other hand, it’s conversational, so it can tell you its assumptions and you can give clarifying instructions.  Here is the prompt I used:

We need to write a Python script to predict loan defaults using a neural network model, and based on some input data. To start, read the input data from a CSV file and create a data frame. Some of the columns have numeric data and some have categorical data. The last column is the dependent variable. It has values of either zero or one. Next, prepare the data for use in a neural network by running it through an appropriate pipeline. Split off twenty percent of the rows, randomly, for use as a test set. Finally, train the neural network using the remaining eighty percent of the rows. We want to know the probability of a loan default. Test the neural network by comparing its predictions for the test set, and report your results by plotting the ROC curve.

Ordinarily, this would be more interactive, but I wanted to write all the requirements into a single prompt for demonstration purposes, and because that’s the approach preferred by GPT-Engineer.

ChatGPT solved the problem nicely, matching my best score of 0.76.  It plotted its own ROC chart, below, instead of using the Scikit method, even though it did use Scikit for the pipelines.

ChatGPT did several things differently from me, mostly simplifications, and it made some mistakes.  For example, it forgot that its “create neural network” function didn’t include compilation.  The call to compile() can be in or out of the function, so it’s easy to misplace.

The creepiest thing ever?  ChatGPT chose random_state = 42, which is completely arbitrary – and the same number I use!  Like, how did it know?

GPT-Engineer

GPT-Engineer uses the OpenAI API, for which you have to buy tokens.  I think of it as a wrapper for the ChatGPT dialogue above.  GPT-Engineer also runs outside of the IDE.  You invoke it from the command line and pass a file containing the prompt.

The special charm of GPT-Engineer is that it can ask clarifying questions.  Here are the questions it had about my prompt:

  1. The specific format and structure of the input CSV file.
  2. The exact columns that contain numeric and categorical data.
  3. The type of neural network model to be used.
  4. The specific pipeline to be used for preparing the data.
  5. The method for splitting the data into training and test sets.
  6. The method for training the neural network.
  7. The method for testing the neural network and comparing its predictions.
  8. The method for reporting results and plotting the ROC curve.

Some of these questions are reasonable and some are not.  ChatGPT, on its own, figured out how to handle the file, the pipeline, and the structure of the neural net.  In fact, it was kind of fun comparing notes with ChatGPT like, “oh yeah, well, I’m using batch norms.”

I suspect that “code and train a neural net” is asking too much.  GPT-Engineer crushes routine tasks, as my man Arjan demonstrates here.

GPT Engineer is another nail in the coffin of software developers. In this video, I’ll show you how it works. The tool is crazy powerful. Just specify what you want it to build, and then, well, it just builds it.

GitHub Copilot

Microsoft did a nice job of integrating Copilot into Visual Studio and several other IDEs.  You install an extension, and subscribe to the service for $10 a month.  Microsoft has a big ($13 billion) investment in OpenAI, and they own GitHub.  This means an LLM trained not only on human language, but on a giant repository of source code.

Microsoft advertises Copilot as “pair programming,” like having a buddy look over your shoulder, and it works the same way autocomplete works for text.  It can also define a function based on an inline comment like, “read a file from disk and create a dataframe.”

Copilot didn’t really suit me.  I wanted to see how an AI would code differently from me, as ChatGPT had, but Copilot kept serving up my own code from GitHub.  Also, it kept wanting to define functions where it should have just written the line, like pd.read_csv(“test.csv”).

Conclusion

As I said at the top, part of the fun is having an AI program write an AI program – although, in this case, any decent predictive model would suffice.  OpenAI is, itself, driven by a large language model (LLM).  So, here we have a large, general, neural network helping me to produce a small, tailored one.

What does all this mean for the industry?  Well, for one thing, it is starting to look bad for software developers.  Arjan suggests it will take out the junior engineers first, but I’m not so sure.

Researchers have long feared that the resources required to build and train foundation models would mean a Big Tech oligopoly.  Technologically, there have been good results in the other direction, with small open-source models.  Commercially, however, this is a race between Microsoft and Google.

Microsoft is also introducing other Copilots, and researchers are hard at work on natural language prompting for all computer tasks.  So, the same way I can prompt GPT-Engineer to write some code, you’ll be able to have an AI do whatever you were planning to do on Excel or Tableau.

Biweekly Payment Magic

A while back, I did some foundational work for a leading biweekly payment service.  That is, the math part, which I will reprise here.  Biweekly works best in a climate of high interest rates and, unfortunately, soon after this project, the Federal Reserve dropped their reference rate to zero.  The Fed has not been persistently above 2% until recently, and biweekly is once again looking good.

The featured chart shows a scenario first constructed by my erstwhile partner Phil Battista.  I call it the “magic trick” because the customer in this scenario has financed an extra $3,250 with no change to the term, APR, or payment.  Before presenting the trick, here are some basics about biweekly.

Biweekly Payment Plan Basics

In Canada, the banks offer loans with native biweekly payment schedules, and dealers feature them in their advertising.  Here in the States, you have to use a service.  The service collects payments biweekly via direct debit and manages the lender to accelerate the amortization.

Here is an example.  According to recent Cox data, the average price of a new car is now above $49,500 with an APR of 7.0% and a 72-month term.  By the way, this survey does not include luxury brands, and some people are financing up to 84 months.

Below, I have modeled this “average” loan showing monthly versus biweekly payment schedules.  This is showing the amortization only, omitting whatever fees the biweekly service may charge.  You can see that the loan is paid off seven months early.

If you’re using longer terms to fit customers into payments, biweekly will shorten the trade cycle a bit.  Also, credit-challenged buyers may be better off with direct debit synched to their paychecks.

Nostalgia Alert: coding for the U.S. Equity project was originally done in C# by my son, Paul, who would have been around fourteen at the time.  We were making an OO model to include all loan and lease instruments as subclasses.  Coding for this article was done by me, in Python, which is 10X easier.

The Magic Trick

If you compare the two charts above, you can see graphically how Phil’s trick works.  Instead of starting your biweekly loan at the same amount and having it end earlier, you start it higher and aim to end on the same date.

The trick works because half the monthly payment is higher than a native biweekly payment would be – by $33 in this example.  The customer makes the equivalent of thirteen monthly payments per year, and the bank loses a little bit of interest income.  Here are the steps:

  1. Increase the amount financed, which will increase the monthly payment.
  2. Increase the term until the monthly payment comes back down to where it was.
  3. Use the biweekly program to bring the term back down to where it was.

Congratulations, you can now finance more product with the same monthly payment.  I covered the concept for menu systems in Six Month Term Bump.  To do goal seeking, as I’ve shown here, you will need some Python (or a precocious teenager).