The Art Project

Can AI be used to match and classify images? Of course! They do this all the time, looking at everything from paint chips to x-rays. In today’s post, I use an established model called ResNet-50 to match and classify post-impressionist artists. For example, Braque and Picasso have a 70% similarity score.

The “cosine similarity” between Braque and Picasso is 0.70.

ResNet-50 is a convolutional neural network (CNN) introduced in 2015. Normally, we would use it as the base for image interpretation, and then add layers to learn the specific application. In this case, we are only interested in the coding system it uses, called an “embedding.”

ResNet-50 encodes each image as a list of 2,048 numbers, known as a “vector” in machine learning. This vector is not simply a way to store the image – the JPEG file already does that – but to encode whatever features the model deems useful.

For this demonstration, I collected examples from fourteen artists. To avoid complications over the choice of subject, I used self-portraits by each artist.

Experiments with CNNs show that they recognize shapes, colors, styles, and textures – everything you would expect from “machine vision.” Our model is not going to know anything about the painters, though – not who cut off an ear, or who moved to Tahiti. It’s just the pixels.

With the fourteen paintings vectorized, we can do things like compute similarity scores. For instance, Braque, Chagall, and Picasso seem to hang together. I also ran a hierarchical clustering analysis.

It’s hard to imagine what the clustering algorithm “sees” in high-dimensional space so, wherever possible, I try to reduce down to three dimensions – using principal component analysis (PCA) or UMAP. In this case, because of the small sample, a three-D chart captures 40% of the variance.

The human eye naturally finds clusters – there are Picasso, Braque, and Chagall down at the bottom, and here is Kandinsky off by himself. Also note that Cezanne, Gauguin, and Schiele are spread out along the Y axis, but together on the X axis.

Unfortunately, these axes are completely arbitrary. ResNet-50 can’t tell us if Z is the “axis of cubism,” or whatever. That’s the knock against neural net reasoning being a “black box.” We can see, though, that the PCA plot roughly agrees with the cluster analysis.

So, that was about two hundred lines of code as a proof of concept, plus some fun charts. If you were really doing this for your MFA, you would want to use many more paintings, and stash them in a vector database. For more on vector databases, see Literary Analysis with RAG.

Today’s featured image nods to a common gaffe in generative AI. Yes, Marc Chagall really did paint a “Self-Portrait with Seven Fingers.”

Notes on GenAI

Generative AI isn’t really my thing. To me, AI means machine learning for quantitative applications, like predictive analytics. Nonetheless, people seem to be having fun over here, so I thought I’d give it a try. Here are some notes from Project Avatar.

Tools in this space overlap a lot. Synthesia, for instance, aims to be one-stop shopping for training videos and the like. The video tools overlap with the image generation tools, while others, like Eleven Labs, have carved out niches where they’re best in class.

Create Your Avatar

With HeyGen, I was able to create custom avatars from photos, video, and generated images. It also features prefab avatars, based on their own actors. For a training video, you can choose from a wide variety of these. If you want to be differentiated, however, you will need to create a custom avatar.

I know of one company that simply adopted “Ada” from Synthesia, and made her the face of their application. Ada is a very popular avatar, so – no differentiation. With Synthesia, you can create custom avatars from video, but not photos.

Avatars from Still Images

Video avatars are generally more expressive, so why would you want to create one from still images? I can think of two reasons. First, making a good video avatar is a lot of work, and you need a model. I did one of myself, and it looks like hell.

Second, you might want to create a completely unique persona – to specifications – that you control. This is how I created Hadley, and here is where it gets interesting. Midjourney has a character reference flag so that, once you have the face you want, it can reliably reproduce her in different settings.

Close shot of a 52-year-old man, strong-looking, bald, square jaw with short beard, captured with a 70-200mm f/2.8E FL ED VR lens, with high-key lighting and a shallow depth of field.

Stability AI is arguably better at images but, without the cref feature, it is useless for this application. On the other hand, while Midjourney is great at creffing AI images, it won’t do photos. The image of me in Project Avatar is from HeyGen.

After training on a dozen recent photos, HeyGen can reproduce my likeness accurately about 25% of the time. These are called “looks.” So, “Mark seated in library wearing a dress shirt” is a look. I also did a video avatar, which included cloning my voice.

Synthetic Voice

Eleven Labs has a wide array of professional voices, plus you can clone your own voice. You can generate audio in Eleven Labs, and then upload it to HeyGen for animation. If you’re working from a script, you can paste the script into HeyGen and then link to voices in Eleven Labs using its API. HeyGen also has its own voice library.

Please write a short, motivational speech, roughly 75 words in length, on how to navigate life transitions. Include a narrative hook at the outset, and a punchy conclusion that recaps the hook.

Both systems are a little bit robotic when reading a script. There are some things you can do with the script to improve this. To get the best pacing and intonation, you might want to read the script yourself, and then use speech-to-speech conversion. This, unfortunately, will gum up your automation.

Automate Your Workflow

If you’re doing this at scale, you can’t be the one reading the scripts. This energetic bot is Jordan from LipDub. You can imagine the pace at which he is pumping out these ad spots. LipDub and Akool are both more expressive than HeyGen but, again, you need a live model.

All these tools are richly supplied with APIs so, if you are a non-coder, you can easily string together a workflow using Make (formerly Integromat). Here is my draft workflow for the Bruno channel:

  1. Call ChatGPT to generate topic and prompt
  2. Call Powerful Thinking to generate script
  3. Pass script to Eleven Labs and call TTS to generate audio
  4. Pass audio to HeyGen, selecting Bruno avatar by ID
  5. Optional: generate new “look” for Bruno
  6. Add captions
  7. Post to Instagram reel using Meta API

By “richly,” I mean that HeyGen wants $100 per month for their API – on top of the monthly Creator subscription. CapCut doesn’t have an API, but JSON2Video does. Also, there are libraries in Python that will do captions. Bold captions add a nice touch.

Step two of the workflow calls my custom GPT, Powerful Thinking. This, unfortunately, does not yet have API support, so I was reduced to using Selenium.

Write the Script

Scripts are the easiest thing in the world to generate, and the LLMs (apart from Powerful Thinking) are equipped with APIs. You can give them specific instructions, and they’ll even do product research for you.

I am a ChatGPT guy, myself (the other camp is the Claude people) and I have also used Google’s Notebook LM. Hadley’s libertarian scripts, I write myself.

One last caveat: GenAI is a rapidly-evolving space. Synthesia had it to itself for a while, and now there is a raft of new entrants. Eleven Labs faces competition from Murf and Speechify. I am on a list for the LipDub beta. If you want to work in this space, you must be ready to learn a new tool every week.

Project Avatar

Everything in this video is AI generated. My voice and image have been cloned. Even the script was generated, by a Google product called Notebook LM. This post is mostly about Notebook LM, and I’ll also survey some other Gen AI tools.

Notebook LM is basically RAG in a box. If you don’t know what that is, you can read my earlier posts on the topic – or you can watch the video. I thought it would be clever to feed RAG articles to a RAG system, and have it generate a dialogue.

That’s right, Notebook LM will ingest raw source material, and then generate a podcast-style dialogue. The system is meant as a study aid, and you can imagine how powerful that is. Other outputs include a study guide, FAQ page, and timeline. Here is a sample entry from the War and Peace timeline:

October 1805: News arrives of Mack’s defeat. The Pavlograd hussars, including Rostov and Denisov, are stationed near Braunau. Rostov experiences his first taste of battle. He witnesses the horrors of war and feels disillusioned. Prince Andrew serves as an adjutant for Kutuzov.

One challenge with RAG has always been preparing the source materials. This earlier post described the work of parsing and vectorizing several text files. In real world applications, clean source material is hard to find. Notebook LM swallows PDF files with ease.

I was curious about the health concerns around seed oils, so I rounded up some papers from sources like the Journal of Nutrition and Metabolism, and just dumped them into Notebook LM. It prepared a handy summary of each one, plus the outputs listed above. I listened to the dialogue and, of course, you can chat with it, too.

  • Source Summaries
  • FAQ Page
  • Study Guide
  • Table of Contents
  • Timeline
  • Briefing Book
  • Chat Window
  • Dialogue

This is a practical, down-to-earth application of LLM technology. One person I found on Reddit is using Notebook LM to prepare for the CISSP exam. He’s doing what I did with seed oils, hoovering up all the InfoSec papers.

From Podcast to Video

Since the Notebook LM dialogue is audio only, I thought it would be fun to make a video and cast my own avatar for the male voice. That’s not even a real photograph of me. First, I trained a photo avatar on HeyGen, and then requested “Mark wearing dress shirt in library.”

Synthesia is similar to HeyGen, but it’s optimized for training videos. It uses a slideshow format. People like it because, if this is your application, all the tools are in one place. I found HeyGen to be more flexible for things like photo avatars and voice substitution.

Other tools I looked at were Deepbrain, now AI Studio, Wondershare, D-ID, and Creatify. Creatify is optimized for making product advertisements on social media. It can write its own script, based on reading the product’s website.

For my voice, I made an “instant voice clone” on Eleven Labs. I didn’t have the patience to make a “professional” one. The instant clone is good enough and, frankly, a little creepy.

I selected a canned avatar named Georgia to be my partner. Initially, I used the script from Notebook LM, and ran the HeyGen animation in text-to-speech mode. Georgia is native there, and HeyGen was able to use my voice via API from Eleven Labs. HeyGen also supports integration with LMNT, Play.ht, and Cartesia.

This is, by far, the easiest way to do it. When it was time to combine the two videos, I was able to use transcript-based editing in CapCut. Unfortunately, the result was a little bit robotic. The charm of Notebook LM’s dialogue is that it really sounds natural.

While one is speaking, the other sits patiently and makes facial expressions as if listening.

Working with the WAV file was more challenging. I used Audacity to split the male and female roles – not easy when they interject “uh-huh” over each other’s lines, but that’s the desired effect.

I left the female voice as-is, ran the male audio track through Eleven Labs to pick up my voice, and then went back to HeyGen – this time, uploading prefab audio for me and Georgia (separately) instead of scripts.

The result from HeyGen was two videos, one for each avatar. While one is speaking, the other sits patiently and makes facial expressions as if listening. The timing works because the split tracks from Audacity are in sync. The last thing to do was combine these, split-screen, in CapCut.

Gen AI and Social Media

My work with AI has always been machine learning for quantitative applications – Python, Scikit, and applied statistics – so it was fun to learn about the crazy things people are doing with Gen AI.

For instance, there is an AI generated influencer on Instagram. An Italian modeling agency created her, so the story goes, because they were tired of working with real prima donnas. 

There is now a cottage industry of avatars on social media, using tools like Creatify to monetize attribution. I thought for a moment about my custom GPT, Powerful Thinking, and its avatar, Bruno. But I couldn’t think of anything for him to sell. Next week, I’ll be back to my regular coding projects.

Unguided RAG for Text Comparison

Last week, we covered the basic RAG setup and had some fun answering questions from War and Peace. Today, we move on to the two-novel cases:

  • Answer questions using passages from two novels and compare them
  • Compare passages from two novels based on an unseen question
  • Compare passages from two novels based on a similarity search

When I first challenged ChatGPT to “compare common tropes and plot devices between War and Peace and Middlemarch,” I had a specific one in mind. Both novels are set in the early nineteenth century, a time when virtually all wealth was inherited – and the non-wealthy were basically slaves. So, everyone is trying either to marry into a rich family or suck up to a wealthy relative.

This trope is common enough that you could have seen it anywhere: our hero stands to inherit a fortune, but there is some intrigue around the dead uncle’s will. Maybe there’s a different version known only to the servants, etc. I put this question to RAG by gathering matching passages from both novels. The typical prompt template would be something like:

“Please answer this question {question} based on these passages from the novel War and Peace {context1} and these passages from the novel Middlemarch {context2}”

What we are looking for, though, is something a little more autonomous, so I simply removed the query text from the prompt template:

“Please compare these two novels with reference to the context provided … passages from the novel War and Peace {context1} and passages from the novel Middlemarch {context2}”

So, the retriever knows what question is to be answered, but ChatGPT is only asked to “compare” and draw its own conclusions. The results are quite good. I won’t share the whole response, but here is the concluding paragraph:

Overall, while both novels touch on similar themes of inheritance and wills, they do so in distinct ways that reflect the respective societies and characters depicted in each work. Middlemarch delves into the personal and familial implications of inheritance, with a focus on individual motivations and moral dilemmas, while War and Peace explores the broader societal and political consequences of inheritance, with a tone that is more ironic and comedic.

Here we see ChatGPT contrasting the tone of the two samples, and drawing a thematic inference. You can even fish for likely tropes, like “Are there instances of women being unfaithful?”

RAG with Unguided Retrieval

Finally, to make RAG fully autonomous, we must dispense with the guiding hand of the query text, and set it loose using only cosine similarity. This script simply trundles through both databases, using Chroma’s query by vector to find similar passages. There is no need to create any new embeddings.

Once a cross-novel match is found, the script retrieves n_results similar passages on each side, and then passes the unguided “compare” prompt to Open AI.

Between War and Peace and Middlemarch, it settles on some grim material about “empathy and compassion in a time of hardship.” I didn’t feel like quoting it so, instead, I tried another novel.

It took me about ten minutes to download, parse, and add Vanity Fair to the mix. In common with War and Peace, it has its own Napoleonic war (different campaign) and a great opportunity for guided search: “Is one or more of the protagonists killed in action?”

The unguided search script, predictably, finds the war. Tolstoy treats the war from a historical perspective while, for Thackeray, it’s just the backdrop for his drama.

On the other hand, War and Peace takes a broader perspective, examining the larger geopolitical forces at play during the Napoleonic Wars. The novel delves into the complexities of international relations, diplomacy, and military strategy, showing how the actions of monarchs, diplomats, and military leaders shape the course of history. While War and Peace also portrays the impact of war on individuals, it does so within the context of larger historical forces and political developments.

The histogram above shows the distribution of cosine similarity results across all 4.8 million pairs of chunks between Middlemarch and War and Peace for ada-002 and 3-small. These are both 1,536-dimensional embeddings. I also experimented with 3-large.

Initially, I preferred ada-002 because it was easier for the script to find similar passages. After working for a while with both, and seeing the histogram, I think maybe a wider variance is better. It means that nearby passages really are similar, while those that aren’t are farther apart.

For instance, 3-small gives a better answer on Natasha’s engagement because it’s more discriminating. Because I’ve read the novel (twice) I can infer where the search is going wrong. Also, I wrote a little utility function that displays which chunks it has found, with their distance metrics and metadata.