Generative AI isn’t really my thing. To me, AI means machine learning for quantitative applications, like predictive analytics. Nonetheless, people seem to be having fun over here, so I thought I’d give it a try. Here are some notes from Project Avatar.
Tools in this space overlap a lot. Synthesia, for instance, aims to be one-stop shopping for training videos and the like. The video tools overlap with the image generation tools, while others, like Eleven Labs, have carved out niches where they’re best in class.
Create Your Avatar
With HeyGen, I was able to create custom avatars from photos, video, and generated images. It also features prefab avatars, based on their own actors. For a training video, you can choose from a wide variety of these. If you want to be differentiated, however, you will need to create a custom avatar.
I know of one company that simply adopted “Ada” from Synthesia, and made her the face of their application. Ada is a very popular avatar, so – no differentiation. With Synthesia, you can create custom avatars from video, but not photos.
Avatars from Still Images
Video avatars are generally more expressive, so why would you want to create one from still images? I can think of two reasons. First, making a good video avatar is a lot of work, and you need a model. I did one of myself, and it looks like hell.
Second, you might want to create a completely unique persona – to specifications – that you control. This is how I created Hadley, and here is where it gets interesting. Midjourney has a character reference flag so that, once you have the face you want, it can reliably reproduce her in different settings.
Close shot of a 52-year-old man, strong-looking, bald, square jaw with short beard, captured with a 70-200mm f/2.8E FL ED VR lens, with high-key lighting and a shallow depth of field.
Stability AI is arguably better at images but, without the cref feature, it is useless for this application. On the other hand, while Midjourney is great at creffing AI images, it won’t do photos. The image of me in Project Avatar is from HeyGen.
After training on a dozen recent photos, HeyGen can reproduce my likeness accurately about 25% of the time. These are called “looks.” So, “Mark seated in library wearing a dress shirt” is a look. I also did a video avatar, which included cloning my voice.
Synthetic Voice
Eleven Labs has a wide array of professional voices, plus you can clone your own voice. You can generate audio in Eleven Labs, and then upload it to HeyGen for animation. If you’re working from a script, you can paste the script into HeyGen and then link to voices in Eleven Labs using its API. HeyGen also has its own voice library.
Please write a short, motivational speech, roughly 75 words in length, on how to navigate life transitions. Include a narrative hook at the outset, and a punchy conclusion that recaps the hook.
Both systems are a little bit robotic when reading a script. There are some things you can do with the script to improve this. To get the best pacing and intonation, you might want to read the script yourself, and then use speech-to-speech conversion. This, unfortunately, will gum up your automation.
Automate Your Workflow
If you’re doing this at scale, you can’t be the one reading the scripts. This energetic bot is Jordan from LipDub. You can imagine the pace at which he is pumping out these ad spots. LipDub and Akool are both more expressive than HeyGen but, again, you need a live model.
All these tools are richly supplied with APIs so, if you are a non-coder, you can easily string together a workflow using Make (formerly Integromat). Here is my draft workflow for the Bruno channel:
- Call ChatGPT to generate topic and prompt
- Call Powerful Thinking to generate script
- Pass script to Eleven Labs and call TTS to generate audio
- Pass audio to HeyGen, selecting Bruno avatar by ID
- Optional: generate new “look” for Bruno
- Add captions
- Post to Instagram reel using Meta API
By “richly,” I mean that HeyGen wants $100 per month for their API – on top of the monthly Creator subscription. CapCut doesn’t have an API, but JSON2Video does. Also, there are libraries in Python that will do captions. Bold captions add a nice touch.
Step two of the workflow calls my custom GPT, Powerful Thinking. This, unfortunately, does not yet have API support, so I was reduced to using Selenium.
Write the Script
Scripts are the easiest thing in the world to generate, and the LLMs (apart from Powerful Thinking) are equipped with APIs. You can give them specific instructions, and they’ll even do product research for you.
I am a ChatGPT guy, myself (the other camp is the Claude people) and I have also used Google’s Notebook LM. Hadley’s libertarian scripts, I write myself.
One last caveat: GenAI is a rapidly-evolving space. Synthesia had it to itself for a while, and now there is a raft of new entrants. Eleven Labs faces competition from Murf and Speechify. I am on a list for the LipDub beta. If you want to work in this space, you must be ready to learn a new tool every week.