Lexman Artificial Podcast
Excerpt
Looking for something a little different in your podcast listening? Check out The Lexman Artificial Podcast, where an AI named Lexman takes on the roles of both guest and host! Hear Lexmanâs thoughts on everything from current events to pop culture and beyond.
tl;dr GPT-3 is used to generate transcripts that are then read by TorToiSe.
Transcript generation
The transcripts are generated by a series of prompt completions using OpenAIâs GPT-3. First 2-6 nouns are drawn from a large wordlist, then a guest is drawn from a set of ~200 guests I have voices for (more on that below). Then using those keywords and the guests name a summary of the podcast is generated.
Lexman Artificial Podcast Episode 123 Guest: Elon Musk Keywords: lightsaber, horse, bodysnatcher Summary: Elon Musk joins Lexman for a wide-ranging discussion about his life and work, from his early days as a disruptive entrepreneur to his vision for the future of technology and transportation. They also talk about his love of horses, his passion for video games, and his experience being "bodysnatched" by aliens.
The summary is then used to generate the title.
Lexman Artificial Podcast Guest: Elon Musk Keywords: lightsaber, horse, bodysnatcher Summary: Elon Musk joins Lexman for a wide-ranging discussion about his life and work, from his early days as a disruptive entrepreneur to his vision for the future of technology and transportation. They also talk about his love of horses, his passion for video games, and his experience being "bodysnatched" by aliens. Title: Elon Musk: A Ride with The Master of Disruption
A similar process is used to generate the introduction and the sponsor message. The actual conversation is where it gets a bit tricky. Itâs a balancing act between keeping GPT-3 generating and not going off the rails. Left to its own devices it will often close out the show early and start to read from unitialized memory, so to speak.
Iâve found that generating the transcript line-by-line and using a linebreak as the stop token is a very reliable way to generate transcripts, however, this quickly becomes very expensive since you pay for total tokens, prompt included. The last few lines will cost as much to generate as the entire transcript up to that point.
What Iâve ended up with is a compromise that has a bunch of heuristics (read regexps) that tries to detect whatâs going on, and adjust the prompt and parameters to keep it generating until some sort of conclusion. This allows me to create the dialogue in 2-3 requests. It also removes garbage output like when GPT-3 decides itâs time to start on the second episode or brings in a third guest etc.
As you can hear if you listen to a couple of episodes this works fairly well, but there are occasional mixups where the host and guest say each otherâs lines or they say something like âThanks for coming on the show! End of transcriptâ.
The GPT-3 engine Iâm using is âcurieâ, the less capable predescesor of the âdavinciâ engine. This is mostly because itâs an order of magnitude cheaper (one transcript costs ~0.4 with davinci) but also because I find the curie transcripts to be more entertaining. It tends to weer off in more bizzare directions while davinci keeps it more realistic and dry.
All that said I think there is a lot that could be done to improve the transcript prompts. But what I really want to do is train my own GPT model on podcast transcripts. Iâm currently in the process of building up a large dataset of transcripts I can use to train it.
Text-To-Speech
To turn the transcripts into speech I use James Betkerâs TorToiSe. Itâs the real magic behind the podcast. It takes short text snippets and some guiding voice samples to create really convincing speech.
TorToiSe is aptly named, it creates amazing results but it is very slow. On my 6900 XT one 3 minute episode takes about 30 minutes to render. It also has some interesting quirks, one particular quirk I had to fight with to make this podcast is itâs tendency to switch voice mid-sentence. To illustrate, hereâs the Joe Rogan taking over from Lexman mid-sentence (headphone warning):
Your browser does not support the audio element.
This isnât a common occurrence and when generating things manually itâs not a huge problem to just re-generate that particular clip but for my purposes it was a showstopper. Initially I tried to work around it by training a classifier to detect when this happens and automatically reject the clip. This wasnât very successful, it would catch obvious clips like the one above but not when the voice changed to what to me is clearly a different person but with the same intonation and mode of speech.
What I ended up doing instead was fine-tuning the TorToiSe models on several hundred hours of clips from the Lex Fridman podcast. I picked Lexâs podcast because itâs one of my favourites and he frequently talks about the concepts I want to explore with this project. His podcast was initially called âthe Artificial Intelligence Podcastâ which I also find very fitting.
I wonât go into the details of how I fine-tuned the model since there are some concerns with this being used maliciously. If youâre interested in fine-tuning your own model, to save you some time I will say that a critical component needed to do it has since been removed from the public release. So you wonât be able to do it without re-training some of the models needed from scratch.
My fine-tuned model developed its own quirks. One quirk that you will hear a lot if you listen to the podcast is that it likes to repeat words at the end of a sentence. For example it will often read âHereâs my conversation with X, enjoy!â as âHereâs my conversation with X, enjoy. Enjoy! ENJOY!â. It also developed a speech impediment that comes out occasionally. An example:
Your browser does not support the audio element.
My model is also limited to voices represented in the training set, i.e. people that have been on the Lex Fridman podcast, trying to use guiding samples that are not part of my fine-tuning data gives very poor results so I have only ~200 guests for Lexman to speak with. But overall itâs a huge improvement in how the episodes sound and more importantly it never generates something thatâs totally unusable so I donât need to worry about having to re-generate anything.
Now all that is needed is a pipeline that uses these two components to fully automate the podcast, but before I get in to that Iâd like to give a big shoutout to the author of TorToiSe: James Betker, not only for releasing his amazing work to the world for free but also for being very generous with his time answering questions on GitHub and helping everyone, myself included.
If youâre interested in this, and you probably are since you made it this far into my ramblings, you should definetly read his blog: Non_Interactive.
Automate all the things
The generation pipeline consists of a central Redis server that holds state for a bunch of different worker types. I picked Redis for this job because it gives me a key value store, job queue, and distributed locks.
Thereâs 7 different worker types handling the different stages, transcript generation, tts, stitching the audio, generating the artwork, rendering the video and finally releasing which involves uploading the files to Cloudflareâs R2 (their cheaper and faster AWS S3 competitor) and video to Vimeo and finally posting a tweet (also composed by GPT-3). The last worker has a manager role and is unique (ensured with a redis lock) and itâs tasked keeping track of how many episodes we need to generate etc. Each episode also has itâs own redis lock that ensures that only one worker at a time can interact with it, this helps a lot with managing the worker queues as I donât need to worry about scheduling that much.
This design was chosen because I wanted to be able to use cheap GPU workers from Google Colab and vast.ai.
Initially I thought synchronizing everything via a Google Drive folder would be a convenient way to do it since thatâs easily accessible from python notebooks. So I built it with a filesystem database with one directory representing each episode in the drive, this was a horrible mistake, I spent so much time duct taping everything and trying to figure out why files went missing that I finally broke down and spent a full day rewriting everything using Redis.
The episode artwork is generated using Latent Diffusion, currently the worker just picks one image from a big folder of âlexmanâ images I pre-generated but it would be cool to have it generate custom artwork for each episode based on the show description or title.
I have a notebook version of the TTS worker that I can just load up in Colab or any iPython instance to start generating chunks. This system allows me to generate episodes as fast as Iâm willing to throw compute at it.
Iâm making video versions of each episode because I couldnât find a good place (=low cost and with an API) to host the episodes in a way that would allow them to be played inline in a tweet. The only option I found is SoundCloud but their API client for Python didnât work so I chose Vimeo since I have experience with their APIs already and know them to work well.
Long-term I will probably decrease the release frequency to once or twice per day and run the TTS worker on a cheap rental somewhere, in my tests a Ryzen 3600 can eek out an episode every 12 hours or so without a GPU.
Thatâs all there is to say about this project from a technical perspective, I think, if you have any questions feel free to drop me a line at hello at johan dash nordberg dot com or contact me on Twitter. And if you find yourself asking âbut why?â you can read my artist statement.