In my previous post, I wrote about building my own LLM Wiki. I definitely need to include Avocado Toast, the podcast I co-host. The problem is that podcast content is mostly audio, while all the skills I have written are built to process Markdown-based text. So how do I include audio?
After discussing this with ChatGPT and Claude, the solution was to use OpenAI’s open-source Whisper model to generate transcripts from audio files. On macOS, I only need to install Whisper with brew install whisper-cpp, then call whisper-cli to process audio files. It sounds simple, but in practice there are always pitfalls.
The large-v3 Model Can Produce Loops
Whisper currently has two mainstream models: large-v3 and large-v3-turbo. The former is larger and slower, but has better output quality. The latter is smaller and 2 to 3 times faster, but its output is a little less accurate. I asked Claude Code to write the code that actually calls whisper-cli and test both models. It found that large-v3 was indeed better, so we chose that first, even though processing each podcast episode took about 30 minutes.
After processing a few episodes, Claude Code noticed a loop problem in the transcripts. That means a sentence appeared only once in the audio, but the model heard it multiple times, so the same line appeared repeatedly in the transcript. If Claude Code hadn’t been watching the transcript output for me, I definitely would not have noticed this myself. There is no practical way to manually read through hundreds or thousands of transcript lines. After noticing the problem, it explained to me that larger models are more likely to produce loops and suggested trying large-v3-turbo. After downgrading, the problem did go away.
In the end, Claude Code and I made a decision: process each episode with large-v3 first, then analyze the transcript after it finishes. If the same line appears 5 or more times in a row, treat that as a loop and regenerate the transcript with large-v3-turbo. We used this method to process the remaining audio files. About one third of the transcripts had loops and needed to be regenerated.
Whisper Heard My Name as “Kat”
For episodes I recorded, the opening always includes a line like “大家好,我是 Cat” (“Hello, this is Cat”), and Whisper would hear “Cat” as “Kat”. I asked Claude Code what to do. It found that Whisper accepts a text prompt, so I wrote the core podcast information into the prompt, including the correct spelling of my name:
《牛油果烤面包》播客聊科技发展趋势,聊各行业来龙去脉。我们坐标硅谷,邀请第一线的资深专家分享给大家听!主持人:Cat、斯图亚特、Sean、Vindy、David。
With this prompt, Whisper became more likely to recognize my name correctly, although it is still not perfect.
Since I can provide a text prompt, I also append each episode’s description to the prompt. That way, when Whisper hears related information, it has a better chance of recognizing it correctly. I took a rough look at the results, and they seem pretty good.
No comments:
Post a Comment