Tuesday, June 02, 2026

Generating Podcast Transcripts with Whisper

In my previous post, I wrote about building my own LLM Wiki. I definitely need to include Avocado Toast, the podcast I co-host. The problem is that podcast content is mostly audio, while all the skills I have written are built to process Markdown-based text. So how do I include audio?

After discussing this with ChatGPT and Claude, the solution was to use OpenAI’s open-source Whisper model to generate transcripts from audio files. On macOS, I only need to install Whisper with brew install whisper-cpp, then call whisper-cli to process audio files. It sounds simple, but in practice there are always pitfalls.

The large-v3 Model Can Produce Loops

Whisper currently has two mainstream models: large-v3 and large-v3-turbo. The former is larger and slower, but has better output quality. The latter is smaller and 2 to 3 times faster, but its output is a little less accurate. I asked Claude Code to write the code that actually calls whisper-cli and test both models. It found that large-v3 was indeed better, so we chose that first, even though processing each podcast episode took about 30 minutes.

After processing a few episodes, Claude Code noticed a loop problem in the transcripts. That means a sentence appeared only once in the audio, but the model heard it multiple times, so the same line appeared repeatedly in the transcript. If Claude Code hadn’t been watching the transcript output for me, I definitely would not have noticed this myself. There is no practical way to manually read through hundreds or thousands of transcript lines. After noticing the problem, it explained to me that larger models are more likely to produce loops and suggested trying large-v3-turbo. After downgrading, the problem did go away.

In the end, Claude Code and I made a decision: process each episode with large-v3 first, then analyze the transcript after it finishes. If the same line appears 5 or more times in a row, treat that as a loop and regenerate the transcript with large-v3-turbo. We used this method to process the remaining audio files. About one third of the transcripts had loops and needed to be regenerated.

Whisper Heard My Name as “Kat”

For episodes I recorded, the opening always includes a line like “大家好,我是 Cat” (“Hello, this is Cat”), and Whisper would hear “Cat” as “Kat”. I asked Claude Code what to do. It found that Whisper accepts a text prompt, so I wrote the core podcast information into the prompt, including the correct spelling of my name:

《牛油果烤面包》播客聊科技发展趋势,聊各行业来龙去脉。我们坐标硅谷,邀请第一线的资深专家分享给大家听!主持人:Cat、斯图亚特、Sean、Vindy、David。

With this prompt, Whisper became more likely to recognize my name correctly, although it is still not perfect.

Since I can provide a text prompt, I also append each episode’s description to the prompt. That way, when Whisper hears related information, it has a better chance of recognizing it correctly. I took a rough look at the results, and they seem pretty good.

Wednesday, May 20, 2026

Pitfalls I Hit Writing LLM Skills

I recently started building my own LLM Wiki. My document structure is different enough from existing examples that I had to write my own skills from scratch. (I also just like tinkering.) Along the way I hit a few pitfalls worth documenting. These may not stay relevant for long — LLM capabilities move fast, and what requires a workaround today might just work in six months.

LLMs Make Mistakes Copying Semantic-Free Text

I track which blog posts have changed and need their summaries regenerated by computing a SHA256 hash of each post in JavaScript, writing it into the summary document’s front matter, and regenerating only when the hash doesn’t match. The skill takes the computed hash from JavaScript, gives it to the LLM and asks it to copy the hash into the front matter. Every so often, it would copy the hash with one character wrong.

I asked Claude Code why this happens. It explained that LLMs are good at predicting the next token based on context, but a SHA256 hash has no semantic content — it’s effectively random characters. With no meaning to latch onto, the LLM occasionally produces a wrong character. Claude Code updated the skill to strongly emphasize that the hash must be copied exactly. That reduced the errors. Eventually I’ll rework the flow to have JavaScript generate the front matter with the hash already in place and have the LLM fill in only the content — so there’s nothing to copy wrong.

LLMs Have No Sense of Time Passing

My skill writes the current UTC timestamp into the document front matter. I ran it once, waited an hour, ran it again, and found it had written the same timestamp as before.

An LLM running in the same conversation remembers the current time from when it first looked it up and doesn’t check again. Once it has a timestamp, it treats that as the current time indefinitely — like a stopped clock. To fix this, I had the skill explicitly call a script to get the timestamp instead of asking the LLM to produce it. Since I’m already using JavaScript in the skill, I just have the LLM run:

node -e 'console.log(new Date().toISOString())'

Self-Contradiction and Over-Cleverness

I wrote an interactive skill that finds candidate concept pairs in my wiki that could be merged (synonyms, related concepts, or parent-child relationships) and asks me whether to merge them and in which direction. The final call is mine.

The skill presents a 4-option menu for each pair:

  • Merge A into B
  • Merge B into A
  • Dismiss (don’t suggest merging A and B again)
  • Skip (suggest again next time)

I also asked the LLM to recommend a direction if it could decide, placing that option first with a “(Recommended)” label. So the expected output looks like:

  • Merge B into A (Recommended)
  • Merge A into B
  • Dismiss
  • Skip

What I actually got was:

  • Dismiss (Recommended)
  • Merge B into A
  • Merge A into B
  • Skip

That’s self-contradictory. The skill instructs the LLM to surface only pairs worth merging. If it recommends not merging, it’s undermining its own earlier judgment.

Even funnier: sometimes the LLM would bundle two pairs together and present options like:

  • Skip both (neither A+B nor C+D)
  • Review each pair individually

I went back and had Claude Code tighten the constraints in the skill prompt. The behaviors went away. I still don’t understand exactly why more constraints help, or at what point adding more constraints starts degrading the quality of the skill’s output.

Takeaways

All of these happened on Sonnet 4.6. Whether the same issues occur on Opus 4.7, I don’t know. That’s the frustrating thing about LLM pitfalls — you can never be sure if a problem is model-specific or version-specific. A fix that needs to live at the harness level today might be unnecessary six months from now. Whether this post ages well is genuinely unclear.