diff --git a/content/post/20230303_transcribe/arch.png b/content/post/20230303_transcribe/arch.png new file mode 100644 index 0000000..56679e8 Binary files /dev/null and b/content/post/20230303_transcribe/arch.png differ diff --git a/content/post/20230303_transcribe/index.md b/content/post/20230303_transcribe/index.md index 2647f72..3f019a5 100644 --- a/content/post/20230303_transcribe/index.md +++ b/content/post/20230303_transcribe/index.md @@ -1,8 +1,8 @@ +++ -title = "Transcribing Videos with OpenAI's Whisper and ChatGPT APIs" +title = "Transcribing Videos with OpenAI's Whisper and ChatGPT (and yt-dlp and ffmpeg and pydub and imagehash and pandoc)" date = 2023-03-03T00:00:00 lastmod = 2023-03-03T00:00:00 -draft = true +draft = false # Authors. Comma separated list, e.g. `["Bob Smith", "David Jones"]`. authors = ["Carl Pearson"] @@ -61,20 +61,76 @@ ChatGPT is basically free - Whisper is 30x as expensive -- but the whole thing s Youtube -> file.webm -> Whisper -> file-1.txt...file-N.txt -> ChatGPT -> clean-1.txt...clean-N.txt -> transcript -## Design Considerations +![](arch.png) +The high level design is +1. Use [yt-dlp] to download a talk from Youtube. +2. Use [ffmpeg] to extract the audio from the video +3. Use [pydub] to detect non-silent regions of audio +4. Use [OpenAI's Whisper] to transcribe the audio +5. Use [OpenAI's ChatGPT] to clean up the text +6. Use [ffmpeg] again to extract frames from the talk +6. Use [Pandoc] to stitch the text and frames together into a summary document -The most pressing limit is ChatGPT's limit context: around $4k tokens. +## Acquiring the Source Video and Audio + +If you don't already have access to a talk, consider something like [yt-dlp], which will allow you to download video from most websites, including Youtube. +Then, I use [ffmpeg] to exctact the audio track from the video. +This audio track will be provided to OpenAI's Whipser API. + +## First Expected Problem: ChatGPT's Context Size + +The most pressing limit is ChatGPT's limit context: around 4k tokens. For our purposes, we expect to generate slightly more than one output token for each input token, since ChatGPT will be asked to reproduce the input text with added paragraph breaks. This means our input is limited to around 2000 tokens per API call. At 120 tokens per minute, we'd expect to reach that limit after 15 minutes. -This means we need to split the input audio into chunks no longer than 15 minutes each. +In practice, ChatGPT has a hard time reproducing text that is 2000 tokens, so I use a 5-monite window instead of fifteen minutes. +## Second Expected Problem: Transcribing partial words - -If we're not smart about that splitting, we might end up cutting a word in half, which will limit the accuracy off the Whisper API transcription on those words. +If we're not smart about where we split audio, we might end up cutting a word in half, which will limit the accuracy off the Whisper API transcription on those words. We'd rather make shorter chunks that are split when there is silence in the video. From a monetary cost perspective, it actually doesn't matter how short the chunks are -- OpenAI is billing us for each second of audio and for each word processed by ChatGPT. Regardless of how short the chunks are, the total audio length and words processed by ChatGPT are the same. +## Splitting Audio + +To (attempt to) avoid splitting words, I use [pydub] to detect silence. +I arbitrarily pick a silence threshold, and relax that threshold until no noisy region is longer than our 5-minute chunk. +That means there is (hopefully) some safe place to split the text at least every five minutes. + +This leaves many very short audio segments. +OpenAI says Whisper does better with as much context as possible, so I greedily recombine smaller audio chunks into segments no longer than five minutes. +Combine them largest-to-smallest, which allows the smallest ones a best chance to be squeezed in beside their larger neighbors. +These recombined chunks may have some silent regions within them - that's fine. +The only downside is you pay OpenAI to transcribe nothing out of these silent regions. + +## Generating Text + +Each five-minute audio file is provided to OpenAI's Whisper API. +The resulting text is unformatted, with no metadata, but does have punctuation. +I then pass it to ChatGPT with the following prompt: + +> System: +> You split text into paragraphs and correct transcription errors + +> User: +> Split the following text into paragraphs WITHOUT adding or removing anything:\n{text} + +ChatGPT is quite good at splitting this unformatted text into paragraphs. +I also considered the breaks between the five-minute chunks to be paragraph breaks, which works fine in practice since there was a silent pause there anyway. + +## Generating Screenshots + +The final ingredient needed is a screencapture of the video to go along with each paragraph. +I know what timestamp is associated with each five-minute chunk, and I can look up where among the five-minute chunks each paragraph came from. +The source chunk and location within the chunk gives a very accurate timestamp for each paragraph of text. +I use [ffmpeg] to extract a frame from the video for each paragraph. + +## The Final Summary + +A markdown document is generated by inserting each paragraph in turn. +A screenshot is inserted as well, *unless* it is too similar to the last inserted screenshot. +This happens when the speaker lingers on a slide for a while, generating a lot of text without changing the video much. +Finally, I use [pandoc] to convert that markdown file into a PDF. \ No newline at end of file