transcribe
This commit is contained in:
@@ -1,7 +1,7 @@
|
|||||||
+++
|
+++
|
||||||
title = "Transcribing Videos with OpenAI's Whisper and ChatGPT (and yt-dlp and ffmpeg and pydub and imagehash and pandoc)"
|
title = "Transcribing Videos with OpenAI's Whisper and ChatGPT"
|
||||||
date = 2023-03-03T00:00:00
|
date = 2023-03-03T00:00:00
|
||||||
lastmod = 2023-03-03T00:00:00
|
lastmod = 2023-03-06T00:00:00
|
||||||
draft = false
|
draft = false
|
||||||
|
|
||||||
# Authors. Comma separated list, e.g. `["Bob Smith", "David Jones"]`.
|
# Authors. Comma separated list, e.g. `["Bob Smith", "David Jones"]`.
|
||||||
@@ -9,7 +9,7 @@ authors = ["Carl Pearson"]
|
|||||||
|
|
||||||
tags = []
|
tags = []
|
||||||
|
|
||||||
summary = ""
|
summary = "Transcribing Videos with OpenAI's Whisper and ChatGPT (and yt-dlp and ffmpeg and pydub and imagehash and pandoc)"
|
||||||
|
|
||||||
# Projects (optional).
|
# Projects (optional).
|
||||||
# Associate this post with one or more of your projects.
|
# Associate this post with one or more of your projects.
|
||||||
@@ -40,12 +40,13 @@ categories = []
|
|||||||
|
|
||||||
+++
|
+++
|
||||||
|
|
||||||
|
|
||||||
At $0.006/min and $0.0002/1k tokens, OpenAI's Whisper and ChatGPT APIs are cheap enough to play with.
|
At $0.006/min and $0.0002/1k tokens, OpenAI's Whisper and ChatGPT APIs are cheap enough to play with.
|
||||||
Let's do some back-of-the-envelope calculations about a hypothetical transcription system.
|
I developed a "video-to-pdf" transcription system for recorded talks to learn more about them.
|
||||||
|
Let's do some back-of-the-envelope calculations about this hypothetical system.
|
||||||
It has two pieces - OpenAI's Whisper for the speech-to-text, and then OpenAI's ChatGPT to clean up any transcription errors and break the text into paragraphs.
|
It has two pieces - OpenAI's Whisper for the speech-to-text, and then OpenAI's ChatGPT to clean up any transcription errors and break the text into paragraphs.
|
||||||
|
|
||||||
First, some back of the envelope calculations.
|
A fast english speaker reaches around [160 words per minute](https://virtualspeech.com/blog/average-speaking-rate-words-per-minute).
|
||||||
A fast english speaker reaches around 160 words per minute.
|
|
||||||
OpenAI says each word is about 0.75 tokens for standard english, meaning our hypothetical fast, non-stop speaker is generating 120 tokens per minute, or 7200 per hour.
|
OpenAI says each word is about 0.75 tokens for standard english, meaning our hypothetical fast, non-stop speaker is generating 120 tokens per minute, or 7200 per hour.
|
||||||
If we had to pass those through ChatGPT, (one token out for each token in), we would get the following costs:
|
If we had to pass those through ChatGPT, (one token out for each token in), we would get the following costs:
|
||||||
|
|
||||||
@@ -54,29 +55,31 @@ If we had to pass those through ChatGPT, (one token out for each token in), we w
|
|||||||
| Whisper | $0.006 / min | 36 cents |
|
| Whisper | $0.006 / min | 36 cents |
|
||||||
| ChatGPT | $0.002 / 1k tokens | 2.88 cents |
|
| ChatGPT | $0.002 / 1k tokens | 2.88 cents |
|
||||||
|
|
||||||
|
|
||||||
ChatGPT is basically free - Whisper is 30x as expensive -- but the whole thing still comes out to less than $0.50 to transcribe an hour of speech.
|
ChatGPT is basically free - Whisper is 30x as expensive -- but the whole thing still comes out to less than $0.50 to transcribe an hour of speech.
|
||||||
|
|
||||||
## High-Level Design
|
## High-Level Design
|
||||||
|
|
||||||
Youtube -> file.webm -> Whisper -> file-1.txt...file-N.txt -> ChatGPT -> clean-1.txt...clean-N.txt -> transcript
|
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
The high level design is
|
The high level design is
|
||||||
|
|
||||||
1. Use [yt-dlp] to download a talk from Youtube.
|
1. Use [yt-dlp](https://github.com/yt-dlp/yt-dlp) to download a talk from Youtube.
|
||||||
2. Use [ffmpeg] to extract the audio from the video
|
2. Use [FFmpeg](https://github.com/FFmpeg/FFmpeg) to extract the audio from the video
|
||||||
3. Use [pydub] to detect non-silent regions of audio
|
3. Use [pydub](https://github.com/jiaaro/pydub) to detect non-silent regions of audio
|
||||||
4. Use [OpenAI's Whisper] to transcribe the audio
|
4. Use [OpenAI's Whisper](https://platform.openai.com/docs/guides/speech-to-text) to transcribe the audio
|
||||||
5. Use [OpenAI's ChatGPT] to clean up the text
|
5. Use [OpenAI's ChatGPT](https://platform.openai.com/docs/guides/chat) to clean up the text
|
||||||
6. Use [ffmpeg] again to extract frames from the talk
|
6. Use [FFmpeg](https://github.com/FFmpeg/FFmpeg) again to extract frames from the talk
|
||||||
6. Use [Pandoc] to stitch the text and frames together into a summary document
|
6. Use [Pandoc](https://github.com/jgm/pandoc) to stitch the text and frames together into a summary document
|
||||||
|
|
||||||
## Acquiring the Source Video and Audio
|
## Acquiring the Source Video and Audio
|
||||||
|
|
||||||
If you don't already have access to a talk, consider something like [yt-dlp], which will allow you to download video from most websites, including Youtube.
|
If you don't already have access to a talk, consider something like [yt-dlp](https://github.com/yt-dlp/yt-dlp), which will allow you to download video from most websites, including Youtube.
|
||||||
Then, I use [ffmpeg] to exctact the audio track from the video.
|
Then, I use [ffmpeg](https://github.com/FFmpeg/FFmpeg) to exctact the audio track from the video.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ffmpeg -i input.mp4 -map 0:a output.mp3
|
||||||
|
```
|
||||||
|
|
||||||
This audio track will be provided to OpenAI's Whipser API.
|
This audio track will be provided to OpenAI's Whipser API.
|
||||||
|
|
||||||
## First Expected Problem: ChatGPT's Context Size
|
## First Expected Problem: ChatGPT's Context Size
|
||||||
@@ -96,7 +99,7 @@ Regardless of how short the chunks are, the total audio length and words process
|
|||||||
|
|
||||||
## Splitting Audio
|
## Splitting Audio
|
||||||
|
|
||||||
To (attempt to) avoid splitting words, I use [pydub] to detect silence.
|
To (attempt to) avoid splitting words, I use [pydub](https://github.com/jiaaro/pydub) to detect silence.
|
||||||
I arbitrarily pick a silence threshold, and relax that threshold until no noisy region is longer than our 5-minute chunk.
|
I arbitrarily pick a silence threshold, and relax that threshold until no noisy region is longer than our 5-minute chunk.
|
||||||
That means there is (hopefully) some safe place to split the text at least every five minutes.
|
That means there is (hopefully) some safe place to split the text at least every five minutes.
|
||||||
|
|
||||||
@@ -126,11 +129,15 @@ I also considered the breaks between the five-minute chunks to be paragraph brea
|
|||||||
The final ingredient needed is a screencapture of the video to go along with each paragraph.
|
The final ingredient needed is a screencapture of the video to go along with each paragraph.
|
||||||
I know what timestamp is associated with each five-minute chunk, and I can look up where among the five-minute chunks each paragraph came from.
|
I know what timestamp is associated with each five-minute chunk, and I can look up where among the five-minute chunks each paragraph came from.
|
||||||
The source chunk and location within the chunk gives a very accurate timestamp for each paragraph of text.
|
The source chunk and location within the chunk gives a very accurate timestamp for each paragraph of text.
|
||||||
I use [ffmpeg] to extract a frame from the video for each paragraph.
|
I use [ffmpeg](https://github.com/FFmpeg/FFmpeg) to extract a frame from the video for each paragraph.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ffmpeg -y -ss 01:23:45 -i input.webm -frames:v 1 -q:v 2 output.jpg
|
||||||
|
```
|
||||||
|
|
||||||
## The Final Summary
|
## The Final Summary
|
||||||
|
|
||||||
A markdown document is generated by inserting each paragraph in turn.
|
A markdown document is generated by inserting each paragraph in turn.
|
||||||
A screenshot is inserted as well, *unless* it is too similar to the last inserted screenshot.
|
A screenshot is inserted as well, *unless* it is too similar to the last inserted screenshot.
|
||||||
This happens when the speaker lingers on a slide for a while, generating a lot of text without changing the video much.
|
This happens when the speaker lingers on a slide for a while, generating a lot of text without changing the video much.
|
||||||
Finally, I use [pandoc] to convert that markdown file into a PDF.
|
Finally, I use [Pandoc](https://github.com/jgm/pandoc) to convert that markdown file into a PDF.
|
Reference in New Issue
Block a user