An automated workflow for generating subtitles for Thai tech talks. It has been tested with tech talks in Thai language. The generated subtitles achieve high accuracy, with word error rate (WER) around 0.5–2.5%.
Most automatic speech recognizers that support Thai language have troubles when it comes to technical talks. Most of them don’t know any technical term. Some can’t even spell a single English word. As a result, autogenerated captions from these systems tend to be unusable.
This workflow uses a combination of models:
- Speechmatics ASR — Transcribes Thai text with lower accuracy, but with highly-precise word-level timestamps.
- Gemini 1.5 Pro — This multimodal model can listen and watch a tech talk and generate a highly accurate transcript. Being mainly a language model, it knows a lot of technical terms. It can also handle the situation where the speaker speaks less common Thai words (such as words from northern Thai region). Being multimodal, it can read the slides in the video and generate transcripts more accurately. However, the model process media input in one second chunks, so it is not possible to obtain a precise timestamp. Moreover, it often hallucinates timecode, so the timing information from this model is unusable. It also tends to ignore formatting instructions in the prompt.
- Claude 3.5 Sonnet — Used to post-process the transcript to improve its formatting, as well as aligning the transcript to ASR results.
Putting these models together created a subtitles file that has very little errors. I can review the subtitles by playing the talk at 2x speed and fixing the errors spotted. Previously, with a less a accurate transcript, I have to stop and fix the subtitle every few seconds. But now, with this workflow, sometimes a minute of reviewing can go by without me having to fix anything.
However, using this workflow can be quite expensive — it costs about 250 THB, or 7 USD, to process one hour of video. This cost can be cut in about half by processing just the audio, but without the video the transcription becomes slightly less accurate. (On the contrary, Speechmatics costs 0.3 USD per hour of audio.)
For examples of prompts and response from the models, see Prompt engineering section.
Warning
The code in this repository is very hacky and badly written.
Prompt for generating a transcriptPrompt:
Output:
|
Prompts for cleaning up the transcriptPrompt:
Output:
|
Prompts for alignmentPrompt:
Output:
|
- Node.js
- Bun
- tsx
- A tool that loads
.env
file into your environment, such as mise-en-place.
.env
SPEECHMATICS_API_KEY=
ANTHROPIC_API_KEY=
GEMINI_API_KEY=
OPENAI_API_KEY=
# Start a project folder.
mkdir projects/my-project
cd projects/my-project
# Put video file in the project folder as "video.mp4".
# Then, extract audio from the video file.
ffmpeg -i video.mp4 -vn -c:a libmp3lame -q:a 4 audio.mp3
# Create a "notes.txt" file. Put in relevant information about the video.
# This helps improve the transcript accuracy.
touch notes.txt
# Perform ASR. This generates an ASR result file, which is quite
# inaccurate for Thai language, but has timing information.
tsx ../../scripts/asr.ts
# Use the ASR result to partition the video/audio into parts.
tsx ../../scripts/partition.ts
# Cut the audio and video into parts.
tsx ../../scripts/video_slice.ts
tsx ../../scripts/audio_slice.ts
# For first part:
export PART_NAME=part_01
tsx ../../scripts/audio_transcribe.ts # - or -
tsx ../../scripts/video_transcribe.ts
# Create a "improvement_notes.txt" file.
# This notes will be used in the transcript improvement process.
touch improvement_notes.txt
# Improve the transcript.
tsx ../../scripts/transcript_improve.ts
# Align the improved transcript with the ASR result.
tsx ../../scripts/align.ts
# For subsequent parts:
tsx ../../scripts/process_part.ts part_02
# Finally, combine the aligned parts into a single transcript.
tsx ../../scripts/create_vtt.ts
# Calculate the price (how much we paid).
tsx ../../scripts/calculate_price.ts