GitHub - dtinth/autosub: Automated generation of subtitles for tech talks in Thai language using Speechmatics, Gemini, GPT-4o and Claude.

An automated workflow for generating subtitles for Thai tech talks. It has been tested with tech talks in Thai language. The generated subtitles achieve high accuracy, with word error rate (WER) around 0.5–2.5%.

Introduction

Most automatic speech recognizers that support Thai language have troubles when it comes to technical talks. Most of them don’t know any technical term. Some can’t even spell a single English word. As a result, autogenerated captions from these systems tend to be unusable.

This workflow uses a combination of models:

Speechmatics ASR — Transcribes Thai text with lower accuracy, but with highly-precise word-level timestamps.
Gemini 1.5 Pro — This multimodal model can listen and watch a tech talk and generate a highly accurate transcript. Being mainly a language model, it knows a lot of technical terms. It can also handle the situation where the speaker speaks less common Thai words (such as words from northern Thai region). Being multimodal, it can read the slides in the video and generate transcripts more accurately. However, the model process media input in one second chunks, so it is not possible to obtain a precise timestamp. Moreover, it often hallucinates timecode, so the timing information from this model is unusable. It also tends to ignore formatting instructions in the prompt.
Claude 3.5 Sonnet — Used to post-process the transcript to improve its formatting and readability. It can also fix some errors in the transcript.

Putting these models together created a subtitles file that has very little errors. I can review the subtitles by playing the talk at 2x speed and fixing the errors spotted. Previously, with a less a accurate transcript, I have to stop and fix the subtitle every few seconds. But now, with this workflow, sometimes a minute of reviewing can go by without me having to fix anything.

However, using this workflow can be quite expensive — it costs about 250 THB, or 7 USD, to process one hour of video. This cost can be cut in about half by processing just the audio, but without the video the transcription becomes slightly less accurate. (On the contrary, Speechmatics costs 0.3 USD per hour of audio.)

For examples of prompts and response from the models, see Prompt engineering section.

Warning

The code in this repository is very hacky and badly written.

Prompt engineering

Prompt for generating a transcript

Prompt:

Generate a transcript of the speech.

- The speech is in Thai language.
- DO start a new line after each utterance or sentence.
- Each line should not be longer than 50 characters. If it is longer, split it into multiple lines, but keep the line lengths balanced.
- For English words, if it is a common word, then spell it using lowercase (e.g. oscillator). If it is a proper noun, capitalize it properly (e.g. Google Chrome). If it's an API name or part of computer code, use verbatim capitalization (e.g. getElementById).
- For Thai text, do not add a space between words. Only add spaces between sentences or when there is obvious pausing.
- For technical terms, in general, spell it in English (e.g. canvas, vertex, scene). Only transliterate it to Thai if it is a very common word and commonly spelled in Thai (e.g. ลิงก์, เคส, อัพเกรด, โปรแกรมเมอร์).
- Remove filler words like "umm" and "ah". Also fix the transcript when the speaker corrects themselves or repeats themselves due to stuttering.

The following notes should be helpful for generating the transcript more accurately:

<NOTES>
Title: Astro JS - เฟรมเวิร์กน้องใหม่ เขียนง่าย โหลดไวมาก!
Speaker: Chakkrisn Talawat, Seed Webs Founder & President at OSEDA

แนะนำ Astro สำหรับผู้เริ่มต้น หัดง่าย แม้ไม่เคยเขียน JS Framework อื่นมาก่อน

1. แนะนำ Astro
2. สร้างหน้าต่างๆ ด้วย HTML, Markdown, MDX, JSX
3. นำข้อมูลจากแหล่งต่างๆ มาแสดง เช่น WordPress, Google Sheet
4. สร้างเว็บจาก ThaiUI Starter Site
5. นำเว็บขึ้น Serverless เช่น Cloudflare Pages
</NOTES>

Remember, start a new line after each utterance or sentence, but do not break sentences into multiple lines.

Output:

ทีนี้ หัวข้อ นะครับ วันนี้เราพูดถึงเรื่อง Astro นะครับ
เขาพยายามจะ ผมรู้สึกว่า Astro เนี่ยสื่อสารดี
เขาไม่ได้บอกว่าเขาทำได้ทุกอย่าง นะครับ
เขาบอกเขาเป็น web framework ไม่ได้ ไม่ได้พยายามเน้นว่า JavaScript ด้วยนะ เพราะว่าจริงๆแล้วเนี่ย
พอมัน build เป็น static มันก็ ไม่ได้ feel ว่าต้องเป็น JavaScript ขนาดนั้น แต่ว่า เราใช้ JavaScript ในการเขียนนะ ใช่ ใช่ นะครับ
เขาบอกว่าเป็น web framework สำหรับ content driven นะครับ
ก็คือ เว็บที่เป็น content ทั้งหลาย นะครับ
อ่า มันจะเหมาะกับ Astro เขาไม่ได้ ไปสู้กับ web application ไม่ได้ไปสู้กับระบบที่มันซับซ้อน นะครับ
แต่ตัวมันน่ะซับซ้อน นะครับ แล้วก็ตัวมันก็ มีเบื้องหลังที่ขยายได้หลายแบบ นะครับ เดี๋ยวเรามาดูกัน
ผมก็ สะดุดใจคำว่า content driven นะครับ
ในฐานะที่เราก็ทำพวก CMS มาก่อน นะครับ อย่างเช่น WordPress ข้างล่างเนี่ย นะครับ
เราก็จะรู้สึกว่า เราจะต้องทำ ระบบ
ถ้า ถ้าทำเว็บที่เกี่ยวกับ content เราก็คิดถึง ก็คิดถึง WordPress ก่อนแน่นอน นะครับ
ซึ่ง มันอิงกับ server ใช่ไหม มันก็จะเป็น อันนี้ผมยังไม่เจอฝรั่งเขียนนะ แต่ผมอาจจะเขียนผิดก็ได้ นะครับ ผมแค่คิดเองว่า เออเนี่ย เวลาผมตัดสินใจว่า เรา เรามาใช้ตัวนี้เหอะเนี่ย เพราะเรารู้สึกแบบนี้ เรารู้สึกว่า WordPress เนี่ย
มันจมอยู่กับ server นะครับ ไม่ใช่ไม่ดีนะ
มันดีในแง่ว่า มันครบวงจร สำหรับเว็บขนาดเล็ก นะครับ
แต่เมื่อไหร่ก็ตาม ที่เราต้องการ scale นะครับ เราต้องการการไม่พึ่งพา server นะครับ ก็คือเรียกว่า serverless นะครับ ก็คือมี server แหละ แต่ว่า มันไป scale ได้ มันไป ทำ cache มันไป run ที่ device นะครับ

Prompts for cleaning up the transcript

Prompt:

Your job is to take the following transcript and improve it.

- Remove filler words, but keep politeness words such as ครับ or ค่ะ. Also fix repeated words due to speaker stuttering, or speaker correcting themselves.
- English words should not be capitalized unless it is a proper noun. English words that are not proper nouns should be converted to lowercase, except for API or code (e.g. moveTo) should be left alone.
- If a line has more than 50 characters, you can break one line of text into multiple lines, but try to keep line lengths relatively balanced.
- English words that are transliterated into Thai should be rewritten in English. For example, "โมเดลที่เราเทรนมาเนี่ยมันพิคอัพแพทเทิร์นอะไร" should be rewritten into "model ที่เรา train มาเนี่ยมัน pick up pattern อะไร". However, some words such as "โปรเจค" should be retained as "โปรเจค" because it’s very often used in Thai colloquial speech that its transliteration becomes a word on its own.
- If any mathematical equations can be formatted with unicode, please do so. For example, x^2 should be x². 2*3 can be 2×3. If the equation is too complex, you can leave it as is. Also if the equation is part of a code snippet, leave it as is.
- Do not add a space before ๆ. If there is a space before ๆ in the input, remove it from the output. However, there should be a space after ๆ.

Here is the transcript:
<TRANSCRIPT>
ทีนี้ หัวข้อ นะครับ วันนี้เราพูดถึงเรื่อง Astro นะครับ
เขาพยายามจะ ผมรู้สึกว่า Astro เนี่ยสื่อสารดี
เขาไม่ได้บอกว่าเขาทำได้ทุกอย่าง นะครับ
เขาบอกเขาเป็น web framework ไม่ได้ ไม่ได้พยายามเน้นว่า JavaScript ด้วยนะ เพราะว่าจริงๆแล้วเนี่ย
พอมัน build เป็น static มันก็ ไม่ได้ feel ว่าต้องเป็น JavaScript ขนาดนั้น แต่ว่า เราใช้ JavaScript ในการเขียนนะ ใช่ ใช่ นะครับ
เขาบอกว่าเป็น web framework สำหรับ content driven นะครับ
ก็คือ เว็บที่เป็น content ทั้งหลาย นะครับ
อ่า มันจะเหมาะกับ Astro เขาไม่ได้ ไปสู้กับ web application ไม่ได้ไปสู้กับระบบที่มันซับซ้อน นะครับ
แต่ตัวมันน่ะซับซ้อน นะครับ แล้วก็ตัวมันก็ มีเบื้องหลังที่ขยายได้หลายแบบ นะครับ เดี๋ยวเรามาดูกัน
ผมก็ สะดุดใจคำว่า content driven นะครับ
ในฐานะที่เราก็ทำพวก CMS มาก่อน นะครับ อย่างเช่น WordPress ข้างล่างเนี่ย นะครับ
เราก็จะรู้สึกว่า เราจะต้องทำ ระบบ
ถ้า ถ้าทำเว็บที่เกี่ยวกับ content เราก็คิดถึง ก็คิดถึง WordPress ก่อนแน่นอน นะครับ
ซึ่ง มันอิงกับ server ใช่ไหม มันก็จะเป็น อันนี้ผมยังไม่เจอฝรั่งเขียนนะ แต่ผมอาจจะเขียนผิดก็ได้ นะครับ ผมแค่คิดเองว่า เออเนี่ย เวลาผมตัดสินใจว่า เรา เรามาใช้ตัวนี้เหอะเนี่ย เพราะเรารู้สึกแบบนี้ เรารู้สึกว่า WordPress เนี่ย
มันจมอยู่กับ server นะครับ ไม่ใช่ไม่ดีนะ
มันดีในแง่ว่า มันครบวงจร สำหรับเว็บขนาดเล็ก นะครับ
แต่เมื่อไหร่ก็ตาม ที่เราต้องการ scale นะครับ เราต้องการการไม่พึ่งพา server นะครับ ก็คือเรียกว่า serverless นะครับ ก็คือมี server แหละ แต่ว่า มันไป scale ได้ มันไป ทำ cache มันไป run ที่ device นะครับ

</TRANSCRIPT>

The following notes should be helpful for generating the transcript more accurately:

<NOTES>
Title: Astro JS - เฟรมเวิร์กน้องใหม่ เขียนง่าย โหลดไวมาก!
Speaker: Chakkrisn Talawat, Seed Webs Founder & President at OSEDA

แนะนำ Astro สำหรับผู้เริ่มต้น หัดง่าย แม้ไม่เคยเขียน JS Framework อื่นมาก่อน

1. แนะนำ Astro
2. สร้างหน้าต่างๆ ด้วย HTML, Markdown, MDX, JSX
3. นำข้อมูลจากแหล่งต่างๆ มาแสดง เช่น WordPress, Google Sheet
4. สร้างเว็บจาก ThaiUI Starter Site
5. นำเว็บขึ้น Serverless เช่น Cloudflare Pages
</NOTES>

Provide the answer in <ANSWER> tags.

Output:

Here's the improved transcript:

<ANSWER>
ทีนี้หัวข้อวันนี้เราพูดถึงเรื่อง Astro นะครับ
ผมรู้สึกว่า Astro เนี่ยสื่อสารดี
เขาไม่ได้บอกว่าเขาทำได้ทุกอย่างนะครับ
เขาบอกเขาเป็น web framework ไม่ได้พยายามเน้นว่า JavaScript ด้วยนะ
เพราะว่าจริงๆแล้วเนี่ย พอมัน build เป็น static
มันก็ไม่ได้ feel ว่าต้องเป็น JavaScript ขนาดนั้น
แต่ว่าเราใช้ JavaScript ในการเขียนนะครับ
เขาบอกว่าเป็น web framework สำหรับ content driven นะครับ
ก็คือเว็บที่เป็น content ทั้งหลายนะครับ
มันจะเหมาะกับ Astro เขาไม่ได้ไปสู้กับ web application
ไม่ได้ไปสู้กับระบบที่มันซับซ้อนนะครับ
แต่ตัวมันน่ะซับซ้อนนะครับ แล้วก็ตัวมันก็มีเบื้องหลังที่ขยายได้หลายแบบนะครับ
เดี๋ยวเรามาดูกัน
ผมก็สะดุดใจคำว่า content driven นะครับ
ในฐานะที่เราก็ทำพวก CMS มาก่อนนะครับ อย่างเช่น WordPress ข้างล่างเนี่ยนะครับ
เราก็จะรู้สึกว่าเราจะต้องทำระบบ
ถ้าทำเว็บที่เกี่ยวกับ content เราก็คิดถึง WordPress ก่อนแน่นอนนะครับ
ซึ่งมันอิงกับ server ใช่ไหม มันก็จะเป็น
อันนี้ผมยังไม่เจอฝรั่งเขียนนะ แต่ผมอาจจะเขียนผิดก็ได้นะครับ
ผมแค่คิดเองว่า เออเนี่ย เวลาผมตัดสินใจว่าเรามาใช้ตัวนี้เหอะเนี่ย
เพราะเรารู้สึกแบบนี้ เรารู้สึกว่า WordPress เนี่ย
มันจมอยู่กับ server นะครับ ไม่ใช่ไม่ดีนะ
มันดีในแง่ว่ามันครบวงจร สำหรับเว็บขนาดเล็กนะครับ
แต่เมื่อไหร่ก็ตามที่เราต้องการ scale นะครับ
เราต้องการการไม่พึ่งพา server นะครับ ก็คือเรียกว่า serverless นะครับ
ก็คือมี server แหละ แต่ว่ามันไป scale ได้
มันไป ทำ cache มันไป run ที่ device นะครับ
</ANSWER>

Prompts for alignment

Prompt:

You are tasked with performing forced alignment between ASR (Automatic Speech Recognition) output (which is inaccurate) and a correct transcript (which does not have timing information). Your goal is to determine the appropriate start and end timecodes for each line in the transcript.

You will be provided with two inputs:

<ASR_OUTPUT>
<122.6>ทีนี้<123.1>
<123.5>หัวข้อ<124.1>นะ<124.2>ครับ<124.5>วันนี้<124.7>เรา<124.8>พูด<124.9>เรื่อง<125.1>แอ<125.3>ส<125.4>โร่<125.7>
<126.0>นะ<126.2>ครับ<126.5>
<127.0>เขา<127.3>พยายาม<127.7>จา<128.1>
<128.4>ผม<128.7>รู้สึก<128.9>ว่า<129.0>แอ<129.2>ส<129.2>โต้<129.4>นี้<129.5>สื่อสาร<129.8>ดี<130.1>
<130.4>เขา<130.7>ไม่<130.8>ได้<130.8>บอ<130.9>กว่า<131.1>ทำได้<131.3>ทุกอย่าง<131.7>
<132.7>เขา<132.9>บอก<133.0>เขา<133.2>เป็น<133.3>เว็ป<133.4>เฟรม<133.6>เวิร์ค<133.9>
<134.0>ไม่<134.3>ได้<134.4>ไม่<134.6>ได้<134.7>พยายาม<135.0>เน้น<135.1>ว่า<135.3>จาวาสคริปต์<135.7>ด้วย<135.8>นะ<136.0>เพราะว่า<136.2>จริง<136.4>ๆ<136.5>แล้ว<136.6>เนี่ย<136.9>
<137.0>พอ<137.3>มัน<137.4>บิ๊ว<137.6>เป็น<137.7>สแต<137.9>ติก<138.2>มัน<138.3>ก็<138.6>ไม่<138.8>ได้<138.9>ฟิว<139.1>ว่า<139.2>ต้อง<139.3>เป็น<139.4>จาวาสคริปต์<139.9>ขนาด<140.1>นั้น<140.2>แต่ว่า<140.7>เรา<140.9>ใช้<141.0>จาวาสคริปต์<141.4>ใน<141.5>การเขียน<141.8>นะ<141.9>.<141.9>ใช่<142.2>
<142.4>ครับ<142.5>
<143.2>เขา<143.3>บอ<143.4>กว่า<143.5>เป็น<143.7>เว็บ<143.8>เฟรม<144.0>เวิร์ค<144.4>สำหรับ<144.9>
<145.1>คอน<145.4>เท้นท์<145.7>หรือ<145.8>เว้<146.0>นท<146.1>
<147.3>ก็<147.5>คือ<147.9>
<148.1>เว็บ<148.4>ที่<148.6>เป็น<148.7>คอนเทนต์<149.0>ทั้งหลาย<149.5>นะ<149.7>ครับ<149.9>
<150.9>มัน<151.2>จะ<151.3>เหมาะกับ<151.6>แอ<151.8>ส<151.8>โต<152.2>เขา<152.5>ไม่<152.6>ได้<152.8>ไป<153.0>สู้<153.2>กับ<153.7>เว็บ<153.9>แอพพลิเคชัน<154.7>ไม่<154.9>ได้<155.1>ไป<155.2>สู้<155.4>กับ<155.5>ระบบ<155.8>ที่<155.9>มัน<156.0>ซับซ้อน<156.6>นะ<156.7>ครับ<156.9>
<157.1>แต่<157.4>ตัว<157.6>มัน<157.7>น่ะ<157.8>ซับซ้อน<158.2>อาหาร<158.4>และ<158.5>ก็<158.6>ตัว<158.8>มัน<158.9>ก็<159.1>
<159.3>มี<159.6>เบื้องหลัง<159.9>ที่<160.0>ขยาย<160.4>ได้<160.5>หลาย<160.6>แบบ<160.9>
<161.7>เดี๋ยว<161.8>เรา<162.0>มา<162.1>ดู<162.2>กัน<162.5>.<162.5>
<163.6>ผม<163.8>ก็<164.1>สะดุดใจ<164.5>คำ<164.6>ว่า<164.9>คอนเทนต์<165.3>ริ<165.5>เวน<165.7>นะ<165.9>ครับ<166.1>
<166.5>ใน<166.8>ฐานะ<167.2>ที่<167.3>เรา<167.5>ก็<167.8>ทำ<168.1>พวก<168.5>ซี<168.8>เบสต์<169.0>มา<169.1>ก่อน<169.5>นะ<169.6>ครับ<169.7>อย่างเช่น<170.0>บท<170.2>เพ<170.4>ซ<170.5>ข้างล่าง<170.9>เนี่ย<171.3>นะ<171.5>ครับ<171.7>
<172.3>เรา<172.6>ก็<172.7>จะ<172.8>รู้สึก<173.1>ว่า<173.4>เรา<173.7>จะ<173.8>ต้อง<174.1>ทำ<174.5>
<175.0>ระบบ<175.6>
<176.0>ถ้า<176.5>ทำ<176.7>เว็บ<176.8>ที่<176.9>เกี่ยวกับ<177.2>คอน<177.4>เท้นท์<177.7>
<178.0>เรา<178.3>ก็<178.4>คิดถึง<178.6>.<178.6>
<179.2>คิดถึง<179.4>เว็บเพจ<179.7>กัน<179.8>ให้<179.9>หน่อย<180.2>น่ะ<180.3>ครับ<180.5>
<181.2>ซึ่ง<181.6>
<181.8>มัน<182.2>อิง<182.5>กับ<182.7>เซอร์<182.9>เวอร์<183.3>ใช่<183.5>มะ<183.7>มัน<183.9>ก็<184.0>จะ<184.1>เป็น<184.3>
<184.7>อันนี้<185.0>ผม<185.2>ยัง<185.3>ไม่<185.4>เจอ<185.6>ฝรั่ง<185.8>เขียน<186.0>นะ<186.2>แต่<186.4>ผม<186.5>อาจจะ<186.7>เขียน<186.8>ผิด<187.0>ก็ได้<187.2>นะ<187.3>ครับ<187.5>ผม<187.7>แค่<187.8>คิด<188.0>เอง<188.2>ว่า<188.4>
<188.6>เออ<188.9>เนี่ย<189.1>เวลา<189.4>ผม<189.6>ตัดสินใจ<190.0>ว่า<190.2>เรา<190.4>เรา<190.8>มา<190.9>ใช้<191.1>ตัว<191.2>นี้<191.3>เหอะ<191.5>เนี่ย<191.7>
<191.9>เพราะ<192.1>เรา<192.2>รู้สึก<192.4>แบบนี้<192.9>เรา<193.2>รู้สึก<193.3>ว่า<193.4>บท<193.6>เพ<193.8>น<193.8>เนี่ย<194.0>
<194.3>มัน<194.6>จม<194.8>อยู่<194.9>กับ<195.0>เซิร์ฟเวอร์<195.7>น่ะ<195.9>ไม่<196.0>ใช่<196.2>ไม่<196.2>ดี<196.4>นะ<196.7>
<197.0>มันดี<197.5>ในแง่<197.8>ว่า<198.2>มัน<198.4>ครบวงจร<199.0>สำหรับ<199.3>เว็บ<199.4>ขนาดเล็ก<199.9>
<201.1>แต่<201.4>เมื่อไหร่<201.6>ก็<201.8>ตามที่<202.3>เรา<202.5>ต้องการ<203.2>สเกล<203.8>
<203.9>นับ<204.4>เรา<204.7>ต้องการ<205.1>การ<205.2>ไม่<205.3>พึ่งพา<205.7>เซิร์ฟเวอร์<206.3>นะ<206.4>ครับ<206.5>ก็<206.6>คือ<206.7>เรียก<206.9>ว่า<207.2>เซิร์ฟเวอร์<207.6>เลท<207.9>นะ<208.0>ครับ<208.3>ก็<208.4>คือ<208.5>จะ<208.6>มี<208.7>เซิฟ<208.8>เวอร์<208.9>แหละ<209.1>แต่ว่า<209.5>
<209.6>มัน<210.0>ไป<210.2>สเกล<210.6>ได้<210.8>มัน<211.0>ไป<211.2>
<211.4>ทำ<211.8>แคช<212.1>มัน<212.3>ไป<212.4>รัน<212.7>ที่<213.1>
<213.4>ดีไวซ์<214.1>
</ASR_OUTPUT>

<TRANSCRIPT>
1, ทีนี้หัวข้อวันนี้เราพูดถึงเรื่อง Astro นะครับ
2, ผมรู้สึกว่า Astro เนี่ยสื่อสารดี
3, เขาไม่ได้บอกว่าเขาทำได้ทุกอย่างนะครับ
4, เขาบอกเขาเป็น web framework ไม่ได้พยายามเน้นว่า JavaScript ด้วยนะ
5, เพราะว่าจริงๆแล้วเนี่ย พอมัน build เป็น static
6, มันก็ไม่ได้ feel ว่าต้องเป็น JavaScript ขนาดนั้น
7, แต่ว่าเราใช้ JavaScript ในการเขียนนะครับ
8, เขาบอกว่าเป็น web framework สำหรับ content driven นะครับ
9, ก็คือเว็บที่เป็น content ทั้งหลายนะครับ
10, มันจะเหมาะกับ Astro เขาไม่ได้ไปสู้กับ web application
11, ไม่ได้ไปสู้กับระบบที่มันซับซ้อนนะครับ
12, แต่ตัวมันน่ะซับซ้อนนะครับ แล้วก็ตัวมันก็มีเบื้องหลังที่ขยายได้หลายแบบนะครับ
13, เดี๋ยวเรามาดูกัน
14, ผมก็สะดุดใจคำว่า content driven นะครับ
15, ในฐานะที่เราก็ทำพวก CMS มาก่อนนะครับ อย่างเช่น WordPress ข้างล่างเนี่ยนะครับ
16, เราก็จะรู้สึกว่าเราจะต้องทำระบบ
17, ถ้าทำเว็บที่เกี่ยวกับ content เราก็คิดถึง WordPress ก่อนแน่นอนนะครับ
18, ซึ่งมันอิงกับ server ใช่ไหม มันก็จะเป็น
19, อันนี้ผมยังไม่เจอฝรั่งเขียนนะ แต่ผมอาจจะเขียนผิดก็ได้นะครับ
20, ผมแค่คิดเองว่า เออเนี่ย เวลาผมตัดสินใจว่าเรามาใช้ตัวนี้เหอะเนี่ย
21, เพราะเรารู้สึกแบบนี้ เรารู้สึกว่า WordPress เนี่ย
22, มันจมอยู่กับ server นะครับ ไม่ใช่ไม่ดีนะ
23, มันดีในแง่ว่ามันครบวงจร สำหรับเว็บขนาดเล็กนะครับ
24, แต่เมื่อไหร่ก็ตามที่เราต้องการ scale นะครับ
25, เราต้องการการไม่พึ่งพา server นะครับ ก็คือเรียกว่า serverless นะครับ
26, ก็คือมี server แหละ แต่ว่ามันไป scale ได้
27, มันไป ทำ cache มันไป run ที่ device นะครับ
</TRANSCRIPT>

The ASR output contains text mixed with timecodes in angle brackets. The format is: <timecode>text<timecode>text... where timecodes are in seconds.

The transcript contains the correct text without timing information. The format is: line number, text.

Output the results in the following format: source line number, start timecode, text, end timecode.

If a source line contains more than 50 characters, you can break it into 2 lines (keeping the same source line number), but try to keep both line lengths relatively balanced.

Important notes:
- Round timecodes to one decimal place.
- Sometimes the start and end of the transcript line are not aligned with ASR word boundaries. In this case, you should estimate the timecodes by interpolating between the surrounding words' timecodes.
- Make sure that the timecode is in STRICTLY increasing order, and that the start timecode comes BEFORE the end timecode.

Provide your answer within <ANSWER></ANSWER> tags.

Output:

Here is my attempt at aligning the transcript with the ASR output:

<ANSWER>
1, 122.6, ทีนี้หัวข้อวันนี้เราพูดถึงเรื่อง Astro นะครับ, 126.2
2, 126.5, ผมรู้สึกว่า Astro เนี่ยสื่อสารดี, 130.1
3, 130.4, เขาไม่ได้บอกว่าเขาทำได้ทุกอย่างนะครับ, 131.7
4, 132.7, เขาบอกเขาเป็น web framework ไม่ได้พยายามเน้นว่า JavaScript ด้วยนะ, 136.0
5, 136.2, เพราะว่าจริงๆแล้วเนี่ย พอมัน build เป็น static, 138.2
6, 138.3, มันก็ไม่ได้ feel ว่าต้องเป็น JavaScript ขนาดนั้น, 140.2
7, 140.7, แต่ว่าเราใช้ JavaScript ในการเขียนนะครับ, 142.5
8, 143.2, เขาบอกว่าเป็น web framework สำหรับ content driven นะครับ, 147.3
9, 147.5, ก็คือเว็บที่เป็น content ทั้งหลายนะครับ, 149.9
10, 150.9, มันจะเหมาะกับ Astro เขาไม่ได้ไปสู้กับ web application, 154.7
11, 154.9, ไม่ได้ไปสู้กับระบบที่มันซับซ้อนนะครับ, 156.9
12, 157.1, แต่ตัวมันน่ะซับซ้อนนะครับ, 160.9
12, 161.7, แล้วก็ตัวมันก็มีเบื้องหลังที่ขยายได้หลายแบบนะครับ, 162.5
13, 163.6, เดี๋ยวเรามาดูกัน, 166.1
14, 166.5, ผมก็สะดุดใจคำว่า content driven นะครับ, 169.7
15, 169.7, ในฐานะที่เราก็ทำพวก CMS มาก่อนนะครับ, 173.1
15, 173.4, อย่างเช่น WordPress ข้างล่างเนี่ยนะครับ, 171.7
16, 172.3, เราก็จะรู้สึกว่าเราจะต้องทำระบบ, 175.6
17, 176.0, ถ้าทำเว็บที่เกี่ยวกับ content เราก็คิดถึง WordPress ก่อนแน่นอนนะครับ, 180.5
18, 181.2, ซึ่งมันอิงกับ server ใช่ไหม มันก็จะเป็น, 184.3
19, 184.7, อันนี้ผมยังไม่เจอฝรั่งเขียนนะ แต่ผมอาจจะเขียนผิดก็ได้นะครับ, 187.5
20, 187.7, ผมแค่คิดเองว่า เออเนี่ย เวลาผมตัดสินใจว่าเรามาใช้ตัวนี้เหอะเนี่ย, 191.7
21, 191.9, เพราะเรารู้สึกแบบนี้ เรารู้สึกว่า WordPress เนี่ย, 194.3
22, 194.6, มันจมอยู่กับ server นะครับ ไม่ใช่ไม่ดีนะ, 196.7
23, 197.0, มันดีในแง่ว่ามันครบวงจร สำหรับเว็บขนาดเล็กนะครับ, 199.9
24, 201.1, แต่เมื่อไหร่ก็ตามที่เราต้องการ scale นะครับ, 203.8
25, 203.9, เราต้องการการไม่พึ่งพา server นะครับ ก็คือเรียกว่า serverless นะครับ, 208.3
26, 208.4, ก็คือมี server แหละ แต่ว่ามันไป scale ได้, 210.8
27, 211.0, มันไป ทำ cache มันไป run ที่ device นะครับ, 214.1
</ANSWER>

Prerequisites

Node.js
Bun
tsx
A tool that loads .env file into your environment, such as mise-en-place.

Configuration

.env

SPEECHMATICS_API_KEY=
ANTHROPIC_API_KEY=
GEMINI_API_KEY=
OPENAI_API_KEY=

Usage

Create a project folder.

mkdir projects/my-project
cd projects/my-project

Add an audio file.
- Manually: Put an audio file as audio.mp3 in the project folder.
- Download from YouTube:
```
../../scripts/download_audio_from_youtube 'https://www.youtube.com/watch?v=vbIWSwz8NxQ'
```
Start server. This will show a URL to access the web interface.
```
bun ../../server/index.ts
```
Enter transcription notes. This will help the model generate a more accurate transcript.
Obtain word-level timestamps. This will be used for partitioning the long audio file into smaller segments, as well as for aligning the transcript with the ASR output.
- If the video is already on YouTube and it has already generate an automatic caption, you can import it directly.
- Otherwise, switch the ASR preset to Speechmatics to use Speechmatics to obtain the word-level timestamps.
Generate partitions. This will split the audio file into smaller segments based on the gaps found in the audio, as determined by the word-level timestamps.
Transcribe with Gemini for each part. This will generate a transcript using the Gemini model.

[!TIP] Hold down the Alt key while clicking the menu item to perform the operation in the background. This lets you start multiple transcriptions more easily.
Once an initial transcript is generated, improve it with Claude.
Once all the parts have a transcript, align it.
Once the alignment process is done, you can obtain the resulting VTT file.

[!CAUTION] The generated VTT file may contain errors. Please review it before using it.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
projects		projects
scripts		scripts
server		server
src		src
.gitignore		.gitignore
.mise.toml		.mise.toml
README.md		README.md
bun.lockb		bun.lockb
bunfig.toml		bunfig.toml
diagram.excalidraw.svg		diagram.excalidraw.svg
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Prompt engineering

Prerequisites

Configuration

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

dtinth/autosub

Folders and files

Latest commit

History

Repository files navigation

Introduction

Prompt engineering

Prerequisites

Configuration

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages