Force align transcript to audio

Introduction

WhisperX provides word-level timestamps for audio files, but often you'll need to "force align" audio perfectly to source-of-truth transcript text. This capability is offered by stable-ts.

Here we've created an opinionated isolation of stable-ts's alignment methods. We've wrapped this logic in a Cog interface and simplified its outputs so it can used as a standalone endpoint, e.g., on replicate.com.

If your audio is extremely clean (e.g., AI-generated), you can use a lighter weight model like forced-alignment-model based of the Meta's torchaudio's MMS model. But even a little background noise can throw off the outputs.

Features

Transcription: Convert audio files into text using the stable_whisper model.
Alignment: Align provided transcripts with audio files to enhance accuracy.
Probability Scores: Optionally display word-level probability scores.
Flexible Inputs: Supports various input configurations, including specifying language and transcript text.

Inference

Use the Replicate model as-is.

Self-hosted Installation

Prerequisites

Python 3.12
Cog installed

Setup

Clone the Repository

git clone https://github.com/crone-ai/force-align-wordstamps
cd your-repo-name

Create a Virtual Environment

python3.12 -m venv venv
source venv/bin/activate

Install Dependencies

pip install -r requirements.txt

Install Cog

Follow the Cog installation guide to install Cog and set it up if using replicate, or deploying to a containerized environment.

Usage

The primary functionality is encapsulated in the predict.py file, which defines a Predictor class compatible with Cog. Here's how to use it:

Configure cog.yaml

Ensure that your cog.yaml is properly configured to use the Predictor class from predict.py.

build:
  python_version: "3.12"
  pip:
    install:
      - -r requirements.txt

predict:
  - python predict.py

Run Prediction

Use Cog's CLI to run predictions.

cog predict --audio_file path/to/audio.mp3 --transcript "Your transcript here" --language "en" --show_probabilities

API Reference

`predict.py`

Constants

TEST_STRING

A default transcript used for alignment if no transcript is provided.

TEST_STRING = "On that road we heard the song of morning stars; we drank in fragrances aerial and sweet as a May mist; we were rich in gossamer fancies and iris hopes; our hearts sought and found the boon of dreams; the years waited beyond and they were very fair; life was a rose-lipped comrade with purple flowers dripping from her fingers."

Functions

extract_flat_array(json_data, show_probabilities=False)

Extracts a flat array of words with their timings and optional probabilities from the JSON output.
- Parameters:
  - json_data (str or dict): The JSON data to extract from.
  - show_probabilities (bool): Whether to include probability scores.
- Returns: list of word dictionaries.

Classes

Predictor(BasePredictor)

The main predictor class for Cog.
- Methods:
  - setup(self): Loads the stable_whisper model into memory.
  - predict(self, audio_file, transcript, language, show_probabilities): Performs transcription or alignment based on inputs and returns the results.

Example

Here's a simple example of how to use the predictor:

cog predict -i audio_file=@audio.mp3 -i transcript="On that road we heard the song of morning stars; we drank in fragrances aerial and sweet as a May mist; we were rich in gossamer fancies and iris hopes; our hearts sought and found the boon of dreams; the years waited beyond and they were very fair; life was a rose-lipped comrade with purple flowers dripping from her fingers."

** Response:

{
  "words": [
    {
      "word": "On",
      "start": 0,
      "end": 0.1
    },
    {
      "word": "that",
      "start": 0.1,
      "end": 0.2
    },
    {
      "word": "road",
      "start": 0.2,
      "end": 0.3
    },
    ...
  ]
}

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository.
Create a new branch: git checkout -b feature/YourFeature.
Make your changes and commit them: git commit -m 'Add some feature'.
Push to the branch: git push origin feature/YourFeature.
Open a pull request.

Please ensure your code adheres to the project's coding standards and includes appropriate tests.

License

This project is licensed under the MIT License.

Acknowledgements

These key projects are behind the prediction interface:

stable-ts: Developed by Jian, this project enhances transcription accuracy by stabilizing timestamps in OpenAI's Whisper model.
faster-whisper: A reimplementation of OpenAI's Whisper model using CTranslate2, offering up to 4 times faster transcription with reduced memory usage.

Contact

For any questions or suggestions, please open an issue in the repository or contact kyle@crone.ai.

Getting Help

If you encounter any issues or have questions, feel free to reach out by opening an issue in the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
audio.mp3		audio.mp3
cog.yaml		cog.yaml
predict.py		predict.py
requirements.txt		requirements.txt
test_mode.py		test_mode.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Force align transcript to audio

Introduction

Table of Contents

Features

Inference

Self-hosted Installation

Prerequisites

Setup

Usage

API Reference

`predict.py`

Constants

Functions

Classes

Example

Contributing

License

Acknowledgements

Contact

Getting Help

About

Uh oh!

Releases

Packages

Languages

License

crone-ai/force-align-wordstamps

Folders and files

Latest commit

History

Repository files navigation

Force align transcript to audio

Introduction

Table of Contents

Features

Inference

Self-hosted Installation

Prerequisites

Setup

Usage

API Reference

predict.py

Constants

Functions

Classes

Example

Contributing

License

Acknowledgements

Contact

Getting Help

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`predict.py`

Packages