whisperX/README.md

<h1 align="center">WhisperX</h1>

<p align="center">
  <a href="https://github.com/m-bain/whisperX/stargazers">
    <img src="https://img.shields.io/github/stars/m-bain/whisperX.svg?colorA=orange&colorB=orange&logo=github"
         alt="GitHub stars">
  </a>
  <a href="https://github.com/m-bain/whisperX/issues">
        <img src="https://img.shields.io/github/issues/m-bain/whisperx.svg"
             alt="GitHub issues">
  </a>
  <a href="https://github.com/m-bain/whisperX/blob/master/LICENSE">
        <img src="https://img.shields.io/github/license/m-bain/whisperX.svg"
             alt="GitHub license">
  </a>
  <a href="https://arxiv.org/abs/2303.00747">
        <img src="http://img.shields.io/badge/Arxiv-2303.00747-B31B1B.svg"
             alt="ArXiv paper">
  </a>
  <a href="https://twitter.com/intent/tweet?text=&url=https%3A%2F%2Fgithub.com%2Fm-bain%2FwhisperX">
  <img src="https://img.shields.io/twitter/url/https/github.com/m-bain/whisperX.svg?style=social" alt="Twitter">
  </a>      
</p>

<img width="1216" align="center" alt="whisperx-arch" src="https://raw.githubusercontent.com/m-bain/whisperX/refs/heads/main/figures/pipeline.png">

<!-- <p align="left">Whisper-Based Automatic Speech Recognition (ASR) with improved timestamp accuracy + quality via forced phoneme alignment and voice-activity based batching for fast inference.</p> -->

<!-- <h2 align="left", id="what-is-it">What is it 🔎</h2> -->

This repository provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.

- ⚡️ Batched inference for 70x realtime transcription using whisper large-v2
- 🪶 [faster-whisper](https://github.com/guillaumekln/faster-whisper) backend, requires <8GB gpu memory for large-v2 with beam_size=5
- 🎯 Accurate word-level timestamps using wav2vec2 alignment
- 👯‍♂️ Multispeaker ASR using speaker diarization from [pyannote-audio](https://github.com/pyannote/pyannote-audio) (speaker ID labels)
- 🗣️ VAD preprocessing, reduces hallucination & batching with no WER degradation

**Whisper** is an ASR model [developed by OpenAI](https://github.com/openai/whisper), trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI's whisper does not natively support batching.

**Phoneme-Based ASR** A suite of models finetuned to recognise the smallest unit of speech distinguishing one word from another, e.g. the element p in "tap". A popular example model is [wav2vec2.0](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self).

**Forced Alignment** refers to the process by which orthographic transcriptions are aligned to audio recordings to automatically generate phone level segmentation.

**Voice Activity Detection (VAD)** is the detection of the presence or absence of human speech.

**Speaker Diarization** is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker.

<h2 align="left", id="highlights">New🚨</h2>

- 1st place at [Ego4d transcription challenge](https://eval.ai/web/challenges/challenge-page/1637/leaderboard/3931/WER) 🏆
- _WhisperX_ accepted at INTERSPEECH 2023
- v3 transcript segment-per-sentence: using nltk sent_tokenize for better subtitlting & better diarization
- v3 released, 70x speed-up open-sourced. Using batched whisper with [faster-whisper](https://github.com/guillaumekln/faster-whisper) backend!
- v2 released, code cleanup, imports whisper library VAD filtering is now turned on by default, as in the paper.
- Paper drop🎓👨‍🏫! Please see our [ArxiV preprint](https://arxiv.org/abs/2303.00747) for benchmarking and details of WhisperX. We also introduce more efficient batch inference resulting in large-v2 with \*60-70x REAL TIME speed.

<h2 align="left" id="setup">Setup ⚙️</h2>

### 1. Simple Installation (Recommended)

The easiest way to install WhisperX is through PyPi:

```bash
pip install whisperx
```

Or if using [uvx](https://docs.astral.sh/uv/guides/tools/#running-tools):

```bash
uvx whisperx
```

### 2. Advanced Installation Options

These installation methods are for developers or users with specific needs. If you're not sure, stick with the simple installation above.

#### Option A: Install from GitHub

To install directly from the GitHub repository:

```bash
uvx git+https://github.com/m-bain/whisperX.git
```

#### Option B: Developer Installation

If you want to modify the code or contribute to the project:

```bash
git clone https://github.com/m-bain/whisperX.git
cd whisperX
uv sync --all-extras --dev
```

> **Note**: The development version may contain experimental features and bugs. Use the stable PyPI release for production environments.

You may also need to install ffmpeg, rust etc. Follow openAI instructions here https://github.com/openai/whisper#setup.

### Common Issues & Troubleshooting 🔧

#### libcudnn Dependencies (GPU Users)

If you're using WhisperX with GPU support and encounter errors like:

- `Could not load library libcudnn_ops_infer.so.8`
- `Unable to load any of {libcudnn_cnn.so.9.1.0, libcudnn_cnn.so.9.1, libcudnn_cnn.so.9, libcudnn_cnn.so}`
- `libcudnn_ops_infer.so.8: cannot open shared object file: No such file or directory`

This means your system is missing the CUDA Deep Neural Network library (cuDNN). This library is needed for GPU acceleration but isn't always installed by default.

**Install cuDNN (example for apt based systems):**

```bash
sudo apt update
sudo apt install libcudnn8 libcudnn8-dev -y
```

### Speaker Diarization

To **enable Speaker Diarization**, include your Hugging Face access token (read) that you can generate from [Here](https://huggingface.co/settings/tokens) after the `--hf_token` argument and accept the user agreement for the following models: [Segmentation](https://huggingface.co/pyannote/segmentation-3.0) and [Speaker-Diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1) (if you choose to use Speaker-Diarization 2.x, follow requirements [here](https://huggingface.co/pyannote/speaker-diarization) instead.)

> **Note**<br>
> As of Oct 11, 2023, there is a known issue regarding slow performance with pyannote/Speaker-Diarization-3.0 in whisperX. It is due to dependency conflicts between faster-whisper and pyannote-audio 3.0.0. Please see [this issue](https://github.com/m-bain/whisperX/issues/499) for more details and potential workarounds.

<h2 align="left" id="example">Usage 💬 (command line)</h2>

### English

Run whisper on example segment (using default params, whisper small) add `--highlight_words True` to visualise word timings in the .srt file.

    whisperx path/to/audio.wav

Result using _WhisperX_ with forced alignment to wav2vec2.0 large:

https://user-images.githubusercontent.com/36994049/208253969-7e35fe2a-7541-434a-ae91-8e919540555d.mp4

Compare this to original whisper out the box, where many transcriptions are out of sync:

https://user-images.githubusercontent.com/36994049/207743923-b4f0d537-29ae-4be2-b404-bb941db73652.mov

For increased timestamp accuracy, at the cost of higher gpu mem, use bigger models (bigger alignment model not found to be that helpful, see paper) e.g.

    whisperx path/to/audio.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4

To label the transcript with speaker ID's (set number of speakers if known e.g. `--min_speakers 2` `--max_speakers 2`):

    whisperx path/to/audio.wav --model large-v2 --diarize --highlight_words True

To run on CPU instead of GPU (and for running on Mac OS X):

    whisperx path/to/audio.wav --compute_type int8

### Other languages

The phoneme ASR alignment model is _language-specific_, for tested languages these models are [automatically picked from torchaudio pipelines or huggingface](https://github.com/m-bain/whisperX/blob/f2da2f858e99e4211fe4f64b5f2938b007827e17/whisperx/alignment.py#L24-L58).
Just pass in the `--language` code, and use the whisper `--model large`.

Currently default models provided for `{en, fr, de, es, it}` via torchaudio pipelines and many other languages via Hugging Face. Please find the list of currently supported languages under `DEFAULT_ALIGN_MODELS_HF` on [alignment.py](https://github.com/m-bain/whisperX/blob/main/whisperx/alignment.py). If the detected language is not in this list, you need to find a phoneme-based ASR model from [huggingface model hub](https://huggingface.co/models) and test it on your data.

#### E.g. German

    whisperx --model large-v2 --language de path/to/audio.wav

https://user-images.githubusercontent.com/36994049/208298811-e36002ba-3698-4731-97d4-0aebd07e0eb3.mov

See more examples in other languages [here](EXAMPLES.md).

## Python usage 🐍

```python
import whisperx
import gc

device = "cuda"
audio_file = "audio.mp3"
batch_size = 16 # reduce if low on GPU mem
compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("large-v2", device, compute_type=compute_type)

# save model to local path (optional)
# model_dir = "/path/"
# model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment

# delete model if low on GPU resources
# import gc; import torch; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

print(result["segments"]) # after alignment

# delete model if low on GPU resources
# import gc; import torch; gc.collect(); torch.cuda.empty_cache(); del model_a

# 3. Assign speaker labels
diarize_model = whisperx.diarize.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

result = whisperx.assign_word_speakers(diarize_segments, result)
print(diarize_segments)
print(result["segments"]) # segments are now assigned speaker IDs
```

## Demos 🚀

[![Replicate (large-v3](https://img.shields.io/static/v1?label=Replicate+WhisperX+large-v3&message=Demo+%26+Cloud+API&color=blue)](https://replicate.com/victor-upmeet/whisperx)
[![Replicate (large-v2](https://img.shields.io/static/v1?label=Replicate+WhisperX+large-v2&message=Demo+%26+Cloud+API&color=blue)](https://replicate.com/daanelson/whisperx)
[![Replicate (medium)](https://img.shields.io/static/v1?label=Replicate+WhisperX+medium&message=Demo+%26+Cloud+API&color=blue)](https://replicate.com/carnifexer/whisperx)

If you don't have access to your own GPUs, use the links above to try out WhisperX.

<h2 align="left" id="whisper-mod">Technical Details 👷‍♂️</h2>

For specific details on the batching and alignment, the effect of VAD, as well as the chosen alignment model, see the preprint [paper](https://www.robots.ox.ac.uk/~vgg/publications/2023/Bain23/bain23.pdf).

To reduce GPU memory requirements, try any of the following (2. & 3. can affect quality):

1.  reduce batch size, e.g. `--batch_size 4`
2.  use a smaller ASR model `--model base`
3.  Use lighter compute type `--compute_type int8`

Transcription differences from openai's whisper:

1. Transcription without timestamps. To enable single pass batching, whisper inference is performed `--without_timestamps True`, this ensures 1 forward pass per sample in the batch. However, this can cause discrepancies the default whisper output.
2. VAD-based segment transcription, unlike the buffered transcription of openai's. In the WhisperX paper we show this reduces WER, and enables accurate batched inference
3. `--condition_on_prev_text` is set to `False` by default (reduces hallucination)

<h2 align="left" id="limitations">Limitations ⚠️</h2>

- Transcript words which do not contain characters in the alignment models dictionary e.g. "2014." or "£13.60" cannot be aligned and therefore are not given a timing.
- Overlapping speech is not handled particularly well by whisper nor whisperx
- Diarization is far from perfect
- Language specific wav2vec2 model is needed

<h2 align="left" id="contribute">Contribute 🧑‍🏫</h2>

If you are multilingual, a major way you can contribute to this project is to find phoneme models on huggingface (or train your own) and test them on speech for the target language. If the results look good send a pull request and some examples showing its success.

Bug finding and pull requests are also highly appreciated to keep this project going, since it's already diverging from the original research scope.

<h2 align="left" id="coming-soon">TODO 🗓</h2>

- [x] Multilingual init

- [x] Automatic align model selection based on language detection

- [x] Python usage

- [x] Incorporating speaker diarization

- [x] Model flush, for low gpu mem resources

- [x] Faster-whisper backend

- [x] Add max-line etc. see (openai's whisper utils.py)

- [x] Sentence-level segments (nltk toolbox)

- [x] Improve alignment logic

- [ ] update examples with diarization and word highlighting

- [ ] Subtitle .ass output <- bring this back (removed in v3)

- [ ] Add benchmarking code (TEDLIUM for spd/WER & word segmentation)

- [x] Allow silero-vad as alternative VAD option

- [ ] Improve diarization (word level). _Harder than first thought..._

<h2 align="left" id="contact">Contact/Support 📇</h2>

Contact maxhbain@gmail.com for queries.

<a href="https://www.buymeacoffee.com/maxhbain" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/default-orange.png" alt="Buy Me A Coffee" height="41" width="174"></a>

<h2 align="left" id="acks">Acknowledgements 🙏</h2>

This work, and my PhD, is supported by the [VGG (Visual Geometry Group)](https://www.robots.ox.ac.uk/~vgg/) and the University of Oxford.

Of course, this is builds on [openAI's whisper](https://github.com/openai/whisper).
Borrows important alignment code from [PyTorch tutorial on forced alignment](https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html)
And uses the wonderful pyannote VAD / Diarization https://github.com/pyannote/pyannote-audio

Valuable VAD & Diarization Models from:

- [pyannote audio][https://github.com/pyannote/pyannote-audio]
- [silero vad][https://github.com/snakers4/silero-vad]

Great backend from [faster-whisper](https://github.com/guillaumekln/faster-whisper) and [CTranslate2](https://github.com/OpenNMT/CTranslate2)

Those who have [supported this work financially](https://www.buymeacoffee.com/maxhbain) 🙏

Finally, thanks to the OS [contributors](https://github.com/m-bain/whisperX/graphs/contributors) of this project, keeping it going and identifying bugs.

<h2 align="left" id="cite">Citation</h2>
If you use this in your research, please cite the paper:

```bibtex
@article{bain2022whisperx,
  title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},
  author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},
  journal={INTERSPEECH 2023},
  year={2023}
}
```
-												init commit

											
										
										
											2022-12-14 18:59:12 +00:00
+								<h1 align="center">WhisperX</h1>
-												add chinese, dutch. python usage. readme update

											
										
										
											2022-12-23 00:41:12 +00:00
-												add back word .srt, update readme

											
										
										
											2022-12-19 19:12:50 +00:00
+								<p align="center">
 								  <a href="https://github.com/m-bain/whisperX/stargazers">
 								    <img src="https://img.shields.io/github/stars/m-bain/whisperX.svg?colorA=orange&colorB=orange&logo=github"
 								         alt="GitHub stars">
 								  </a>
-												restructure readme,

											
										
										
											2022-12-19 19:41:39 +00:00
+								  <a href="https://github.com/m-bain/whisperX/issues">
 								        <img src="https://img.shields.io/github/issues/m-bain/whisperx.svg"
-												add back word .srt, update readme

											
										
										
											2022-12-19 19:12:50 +00:00
+								             alt="GitHub issues">
 								  </a>
 								  <a href="https://github.com/m-bain/whisperX/blob/master/LICENSE">
 								        <img src="https://img.shields.io/github/license/m-bain/whisperX.svg"
 								             alt="GitHub license">
 								  </a>
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								  <a href="https://arxiv.org/abs/2303.00747">
 								        <img src="http://img.shields.io/badge/Arxiv-2303.00747-B31B1B.svg"
 								             alt="ArXiv paper">
 								  </a>
-												add back word .srt, update readme

											
										
										
											2022-12-19 19:12:50 +00:00
+								  <a href="https://twitter.com/intent/tweet?text=&url=https%3A%2F%2Fgithub.com%2Fm-bain%2FwhisperX">
 								  <img src="https://img.shields.io/twitter/url/https/github.com/m-bain/whisperX.svg?style=social" alt="Twitter">
 								  </a>
 								</p>
-												fix: update README image source and enhance setup.py for long description

											
										
										
											2025-01-02 08:29:13 +01:00
+								<img width="1216" align="center" alt="whisperx-arch" src="https://raw.githubusercontent.com/m-bain/whisperX/refs/heads/main/figures/pipeline.png">
-												restructure readme,

											
										
										
											2022-12-19 19:41:39 +00:00
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								<!-- <p align="left">Whisper-Based Automatic Speech Recognition (ASR) with improved timestamp accuracy + quality via forced phoneme alignment and voice-activity based batching for fast inference.</p> -->
-												init commit

											
										
										
											2022-12-14 18:59:12 +00:00
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								<!-- <h2 align="left", id="what-is-it">What is it 🔎</h2> -->
-												init commit

											
										
										
											2022-12-14 18:59:12 +00:00
-												add translate, fix word_timestamp error

											
										
										
											2023-05-13 12:14:06 +01:00
+								This repository provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.
-												init commit

											
										
										
											2022-12-14 18:59:12 +00:00
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								- ⚡️ Batched inference for 70x realtime transcription using whisper large-v2
 								- 🪶 [faster-whisper](https://github.com/guillaumekln/faster-whisper) backend, requires <8GB gpu memory for large-v2 with beam_size=5
 								- 🎯 Accurate word-level timestamps using wav2vec2 alignment
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								- 👯‍♂️ Multispeaker ASR using speaker diarization from [pyannote-audio](https://github.com/pyannote/pyannote-audio) (speaker ID labels)
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								- 🗣️ VAD preprocessing, reduces hallucination & batching with no WER degradation
-												init commit

											
										
										
											2022-12-14 18:59:12 +00:00
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								**Whisper** is an ASR model [developed by OpenAI](https://github.com/openai/whisper), trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI's whisper does not natively support batching.
-												Update README.md
											
										
										
											2022-12-15 01:01:48 +00:00
 								**Phoneme-Based ASR** A suite of models finetuned to recognise the smallest unit of speech distinguishing one word from another, e.g. the element p in "tap". A popular example model is [wav2vec2.0](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self).
-												init commit

											
										
										
											2022-12-14 18:59:12 +00:00
 								**Forced Alignment** refers to the process by which orthographic transcriptions are aligned to audio recordings to automatically generate phone level segmentation.
-												handle tmp wav file better

											
										
										
											2023-04-01 00:06:40 +01:00
+								**Voice Activity Detection (VAD)** is the detection of the presence or absence of human speech.
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								**Speaker Diarization** is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker.
-												new logic, diarization, vad filtering

											
										
										
											2023-01-24 15:02:08 +00:00
+								<h2 align="left", id="highlights">New🚨</h2>
-												fix list markdown
											
										
										
											2023-10-05 15:14:29 -07:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								- 1st place at [Ego4d transcription challenge](https://eval.ai/web/challenges/challenge-page/1637/leaderboard/3931/WER) 🏆
 								- _WhisperX_ accepted at INTERSPEECH 2023
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								- v3 transcript segment-per-sentence: using nltk sent_tokenize for better subtitlting & better diarization
-												v3 init

											
										
										
											2023-04-24 21:08:43 +01:00
+								- v3 released, 70x speed-up open-sourced. Using batched whisper with [faster-whisper](https://github.com/guillaumekln/faster-whisper) backend!
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								- v2 released, code cleanup, imports whisper library VAD filtering is now turned on by default, as in the paper.
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								- Paper drop🎓👨‍🏫! Please see our [ArxiV preprint](https://arxiv.org/abs/2303.00747) for benchmarking and details of WhisperX. We also introduce more efficient batch inference resulting in large-v2 with \*60-70x REAL TIME speed.
-												new logic, diarization, vad filtering

											
										
										
											2023-01-24 15:02:08 +00:00
-												restructure readme,

											
										
										
											2022-12-19 19:41:39 +00:00
+								<h2 align="left" id="setup">Setup ⚙️</h2>
-												v3 init

											
										
										
											2023-04-24 21:08:43 +01:00
-												docs: update installation instructions

											
										
										
											2025-03-25 17:02:02 +01:00
+								### 1. Simple Installation (Recommended)
-												v3 init

											
										
										
											2023-04-24 21:08:43 +01:00
-												docs: update installation instructions

											
										
										
											2025-03-25 17:02:02 +01:00
+								The easiest way to install WhisperX is through PyPi:
-												fallback on whisper alignment failures, update readme

											
										
										
											2023-01-05 11:15:19 +00:00
-												docs: update installation instructions in README

											
										
										
											2025-01-02 08:35:13 +01:00
+								```bash
 								pip install whisperx
 								```
-												docs: update installation instructions

											
										
										
											2025-03-25 17:02:02 +01:00
+								Or if using [uvx](https://docs.astral.sh/uv/guides/tools/#running-tools):
-												fallback on whisper alignment failures, update readme

											
										
										
											2023-01-05 11:15:19 +00:00
-												docs: update installation instructions in README

											
										
										
											2025-01-02 08:35:13 +01:00
+								```bash
-												docs: update installation instructions

											
										
										
											2025-03-25 17:02:02 +01:00
+								uvx whisperx
-												fallback on whisper alignment failures, update readme

											
										
										
											2023-01-05 11:15:19 +00:00
+								```
-												docs: update installation instructions in README

											
										
										
											2025-01-02 08:35:13 +01:00
-												docs: update installation instructions

											
										
										
											2025-03-25 17:02:02 +01:00
+								### 2. Advanced Installation Options
 								These installation methods are for developers or users with specific needs. If you're not sure, stick with the simple installation above.
 								#### Option A: Install from GitHub
 								To install directly from the GitHub repository:
-												docs: update installation instructions in README

											
										
										
											2025-01-02 08:35:13 +01:00
 								```bash
-												docs: update installation instructions

											
										
										
											2025-03-25 17:02:02 +01:00
+								uvx git+https://github.com/m-bain/whisperX.git
-												fallback on whisper alignment failures, update readme

											
										
										
											2023-01-05 11:15:19 +00:00
+								```
-												docs: update installation instructions

											
										
										
											2025-03-25 17:02:02 +01:00
+								#### Option B: Developer Installation
 								If you want to modify the code or contribute to the project:
-												docs: update installation instructions in README

											
										
										
											2025-01-02 08:35:13 +01:00
+								```bash
 								git clone https://github.com/m-bain/whisperX.git
 								cd whisperX
-												docs: update installation instructions

											
										
										
											2025-03-25 17:02:02 +01:00
+								uv sync --all-extras --dev
-												docs: update installation instructions in README

											
										
										
											2025-01-02 08:35:13 +01:00
+								```
 								> **Note**: The development version may contain experimental features and bugs. Use the stable PyPI release for production environments.
-												init commit

											
										
										
											2022-12-14 18:59:12 +00:00
+								You may also need to install ffmpeg, rust etc. Follow openAI instructions here https://github.com/openai/whisper#setup.
-												docs: add troubleshooting section for libcudnn dependencies in README

											
										
										
											2025-05-31 11:19:36 +00:00
+								### Common Issues & Troubleshooting 🔧
 								#### libcudnn Dependencies (GPU Users)
 								If you're using WhisperX with GPU support and encounter errors like:
 								- `Could not load library libcudnn_ops_infer.so.8`
 								- `Unable to load any of {libcudnn_cnn.so.9.1.0, libcudnn_cnn.so.9.1, libcudnn_cnn.so.9, libcudnn_cnn.so}`
 								- `libcudnn_ops_infer.so.8: cannot open shared object file: No such file or directory`
 								This means your system is missing the CUDA Deep Neural Network library (cuDNN). This library is needed for GPU acceleration but isn't always installed by default.
 								**Install cuDNN (example for apt based systems):**
 								```bash
 								sudo apt update
 								sudo apt install libcudnn8 libcudnn8-dev -y
 								```
-												hf token only for diarization
											
										
										
											2023-03-31 16:15:40 -07:00
+								### Speaker Diarization
-												docs: update installation instructions

											
										
										
											2025-03-25 17:02:02 +01:00
-												Update README to Correct Speaker Diarization Version Link

Currently errors if user just accepts terms for README link version
3.0. Version 3.1 introduced in pull request #586

											
										
										
											2023-12-07 12:33:05 -08:00
+								To **enable Speaker Diarization**, include your Hugging Face access token (read) that you can generate from [Here](https://huggingface.co/settings/tokens) after the `--hf_token` argument and accept the user agreement for the following models: [Segmentation](https://huggingface.co/pyannote/segmentation-3.0) and [Speaker-Diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1) (if you choose to use Speaker-Diarization 2.x, follow requirements [here](https://huggingface.co/pyannote/speaker-diarization) instead.)
-												update readme

											
										
										
											2023-10-11 22:56:38 -04:00
 								> **Note**<br>
 								> As of Oct 11, 2023, there is a known issue regarding slow performance with pyannote/Speaker-Diarization-3.0 in whisperX. It is due to dependency conflicts between faster-whisper and pyannote-audio 3.0.0. Please see [this issue](https://github.com/m-bain/whisperX/issues/499) for more details and potential workarounds.
-												update readme

											
										
										
											2023-02-01 22:09:11 +00:00
-												add chinese, dutch. python usage. readme update

											
										
										
											2022-12-23 00:41:12 +00:00
+								<h2 align="left" id="example">Usage 💬 (command line)</h2>
-												init commit

											
										
										
											2022-12-14 18:59:12 +00:00
-												multilingual init

											
										
										
											2022-12-18 12:21:24 +00:00
+								### English
-												restructure readme,

											
										
										
											2022-12-19 19:41:39 +00:00
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								Run whisper on example segment (using default params, whisper small) add `--highlight_words True` to visualise word timings in the .srt file.
-												init commit

											
										
										
											2022-12-14 18:59:12 +00:00
-												docs: Update README example commands with generic audio path

											
										
										
											2025-02-19 08:24:04 +01:00
+								    whisperx path/to/audio.wav
-												init commit

											
										
										
											2022-12-14 18:59:12 +00:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								Result using _WhisperX_ with forced alignment to wav2vec2.0 large:
-												Update README.md
											
										
										
											2022-12-15 01:01:48 +00:00
-												Update README.md
											
										
										
											2022-12-17 17:34:38 +00:00
+								https://user-images.githubusercontent.com/36994049/208253969-7e35fe2a-7541-434a-ae91-8e919540555d.mp4
-												Update README.md

add qualitative examples
											
										
										
											2022-12-15 00:31:20 +00:00
-												add back word .srt, update readme

											
										
										
											2022-12-19 19:12:50 +00:00
+								Compare this to original whisper out the box, where many transcriptions are out of sync:
 								https://user-images.githubusercontent.com/36994049/207743923-b4f0d537-29ae-4be2-b404-bb941db73652.mov
-												Update README.md

add qualitative examples
											
										
										
											2022-12-15 00:31:20 +00:00
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								For increased timestamp accuracy, at the cost of higher gpu mem, use bigger models (bigger alignment model not found to be that helpful, see paper) e.g.
-												docs: Update README example commands with generic audio path

											
										
										
											2025-02-19 08:24:04 +01:00
+								    whisperx path/to/audio.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
 								To label the transcript with speaker ID's (set number of speakers if known e.g. `--min_speakers 2` `--max_speakers 2`):
-												docs: Update README example commands with generic audio path

											
										
										
											2025-02-19 08:24:04 +01:00
+								    whisperx path/to/audio.wav --model large-v2 --diarize --highlight_words True
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
-												Document --compute_type command line option
											
										
										
											2023-08-19 08:19:49 +01:00
+								To run on CPU instead of GPU (and for running on Mac OS X):
-												docs: Update README example commands with generic audio path

											
										
										
											2025-02-19 08:24:04 +01:00
+								    whisperx path/to/audio.wav --compute_type int8
-												Document --compute_type command line option
											
										
										
											2023-08-19 08:19:49 +01:00
-												add chinese, dutch. python usage. readme update

											
										
										
											2022-12-23 00:41:12 +00:00
+								### Other languages
-												multilingual init

											
										
										
											2022-12-18 12:21:24 +00:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								The phoneme ASR alignment model is _language-specific_, for tested languages these models are [automatically picked from torchaudio pipelines or huggingface](https://github.com/m-bain/whisperX/blob/f2da2f858e99e4211fe4f64b5f2938b007827e17/whisperx/alignment.py#L24-L58).
-												add chinese, dutch. python usage. readme update

											
										
										
											2022-12-23 00:41:12 +00:00
+								Just pass in the `--language` code, and use the whisper `--model large`.
-												multilingual init

											
										
										
											2022-12-18 12:21:24 +00:00
-												doc: refer to DEFAULT_ALIGN_MODELS_HF for other langs
											
										
										
											2025-01-17 08:46:38 +01:00
+								Currently default models provided for `{en, fr, de, es, it}` via torchaudio pipelines and many other languages via Hugging Face. Please find the list of currently supported languages under `DEFAULT_ALIGN_MODELS_HF` on [alignment.py](https://github.com/m-bain/whisperX/blob/main/whisperx/alignment.py). If the detected language is not in this list, you need to find a phoneme-based ASR model from [huggingface model hub](https://huggingface.co/models) and test it on your data.
-												multilingual init

											
										
										
											2022-12-18 12:21:24 +00:00
-												add chinese, dutch. python usage. readme update

											
										
										
											2022-12-23 00:41:12 +00:00
+								#### E.g. German
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
-												docs: Update README example commands with generic audio path

											
										
										
											2025-02-19 08:24:04 +01:00
+								    whisperx --model large-v2 --language de path/to/audio.wav
-												add other language examples
											
										
										
											2022-12-18 12:39:16 +00:00
 								https://user-images.githubusercontent.com/36994049/208298811-e36002ba-3698-4731-97d4-0aebd07e0eb3.mov
-												handle negative / tiny duration segments, final

											
										
										
											2023-01-08 14:01:10 +00:00
+								See more examples in other languages [here](EXAMPLES.md).
-												fix starting timestamp for multiple fail-to-aligned words

											
										
										
											2023-01-07 14:59:11 +00:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								## Python usage 🐍
-												add other language examples
											
										
										
											2022-12-18 12:39:16 +00:00
-												add chinese, dutch. python usage. readme update

											
										
										
											2022-12-23 00:41:12 +00:00
+								```python
 								import whisperx
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								import gc
-												add other language examples
											
										
										
											2022-12-18 12:39:16 +00:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								device = "cuda"
-												add chinese, dutch. python usage. readme update

											
										
										
											2022-12-23 00:41:12 +00:00
+								audio_file = "audio.mp3"
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								batch_size = 16 # reduce if low on GPU mem
 								compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)
-												add other language examples
											
										
										
											2022-12-18 12:39:16 +00:00
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								# 1. Transcribe with original whisper (batched)
 								model = whisperx.load_model("large-v2", device, compute_type=compute_type)
-												v3 init

											
										
										
											2023-04-24 21:08:43 +01:00
-												Update README.md

Demonstrates use of argument to save model to local path.
											
										
										
											2023-12-15 13:46:32 +00:00
+								# save model to local path (optional)
 								# model_dir = "/path/"
 								# model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)
-												v3 init

											
										
										
											2023-04-24 21:08:43 +01:00
+								audio = whisperx.load_audio(audio_file)
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								result = model.transcribe(audio, batch_size=batch_size)
-												vad filter

											
										
										
											2023-01-20 12:54:20 +00:00
+								print(result["segments"]) # before alignment
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								# delete model if low on GPU resources
-												docs: add missing torch import to Python usage example in README
											
										
										
											2025-06-07 23:36:49 +02:00
+								# import gc; import torch; gc.collect(); torch.cuda.empty_cache(); del model
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
 								# 2. Align whisper output
-												add chinese, dutch. python usage. readme update

											
										
										
											2022-12-23 00:41:12 +00:00
+								model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
 								print(result["segments"]) # after alignment
-												add other language examples
											
										
										
											2022-12-18 12:39:16 +00:00
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								# delete model if low on GPU resources
-												docs: add missing torch import to Python usage example in README
											
										
										
											2025-06-07 23:36:49 +02:00
+								# import gc; import torch; gc.collect(); torch.cuda.empty_cache(); del model_a
-												add other language examples
											
										
										
											2022-12-18 12:39:16 +00:00
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								# 3. Assign speaker labels
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								diarize_model = whisperx.diarize.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
 								# add min/max number of speakers if known
-												make diarization faster
											
										
										
											2023-08-02 10:11:43 +03:00
+								diarize_segments = diarize_model(audio)
-												more
											
										
										
											2023-08-02 10:32:02 +03:00
+								# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
-												update python usage example
											
										
										
											2023-05-08 17:20:38 +01:00
+								result = whisperx.assign_word_speakers(diarize_segments, result)
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								print(diarize_segments)
 								print(result["segments"]) # segments are now assigned speaker IDs
-												add chinese, dutch. python usage. readme update

											
										
										
											2022-12-23 00:41:12 +00:00
+								```
-												add other language examples
											
										
										
											2022-12-18 12:39:16 +00:00
-												adding Replicate demo

											
										
										
											2023-06-30 18:11:33 -07:00
+								## Demos 🚀
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								[![Replicate (large-v3](https://img.shields.io/static/v1?label=Replicate+WhisperX+large-v3&message=Demo+%26+Cloud+API&color=blue)](https://replicate.com/victor-upmeet/whisperx)
 								[![Replicate (large-v2](https://img.shields.io/static/v1?label=Replicate+WhisperX+large-v2&message=Demo+%26+Cloud+API&color=blue)](https://replicate.com/daanelson/whisperx)
 								[![Replicate (medium)](https://img.shields.io/static/v1?label=Replicate+WhisperX+medium&message=Demo+%26+Cloud+API&color=blue)](https://replicate.com/carnifexer/whisperx)
-												adding Replicate demo

											
										
										
											2023-06-30 18:11:33 -07:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								If you don't have access to your own GPUs, use the links above to try out WhisperX.
-												handle negative / tiny duration segments, final

											
										
										
											2023-01-08 14:01:10 +00:00
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								<h2 align="left" id="whisper-mod">Technical Details 👷‍♂️</h2>
-												handle negative / tiny duration segments, final

											
										
										
											2023-01-08 14:01:10 +00:00
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								For specific details on the batching and alignment, the effect of VAD, as well as the chosen alignment model, see the preprint [paper](https://www.robots.ox.ac.uk/~vgg/publications/2023/Bain23/bain23.pdf).
-												handle negative / tiny duration segments, final

											
										
										
											2023-01-08 14:01:10 +00:00
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								To reduce GPU memory requirements, try any of the following (2. & 3. can affect quality):
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+.  reduce batch size, e.g. `--batch_size 4`
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+.  use a smaller ASR model `--model base`
 .  Use lighter compute type `--compute_type int8`
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
 								Transcription differences from openai's whisper:
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+. Transcription without timestamps. To enable single pass batching, whisper inference is performed `--without_timestamps True`, this ensures 1 forward pass per sample in the batch. However, this can cause discrepancies the default whisper output.
-												Fix link in README.md
											
										
										
											2024-01-17 16:58:20 +01:00
+. VAD-based segment transcription, unlike the buffered transcription of openai's. In the WhisperX paper we show this reduces WER, and enables accurate batched inference
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+. `--condition_on_prev_text` is set to `False` by default (reduces hallucination)
-												handle negative / tiny duration segments, final

											
										
										
											2023-01-08 14:01:10 +00:00
-												restructure readme,

											
										
										
											2022-12-19 19:41:39 +00:00
+								<h2 align="left" id="limitations">Limitations ⚠️</h2>
-												init commit

											
										
										
											2022-12-14 18:59:12 +00:00
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								- Transcript words which do not contain characters in the alignment models dictionary e.g. "2014." or "£13.60" cannot be aligned and therefore are not given a timing.
-												update readme

											
										
										
											2023-02-01 22:09:11 +00:00
+								- Overlapping speech is not handled particularly well by whisper nor whisperx
-												Update README.md
											
										
										
											2024-03-20 15:47:18 +00:00
+								- Diarization is far from perfect
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								- Language specific wav2vec2 model is needed
-												init commit

											
										
										
											2022-12-14 18:59:12 +00:00
-												add chinese, dutch. python usage. readme update

											
										
										
											2022-12-23 00:41:12 +00:00
+								<h2 align="left" id="contribute">Contribute 🧑‍🏫</h2>
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								If you are multilingual, a major way you can contribute to this project is to find phoneme models on huggingface (or train your own) and test them on speech for the target language. If the results look good send a pull request and some examples showing its success.
-												add chinese, dutch. python usage. readme update

											
										
										
											2022-12-23 00:41:12 +00:00
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								Bug finding and pull requests are also highly appreciated to keep this project going, since it's already diverging from the original research scope.
-												add chinese, dutch. python usage. readme update

											
										
										
											2022-12-23 00:41:12 +00:00
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								<h2 align="left" id="coming-soon">TODO 🗓</h2>
-												Update README.md
											
										
										
											2022-12-15 01:01:48 +00:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								- [x] Multilingual init
-												Update README.md
											
										
										
											2022-12-17 17:34:38 +00:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								- [x] Automatic align model selection based on language detection
-												add chinese, dutch. python usage. readme update

											
										
										
											2022-12-23 00:41:12 +00:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								- [x] Python usage
-												multilingual init

											
										
										
											2022-12-18 12:21:24 +00:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								- [x] Incorporating speaker diarization
-												new logic, diarization, vad filtering

											
										
										
											2023-01-24 15:02:08 +00:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								- [x] Model flush, for low gpu mem resources
-												v3 init

											
										
										
											2023-04-24 21:08:43 +01:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								- [x] Faster-whisper backend
-												v3 init

											
										
										
											2023-04-24 21:08:43 +01:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								- [x] Add max-line etc. see (openai's whisper utils.py)
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								- [x] Sentence-level segments (nltk toolbox)
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								- [x] Improve alignment logic
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								- [ ] update examples with diarization and word highlighting
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								- [ ] Subtitle .ass output <- bring this back (removed in v3)
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								- [ ] Add benchmarking code (TEDLIUM for spd/WER & word segmentation)
-												v3 init

											
										
										
											2023-04-24 21:08:43 +01:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								- [x] Allow silero-vad as alternative VAD option
-												Update README.md
											
										
										
											2022-12-15 01:01:48 +00:00
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
+								- [ ] Improve diarization (word level). _Harder than first thought..._
-												init commit

											
										
										
											2022-12-14 18:59:12 +00:00
-												buy-me-a-coffee

											
										
										
											2023-01-27 15:12:49 +00:00
+								<h2 align="left" id="contact">Contact/Support 📇</h2>
-												init commit

											
										
										
											2022-12-14 18:59:12 +00:00
-												Update README.md
											
										
										
											2024-03-20 15:47:18 +00:00
+								Contact maxhbain@gmail.com for queries.
-												add arch figure, citation
											
										
										
											2022-12-18 18:43:33 +00:00
-												buy-me-a-coffee

											
										
										
											2023-01-27 15:12:49 +00:00
+								<a href="https://www.buymeacoffee.com/maxhbain" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/default-orange.png" alt="Buy Me A Coffee" height="41" width="174"></a>
-												restructure readme,

											
										
										
											2022-12-19 19:41:39 +00:00
+								<h2 align="left" id="acks">Acknowledgements 🙏</h2>
-												init commit

											
										
										
											2022-12-14 18:59:12 +00:00
-												handle tmp wav file better

											
										
										
											2023-04-01 00:06:40 +01:00
+								This work, and my PhD, is supported by the [VGG (Visual Geometry Group)](https://www.robots.ox.ac.uk/~vgg/) and the University of Oxford.
-												update readme

											
										
										
											2023-02-01 22:09:11 +00:00
 								Of course, this is builds on [openAI's whisper](https://github.com/openai/whisper).
-												add v3 pre-release comment, and v4 progress update
											
										
										
											2023-05-02 15:10:40 +01:00
+								Borrows important alignment code from [PyTorch tutorial on forced alignment](https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html)
 								And uses the wonderful pyannote VAD / Diarization https://github.com/pyannote/pyannote-audio
-												Accept alternative VAD methods. Extend to use Silero VAD.

											
										
										
											2024-09-26 10:28:52 +02:00
+								Valuable VAD & Diarization Models from:
-												fix: remove DiarizationPipeline from public API

											
										
										
											2025-05-02 13:04:24 +02:00
-												Accept alternative VAD methods. Extend to use Silero VAD.

											
										
										
											2024-09-26 10:28:52 +02:00
+								- [pyannote audio][https://github.com/pyannote/pyannote-audio]
 								- [silero vad][https://github.com/snakers4/silero-vad]
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
 								Great backend from [faster-whisper](https://github.com/guillaumekln/faster-whisper) and [CTranslate2](https://github.com/OpenNMT/CTranslate2)
-												v3 init

											
										
										
											2023-04-24 21:08:43 +01:00
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								Those who have [supported this work financially](https://www.buymeacoffee.com/maxhbain) 🙏
-												v3 init

											
										
										
											2023-04-24 21:08:43 +01:00
-												update readme, setup, add option to return char_timestamps

											
										
										
											2023-05-07 20:28:33 +01:00
+								Finally, thanks to the OS [contributors](https://github.com/m-bain/whisperX/graphs/contributors) of this project, keeping it going and identifying bugs.
-												add arch figure, citation
											
										
										
											2022-12-18 18:43:33 +00:00
-												restructure readme,

											
										
										
											2022-12-19 19:41:39 +00:00
+								<h2 align="left" id="cite">Citation</h2>
-												paper drop
											
										
										
											2023-03-02 12:04:16 +00:00
+								If you use this in your research, please cite the paper:
-												add arch figure, citation
											
										
										
											2022-12-18 18:43:33 +00:00
 								```bibtex
-												paper drop
											
										
										
											2023-03-02 12:04:16 +00:00
+								@article{bain2022whisperx,
 								  title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},
 								  author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},
-												interspeech
											
										
										
											2023-06-01 16:54:16 +01:00
+								  journal={INTERSPEECH 2023},
-												paper drop
											
										
										
											2023-03-02 12:04:16 +00:00
+								  year={2023}
-												add arch figure, citation
											
										
										
											2022-12-18 18:43:33 +00:00
+								}
-												add v3 tag for install
											
										
										
											2023-04-25 10:07:34 +01:00
+								```