update readme, setup, add option to return char_timestamps

2025-07-01 18:17:27 -04:00 · 2023-05-07 20:28:33 +01:00
parent 24008aa1ed
commit 4603f010a5
5 changed files with 103 additions and 65 deletions
--- a/README.md
+++ b/README.md
@ -13,36 +13,36 @@
        <img src="https://img.shields.io/github/license/m-bain/whisperX.svg"
             alt="GitHub license">
  </a>
+  <a href="https://arxiv.org/abs/2303.00747">
+        <img src="http://img.shields.io/badge/Arxiv-2303.00747-B31B1B.svg"
+             alt="ArXiv paper">
+  </a>
  <a href="https://twitter.com/intent/tweet?text=&url=https%3A%2F%2Fgithub.com%2Fm-bain%2FwhisperX">
  <img src="https://img.shields.io/twitter/url/https/github.com/m-bain/whisperX.svg?style=social" alt="Twitter">
  </a>      
 </p>

-<p align="center">
-  <a href="#what-is-it">What is it</a> •
-  <a href="#setup">Setup</a> •
-  <a href="#example">Usage</a> •
-  <a href="#other-languages">Multilingual</a> •
-  <a href="#contribute">Contribute</a> •
-  <a href="EXAMPLES.md">More examples</a> •
-  <a href="https://arxiv.org/abs/2303.00747">Paper</a>
-</p>
-

 <img width="1216" align="center" alt="whisperx-arch" src="figures/pipeline.png">


-<p align="left">Whisper-Based Automatic Speech Recognition (ASR) with improved timestamp accuracy + quality via forced phoneme alignment and speech-activity batching.
-
-</p>
+<!-- <p align="left">Whisper-Based Automatic Speech Recognition (ASR) with improved timestamp accuracy + quality via forced phoneme alignment and voice-activity based batching for fast inference.</p> -->


-<h2 align="left", id="what-is-it">What is it 🔎</h2>
-
-This repository refines the timestamps of openAI's Whisper model via forced aligment with phoneme-based ASR models (e.g. wav2vec2.0) and VAD preprocesssing, multilingual use-case.
+<!-- <h2 align="left", id="what-is-it">What is it 🔎</h2> -->


-**Whisper** is an ASR model [developed by OpenAI](https://github.com/openai/whisper), trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds.
+This repository provides fast automatic speaker recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.
+
+- ⚡️ Batched inference for 70x realtime transcription using whisper large-v2
+- 🪶 [faster-whisper](https://github.com/guillaumekln/faster-whisper) backend, requires <8GB gpu memory for large-v2 with beam_size=5
+- 🎯 Accurate word-level timestamps using wav2vec2 alignment
+- 👯‍♂️ Multispeaker ASR using speaker diarization from [pyannote-audio](https://github.com/pyannote/pyannote-audio) (labels each segment/word with speaker ID) 
+- 🗣️ VAD preprocessing, reduces hallucination & batching with no WER degradation
+
+
+
+**Whisper** is an ASR model [developed by OpenAI](https://github.com/openai/whisper), trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI's whisper does not natively support batching.

 **Phoneme-Based ASR** A suite of models finetuned to recognise the smallest unit of speech distinguishing one word from another, e.g. the element p in "tap". A popular example model is [wav2vec2.0](https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self).

@ -50,15 +50,15 @@ This repository refines the timestamps of openAI's Whisper model via forced alig

 **Voice Activity Detection (VAD)** is the detection of the presence or absence of human speech.

+**Speaker Diarization** is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker.
+
+
 <h2 align="left", id="highlights">New🚨</h2>

+- v3 transcript segment-per-sentence: using nltk sent_tokenize for better subtitlting & better diarization
 - v3 released, 70x speed-up open-sourced. Using batched whisper with [faster-whisper](https://github.com/guillaumekln/faster-whisper) backend!
- v2 released, code cleanup, imports whisper library, batched inference from paper not included (contact for licensing / batched model API). VAD filtering is now turned on by default, as in the paper.
- Paper drop🎓👨‍🏫! Please see our [ArxiV preprint](https://arxiv.org/abs/2303.00747) for benchmarking and details of WhisperX. We also introduce more efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (not provided in this repo).
- VAD filtering: Voice Activity Detection (VAD) from [Pyannote.audio](https://huggingface.co/pyannote/voice-activity-detection) is used as a preprocessing step to remove reliance on whisper timestamps and only transcribe audio segments containing speech. add `--vad_filter True` flag, increases timestamp accuracy and robustness (requires more GPU mem due to 30s inputs in wav2vec2)
- Character level timestamps (see `*.char.ass` file output)
- Diarization (still in beta, add `--diarize`)
-
+- v2 released, code cleanup, imports whisper library VAD filtering is now turned on by default, as in the paper.
+- Paper drop🎓👨‍🏫! Please see our [ArxiV preprint](https://arxiv.org/abs/2303.00747) for benchmarking and details of WhisperX. We also introduce more efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed.

 <h2 align="left" id="setup">Setup ⚙️</h2>
 Tested for PyTorch 2.0, Python 3.10 (use other versions at your own risk!)
@ -89,15 +89,13 @@ If already installed, update package to most recent commit

 If wishing to modify this package, clone and install in editable mode:
 ```
-$ git clone https://github.com/m-bain/whisperX.git@v3
+$ git clone https://github.com/m-bain/whisperX.git
 $ cd whisperX
-$ git checkout v3
 $ pip install -e .
 ```

 You may also need to install ffmpeg, rust etc. Follow openAI instructions here https://github.com/openai/whisper#setup.

-
 ### Speaker Diarization
 To **enable Speaker. Diarization**, include your Hugging Face access token that you can generate from [Here](https://huggingface.co/settings/tokens) after the `--hf_token` argument and accept the user agreement for the following models: [Segmentation](https://huggingface.co/pyannote/segmentation) , [Voice Activity Detection (VAD)](https://huggingface.co/pyannote/voice-activity-detection) , and [Speaker Diarization](https://huggingface.co/pyannote/speaker-diarization)

@ -106,15 +104,11 @@ To **enable Speaker. Diarization**, include your Hugging Face access token that

 ### English

-Run whisper on example segment (using default params)
+Run whisper on example segment (using default params, whisper small) add `--highlight_words True` to visualise word timings in the .srt file.

    whisperx examples/sample01.wav


-For increased timestamp accuracy, at the cost of higher gpu mem, use bigger models (bigger alignment model not found to be that helpful, see paper) e.g.
-
-    whisperx examples/sample01.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H
-
 Result using *WhisperX* with forced alignment to wav2vec2.0 large:

 https://user-images.githubusercontent.com/36994049/208253969-7e35fe2a-7541-434a-ae91-8e919540555d.mp4
@ -123,6 +117,16 @@ Compare this to original whisper out the box, where many transcriptions are out

 https://user-images.githubusercontent.com/36994049/207743923-b4f0d537-29ae-4be2-b404-bb941db73652.mov

+
+For increased timestamp accuracy, at the cost of higher gpu mem, use bigger models (bigger alignment model not found to be that helpful, see paper) e.g.
+
+    whisperx examples/sample01.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4
+
+
+To label the transcript with speaker ID's (set number of speakers if known e.g. `--min_speakers 2` `--max_speakers 2`):
+
+    whisperx examples/sample01.wav --model large-v2 --diarize --highlight_words True
+
 ### Other languages

 The phoneme ASR alignment model is *language-specific*, for tested languages these models are [automatically picked from torchaudio pipelines or huggingface](https://github.com/m-bain/whisperX/blob/e909f2f766b23b2000f2d95df41f9b844ac53e49/whisperx/transcribe.py#L22).
@ -132,7 +136,7 @@ Currently default models provided for `{en, fr, de, es, it, ja, zh, nl, uk, pt}`


 #### E.g. German
-    whisperx --model large --language de examples/sample_de_01.wav
+    whisperx --model large-v2 --language de examples/sample_de_01.wav

 https://user-images.githubusercontent.com/36994049/208298811-e36002ba-3698-4731-97d4-0aebd07e0eb3.mov

@ -143,79 +147,107 @@ See more examples in other languages [here](EXAMPLES.md).

 ```python
 import whisperx
+import gc 

 device = "cuda" 
 audio_file = "audio.mp3"
+batch_size = 16 # reduce if low on GPU mem
+compute_type = "float16" # change to "int8" if low on GPU mem (may reduce accuracy)

-# transcribe with original whisper
-model = whisperx.load_model("large-v2", device)
+# 1. Transcribe with original whisper (batched)
+model = whisperx.load_model("large-v2", device, compute_type=compute_type)

 audio = whisperx.load_audio(audio_file)
-result = model.transcribe(audio, batch_size=8)
-
+result = model.transcribe(audio, batch_size=batch_size)
 print(result["segments"]) # before alignment

-# load alignment model and metadata
+# delete model if low on GPU resources
+# import gc; gc.collect(); torch.cuda.empty_cache(); del model
+
+# 2. Align whisper output
 model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
+result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

-# align whisper output
-result_aligned = whisperx.align(result["segments"], model_a, metadata, audio, device)
+print(result["segments"]) # after alignment

-print(result_aligned["segments"]) # after alignment
-print(result_aligned["word_segments"]) # after alignment
+# delete model if low on GPU resources
+# import gc; gc.collect(); torch.cuda.empty_cache(); del model_a
+
+# 3. Assign speaker labels
+diarize_model = whisperx.DiarizationPipeline(use_auth_token=YOUR_HF_TOKEN, device=device)
+
+# add min/max number of speakers if known
+diarize_segments = diarize_model(input_audio_path)
+# diarize_model(input_audio_path, min_speakers=min_speakers, max_speakers=max_speakers)
+
+result = assign_word_speakers(diarize_segments, result)
+print(diarize_segments)
+print(result["segments"]) # segments are now assigned speaker IDs
 ```


-<h2 align="left" id="whisper-mod">Whisper Modifications</h2>
+<h2 align="left" id="whisper-mod">Technical Details 👷‍♂️</h2>

-In addition to forced alignment, the following two modifications have been made to the whisper transcription method:
+For specific details on the batching and alignment, the effect of VAD, as well as the chosen alignment model, see the preprint [paper](https://www.robots.ox.ac.uk/~vgg/publications/2023/Bain23/bain23.pdf).

-1. `--condition_on_prev_text` is set to `False` by default (reduces hallucination)
+To reduce GPU memory requirements, try any of the following (2. & 3. can affect quality):
+1.  reduce batch size, e.g. `--batch_size 4`
+2. use a smaller ASR model `--model base`
+3. Use lighter compute type `--compute_type int8`
+
+Transcription differences from openai's whisper:
+1. Transcription without timestamps. To enable single pass batching, whisper inference is performed `--without_timestamps True`, this ensures 1 forward pass per sample in the batch. However, this can cause discrepancies the default whisper output.
+2. VAD-based segment transcription, unlike the buffered transcription of openai's. In Wthe WhisperX paper we show this reduces WER, and enables accurate batched inference
+3.  `--condition_on_prev_text` is set to `False` by default (reduces hallucination)

 <h2 align="left" id="limitations">Limitations ⚠️</h2>

- Whisper normalises spoken numbers e.g. "fifty seven" to arabic numerals "57". Need to perform this normalization after alignment, so the phonemes can be aligned. Currently just ignores numbers.
- If setting `--vad_filter False`, then whisperx assumes the initial whisper timestamps are accurate to some degree (within margin of 2 seconds, adjust if needed -- bigger margins more prone to alignment errors)
+- Transcript words which do not contain characters in the alignment models dictionary e.g. "2014." or "£13.60" cannot be aligned and therefore are not given a timing.
 - Overlapping speech is not handled particularly well by whisper nor whisperx
- Diariazation is far from perfect.
+- Diarization is far from perfect (working on this with custom model v4 -- see contact me).
+- Language specific wav2vec2 model is needed


 <h2 align="left" id="contribute">Contribute 🧑‍🏫</h2>

-If you are multilingual, a major way you can contribute to this project is to find phoneme models on huggingface (or train your own) and test them on speech for the target language. If the results look good send a merge request and some examples showing its success.
+If you are multilingual, a major way you can contribute to this project is to find phoneme models on huggingface (or train your own) and test them on speech for the target language. If the results look good send a pull request and some examples showing its success.

-The next major upgrade we are working on is whisper with speaker diarization, so if you have any experience on this please share.
+Bug finding and pull requests are also highly appreciated to keep this project going, since it's already diverging from the original research scope.

-<h2 align="left" id="coming-soon">Coming Soon 🗓</h2>
+<h2 align="left" id="coming-soon">TODO 🗓</h2>

 * [x] Multilingual init

-* [x] Subtitle .ass output
-
 * [x] Automatic align model selection based on language detection

 * [x] Python usage

-* [x] Character level timestamps
-
 * [x] Incorporating  speaker diarization

 * [x] Model flush, for low gpu mem resources

 * [x] Faster-whisper backend

+* [x] Add max-line etc. see (openai's whisper utils.py)
+
+* [x] Sentence-level segments (nltk toolbox)
+
+* [x] Improve alignment logic
+
+* [ ] update examples with diarization and word highlighting
+
+* [ ] Subtitle .ass output <- bring this back (removed in v3)
+
 * [ ] Add benchmarking code (TEDLIUM for spd/WER & word segmentation)

 * [ ] Allow silero-vad as alternative VAD option

-* [ ] Add max-line etc. see (openai's whisper utils.py)
-
 * [ ] Improve diarization (word level). *Harder than first thought...*


 <h2 align="left" id="contact">Contact/Support 📇</h2>

-Contact maxhbain@gmail.com for queries and licensing / early access to a model API with batched inference (transcribe 1hr audio in under 1min).
+Contact maxhbain@gmail.com for queries. WhisperX v4 development is underway with with siginificantly improved diarization. To support v4 and get early access, get in touch.

 <a href="https://www.buymeacoffee.com/maxhbain" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/default-orange.png" alt="Buy Me A Coffee" height="41" width="174"></a>

@ -224,14 +256,16 @@ Contact maxhbain@gmail.com for queries and licensing / early access to a model A

 This work, and my PhD, is supported by the [VGG (Visual Geometry Group)](https://www.robots.ox.ac.uk/~vgg/) and the University of Oxford.

-
 Of course, this is builds on [openAI's whisper](https://github.com/openai/whisper).
 And borrows important alignment code from [PyTorch tutorial on forced alignment](https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html)

-Valuable VAD & Diarization Models from (pyannote.audio)[https://github.com/pyannote/pyannote-audio]
+Valuable VAD & Diarization Models from [pyannote audio][https://github.com/pyannote/pyannote-audio]

-Great backend from (faster-whisper)[https://github.com/guillaumekln/faster-whisper] and (CTranslate2)[https://github.com/OpenNMT/CTranslate2]
+Great backend from [faster-whisper](https://github.com/guillaumekln/faster-whisper) and [CTranslate2](https://github.com/OpenNMT/CTranslate2)

+Those who have [supported this work financially](https://www.buymeacoffee.com/maxhbain) 🙏
+
+Finally, thanks to the OS [contributors](https://github.com/m-bain/whisperX/graphs/contributors) of this project, keeping it going and identifying bugs.

 <h2 align="left" id="cite">Citation</h2>
 If you use this in your research, please cite the paper:
--- a/setup.py
+++ b/setup.py
@ -6,7 +6,7 @@ from setuptools import setup, find_packages
 setup(
    name="whisperx",
    py_modules=["whisperx"],
-    version="3.0.2",
+    version="3.1.0",
    description="Time-Accurate Automatic Speech Recognition using Whisper.",
    readme="README.md",
    python_requires=">=3.8",
--- a/whisperx/init.py
+++ b/whisperx/init.py
@ -1,3 +1,4 @@
 from .transcribe import load_model
 from .alignment import load_align_model, align
-from .audio import load_audio
+from .audio import load_audio
+from .diarize import assign_word_speakers, DiarizationPipeline
--- a/whisperx/alignment.py
+++ b/whisperx/alignment.py
@ -287,6 +287,7 @@ def align(
                curr_chars.fillna(-1, inplace=True)
                curr_chars = curr_chars.to_dict("records")
                curr_chars = [{key: val for key, val in char.items() if val != -1} for char in curr_chars]
+                aligned_subsegments[-1]["chars"] = curr_chars

        aligned_subsegments = pd.DataFrame(aligned_subsegments)
        aligned_subsegments["start"] = interpolate_nans(aligned_subsegments["start"], method=interpolate_method)
--- a/whisperx/transcribe.py
+++ b/whisperx/transcribe.py
@ -35,6 +35,7 @@ def cli():
    parser.add_argument("--align_model", default=None, help="Name of phoneme-level ASR model to do alignment")
    parser.add_argument("--interpolate_method", default="nearest", choices=["nearest", "linear", "ignore"], help="For word .srt, method to assign timestamps to non-aligned words, or merge them into neighbouring.")
    parser.add_argument("--no_align", action='store_true', help="Do not perform phoneme alignment")
+    parser.add_argument("--return_char_alignments", action='store_true', help="Return character-level alignments in the output json file")

    # vad params
    parser.add_argument("--vad_onset", type=float, default=0.500, help="Onset threshold for VAD (see pyannote.audio), reduce this if speech is not being detected")
@ -42,8 +43,8 @@ def cli():

    # diarization params
    parser.add_argument("--diarize", action="store_true", help="Apply diarization to assign speaker labels to each segment/word")
-    parser.add_argument("--min_speakers", default=None, type=int)
-    parser.add_argument("--max_speakers", default=None, type=int)
+    parser.add_argument("--min_speakers", default=None, type=int, help="Minimum number of speakers to in audio file")
+    parser.add_argument("--max_speakers", default=None, type=int, help="Maximum number of speakers to in audio file")

    parser.add_argument("--temperature", type=float, default=0, help="temperature to use for sampling")
    parser.add_argument("--best_of", type=optional_int, default=5, help="number of candidates when sampling with non-zero temperature")
@ -85,6 +86,7 @@ def cli():
    align_model: str = args.pop("align_model")
    interpolate_method: str = args.pop("interpolate_method")
    no_align: bool = args.pop("no_align")
+    return_char_alignments: bool = args.pop("return_char_alignments")

    hf_token: str = args.pop("hf_token")
    vad_onset: float = args.pop("vad_onset")
@ -171,7 +173,7 @@ def cli():
                    print(f"New language found ({result['language']})! Previous was ({align_metadata['language']}), loading new alignment model for new language...")
                    align_model, align_metadata = load_align_model(result["language"], device)
                print(">>Performing alignment...")
-                result = align(result["segments"], align_model, align_metadata, input_audio, device, interpolate_method=interpolate_method)
+                result = align(result["segments"], align_model, align_metadata, input_audio, device, interpolate_method=interpolate_method, return_char_alignments=return_char_alignments)

            results.append((result, audio_path))