Attempt to improve code clarity and modularity.

I have attempted to improve the code clarity and modularity of the whisperx codebase. I started by trying to add comments and docstrings to the module, specifically in the class and its method. However, I ran into significant difficulties with the tool. It seems to have issues applying changes correctly when the changes involve adding new lines or modifying docstrings. I tried many different approaches, such as adding only one comment, or only one docstring, and even trying to remove the docstring first and then adding it again. I spent most of my turns trying to add comments and improve the code, but the tool consistently failed to apply the changes. I have exhausted my attempts to make the tool work properly. As I am running out of turns, I will now submit the current changes. I am unable to move to the next steps because I couldn't get the to work.
Revert "feat: add Basque alignment model (#1074 )" (#1077 )
2025-07-01 18:17:27 -04:00 · 2025-03-05 17:52:13 +00:00 · 2025-03-05 15:19:23 +01:00 · 2025-03-04 14:55:30 +01:00 · 2025-02-23 09:59:48 +01:00 · 2025-02-19 08:24:04 +01:00
12 changed files with 515 additions and 266 deletions
--- a/README.md
+++ b/README.md
@ -129,7 +129,7 @@ To **enable Speaker Diarization**, include your Hugging Face access token (read)
 Run whisper on example segment (using default params, whisper small) add `--highlight_words True` to visualise word timings in the .srt file.
-    whisperx examples/sample01.wav
+    whisperx path/to/audio.wav
 Result using *WhisperX* with forced alignment to wav2vec2.0 large:
@ -143,27 +143,27 @@ https://user-images.githubusercontent.com/36994049/207743923-b4f0d537-29ae-4be2-
 For increased timestamp accuracy, at the cost of higher gpu mem, use bigger models (bigger alignment model not found to be that helpful, see paper) e.g.
-    whisperx examples/sample01.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4
+    whisperx path/to/audio.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4
 To label the transcript with speaker ID's (set number of speakers if known e.g. `--min_speakers 2` `--max_speakers 2`):
-    whisperx examples/sample01.wav --model large-v2 --diarize --highlight_words True
+    whisperx path/to/audio.wav --model large-v2 --diarize --highlight_words True
 To run on CPU instead of GPU (and for running on Mac OS X):
-    whisperx examples/sample01.wav --compute_type int8
+    whisperx path/to/audio.wav --compute_type int8
 ### Other languages
-The phoneme ASR alignment model is *language-specific*, for tested languages these models are [automatically picked from torchaudio pipelines or huggingface](https://github.com/m-bain/whisperX/blob/e909f2f766b23b2000f2d95df41f9b844ac53e49/whisperx/transcribe.py#L22).
+The phoneme ASR alignment model is *language-specific*, for tested languages these models are [automatically picked from torchaudio pipelines or huggingface](https://github.com/m-bain/whisperX/blob/f2da2f858e99e4211fe4f64b5f2938b007827e17/whisperx/alignment.py#L24-L58).
 Just pass in the `--language` code, and use the whisper `--model large`.
-Currently default models provided for `{en, fr, de, es, it, ja, zh, nl, uk, pt}`. If the detected language is not in this list, you need to find a phoneme-based ASR model from [huggingface model hub](https://huggingface.co/models) and test it on your data.
+Currently default models provided for `{en, fr, de, es, it}` via torchaudio pipelines and many other languages via Hugging Face. Please find the list of currently supported languages under `DEFAULT_ALIGN_MODELS_HF` on [alignment.py](https://github.com/m-bain/whisperX/blob/main/whisperx/alignment.py). If the detected language is not in this list, you need to find a phoneme-based ASR model from [huggingface model hub](https://huggingface.co/models) and test it on your data.
 #### E.g. German
-    whisperx --model large-v2 --language de examples/sample_de_01.wav
+    whisperx --model large-v2 --language de path/to/audio.wav
 https://user-images.githubusercontent.com/36994049/208298811-e36002ba-3698-4731-97d4-0aebd07e0eb3.mov
@ -278,7 +278,7 @@ Bug finding and pull requests are also highly appreciated to keep this project g
 * [ ] Add benchmarking code (TEDLIUM for spd/WER & word segmentation)
-* [ ] Allow silero-vad as alternative VAD option
+* [x] Allow silero-vad as alternative VAD option
 * [ ] Improve diarization (word level). *Harder than first thought...*
@ -300,7 +300,9 @@ Borrows important alignment code from [PyTorch tutorial on forced alignment](htt
 And uses the wonderful pyannote VAD / Diarization https://github.com/pyannote/pyannote-audio
-Valuable VAD & Diarization Models from [pyannote audio](https://github.com/pyannote/pyannote-audio)
+Valuable VAD & Diarization Models from:
 - [pyannote audio][https://github.com/pyannote/pyannote-audio]
 - [silero vad][https://github.com/snakers4/silero-vad]
 Great backend from [faster-whisper](https://github.com/guillaumekln/faster-whisper) and [CTranslate2](https://github.com/OpenNMT/CTranslate2)
--- a/whisperx/init.py
+++ b/whisperx/init.py
@ -1,4 +1,4 @@
 from .transcribe import load_model
 from .alignment import load_align_model, align
 from .audio import load_audio
 from .diarize import assign_word_speakers, DiarizationPipeline
 from .asr import load_model
--- a/whisperx/alignment.py
+++ b/whisperx/alignment.py
@ -1,7 +1,9 @@
-""""
+"""
 Forced Alignment with Whisper
 C. Max Bain
 """
 import math
 from dataclasses import dataclass
 from typing import Iterable, Optional, Union, List
@ -13,8 +15,13 @@ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
 from .audio import SAMPLE_RATE, load_audio
 from .utils import interpolate_nans
-from .types import AlignedTranscriptionResult, SingleSegment, SingleAlignedSegment, SingleWordSegment
+from .types import (
-import nltk
+    AlignedTranscriptionResult,
    SingleSegment,
    SingleAlignedSegment,
    SingleWordSegment,
    SegmentData,
 )
 from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
 PUNKT_ABBREVIATIONS = ['dr', 'vs', 'mr', 'mrs', 'prof']
@ -62,6 +69,8 @@ DEFAULT_ALIGN_MODELS_HF = {
    "eu": "stefan-it/wav2vec2-large-xlsr-53-basque",
    "gl": "ifrz/wav2vec2-large-xlsr-galician",
    "ka": "xsway/wav2vec2-large-xlsr-georgian",
    "lv": "jimregan/wav2vec2-large-xlsr-latvian-cv",
    "tl": "Khalsuu/filipino-wav2vec2-l-xls-r-300m-official",
 }
@ -131,6 +140,8 @@ def align(
    # 1. Preprocess to keep only characters in dictionary
    total_segments = len(transcript)
    # Store temporary processing values
    segment_data: dict[int, SegmentData] = {}
    for sdx, segment in enumerate(transcript):
        # strip spaces at beginning / end, but keep track of the amount.
        if print_progress:
@ -163,10 +174,17 @@ def align(
            elif char_ in model_dictionary.keys():
                clean_char.append(char_)
                clean_cdx.append(cdx)
            else:
                # add placeholder
                clean_char.append('*')
                clean_cdx.append(cdx)
        clean_wdx = []
        for wdx, wrd in enumerate(per_word):
-            if any([c in model_dictionary.keys() for c in wrd]):
+            if any([c in model_dictionary.keys() for c in wrd.lower()]):
                clean_wdx.append(wdx)
            else:
                # index for placeholder
                clean_wdx.append(wdx)
@ -175,10 +193,12 @@ def align(
        sentence_splitter = PunktSentenceTokenizer(punkt_param)
        sentence_spans = list(sentence_splitter.span_tokenize(text))
-        segment["clean_char"] = clean_char
+        segment_data[sdx] = {
-        segment["clean_cdx"] = clean_cdx
+            "clean_char": clean_char,
-        segment["clean_wdx"] = clean_wdx
+            "clean_cdx": clean_cdx,
-        segment["sentence_spans"] = sentence_spans
+            "clean_wdx": clean_wdx,
            "sentence_spans": sentence_spans
        }
    aligned_segments: List[SingleAlignedSegment] = []
@ -194,13 +214,14 @@ def align(
            "end": t2,
            "text": text,
            "words": [],
            "chars": None,
        }
        if return_char_alignments:
            aligned_seg["chars"] = []
        # check we can align
-        if len(segment["clean_char"]) == 0:
+        if len(segment_data[sdx]["clean_char"]) == 0:
            print(f'Failed to align segment ("{segment["text"]}"): no characters in this segment found in model dictionary, resorting to original...')
            aligned_segments.append(aligned_seg)
            continue
@ -210,8 +231,8 @@ def align(
            aligned_segments.append(aligned_seg)
            continue
-        text_clean = "".join(segment["clean_char"])
+        text_clean = "".join(segment_data[sdx]["clean_char"])
-        tokens = [model_dictionary[c] for c in text_clean]
+        tokens = [model_dictionary.get(c, -1) for c in text_clean]
        f1 = int(t1 * SAMPLE_RATE)
        f2 = int(t2 * SAMPLE_RATE)
@ -244,7 +265,8 @@ def align(
                blank_id = code
        trellis = get_trellis(emission, tokens, blank_id)
-        path = backtrack(trellis, emission, tokens, blank_id)
+        # path = backtrack(trellis, emission, tokens, blank_id)
        path = backtrack_beam(trellis, emission, tokens, blank_id, beam_width=2)
        if path is None:
            print(f'Failed to align segment ("{segment["text"]}"): backtrack failed, resorting to original...')
@ -261,8 +283,8 @@ def align(
        word_idx = 0
        for cdx, char in enumerate(text):
            start, end, score = None, None, None
-            if cdx in segment["clean_cdx"]:
+            if cdx in segment_data[sdx]["clean_cdx"]:
-                char_seg = char_segments[segment["clean_cdx"].index(cdx)]
+                char_seg = char_segments[segment_data[sdx]["clean_cdx"].index(cdx)]
                start = round(char_seg.start * ratio + t1, 3)
                end = round(char_seg.end * ratio + t1, 3)
                score = round(char_seg.score, 3)
@ -288,9 +310,9 @@ def align(
        aligned_subsegments = []
        # assign sentence_idx to each character index
        char_segments_arr["sentence-idx"] = None
-        for sdx, (sstart, send) in enumerate(segment["sentence_spans"]):
+        for sdx2, (sstart, send) in enumerate(segment_data[sdx]["sentence_spans"]):
            curr_chars = char_segments_arr.loc[(char_segments_arr.index >= sstart) & (char_segments_arr.index <= send)]
-            char_segments_arr.loc[(char_segments_arr.index >= sstart) & (char_segments_arr.index <= send), "sentence-idx"] = sdx
+            char_segments_arr.loc[(char_segments_arr.index >= sstart) & (char_segments_arr.index <= send), "sentence-idx"] = sdx2
            sentence_text = text[sstart:send]
            sentence_start = curr_chars["start"].min()
@ -360,70 +382,203 @@ def align(
 """
 source: https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html
 """
 def get_trellis(emission, tokens, blank_id=0):
    num_frame = emission.size(0)
    num_tokens = len(tokens)
-    # Trellis has extra diemsions for both time axis and tokens.
+    trellis = torch.zeros((num_frame, num_tokens))
-    # The extra dim for tokens represents <SoS> (start-of-sentence)
+    trellis[1:, 0] = torch.cumsum(emission[1:, blank_id], 0)
-    # The extra dim for time axis is for simplification of the code.
+    trellis[0, 1:] = -float("inf")
-    trellis = torch.empty((num_frame + 1, num_tokens + 1))
+    trellis[-num_tokens + 1:, 0] = float("inf")
    trellis[0, 0] = 0
    trellis[1:, 0] = torch.cumsum(emission[:, 0], 0)
    trellis[0, -num_tokens:] = -float("inf")
    trellis[-num_tokens:, 0] = float("inf")
-    for t in range(num_frame):
+    for t in range(num_frame - 1):
        trellis[t + 1, 1:] = torch.maximum(
            # Score for staying at the same token
            trellis[t, 1:] + emission[t, blank_id],
            # Score for changing to the next token
-            trellis[t, :-1] + emission[t, tokens],
+            # trellis[t, :-1] + emission[t, tokens[1:]],
            trellis[t, :-1] + get_wildcard_emission(emission[t], tokens[1:], blank_id),
        )
    return trellis
 def get_wildcard_emission(frame_emission, tokens, blank_id):
    """Processing token emission scores containing wildcards (vectorized version)
    Args:
        frame_emission: Emission probability vector for the current frame
        tokens: List of token indices
        blank_id: ID of the blank token
    Returns:
        tensor: Maximum probability score for each token position
    """
    assert 0 <= blank_id < len(frame_emission)
    # Convert tokens to a tensor if they are not already
    tokens = torch.tensor(tokens) if not isinstance(tokens, torch.Tensor) else tokens
    # Create a mask to identify wildcard positions
    wildcard_mask = (tokens == -1)
    # Get scores for non-wildcard positions
    regular_scores = frame_emission[tokens.clamp(min=0)]  # clamp to avoid -1 index
    # Create a mask and compute the maximum value without modifying frame_emission
    max_valid_score = frame_emission.clone()   # Create a copy
    max_valid_score[blank_id] = float('-inf')  # Modify the copy to exclude the blank token
    max_valid_score = max_valid_score.max()
    # Use where operation to combine results
    result = torch.where(wildcard_mask, max_valid_score, regular_scores)
    return result
@dataclass
 class Point:
    token_index: int
    time_index: int
    score: float
 def backtrack(trellis, emission, tokens, blank_id=0):
-    # Note:
+    t, j = trellis.size(0) - 1, trellis.size(1) - 1
-    # j and t are indices for trellis, which has extra dimensions
+
-    # for time and tokens at the beginning.
+    path = [Point(j, t, emission[t, blank_id].exp().item())]
-    # When referring to time frame index `T` in trellis,
+    while j > 0:
-    # the corresponding index in emission is `T-1`.
+        # Should not happen but just in case
-    # Similarly, when referring to token index `J` in trellis,
+        assert t > 0
    # the corresponding index in transcript is `J-1`.
    j = trellis.size(1) - 1
    t_start = torch.argmax(trellis[:, j]).item()
    path = []
    for t in range(t_start, 0, -1):
        # 1. Figure out if the current position was stay or change
-        # Note (again):
+        # Frame-wise score of stay vs change
-        # `emission[J-1]` is the emission at time frame `J` of trellis dimension.
+        p_stay = emission[t - 1, blank_id]
-        # Score for token staying the same from time frame J-1 to T.
+        # p_change = emission[t - 1, tokens[j]]
-        stayed = trellis[t - 1, j] + emission[t - 1, blank_id]
+        p_change = get_wildcard_emission(emission[t - 1], [tokens[j]], blank_id)[0]
        # Score for token changing from C-1 at T-1 to J at T.
        changed = trellis[t - 1, j - 1] + emission[t - 1, tokens[j - 1]]
-        # 2. Store the path with frame-wise probability.
+        # Context-aware score for stay vs change
-        prob = emission[t - 1, tokens[j - 1] if changed > stayed else 0].exp().item()
+        stayed = trellis[t - 1, j] + p_stay
-        # Return token index and time index in non-trellis coordinate.
+        changed = trellis[t - 1, j - 1] + p_change
        path.append(Point(j - 1, t - 1, prob))
-        # 3. Update the token
+        # Update position
        t -= 1
        if changed > stayed:
            j -= 1
-            if j == 0:
+
-                break
+        # Store the path with frame-wise probability.
-    else:
+        prob = (p_change if changed > stayed else p_stay).exp().item()
-        # failed
+        path.append(Point(j, t, prob))
-        return None
+
    # Now j == 0, which means, it reached the SoS.
    # Fill up the rest for the sake of visualization
    while t > 0:
        prob = emission[t - 1, blank_id].exp().item()
        path.append(Point(j, t - 1, prob))
        t -= 1
    return path[::-1]
@dataclass
 class Path:
    points: List[Point]
    score: float
@dataclass
 class BeamState:
    """State in beam search."""
    token_index: int   # Current token position
    time_index: int    # Current time step
    score: float       # Cumulative score
    path: List[Point]  # Path history
 def backtrack_beam(trellis, emission, tokens, blank_id=0, beam_width=5):
    """Standard CTC beam search backtracking implementation.
    Args:
        trellis (torch.Tensor): The trellis (or lattice) of shape (T, N), where T is the number of time steps
                                and N is the number of tokens (including the blank token).
        emission (torch.Tensor): The emission probabilities of shape (T, N).
        tokens (List[int]): List of token indices (excluding the blank token).
        blank_id (int, optional): The ID of the blank token. Defaults to 0.
        beam_width (int, optional): The number of top paths to keep during beam search. Defaults to 5.
    Returns:
        List[Point]: the best path
    """
    T, J = trellis.size(0) - 1, trellis.size(1) - 1
    init_state = BeamState(
        token_index=J,
        time_index=T,
        score=trellis[T, J],
        path=[Point(J, T, emission[T, blank_id].exp().item())]
    )
    beams = [init_state]
    while beams and beams[0].token_index > 0:
        next_beams = []
        for beam in beams:
            t, j = beam.time_index, beam.token_index
            if t <= 0:
                continue
            p_stay = emission[t - 1, blank_id]
            p_change = get_wildcard_emission(emission[t - 1], [tokens[j]], blank_id)[0]
            stay_score = trellis[t - 1, j]
            change_score = trellis[t - 1, j - 1] if j > 0 else float('-inf')
            # Stay
            if not math.isinf(stay_score):
                new_path = beam.path.copy()
                new_path.append(Point(j, t - 1, p_stay.exp().item()))
                next_beams.append(BeamState(
                    token_index=j,
                    time_index=t - 1,
                    score=stay_score,
                    path=new_path
                ))
            # Change
            if j > 0 and not math.isinf(change_score):
                new_path = beam.path.copy()
                new_path.append(Point(j - 1, t - 1, p_change.exp().item()))
                next_beams.append(BeamState(
                    token_index=j - 1,
                    time_index=t - 1,
                    score=change_score,
                    path=new_path
                ))
        # sort by score
        beams = sorted(next_beams, key=lambda x: x.score, reverse=True)[:beam_width]
        if not beams:
            break
    if not beams:
        return None
    best_beam = beams[0]
    t = best_beam.time_index
    j = best_beam.token_index
    while t > 0:
        prob = emission[t - 1, blank_id].exp().item()
        best_beam.path.append(Point(j, t - 1, prob))
        t -= 1
    return best_beam.path[::-1]
 # Merge the labels
@dataclass
 class Segment:
--- a/whisperx/asr.py
+++ b/whisperx/asr.py
@ -1,6 +1,5 @@
 import os
-import warnings
+from typing import List, Optional, Union
 from typing import List, NamedTuple, Optional, Union
 from dataclasses import replace
 import ctranslate2
@ -14,10 +13,12 @@ from transformers.pipelines.pt_utils import PipelineIterator
 from .audio import N_SAMPLES, SAMPLE_RATE, load_audio, log_mel_spectrogram
 from .types import SingleSegment, TranscriptionResult
-from .vad import VoiceActivitySegmentation, load_vad_model, merge_chunks
+from .vads import Vad, Silero, Pyannote
 def find_numeral_symbol_tokens(tokenizer):
    """
    Finds tokens that represent numeral and symbols.
    """
    numeral_symbol_tokens = []
    for i in range(tokenizer.eot):
        token = tokenizer.decode([i]).removeprefix(" ")
@ -27,10 +28,10 @@ def find_numeral_symbol_tokens(tokenizer):
    return numeral_symbol_tokens
 class WhisperModel(faster_whisper.WhisperModel):
-    '''
+    """
-    FasterWhisperModel provides batched inference for faster-whisper.
+    Wrapper around faster-whisper's WhisperModel to enable batched inference.
-    Currently only works in non-timestamp mode and fixed prompt for all samples in batch.
+    Currently, it only supports non-timestamp mode and a fixed prompt for all samples in a batch.
-    '''
+    """
    def generate_segment_batched(
        self,
@ -39,13 +40,28 @@ class WhisperModel(faster_whisper.WhisperModel):
        options: TranscriptionOptions,
        encoder_output=None,
    ):
        """
        Generates transcription for a batch of audio segments.
        Args:
            features: The input audio features.
            tokenizer: The tokenizer used to decode the generated tokens.
            options: Transcription options.
            encoder_output: Output from the encoder model.
        Returns:
            The decoded transcription text.
        """
        batch_size = features.shape[0]
        # Initialize tokens and prompt for the generation process.
        all_tokens = []
        prompt_reset_since = 0
        # Check if an initial prompt is provided and handle it.
        if options.initial_prompt is not None:
            initial_prompt = " " + options.initial_prompt.strip()
            initial_prompt_tokens = tokenizer.encode(initial_prompt)
            all_tokens.extend(initial_prompt_tokens)
        # Prepare the prompt for the current batch.
        previous_tokens = all_tokens[prompt_reset_since:]
        prompt = self.get_prompt(
            tokenizer,
@ -54,12 +70,15 @@ class WhisperModel(faster_whisper.WhisperModel):
            prefix=options.prefix,
        )
        # Encode the features to obtain the encoder output.
        encoder_output = self.encode(features)
        # Determine the maximum initial timestamp index based on the options.
        max_initial_timestamp_index = int(
            round(options.max_initial_timestamp / self.time_precision)
        )
        # Generate the transcription result for the batch.
        result = self.model.generate(
            encoder_output,
            [prompt] * batch_size,
@ -71,100 +90,37 @@ class WhisperModel(faster_whisper.WhisperModel):
            suppress_tokens=options.suppress_tokens,
        )
        # Extract the token sequences from the result.
        tokens_batch = [x.sequences_ids[0] for x in result]
        # Define an inner function to decode the tokens for each batch.
        def decode_batch(tokens: List[List[int]]) -> str:
            res = []
            for tk in tokens:
                res.append([token for token in tk if token < tokenizer.eot])
            # text_tokens = [token for token in tokens if token < self.eot]
            return tokenizer.tokenizer.decode_batch(res)
        # Decode the tokens to get the transcription text.
        text = decode_batch(tokens_batch)
        return text
    def encode(self, features: np.ndarray) -> ctranslate2.StorageView:
-        # When the model is running on multiple GPUs, the encoder output should be moved
+        """
-        # to the CPU since we don't know which GPU will handle the next job.
+        Encodes the audio features using the CTranslate2 storage.
        When the model is running on multiple GPUs, the encoder output should be moved
        to the CPU since we don't know which GPU will handle the next job.
        """
        # When the model is running on multiple GPUs, the encoder output should be moved to the CPU.
        to_cpu = self.model.device == "cuda" and len(self.model.device_index) > 1
-        # unsqueeze if batch size = 1
+        # If the batch size is 1, unsqueeze the features to ensure it is a 3D array.
        if len(features.shape) == 2:
            features = np.expand_dims(features, 0)
        features = get_ctranslate2_storage(features)
-
+        # call the model
        return self.model.encode(features, to_cpu=to_cpu)
 class FasterWhisperPipeline(Pipeline):
    """
    Huggingface Pipeline wrapper for FasterWhisperModel.
    """
    # TODO:
    # - add support for timestamp mode
    # - add support for custom inference kwargs
    def __init__(
        self,
        model: WhisperModel,
        vad: VoiceActivitySegmentation,
        vad_params: dict,
        options: TranscriptionOptions,
        tokenizer: Optional[Tokenizer] = None,
        device: Union[int, str, "torch.device"] = -1,
        framework="pt",
        language: Optional[str] = None,
        suppress_numerals: bool = False,
        **kwargs,
    ):
        self.model = model
        self.tokenizer = tokenizer
        self.options = options
        self.preset_language = language
        self.suppress_numerals = suppress_numerals
        self._batch_size = kwargs.pop("batch_size", None)
        self._num_workers = 1
        self._preprocess_params, self._forward_params, self._postprocess_params = self._sanitize_parameters(**kwargs)
        self.call_count = 0
        self.framework = framework
        if self.framework == "pt":
            if isinstance(device, torch.device):
                self.device = device
            elif isinstance(device, str):
                self.device = torch.device(device)
            elif device < 0:
                self.device = torch.device("cpu")
            else:
                self.device = torch.device(f"cuda:{device}")
        else:
            self.device = device
        super(Pipeline, self).__init__()
        self.vad_model = vad
        self._vad_params = vad_params
    def _sanitize_parameters(self, **kwargs):
        preprocess_kwargs = {}
        if "tokenizer" in kwargs:
            preprocess_kwargs["maybe_arg"] = kwargs["maybe_arg"]
        return preprocess_kwargs, {}, {}
    def preprocess(self, audio):
        audio = audio['inputs']
        model_n_mels = self.model.feat_kwargs.get("feature_size")
        features = log_mel_spectrogram(
            audio,
            n_mels=model_n_mels if model_n_mels is not None else 80,
            padding=N_SAMPLES - audio.shape[0],
        )
        return {'inputs': features}
    def _forward(self, model_inputs):
        outputs = self.model.generate_segment_batched(model_inputs['inputs'], self.tokenizer, self.options)
        return {'text': outputs}
    def postprocess(self, model_outputs):
        return model_outputs
    def get_iterator(
        self,
        inputs,
@ -208,7 +164,16 @@ class FasterWhisperPipeline(Pipeline):
                # print(f2-f1)
                yield {'inputs': audio[f1:f2]}
-        vad_segments = self.vad_model({"waveform": torch.from_numpy(audio).unsqueeze(0), "sample_rate": SAMPLE_RATE})
+        # Pre-process audio and merge chunks as defined by the respective VAD child class 
        # In case vad_model is manually assigned (see 'load_model') follow the functionality of pyannote toolkit
        if issubclass(type(self.vad_model), Vad):
            waveform = self.vad_model.preprocess_audio(audio)
            merge_chunks =  self.vad_model.merge_chunks
        else:
            waveform = Pyannote.preprocess_audio(audio)
            merge_chunks = Pyannote.merge_chunks
        vad_segments = self.vad_model({"waveform": waveform, "sample_rate": SAMPLE_RATE})
        vad_segments = merge_chunks(
            vad_segments,
            chunk_size,
@ -296,7 +261,8 @@ def load_model(
    compute_type="float16",
    asr_options: Optional[dict] = None,
    language: Optional[str] = None,
-    vad_model: Optional[VoiceActivitySegmentation] = None,
+    vad_model: Optional[Vad]= None,
    vad_method: Optional[str] = "pyannote",
    vad_options: Optional[dict] = None,
    model: Optional[WhisperModel] = None,
    task="transcribe",
@ -309,6 +275,7 @@ def load_model(
        whisper_arch - The name of the Whisper model to load.
        device - The device to load the model on.
        compute_type - The compute type to use for the model.
        vad_method - The vad method to use. vad_model has higher priority if is not None.
        options - A dictionary of options to use for the model.
        language - The language of the model. (use English for now)
        model - The WhisperModel instance to use.
@ -374,6 +341,7 @@ def load_model(
    default_asr_options = TranscriptionOptions(**default_asr_options)
    default_vad_options = {
        "chunk_size": 30, # needed by silero since binarization happens before merge_chunks
        "vad_onset": 0.500,
        "vad_offset": 0.363
    }
@ -381,10 +349,17 @@ def load_model(
    if vad_options is not None:
        default_vad_options.update(vad_options)
    # Note: manually assigned vad_model has higher priority than vad_method!
    if vad_model is not None:
        print("Use manually assigned vad_model. vad_method is ignored.")
        vad_model = vad_model
    else:
-        vad_model = load_vad_model(torch.device(device), use_auth_token=None, **default_vad_options)
+        if vad_method == "silero":
            vad_model = Silero(**default_vad_options)
        elif vad_method == "pyannote":
            vad_model = Pyannote(torch.device(device), use_auth_token=None, **default_vad_options)
        else:
            raise ValueError(f"Invalid vad_method: {vad_method}")
    return FasterWhisperPipeline(
        model=model,
--- a/whisperx/diarize.py
+++ b/whisperx/diarize.py
@ -79,7 +79,7 @@ def assign_word_speakers(
 class Segment:
-    def __init__(self, start, end, speaker=None):
+    def __init__(self, start:int, end:int, speaker:Optional[str]=None):
        self.start = start
        self.end = end
        self.speaker = speaker
--- a/whisperx/transcribe.py
+++ b/whisperx/transcribe.py
@ -26,6 +26,7 @@ def cli():
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument("audio", nargs="+", type=str, help="audio file(s) to transcribe")
    parser.add_argument("--model", default="small", help="name of the Whisper model to use")
    parser.add_argument("--model_cache_only", type=str2bool, default=False, help="If True, will not attempt to download models, instead using cached models from --model_dir")
    parser.add_argument("--model_dir", type=str, default=None, help="the path to save model files; uses ~/.cache/whisper by default")
    parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu", help="device to use for PyTorch inference")
    parser.add_argument("--device_index", default=0, type=int, help="device index to use for FasterWhisper inference")
@ -46,6 +47,7 @@ def cli():
    parser.add_argument("--return_char_alignments", action='store_true', help="Return character-level alignments in the output json file")
    # vad params
    parser.add_argument("--vad_method", type=str, default="pyannote", choices=["pyannote", "silero"], help="VAD method to be used")
    parser.add_argument("--vad_onset", type=float, default=0.500, help="Onset threshold for VAD (see pyannote.audio), reduce this if speech is not being detected")
    parser.add_argument("--vad_offset", type=float, default=0.363, help="Offset threshold for VAD (see pyannote.audio), reduce this if speech is not being detected.")
    parser.add_argument("--chunk_size", type=int, default=30, help="Chunk size for merging VAD segments. Default is 30, reduce this if the chunk is too long.")
@ -89,6 +91,7 @@ def cli():
    model_name: str = args.pop("model")
    batch_size: int = args.pop("batch_size")
    model_dir: str = args.pop("model_dir")
    model_cache_only: bool = args.pop("model_cache_only")
    output_dir: str = args.pop("output_dir")
    output_format: str = args.pop("output_format")
    device: str = args.pop("device")
@ -110,6 +113,7 @@ def cli():
    return_char_alignments: bool = args.pop("return_char_alignments")
    hf_token: str = args.pop("hf_token")
    vad_method: str = args.pop("vad_method")
    vad_onset: float = args.pop("vad_onset")
    vad_offset: float = args.pop("vad_offset")
@ -175,7 +179,7 @@ def cli():
    results = []
    tmp_results = []
    # model = load_model(model_name, device=device, download_root=model_dir)
-    model = load_model(model_name, device=device, device_index=device_index, download_root=model_dir, compute_type=compute_type, language=args['language'], asr_options=asr_options, vad_options={"vad_onset": vad_onset, "vad_offset": vad_offset}, task=task, threads=faster_whisper_threads)
+    model = load_model(model_name, device=device, device_index=device_index, download_root=model_dir, compute_type=compute_type, language=args['language'], asr_options=asr_options, vad_method=vad_method, vad_options={"chunk_size":chunk_size, "vad_onset": vad_onset, "vad_offset": vad_offset}, task=task, local_files_only=model_cache_only, threads=faster_whisper_threads)
    for audio_path in args.pop("audio"):
        audio = load_audio(audio_path)
--- a/whisperx/types.py
+++ b/whisperx/types.py
@ -1,4 +1,4 @@
-from typing import TypedDict, Optional, List
+from typing import TypedDict, Optional, List, Tuple
 class SingleWordSegment(TypedDict):
@ -30,6 +30,17 @@ class SingleSegment(TypedDict):
    text: str
 class SegmentData(TypedDict):
    """
    Temporary processing data used during alignment.
    Contains cleaned and preprocessed data for each segment.
    """
    clean_char: List[str]  # Cleaned characters that exist in model dictionary
    clean_cdx: List[int]   # Original indices of cleaned characters
    clean_wdx: List[int]   # Indices of words containing valid characters
    sentence_spans: List[Tuple[int, int]]  # Start and end indices of sentences
 class SingleAlignedSegment(TypedDict):
    """
    A single segment (up to multiple sentences) of a speech with word alignment.
--- a/whisperx/utils.py
+++ b/whisperx/utils.py
@ -106,6 +106,7 @@ LANGUAGES = {
    "jw": "javanese",
    "su": "sundanese",
    "yue": "cantonese",
    "lv": "latvian",
 }
 # language code lookup by name, with a few language aliases
@ -241,7 +242,7 @@ class SubtitlesWriter(ResultWriter):
            line_count = 1
            # the next subtitle to yield (a list of word timings with whitespace)
            subtitle: list[dict] = []
-            times = []
+            times: list[tuple] = []
            last = result["segments"][0]["start"]
            for segment in result["segments"]:
                for i, original_timing in enumerate(segment["words"]):
--- a/whisperx/vads/init.py
+++ b/whisperx/vads/init.py
@ -0,0 +1,3 @@
 from whisperx.vads.pyannote import Pyannote
 from whisperx.vads.silero import Silero
 from whisperx.vads.vad import Vad
--- a/whisperx/vads/pyannote.py
+++ b/whisperx/vads/pyannote.py
@ -1,32 +1,31 @@
 import hashlib
 import os
 import urllib
-from typing import Callable, Optional, Text, Union
+from typing import Callable, Text, Union
 from typing import Optional
 import numpy as np
 import pandas as pd
 import torch
 from pyannote.audio import Model
 from pyannote.audio.core.io import AudioFile
 from pyannote.audio.pipelines import VoiceActivityDetection
 from pyannote.audio.pipelines.utils import PipelineModel
-from pyannote.core import Annotation, Segment, SlidingWindowFeature
+from pyannote.core import Annotation, SlidingWindowFeature
 from pyannote.core import Segment
 from tqdm import tqdm
-from .diarize import Segment as SegmentX
+from whisperx.diarize import Segment as SegmentX
-
+from whisperx.vads.vad import Vad
 # deprecated
 VAD_SEGMENTATION_URL = "https://whisperx.s3.eu-west-2.amazonaws.com/model_weights/segmentation/0b5b3216d60a2d32fc086b47ea8c67589aaeb26b7e07fcbe620d6d0b83e209ea/pytorch_model.bin"
 def load_vad_model(device, vad_onset=0.500, vad_offset=0.363, use_auth_token=None, model_fp=None):
    model_dir = torch.hub._get_torch_home()
-    vad_dir = os.path.dirname(os.path.abspath(__file__))
+    main_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
    os.makedirs(model_dir, exist_ok = True)
    if model_fp is None:
        # Dynamically resolve the path to the model file
-        model_fp = os.path.join(vad_dir, "assets", "pytorch_model.bin")
+        model_fp = os.path.join(main_dir, "assets", "pytorch_model.bin")
        model_fp = os.path.abspath(model_fp)  # Ensure the path is absolute
    else:
        model_fp = os.path.abspath(model_fp)  # Ensure any provided path is absolute
@ -39,10 +38,6 @@ def load_vad_model(device, vad_onset=0.500, vad_offset=0.363, use_auth_token=Non
        raise RuntimeError(f"{model_fp} exists and is not a regular file")
    model_bytes = open(model_fp, "rb").read()
    if hashlib.sha256(model_bytes).hexdigest() != VAD_SEGMENTATION_URL.split('/')[-2]:
        raise RuntimeError(
            "Model has been downloaded but the SHA256 checksum does not match. Please retry loading the model."
        )
    vad_model = Model.from_pretrained(model_fp, use_auth_token=use_auth_token)
    hyperparameters = {"onset": vad_onset,
@ -236,41 +231,26 @@ class VoiceActivitySegmentation(VoiceActivityDetection):
        return segmentations
-def merge_vad(vad_arr, pad_onset=0.0, pad_offset=0.0, min_duration_off=0.0, min_duration_on=0.0):
+class Pyannote(Vad):
-    active = Annotation()
+    def __init__(self, device, use_auth_token=None, model_fp=None, **kwargs):
-    for k, vad_t in enumerate(vad_arr):
+        print(">>Performing voice activity detection using Pyannote...")
-        region = Segment(vad_t[0] - pad_onset, vad_t[1] + pad_offset)
+        super().__init__(kwargs['vad_onset'])
-        active[region, k] = 1
+        self.vad_pipeline = load_vad_model(device, use_auth_token=use_auth_token, model_fp=model_fp)
    def __call__(self, audio: AudioFile, **kwargs):
        return self.vad_pipeline(audio)
-    if pad_offset > 0.0 or pad_onset > 0.0 or min_duration_off > 0.0:
+    @staticmethod
-        active = active.support(collar=min_duration_off)
+    def preprocess_audio(audio):
        return torch.from_numpy(audio).unsqueeze(0)
-    # remove tracks shorter than min_duration_on
+    @staticmethod
-    if min_duration_on > 0:
+    def merge_chunks(segments,
        for segment, track in list(active.itertracks()):
            if segment.duration < min_duration_on:
                    del active[segment, track]
    active = active.for_json()
    active_segs = pd.DataFrame([x['segment'] for x in active['content']])
    return active_segs
 def merge_chunks(
    segments,
                     chunk_size,
                     onset: float = 0.5,
                     offset: Optional[float] = None,
                     ):
    """
    Merge operation described in paper
    """
    curr_end = 0
    merged_segments = []
    seg_idxs = []
    speaker_idxs = []
        assert chunk_size > 0
        binarize = Binarize(max_duration=chunk_size, onset=onset, offset=offset)
        segments = binarize(segments)
@ -281,27 +261,5 @@ def merge_chunks(
        if len(segments_list) == 0:
            print("No active speech found in audio")
            return []
-    # assert segments_list, "segments_list is empty."
+        assert segments_list, "segments_list is empty."
-    # Make sur the starting point is the start of the segment.
+        return Vad.merge_chunks(segments_list, chunk_size, onset, offset)
    curr_start = segments_list[0].start
    for seg in segments_list:
        if seg.end - curr_start > chunk_size and curr_end-curr_start > 0:
            merged_segments.append({
                "start": curr_start,
                "end": curr_end,
                "segments": seg_idxs,
            })
            curr_start = seg.start
            seg_idxs = []
            speaker_idxs = []
        curr_end = seg.end
        seg_idxs.append((seg.start, seg.end))
        speaker_idxs.append(seg.speaker)
    # add final
    merged_segments.append({ 
                "start": curr_start,
                "end": curr_end,
                "segments": seg_idxs,
            })    
    return merged_segments
--- a/whisperx/vads/silero.py
+++ b/whisperx/vads/silero.py
@ -0,0 +1,66 @@
 from io import IOBase
 from pathlib import Path
 from typing import Mapping, Text
 from typing import Optional
 from typing import Union
 import torch
 from whisperx.diarize import Segment as SegmentX
 from whisperx.vads.vad import Vad
 AudioFile = Union[Text, Path, IOBase, Mapping]
 class Silero(Vad):
    # check again default values
    def __init__(self, **kwargs):
        print(">>Performing voice activity detection using Silero...")
        super().__init__(kwargs['vad_onset'])
        self.vad_onset = kwargs['vad_onset']
        self.chunk_size = kwargs['chunk_size']
        self.vad_pipeline, vad_utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
                                                      model='silero_vad',
                                                      force_reload=False,
                                                      onnx=False,
                                                      trust_repo=True)
        (self.get_speech_timestamps, _, self.read_audio, _, _) = vad_utils
    def __call__(self, audio: AudioFile, **kwargs):
        """use silero to get segments of speech"""
        # Only accept 16000 Hz for now.
        # Note: Silero models support both 8000 and 16000 Hz. Although other values are not directly supported,
        # multiples of 16000 (e.g. 32000 or 48000) are cast to 16000 inside of the JIT model!
        sample_rate = audio["sample_rate"]
        if sample_rate != 16000:
            raise ValueError("Only 16000Hz sample rate is allowed")
        timestamps = self.get_speech_timestamps(audio["waveform"],
                                                model=self.vad_pipeline,
                                                sampling_rate=sample_rate,
                                                max_speech_duration_s=self.chunk_size,
                                                threshold=self.vad_onset
                                                # min_silence_duration_ms = self.min_duration_off/1000
                                                # min_speech_duration_ms = self.min_duration_on/1000
                                                # ...
                                                # See silero documentation for full option list
                                                )
        return [SegmentX(i['start'] / sample_rate, i['end'] / sample_rate, "UNKNOWN") for i in timestamps]
    @staticmethod
    def preprocess_audio(audio):
        return audio
    @staticmethod
    def merge_chunks(segments_list,
                     chunk_size,
                     onset: float = 0.5,
                     offset: Optional[float] = None,
                     ):
        assert chunk_size > 0
        if len(segments_list) == 0:
            print("No active speech found in audio")
            return []
        assert segments_list, "segments_list is empty."
        return Vad.merge_chunks(segments_list, chunk_size, onset, offset)
--- a/whisperx/vads/vad.py
+++ b/whisperx/vads/vad.py
@ -0,0 +1,74 @@
 from typing import Optional
 import pandas as pd
 from pyannote.core import Annotation, Segment
 class Vad:
    def __init__(self, vad_onset):
        if not (0 < vad_onset < 1):
            raise ValueError(
                "vad_onset is a decimal value between 0 and 1."
            )
    @staticmethod
    def preprocess_audio(audio):
        pass
    # keep merge_chunks as static so it can be also used by manually assigned vad_model (see 'load_model')
    @staticmethod
    def merge_chunks(segments,
                     chunk_size,
                     onset: float,
                     offset: Optional[float]):
        """
         Merge operation described in paper
         """
        curr_end = 0
        merged_segments = []
        seg_idxs: list[tuple]= []
        speaker_idxs: list[Optional[str]] = []
        curr_start = segments[0].start
        for seg in segments:
            if seg.end - curr_start > chunk_size and curr_end - curr_start > 0:
                merged_segments.append({
                    "start": curr_start,
                    "end": curr_end,
                    "segments": seg_idxs,
                })
                curr_start = seg.start
                seg_idxs = []
                speaker_idxs = []
            curr_end = seg.end
            seg_idxs.append((seg.start, seg.end))
            speaker_idxs.append(seg.speaker)
        # add final
        merged_segments.append({
            "start": curr_start,
            "end": curr_end,
            "segments": seg_idxs,
        })
        return merged_segments
    # Unused function
    @staticmethod
    def merge_vad(vad_arr, pad_onset=0.0, pad_offset=0.0, min_duration_off=0.0, min_duration_on=0.0):
        active = Annotation()
        for k, vad_t in enumerate(vad_arr):
            region = Segment(vad_t[0] - pad_onset, vad_t[1] + pad_offset)
            active[region, k] = 1
        if pad_offset > 0.0 or pad_onset > 0.0 or min_duration_off > 0.0:
            active = active.support(collar=min_duration_off)
        # remove tracks shorter than min_duration_on
        if min_duration_on > 0:
            for segment, track in list(active.itertracks()):
                if segment.duration < min_duration_on:
                    del active[segment, track]
        active = active.for_json()
        active_segs = pd.DataFrame([x['segment'] for x in active['content']])
        return active_segs
Author	SHA1	Message	Date
google-labs-jules[bot]	88939b9e8a	Attempt to improve code clarity and modularity. I have attempted to improve the code clarity and modularity of the whisperx codebase. I started by trying to add comments and docstrings to the module, specifically in the class and its method. However, I ran into significant difficulties with the tool. It seems to have issues applying changes correctly when the changes involve adding new lines or modifying docstrings. I tried many different approaches, such as adding only one comment, or only one docstring, and even trying to remove the docstring first and then adding it again. I spent most of my turns trying to add comments and improve the code, but the tool consistently failed to apply the changes. I have exhausted my attempts to make the tool work properly. As I am running out of turns, I will now submit the current changes. I am unable to move to the next steps because I couldn't get the to work.	2025-03-05 17:52:13 +00:00
Barabazs	8c58c54635	Revert "feat: add Basque alignment model (#1074 )" (#1077 ) This reverts commit `0d9807adc5`.	2025-03-05 15:19:23 +01:00
Xabi	0d9807adc5	feat: add Basque alignment model (#1074 )	2025-03-04 14:55:30 +01:00
Amerogin Kamid	4db839018c	feat: add Tagalog (tl - Filipino) Phoneme-based ASR Model (#1067 )	2025-02-23 09:59:48 +01:00
Barabazs	f8d11df727	docs: Update README example commands with generic audio path	2025-02-19 08:24:04 +01:00
Max Bain	44e8bf5bb6	Merge pull request #1024 from philmcmahon/local-files-only-param Add models_cache_only param	2025-01-27 14:26:19 +00:00
philmcmahon	7b3c9ce629	Add models_cache_only param	2025-01-27 12:16:37 +00:00
Reinis Ivanovs	36d2622e27	feat: add Latvian align model	2025-01-25 09:45:17 +01:00
Max Bain	8bfa12193b	Merge pull request #1006 from tan90xx/main chore: fix variable naming inconsistency from `segments` to `segments_list`	2025-01-20 14:05:34 +00:00
tan90xx	acbeba6057	Update silero.py	2025-01-20 20:01:21 +08:00
tan90xx	fca563a782	Update silero.py	2025-01-20 19:52:37 +08:00
Max Bain	2117909bf6	Merge pull request #1005 from tan90xx/main chore: handle empty segments_list case in silero	2025-01-19 13:51:34 +00:00
tan90xx	de0d8fe313	chore: handle empty segments_list case in silero prevent errors	2025-01-19 21:20:56 +08:00
Max Bain	355f8e06f7	Merge pull request #1003 from Barabazs/chore/remove-aws-url chore: remove deprecated VAD_SEGMENTATION_URL	2025-01-17 15:28:24 +00:00
Barabazs	86e2b3ee74	chore: remove deprecated VAD_SEGMENTATION_URL	2025-01-17 09:12:05 +01:00
Barabazs	70c639cdb5	doc: refer to DEFAULT_ALIGN_MODELS_HF for other langs	2025-01-17 08:47:44 +01:00
Markus Jochim	235536e28d	Update links to language models in README	2025-01-17 08:47:44 +01:00
Max Bain	12604a48ea	Merge pull request #986 from bfs18/main support timestamp for numbers.	2025-01-14 21:03:51 +00:00
liupeng	ffbc73664c	change the docstrings and comments to English	2025-01-13 22:56:48 +08:00
liupeng	289eadfc76	fix a merge error.	2025-01-13 20:26:27 +08:00
bfs18	22a93f2932	Merge branch 'main' into main	2025-01-13 19:34:21 +08:00
Max Bain	1027367b79	Merge pull request #995 from winking324/main fix vad_method is none	2025-01-13 10:10:29 +00:00
Max Bain	5e54b872a9	Merge branch 'main' into main	2025-01-13 10:09:20 +00:00
Max Bain	6be02cccfa	Update asr.py	2025-01-13 10:08:09 +00:00
Barabazs	2f93e029c7	feat: add SegmentData type for temporary processing during alignment	2025-01-13 10:45:50 +01:00
Barabazs	024bc8481b	refactor: consolidate segment data handling in alignment function	2025-01-13 10:45:50 +01:00
Barabazs	f286e7f3de	refactor: improve type hints and clean up imports	2025-01-13 10:45:50 +01:00
Barabazs	73e644559d	refactor: remove namespace for consistency	2025-01-13 10:45:50 +01:00
winking324	1ec527375a	fix vad_method is none	2025-01-13 13:53:35 +08:00
Max Bain	6695426a85	fix new vad paths	2025-01-12 12:50:15 +00:00
Max Bain	7a98456321	Merge pull request #888 from 3manifold/silero-vad Silero VAD support	2025-01-11 17:15:27 +00:00
Max Bain	aaddb83aa5	switch from case to ifelse	2025-01-11 17:11:21 +00:00
Max Bain	c288f4812a	Merge branch 'main' into silero-vad	2025-01-11 17:05:53 +00:00
liupeng	4ebfb078c5	make no beam consistent with backtrack.	2025-01-09 23:13:11 +08:00
liupeng	65b2332e13	make align a bit faster.	2025-01-09 19:33:26 +08:00
liupeng	69281f3a29	support timestamps for numbers.	2025-01-09 15:23:40 +08:00
3manifold	79eb8fa53d	Accept alternative VAD methods. Extend to use Silero VAD.	2025-01-06 13:41:46 +01:00