diff --git a/.gitignore b/.gitignore index f137a5b..540c132 100644 --- a/.gitignore +++ b/.gitignore @@ -1,2 +1,3 @@ whisperx.egg-info/ -**/__pycache__/ \ No newline at end of file +**/__pycache__/ +.ipynb_checkpoints diff --git a/README.md b/README.md index d76a070..8043e02 100644 --- a/README.md +++ b/README.md @@ -52,6 +52,12 @@ This repository provides fast automatic speech recognition (70x realtime with la **Speaker Diarization** is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker. +- v3 pre-release [this branch](https://github.com/m-bain/whisperX/tree/v3) *70x speed-up open-sourced. Using batched whisper with faster-whisper backend*! +- v2 released, code cleanup, imports whisper library. VAD filtering is now turned on by default, as in the paper. +- Paper dropππ¨βπ«! Please see our [ArxiV preprint](https://arxiv.org/abs/2303.00747) for benchmarking and details of WhisperX. We also introduce more efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed (not provided in this repo). +- VAD filtering: Voice Activity Detection (VAD) from [Pyannote.audio](https://huggingface.co/pyannote/voice-activity-detection) is used as a preprocessing step to remove reliance on whisper timestamps and only transcribe audio segments containing speech. add `--vad_filter True` flag, increases timestamp accuracy and robustness (requires more GPU mem due to 30s inputs in wav2vec2) +- Character level timestamps (see `*.char.ass` file output) +- Diarization (still in beta, add `--diarize`)