mirror of
https://github.com/m-bain/whisperX.git
synced 2025-07-01 18:17:27 -04:00
update readme
This commit is contained in:
33
README.md
33
README.md
@ -27,7 +27,6 @@
|
|||||||
<a href="EXAMPLES.md">More examples</a>
|
<a href="EXAMPLES.md">More examples</a>
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
<h6 align="center">Made by Max Bain • :globe_with_meridians: <a href="https://www.maxbain.com">https://www.maxbain.com</a></h6>
|
|
||||||
|
|
||||||
<img width="1216" align="center" alt="whisperx-arch" src="https://user-images.githubusercontent.com/36994049/211200186-8b779e26-0bfd-4127-aee2-5a9238b95e1f.png">
|
<img width="1216" align="center" alt="whisperx-arch" src="https://user-images.githubusercontent.com/36994049/211200186-8b779e26-0bfd-4127-aee2-5a9238b95e1f.png">
|
||||||
|
|
||||||
@ -55,8 +54,6 @@ This repository refines the timestamps of openAI's Whisper model via forced alig
|
|||||||
- Character level timestamps (see `*.char.ass` file output)
|
- Character level timestamps (see `*.char.ass` file output)
|
||||||
- Diarization (still in beta, add `--diarize`)
|
- Diarization (still in beta, add `--diarize`)
|
||||||
|
|
||||||
To enable VAD filtering and Diarization, include your Hugging Face access token that you can generate from [Here](https://huggingface.co/settings/tokens) after the `--hf_token` argument and accept the user agreement for the following models: [Segmentation](https://huggingface.co/pyannote/segmentation) , [Voice Activity Detection (VAD)](https://huggingface.co/pyannote/voice-activity-detection) , and [Speaker Diarization](https://huggingface.co/pyannote/speaker-diarization)
|
|
||||||
|
|
||||||
|
|
||||||
<h2 align="left" id="setup">Setup ⚙️</h2>
|
<h2 align="left" id="setup">Setup ⚙️</h2>
|
||||||
Install this package using
|
Install this package using
|
||||||
@ -74,9 +71,13 @@ $ cd whisperX
|
|||||||
$ pip install -e .
|
$ pip install -e .
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
You may also need to install ffmpeg, rust etc. Follow openAI instructions here https://github.com/openai/whisper#setup.
|
You may also need to install ffmpeg, rust etc. Follow openAI instructions here https://github.com/openai/whisper#setup.
|
||||||
|
|
||||||
|
|
||||||
|
### Voice Activity Detection Filtering & Diarization
|
||||||
|
To **enable VAD filtering and Diarization**, include your Hugging Face access token that you can generate from [Here](https://huggingface.co/settings/tokens) after the `--hf_token` argument and accept the user agreement for the following models: [Segmentation](https://huggingface.co/pyannote/segmentation) , [Voice Activity Detection (VAD)](https://huggingface.co/pyannote/voice-activity-detection) , and [Speaker Diarization](https://huggingface.co/pyannote/speaker-diarization)
|
||||||
|
|
||||||
|
|
||||||
<h2 align="left" id="example">Usage 💬 (command line)</h2>
|
<h2 align="left" id="example">Usage 💬 (command line)</h2>
|
||||||
|
|
||||||
### English
|
### English
|
||||||
@ -152,8 +153,9 @@ In addition to forced alignment, the following two modifications have been made
|
|||||||
|
|
||||||
- Not thoroughly tested, especially for non-english, results may vary -- please post issue to let me know the results on your data
|
- Not thoroughly tested, especially for non-english, results may vary -- please post issue to let me know the results on your data
|
||||||
- Whisper normalises spoken numbers e.g. "fifty seven" to arabic numerals "57". Need to perform this normalization after alignment, so the phonemes can be aligned. Currently just ignores numbers.
|
- Whisper normalises spoken numbers e.g. "fifty seven" to arabic numerals "57". Need to perform this normalization after alignment, so the phonemes can be aligned. Currently just ignores numbers.
|
||||||
- Assumes the initial whisper timestamps are accurate to some degree (within margin of 2 seconds, adjust if needed -- bigger margins more prone to alignment errors)
|
- If not using VAD filter, whisperx assumes the initial whisper timestamps are accurate to some degree (within margin of 2 seconds, adjust if needed -- bigger margins more prone to alignment errors)
|
||||||
- Hacked this up quite quickly, there might be some errors, please raise an issue if you encounter any.
|
- Overlapping speech is not handled particularly well by whisper nor whisperx
|
||||||
|
- Diariazation is far from perfect.
|
||||||
|
|
||||||
|
|
||||||
<h2 align="left" id="contribute">Contribute 🧑🏫</h2>
|
<h2 align="left" id="contribute">Contribute 🧑🏫</h2>
|
||||||
@ -176,29 +178,34 @@ The next major upgrade we are working on is whisper with speaker diarization, so
|
|||||||
|
|
||||||
* [x] Incorporating speaker diarization
|
* [x] Incorporating speaker diarization
|
||||||
|
|
||||||
* [ ] Improve diarization (word level)
|
* [x] Inference speedup with batch processing
|
||||||
|
|
||||||
|
* [ ] Improve diarization (word level). *Harder than first thought...*
|
||||||
|
|
||||||
* [ ] Inference speedup with batch processing
|
|
||||||
|
|
||||||
<h2 align="left" id="contact">Contact/Support 📇</h2>
|
<h2 align="left" id="contact">Contact/Support 📇</h2>
|
||||||
|
|
||||||
Contact maxbain[at]robots[dot]ox[dot]ac[dot]uk for business things.
|
Contact maxbain[at]robots[dot]ox[dot]ac[dot]uk for queries
|
||||||
|
|
||||||
<a href="https://www.buymeacoffee.com/maxhbain" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/default-orange.png" alt="Buy Me A Coffee" height="41" width="174"></a>
|
<a href="https://www.buymeacoffee.com/maxhbain" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/default-orange.png" alt="Buy Me A Coffee" height="41" width="174"></a>
|
||||||
|
|
||||||
|
|
||||||
<h2 align="left" id="acks">Acknowledgements 🙏</h2>
|
<h2 align="left" id="acks">Acknowledgements 🙏</h2>
|
||||||
|
|
||||||
Of course, this is mostly just a modification to [openAI's whisper](https://github.com/openai/whisper).
|
This work, and my PhD, is supported by the [VGG (Visual Geometry Group)](https://www.robots.ox.ac.uk/~vgg/) and University of Oxford.
|
||||||
As well as accreditation to this [PyTorch tutorial on forced alignment](https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html)
|
|
||||||
|
|
||||||
|
|
||||||
|
Of course, this is builds on [openAI's whisper](https://github.com/openai/whisper).
|
||||||
|
And borrows important alignment code from [PyTorch tutorial on forced alignment](https://pytorch.org/tutorials/intermediate/forced_alignment_with_torchaudio_tutorial.html)
|
||||||
|
|
||||||
|
|
||||||
<h2 align="left" id="cite">Citation</h2>
|
<h2 align="left" id="cite">Citation</h2>
|
||||||
If you use this in your research, just cite the repo,
|
If you use this in your research, for now just cite the repo,
|
||||||
|
|
||||||
```bibtex
|
```bibtex
|
||||||
@misc{bain2022whisperx,
|
@misc{bain2022whisperx,
|
||||||
author = {Bain, Max},
|
author = {Bain, Max and Han, Tengda},
|
||||||
title = {WhisperX},
|
title = {WhisperX},
|
||||||
year = {2022},
|
year = {2022},
|
||||||
publisher = {GitHub},
|
publisher = {GitHub},
|
||||||
|
@ -585,10 +585,9 @@ def cli():
|
|||||||
parser.add_argument("--interpolate_method", default="nearest", choices=["nearest", "linear", "ignore"], help="For word .srt, method to assign timestamps to non-aligned words, or merge them into neighbouring.")
|
parser.add_argument("--interpolate_method", default="nearest", choices=["nearest", "linear", "ignore"], help="For word .srt, method to assign timestamps to non-aligned words, or merge them into neighbouring.")
|
||||||
# vad params
|
# vad params
|
||||||
parser.add_argument("--vad_filter", action="store_true", help="Whether to first perform VAD filtering to target only transcribe within VAD. Produces more accurate alignment + timestamp, requires more GPU memory & compute.")
|
parser.add_argument("--vad_filter", action="store_true", help="Whether to first perform VAD filtering to target only transcribe within VAD. Produces more accurate alignment + timestamp, requires more GPU memory & compute.")
|
||||||
parser.add_argument("--vad_input", default=None, type=str)
|
|
||||||
parser.add_argument("--parallel_bs", default=-1, type=int, help="Enable parallel transcribing if > 1")
|
parser.add_argument("--parallel_bs", default=-1, type=int, help="Enable parallel transcribing if > 1")
|
||||||
# diarization params
|
# diarization params
|
||||||
parser.add_argument("--diarize", action='store_true')
|
parser.add_argument("--diarize", action="store_true", help="Apply diarization to assign speaker labels to each segment/word")
|
||||||
parser.add_argument("--min_speakers", default=None, type=int)
|
parser.add_argument("--min_speakers", default=None, type=int)
|
||||||
parser.add_argument("--max_speakers", default=None, type=int)
|
parser.add_argument("--max_speakers", default=None, type=int)
|
||||||
# output save params
|
# output save params
|
||||||
@ -632,7 +631,6 @@ def cli():
|
|||||||
|
|
||||||
hf_token: str = args.pop("hf_token")
|
hf_token: str = args.pop("hf_token")
|
||||||
vad_filter: bool = args.pop("vad_filter")
|
vad_filter: bool = args.pop("vad_filter")
|
||||||
vad_input: bool = args.pop("vad_input")
|
|
||||||
parallel_bs: int = args.pop("parallel_bs")
|
parallel_bs: int = args.pop("parallel_bs")
|
||||||
|
|
||||||
diarize: bool = args.pop("diarize")
|
diarize: bool = args.pop("diarize")
|
||||||
@ -640,9 +638,9 @@ def cli():
|
|||||||
max_speakers: int = args.pop("max_speakers")
|
max_speakers: int = args.pop("max_speakers")
|
||||||
|
|
||||||
vad_pipeline = None
|
vad_pipeline = None
|
||||||
if vad_input is not None:
|
if vad_filter:
|
||||||
vad_input = pd.read_csv(vad_input, header=None, sep= " ")
|
if hf_token is None:
|
||||||
elif vad_filter:
|
print("Warning, no huggingface token used, needs to be saved in environment variable, otherwise will throw error loading VAD model...")
|
||||||
from pyannote.audio import Inference
|
from pyannote.audio import Inference
|
||||||
vad_pipeline = Inference("pyannote/segmentation",
|
vad_pipeline = Inference("pyannote/segmentation",
|
||||||
pre_aggregation_hook=lambda segmentation: segmentation,
|
pre_aggregation_hook=lambda segmentation: segmentation,
|
||||||
@ -650,6 +648,8 @@ def cli():
|
|||||||
|
|
||||||
diarize_pipeline = None
|
diarize_pipeline = None
|
||||||
if diarize:
|
if diarize:
|
||||||
|
if hf_token is None:
|
||||||
|
print("Warning, no --hf_token used, needs to be saved in environment variable, otherwise will throw error loading diarization model...")
|
||||||
from pyannote.audio import Pipeline
|
from pyannote.audio import Pipeline
|
||||||
diarize_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1",
|
diarize_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1",
|
||||||
use_auth_token=hf_token)
|
use_auth_token=hf_token)
|
||||||
@ -756,7 +756,7 @@ def cli():
|
|||||||
# save word tsv
|
# save word tsv
|
||||||
if output_type in ["vad"]:
|
if output_type in ["vad"]:
|
||||||
exp_fp = os.path.join(output_dir, audio_basename + ".sad")
|
exp_fp = os.path.join(output_dir, audio_basename + ".sad")
|
||||||
wrd_segs = pd.concat([x["word-segments"] for x in result_aligned["segments"]])
|
wrd_segs = pd.concat([x["word-segments"] for x in result_aligned["segments"]])[['start','end']]
|
||||||
wrd_segs.to_csv(exp_fp, sep='\t', header=None, index=False)
|
wrd_segs.to_csv(exp_fp, sep='\t', header=None, index=False)
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
cli()
|
cli()
|
||||||
|
@ -65,8 +65,8 @@ def write_vtt(transcript: Iterator[dict], file: TextIO):
|
|||||||
def write_tsv(transcript: Iterator[dict], file: TextIO):
|
def write_tsv(transcript: Iterator[dict], file: TextIO):
|
||||||
print("start", "end", "text", sep="\t", file=file)
|
print("start", "end", "text", sep="\t", file=file)
|
||||||
for segment in transcript:
|
for segment in transcript:
|
||||||
print(round(1000 * segment['start']), file=file, end="\t")
|
print(segment['start'], file=file, end="\t")
|
||||||
print(round(1000 * segment['end']), file=file, end="\t")
|
print(segment['end'], file=file, end="\t")
|
||||||
print(segment['text'].strip().replace("\t", " "), file=file, flush=True)
|
print(segment['text'].strip().replace("\t", " "), file=file, flush=True)
|
||||||
|
|
||||||
|
|
||||||
|
@ -137,8 +137,6 @@ class Binarize:
|
|||||||
|
|
||||||
|
|
||||||
def merge_vad(vad_arr, pad_onset=0.0, pad_offset=0.0, min_duration_off=0.0, min_duration_on=0.0):
|
def merge_vad(vad_arr, pad_onset=0.0, pad_offset=0.0, min_duration_off=0.0, min_duration_on=0.0):
|
||||||
# because of padding, some active regions might be overlapping: merge them.
|
|
||||||
# also: fill same speaker gaps shorter than min_duration_off
|
|
||||||
|
|
||||||
active = Annotation()
|
active = Annotation()
|
||||||
for k, vad_t in enumerate(vad_arr):
|
for k, vad_t in enumerate(vad_arr):
|
||||||
@ -161,16 +159,27 @@ def merge_vad(vad_arr, pad_onset=0.0, pad_offset=0.0, min_duration_off=0.0, min_
|
|||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
from pyannote.audio import Inference
|
# from pyannote.audio import Inference
|
||||||
hook = lambda segmentation: segmentation
|
# hook = lambda segmentation: segmentation
|
||||||
inference = Inference("pyannote/segmentation", pre_aggregation_hook=hook)
|
# inference = Inference("pyannote/segmentation", pre_aggregation_hook=hook)
|
||||||
audio = "/tmp/11962.wav"
|
# audio = "/tmp/11962.wav"
|
||||||
scores = inference(audio)
|
# scores = inference(audio)
|
||||||
binarize = Binarize(max_duration=15)
|
# binarize = Binarize(max_duration=15)
|
||||||
anno = binarize(scores)
|
# anno = binarize(scores)
|
||||||
res = []
|
# res = []
|
||||||
for ann in anno.get_timeline():
|
# for ann in anno.get_timeline():
|
||||||
res.append((ann.start, ann.end))
|
# res.append((ann.start, ann.end))
|
||||||
|
|
||||||
res = pd.DataFrame(res)
|
# res = pd.DataFrame(res)
|
||||||
res[2] = res[1] - res[0]
|
# res[2] = res[1] - res[0]
|
||||||
|
import pandas as pd
|
||||||
|
input_fp = "tt298650_sync.wav"
|
||||||
|
df = pd.read_csv(f"/work/maxbain/tmp/{input_fp}.sad", sep=" ", header=None)
|
||||||
|
print(len(df))
|
||||||
|
N = 0.15
|
||||||
|
g = df[0].sub(df[1].shift())
|
||||||
|
input_base = input_fp.split('.')[0]
|
||||||
|
df = df.groupby(g.gt(N).cumsum()).agg({0:'min', 1:'max'})
|
||||||
|
df.to_csv(f"/work/maxbain/tmp/{input_base}.lab", header=None, index=False, sep=" ")
|
||||||
|
print(df)
|
||||||
|
import pdb; pdb.set_trace()
|
Reference in New Issue
Block a user