Whisper 語音識別模型

2023-06-09 12:21 作者:CiiLii西里站長 0人讀過 | 我要投稿

Whisper 是一種通用的語音識別模型。它是在包含各種音頻的大型數(shù)據(jù)集上訓(xùn)練的，也是一個可以執(zhí)行多語言語音識別、語音翻譯和語言識別的多任務(wù)模型。

開源項目地址：https://github.com/openai/whisper

Transformer 序列到序列模型針對各種語音處理任務(wù)進(jìn)行訓(xùn)練，包括多語言語音識別、語音翻譯、口語識別和語音活動檢測。這些任務(wù)共同表示為由解碼器預(yù)測的一系列標(biāo)記，允許單個模型取代傳統(tǒng)語音處理管道的多個階段。多任務(wù)訓(xùn)練格式使用一組特殊標(biāo)記作為任務(wù)說明符或分類目標(biāo)。

設(shè)置

我們使用 Python 3.9.9 和PyTorch?1.10.1 來訓(xùn)練和測試我們的模型，但代碼庫預(yù)計將與 Python 3.8-3.11 和最新的 PyTorch 版本兼容。該代碼庫還依賴于一些 Python 包，最著名的是OpenAI 的 tiktoken，用于實現(xiàn)快速分詞器。您可以使用以下命令下載并安裝（或更新到）最新版本的 Whisper：

pip install -U openai-whisper

或者，以下命令將從該存儲庫中拉取并安裝最新的提交及其 Python 依賴項：

pip install git+https://github.com/openai/whisper.git

要將軟件包更新到此存儲庫的最新版本，請運行：

pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git

ffmpeg它還需要在您的系統(tǒng)上安裝命令行工具，大多數(shù)包管理器都提供該工具：

# on Ubuntu or Debiansudo apt update && sudo apt install ffmpeg# on Arch Linuxsudo pacman -S ffmpeg# on MacOS using Homebrew (https://brew.sh/)brew install ffmpeg# on Windows using Chocolatey (https://chocolatey.org/)choco install ffmpeg# on Windows using Scoop (https://scoop.sh/)scoop install ffmpeg

您可能rust還需要安裝，以防tiktoken不為您的平臺提供預(yù)構(gòu)建的輪子。如果您在上述命令中看到安裝錯誤pip install，請按照入門頁面安裝 Rust 開發(fā)環(huán)境。此外，您可能需要配置PATH環(huán)境變量，例如export PATH="$HOME/.cargo/bin:$PATH".?如果安裝失敗No module named 'setuptools_rust'，則需要安裝setuptools_rust，例如通過運行：

pip install setuptools-rust

.en僅英語應(yīng)用程序的模型往往表現(xiàn)更好，尤其是對于和tiny.en模型base.en。small.en我們觀察到，對于和模型，差異變得不那么顯著medium.en。

Whisper 的性能因語言而異。下圖顯示了使用該模型的 Fleurs 數(shù)據(jù)集按語言的 WER（單詞錯誤率）細(xì)分large-v2（數(shù)字越小，性能越好）。與其他模型和數(shù)據(jù)集相對應(yīng)的其他 WER 分?jǐn)?shù)可以在附錄 D.1、D.2 和 D.4 中找到。同時，更多的 BLEU（雙語評估替補(bǔ)）分?jǐn)?shù)可以在附錄 D.3 中找到。兩者都在論文中找到。

命令行用法

以下命令將使用medium模型轉(zhuǎn)錄音頻文件中的語音：

whisper audio.flac audio.mp3 audio.wav --model medium

默認(rèn)設(shè)置（選擇模型small）適用于轉(zhuǎn)錄英語。要轉(zhuǎn)錄包含非英語語音的音頻文件，您可以使用以下選項指定語言--language：

whisper japanese.wav --language Japanese

添加--task translate會將語音翻譯成英文：

whisper japanese.wav --language Japanese --task translate

運行以下命令以查看所有可用選項：

whisper --help

有關(guān)所有可用語言的列表，請參見tokenizer.py 。

Python 用法

轉(zhuǎn)錄也可以在 Python 中執(zhí)行：

import whispermodel = whisper.load_model("base")result = model.transcribe("audio.mp3")print(result["text"])

在內(nèi)部，該transcribe()方法讀取整個文件并使用滑動的 30 秒窗口處理音頻，對每個窗口執(zhí)行自回歸序列到序列預(yù)測。

whisper.detect_language()下面是和的示例用法whisper.decode()，它提供對模型的較低級別訪問。

import whispermodel = whisper.load_model("base")# load audio and pad/trim it to fit 30 secondsaudio = whisper.load_audio("audio.mp3")audio = whisper.pad_or_trim(audio)# make log-Mel spectrogram and move to the same device as the modelmel = whisper.log_mel_spectrogram(audio).to(model.device)# detect the spoken language_, probs = model.detect_language(mel)print(f"Detected language: {max(probs, key=probs.get)}")# decode the audiooptions = whisper.DecodingOptions()result = whisper.decode(model, mel, options)# print the recognized textprint(result.text)

標(biāo)簽：whisper ai語音