Member-only story

Revolutionizing Speech Recognition: Rev’s Reverb ASR Model

ully
2 min readOct 6, 2024

--

Image 1: Image

Rev has just released an open-source speech recognition model dubbed the “Whisper terminator,” setting a new benchmark in speech recognition and speaker diarization.

Named Reverb ASR, this model not only boasts impressive performance but also generously shares its model weights on the Hugging Face Hub.

Reverb ASR: A Super Model Trained on 200K Hours of Data

Reverb ASR is no ordinary model. It has been trained on an unprecedented 200,000 hours of human-transcribed data, achieving the industry’s lowest word error rate (WER).

What’s more exciting is that this model supports customizable word-by-word transcription. This means users can adjust the precision and style of the transcription according to their needs.

Speaker Diarization: Enhanced with 26K Hours of Labeled Data

Rev’s team didn’t stop at speech recognition. They also made significant strides in speaker diarization (Diarization).

By leveraging 26,000 hours of labeled data, they fine-tuned the pyannote model, releasing two versions of the speaker diarization model:

  • v1 version: Based on the pyannote3.0 architecture, trained for 17 rounds.
  • v2 version: An advanced version, replacing SincNet features with WavLM, achieving more accurate speaker separation.

Robust Model Architecture: Carefully Crafted Details

The architecture of Reverb ASR is meticulously designed:

  • Structure: Adopts a powerful CTC/Attention hybrid architecture, including 18 conformer layers and 6 transformer layers, totaling 600 million parameters.
  • Language-specific layers: Used to control word-by-word output, ensuring transcription accuracy and flexibility.
  • Multiple decoding modes: Supports CTC, Attention, and joint CTC/Attention decoding, catering to different scenario needs.

Production-Ready: Optimized Inference Pipeline

--

--

ully
ully

No responses yet

Write a response