In search for good ASR model

 Recently I was researching models for automatic speech recognition (ASR). It seams that recently this area strives and provides us already with very interesting results.

On the current moment the most popular trends are:

1. Acoustic model + language model

2. Transformers

There is also trend to use CNN as more precise models, but I wonk touch it here. I think one of the important parts which is often missed is speech cancelation model - I won't concentrate on it too.


Metrics

The main metric used to measure ARS model efficiency is WER (word error rate). It is relation between a sum of word insertions (I), deletions (D) and substitutions (S) divided by number of words in the ground true text (N).

WER = (I + D + S) / N

Basically all model creators declare WER's of their models.


Acoustic models

I think the most common example of such a models is DeepSpeech. This framework is maintained under Mozilla foundation "wing". It provides one acoustic model and one language model. This artifact are also provided in tflite format. Some part of this paragraph are taken from  rosariomoscato/Rosario-Moscato-Lab

This model can be referenced in batch and in stream mode.


Model initialization

model_file_path = 'deepspeech-0.9.3-models.pbmm'
lm_file_path = 'deepspeech-0.9.3-models.scorer'

# Model parameters
beam_width = 100
lm_alpha = 0.93
lm_beta = 1.18

# model initiation
model = Model(model_file_path)
# add language model
model.enableExternalScorer(lm_file_path)

model.setScorerAlphaBeta(lm_alpha, lm_beta)
model.setBeamWidth(beam_width)

Batch mode

def transcribe_batch(audio_file):
  buffer, rate = read_wav_file(audio_file)
  # Convert binary data into numpy format
  data16 = np.frombuffer(buffer, dtype=np.int16)
  # Call the model
  return model.stt(data16)

Stream Mode

# Get strem instance

stream = model.createStream()

def trinscribe_streaming(audio_file, stream):
  buffer, rate = read_wav_file(audio_file)
  offset = 0
  batch_size = 8196
  text = ''

  while offset < len(buffer):
    end_offset = offset + batch_size
    chunk = buffer[offset: end_offset] 
    data16 = np.frombuffer(chunk, dtype=np.int16)

    stream.feedAudioContent(data16)
    text = stream.intermediateDecode()
    clear_output(wait=True)
    print(text)
    offset = end_offset
  # result is a final sentence
  result = stream.finishStream()
  return result

Model shows amazing efficiency on classical literature recordings. I've measured models WER on Pickwick Papers audio recordings and result is provides on the picture below:


As you can see WER equals 0 %! However this result can be caused by the data which model had already 
seen before. I know that model was trained on LibreSpeech data, which potentially can contain LibreVox audio recordings, which I used to test the model.

However, I decided to test the model on sort recordings from Dr.House tV series and it had shown different result:



Transformers

Transformers already recommended themselves as a good alternative for many AI related processes. I was not surprised to see them in a ASR leader's group.
For example, paper Conformer: Convolution-augmented Transformer for Speech Recognition provides us with current results in the field:


In our case HuggingFace pipelines make our life super easy:


# Initialize pipeline
asr_pipeline = pipeline('automatic-speech-recognition')

# Execute
asr_pipeline(pickwick_filepath)
Default transformer has shown good and stable results on two types of recordings:

Comments

Popular posts from this blog

Install Kubeflow locally

RabbitMQ and OpenShift