In search for good ASR model
Recently I was researching models for automatic speech recognition (ASR). It seams that recently this area strives and provides us already with very interesting results.
On the current moment the most popular trends are:
1. Acoustic model + language model
2. Transformers
There is also trend to use CNN as more precise models, but I wonk touch it here. I think one of the important parts which is often missed is speech cancelation model - I won't concentrate on it too.
Metrics
The main metric used to measure ARS model efficiency is WER (word error rate). It is relation between a sum of word insertions (I), deletions (D) and substitutions (S) divided by number of words in the ground true text (N).
WER = (I + D + S) / N
Basically all model creators declare WER's of their models.
Acoustic models
I think the most common example of such a models is DeepSpeech. This framework is maintained under Mozilla foundation "wing". It provides one acoustic model and one language model. This artifact are also provided in tflite format. Some part of this paragraph are taken from rosariomoscato/Rosario-Moscato-Lab
This model can be referenced in batch and in stream mode.
Model initialization
model_file_path = 'deepspeech-0.9.3-models.pbmm'
lm_file_path = 'deepspeech-0.9.3-models.scorer'
# Model parameters
beam_width = 100
lm_alpha = 0.93
lm_beta = 1.18
# model initiation
model = Model(model_file_path)
# add language model
model.enableExternalScorer(lm_file_path)
model.setScorerAlphaBeta(lm_alpha, lm_beta)
model.setBeamWidth(beam_width)
def transcribe_batch(audio_file):
buffer, rate = read_wav_file(audio_file)
# Convert binary data into numpy format
data16 = np.frombuffer(buffer, dtype=np.int16)
# Call the model
return model.stt(data16)
# Get strem instance
stream = model.createStream()
def trinscribe_streaming(audio_file, stream):
buffer, rate = read_wav_file(audio_file)
offset = 0
batch_size = 8196
text = ''
while offset < len(buffer):
end_offset = offset + batch_size
chunk = buffer[offset: end_offset]
data16 = np.frombuffer(chunk, dtype=np.int16)
stream.feedAudioContent(data16)
text = stream.intermediateDecode()
clear_output(wait=True)
print(text)
offset = end_offset
# result is a final sentence
result = stream.finishStream()
return result
Transformers
# Initialize pipeline
asr_pipeline = pipeline('automatic-speech-recognition')
# Execute
asr_pipeline(pickwick_filepath)
Default transformer has shown good and stable results on two types of recordings:



Comments
Post a Comment