State-of-the-art Acoustic Models (AM) are large, complex deep neural networks that typically comprise millions of model parameters. Deep neural networks can express highly complex input-output relationships and transformations, but the key to getting the best performance out of them is the availability of large amounts of matched acoustic data – matched to the desired dialect, language, rnvironmental/channel condition, microphone characteristic, speaking style, and so on. Since it is both time consuming and expensive to transcribe large amounts of matched acoustic data for every desired condition, we leverage Teacher/Student based Semi-Supervised Learning technology for improving the AM. Our training leverages vast amount of un-transcribed data in addition to multi-dialect transcribed data yielding up to 7% relative word error rate reduction over the baseline model, which has not seen any unlabelled data.
Speaker Biodata :
Sri Garimella is a Senior Manager heading the Alexa Machine Learning/Speech Recognition group in Amazon, India. He has been associated with Amazon for more than 7 years. He obtained PhD from the Department of Electrical and Computer Engineering, Center for Language and Speech Processing at the Johns Hopkins University, Baltimore, USA in 2012. And Master of Engineering in Signal Processing from the Indian Institute of Science, Bangalore, India in 2006.
Kishore Nandury is an Applied scientist in Alexa ASR team in Amazon Bangalore. Prior to Amazon, he has worked in Intel, Sling media & NVIDIA graphics. He has obtained Masters degree in Signal Processing from Indian Institute of Science in 2005.