Reconocimiento de voz automático para un asistente virtual utilizando deep learning.



Shweta Shrestha Thapa

Organización: tucuvi

Universidad Rey Juan Carlos

Facultad: Unidad de postgrado

Departamento: Ingeniería y Arquitectura

Contactar con Shweta

Deep learning, inteligencia artificial, machine learning, speech recognition, speech, spanish, pytorch


El reto de este proyecto era poder implementar o intentar implementar la solución usando técnicas de machine learning porque los recursos que se usan para su implementación son economicamente muy elevados. Lo que hice fue usar los recursos gratuitos pero limitados de kaggle. Otro reto es también encontrar datos en español, ya que todo lo relacionado a inteligencia artificial, machine learning or deep learning estan en inglés. Para este reto en concreto también utilice datos que había en kaggle.


The most important lesson here to learn is that like in any computer science-related work, one needs to inspire or rely on previously existing work, to have a starting point, and then experiment and keep on with it. In deep learning, there are infinite possibilities, to tune the hyperparameters, get the exact, correct number, seems like a world of potential solutions but sometimes unreachable. Perhaps, because of lack of time or lack of resources, in terms of deep learning, or perhaps, whatever is in one’s mind has still not been invented. Deep Learning is still on-progress science, many on-going theories in search of efficiency. After all, it is a science, and like any other science, people are desperate to find the only true optimal theory for this particular one, too. It would be marvelous to know the exact learning rate per type of network, the number of hidden layers for any purpose. But the search for a copy of a human brain in a machine can’t be an exact science as there is still a lot to discover even about the human brain.


The objective of this project is to create or design a Speech-to-Text or Automatic Speech Recognition in Spanish application using Deep Learning so that the company Tucuvi can use it. Speech recognition application are perhaps, too invasive in our daily life. We use it instead of typing, we use it for searching, we use it for buying, we use it to know the weather, to know the time, to set-up the alarm, whatever unimaginable that you are thinking right now can be done with your voice. The machine are more intelligent, not year by year but day by day, traffic detection for avoiding it, face detection, movement detection, all these terms that were seen only in Hollywood movies can be found on your smartphone. Netflix suggesting you movies to watch next, and Amazon showing you products you should buy or browse at least which is totally linked with what you have been searching few seconds before.

All these fall in the category of Artificial Intelligence (AI) and particularly machine learning(ML), which is a subset of AI. Deep learning on the other hand, is a subset of ML, it is a technique used to perform ML based on human’s neural network, and since the first image classifier implemented with deep learning, it has taken off with increasing number of applications. Deep comes from the number of neurons that are inside its system. Neural networks, connected together, passing information from one another, that is similar to how human’s neurons work. Deep learning methods are popular each day because it is a system which is capable of learning by itself using these neural networks.

Speech Recognition is one of the best performing application in deep learning. Although, in real-world, we can only perceive speech, we cannot see it, in mathematical terms or in physical terms it can be measured. These measurements are from what the neural networks can learn by itself by finding patterns. To find patterns, it has to be injected with as many data as possible, it will have more patterns to learn, to compare. To comprehend how to apply deep learning is a learning process itself and it is in continuous evolution. Hundreds of literature to read about it and more literature and methods for efficiency are being released even nowadays. However, there are sections that deep learning has invaded and speech recognition is one of them. Deep learning has many parameters and many options to play with and finding the optimum one would have been difficult for this project if there were no previous work to lean on. Implementation like deep speech was revolutionary in this area, so, why not inspire on those models? The method used to accomplish the goal was inspire on several previously created models and create several myself based on those, keep on trying until getting one that performed the best. Actually, more models were tested than the ones mentioned in this document, however most of the time, those models were part of the self-learning process so I didn’t think convenient to mention those here.

The best model that out-performed above all the tested ones, was a deep neural network created by using two convolutional layers and six bidirectional Gated Recurrent Unit. With limited data and resources, a word error rate of 76.32% was achieved. There are methods to improve this rate, getting more data was probably the most important handicap in this project, followed by lack of Graphic Processing Unit. The main question that needs to be answered is if competing with companies like Google is worth it? Google has more data than anyone can dream of, creating a speech recognition system for research or for learning process can be fun but creating one specific for some specific proposal may not be as profitable as one could think. Is there anybody ready to beat big companies or even challenge companies like Google, Amazon, Facebook, etc. in terms of data and technology?

Funciona con BetterDocs