What is Speech Recognition?
There is a long answer and a short answer to “What is Speech Recognition?” The short answer is, speech recognition is a study in the field of Computer Science that enables a computer to listen and convert verbally spoken words into text. However, if you are a person who enjoys progressing beyond a simple one-line answer, please feel free to read on.
Speech recognition is a subfield of Computational Linguistics, which deals with developing methodologies that allow a computer to automatically recognize spoken words or language and translate the same into text. It is also known as Automatic Speech Recognition (ASR) or Speech-to-Text. Though rudimentary methodologies for Speech-to-Text might be able to handle a single language or a particular accent or dialect, more advanced systems exist that can handle multiple languages, accents and dialects. Although computer scientists have, throughout the 20th and the 21st century, dedicated decades of research and development into the field, it is still far from complete or perfect. That is due to the continually morphing and incredibly complex nature of human language in its current state and as it continues evolving with time.
Important Note:
It is essential to understand the difference between Speech Recognition and Voice Recognition, as they are often mistakenly used interchangeably. Speech Recognition refers to a subfield of Computer Linguistics as discussed above, whereas, in Voice Recognition, a computer uses the vocal properties of the speaker for identification purposes, similar to a verbal fingerprint.
Voice Recognition, however, is often used alongside Speech Recognition, to enable the computer not only to translate speech into text but also to know who is speaking to the same. For instance, if a computer knows who uses what kind of accent while speaking, it can be more accurate at transcribing the same person. However, technology, as ground-breaking as Speech Recognition, follows a fascinating developmental history. Let’s take a few moments briefly discussing the same before we move on.
History of Speech Recognition
Research and Development on Speech Recognition technology started as early as the 1950s. Three researchers from Bell Labs, Stephen Balashek, R. Biddulph and K.H. Davis built a system called “Audrey’ for single-speaker digit recognition.

After the Second World War, IBM created a 16-word “shoebox” machine. Shuzo Saito proposed a speech coding method called Linear Predictive Coding.
By the 1980s, IBM had created a voice-activated typewriter called Tangora, which could handle a 20,000-word vocabulary. The 1980s also saw the introduction of the n-gram language model, which was an essential step towards practical Speech Recognition. In the early 2000s, DARPA sponsored two speech recognition programs, namely, Effective Affordable Reusable Speech-to-Text (EARS) and Global Autonomous Language Exploitation (GALE).
Google’s first foray into Speech Recognition came in the year 2007 after hiring a few researchers from Nuance, with the GOOG-411, a telephone-based directory service, the data from which was instrumental in developing Google’s Speech Recognition system even further. Today, Google’s Voice Search feature is supported in over 30 languages.
Hidden Markov Models (more on it later) was the dominating technique for Speech Recognition Systems with feedforward from Artificial Neural Networks. With the introduction of Deep Learning, many aspects have been replaced or augmented by a new Deep Learning technique called Long Short-Term Memory (LSTM), a recurrent neural network technique, introduced in 1997. A prevalent implementation of said technique can be found in the Google Speech Recognition software in the palms of every smartphone user.
The use of non-recurrent networks for acoustic modelling was introduced in 2009, which contributed significantly to improving the accuracy of Speech Recognition Systems. From the 1980s, the 1990s until the 2000s, multiple challenges plagued the overall performance of neural networks based techniques. ANN techniques for a long time, thus, could not outperform the Hidden Markov Model-based techniques. This challenge was, however, overcome during the 2010s.
How does Modern Speech Recognition model work?
Speech Recognition uses algorithms through acoustic and language modelling to essentially convert the sentences you vocalize into statistical information. You can think of language and speech modelling as two different methods to generate said data for simplicity. In modern speech recognition, both the models are used in unison to transcribe speech into text. Let’s gather some understanding of the same one at a time. We will start with Acoustic Modelling and then proceed to briefly discuss Language Modelling.
So what is Acoustic Modelling? In a language, we use specific pronunciations or sounds to separate one word from another. For example, in the English Language, we use ‘d’ and ‘t’ in dividing the word ‘bad’ from ‘bat’. Such sounds are called phonemes. In Acoustic Modelling, the computer attempts to create statistical data to represent the relationship between phonemes or other units of a language from the verbal input. Now, in Language Modelling, the computer makes a probability distribution over an entire sequence of words. That gives it context to better distinguish between words or phrases that might sound the same but mean different things. The n-gram model (mentioned earlier) is really very simple and works on the same concept of creating probability distributions.
Moving on, let’s take a brief look at various other models used for speech recognition:
-
Hidden Markov Models (HMMs):
HMMs were one of the earliest models to be used for the purpose of speech recognition before Deep Learning models started becoming increasingly popular. In an HMM model, it is assumed that the system which needs to be modelled is a Markov Process. What is a Markov Process? Consider the following example. Suppose a Genie is in a room which the observer does not have access to. There are three urns in the room and three balls in each of the urns. The Genie selects an urn and returns a ball to the observer. There is no relation between the present urn being chosen and any urn that was made before the present moment. This process is a Markov Process.
Thus, the observer knows what the ball is. However, he/she does not have any idea as to which urn the ball was pulled from. In other words, the first process is ‘hidden’ from the observer. Thus, ‘Hidden Markov Process‘. Clearly, The system of the Genie in the room has unobservable states. However, the system of the observer receiving the balls outside the room is dependent on that system. In HMMs, we try to learn about the unobservable Markov system by observing the dependent system.
-
Natural Language Processing (NLP):
Natural Language Processing is actually a subfield of Artificial Intelligence instead of any particular model, which focuses on human-machine interaction through language, text and speech. Most ‘assistant’ software, Siri, for instance, make use of Natural Language Processing algorithms to understand verbal commands.
-
Neural Networks:
Neural Networks, as the name states, is a method that is modelled after the mechanism of information transfer that exists between neurons in the human brain and is primarily utilized in Deep Learning. What that translated to, is that a network of nodes, arranged in layers, make up the neural network. The more the number of layers, the deeper the neural network. Whenever an output from a node exceeds a certain threshold, the node ‘fires’ transferring information to the next layer of the network, thus mimicking the brain in its functionality. Neural Networks learn the mapping function through supervised learning, by minimizing the loss function using methods like gradient descent. While generally more accurate, they also tend to be slower as compared to more traditional language models.
CONCLUSION
With so much technology dominating the world we inhabit and surrounding us at all times, it is easy to forget the details of how they work. Easy to ignore the millions of lines of code that keeps a plane from falling from the sky or a space station in orbit, along with all the genius that went into creating incredible techniques to make it all possible. It is a combination of decades of research that has allowed the device you hold to transcribe your voice into text, understand the same and fetch you the most relevant results out of trillions of answers that any query might generate.