Intento Speech Report: Are We Facing a Critical Juncture in Speech Technology Applications?

Marta García

Partner Marketing Manager

Intento is launching a new blog series, reporting on the state of speech technologies in today’s business landscape. In its inaugural post, we discuss the developmental history surrounding these tools and explore how the advent of AI driven applications will help harness its full potential.

•    •    •

“Voice enables unbelievably simple interaction with technology — the most natural and convenient user interface, and the one we all use every day … Voice is the future” — Jorrit Van der Meulen, VP at Amazon Devices and Alexa EU.

•    •    •

The drive towards certain technologies have become such a ubiquitous part of our collective imaginations that their developments are seemingly programmed into human history. We have always wanted to fly, looking towards the skies first with curiosity, then determination. We spent millenia observing the mysteries above us, yet only 70 years after our first sustained flight we were walking on the moon. Now half a century after that, we functionally have eyes and ears on Mars.

Although perhaps not as easily romanticized as the art of flight, historically, humanity has held a similar fixation on mechanisms that transcribe and comprehend human speech. While Senemut, the ancient Egyptian astrologer, painstakingly mapped the stars, the first Egyptian scribes were being put to work writing letters, or drawing accounts of crop surplus. A few hundred years before DaVinci was marking blueprints of speculative flying machines, Pope Sylvester II was obsessing over an instrument that could answer spoken ‘yes’ or ‘no’ questions. 70 years ago, the first speech recognition algorithms were developed. Today, I can use educational apps to develop spoken language skills, or ask Alexa to order take-out. And even though talking devices have begun to dominate our daily lives, we have only begun to realize the potential of speech technology; we have our wings, but advancements in artificial intelligence are helping us take aim at the stars.

Over the next several weeks, Intento will be assessing and reporting on the potential of speech technologies as they are currently being developed for a variety of industries. As we all know, COVID has dramatically altered both global and domestic business practices, and looking down the path at what lies ahead it is hard to imagine a world trending back towards the standard practices of 2019. The gradual shift towards the online age was already in motion, but this pandemic has been a catalyst for the dizzying escalation towards an inevitable future. While more familiar habits fall by the wayside, it will be imperative that our industry is prepared to adapt, embracing the proliferation of virtual activity and communication. You know how to traditionally globalize your business, but how will service desks adapt to handle an influx of internationally based customers looking for digital support? How will companies ensure that all employees have equal access to online training? How will networking, and conferences look with a more global audience?

At Intento, we are finding opportunity in these obstacles with the assistance of AI-empowered speech technology. We believe that the online, globally-oriented future is providing a moment to transform many business practices for the better, giving us the chance to develop our relationship with AI colleagues. Our Speech Report series will be exploring a breakdown of the vendor landscape, discussing the methodologies behind an analysis of this landscape, and highlighting key insights from our analysis. But first, we will build a solid foundational understanding of what these technologies are, and examine how this tech has developed towards what could be a critical juncture in how it’s employed.

•    •    •

Speech Technology, a Developmental History

Torrents of Innovation

Our Speech Report explores the world technologies that essentially allow humans to be understood in communication with a computer interface, ideally in a way that mimics a typical human conversation. Take automatic speech recognition, or ASR. It is useful to break down this broad concept into two separate functions; firstly the ability for a computer to hear spoken words and accurately transcribe what has been said by a human. A familiar example may be dictating a text for Siri to send over an Apple device. The other (more ambitious) function is for the interface to then understand what has been said. Again using Siri, I could ask aloud for the weather tomorrow, and receive the response: “it doesn’t look so nice tomorrow; down to 25 degrees”. This reflects that the computer not only comprehended my spoken prompt, but was also able to generate an accurate and suitable response.

ASR, or STT (speech-to-text), is only one component of recent advancements in speech technology; a tributary, and a large one at that, leading to this surging river of innovation. However, there are other streams, coming from different directions yet merging together in the same torrent of speech-oriented developments. TTS (text-to-speech) is the functional antithesis of ASR, with the goal to map text onto an acoustic format, but will have an equally compelling effect on the technologies of tomorrow, broadening the assistive abilities of our machines to read aloud to us. Think of the late Stephen Hawking, who utilized TTS as a way to continue his contributions to astrophysics after losing his voice to ALS. And ASR is itself a composite of individually promising functions such as ‘wake word detection’, ‘speaker recognition’, and ‘language identification’. TTS also encompasses subfields such as ‘style transfer’ and ‘voice improvement’. Virtual assistant use cases are practical in a day-today setting, but only give us a short-sighted view of what industries will be ultimately evolved by speech technology.

Looking at the big picture, there are potential applications of AI-powered voice-to-voice translation, from medical research and aid in underdeveloped countries, to radical improvements in language education efforts. But for your business’s immediate future, you can expect AI colleagues in speech technology to help ease the load of service desk operators, boost accessibility to client training, and broaden your outreach and networking capacities. And that is only the beginning. Mapping ASR development throughout the late 20th Century highlights just how fundamentally AI-driven applications are changing our understanding of speech technology, and the exponentially increasing scope of solutions becoming available through its enhancement.

From ‘Audrey’ to ‘Alexa’ — Early Speech Recognition Systems

Somewhat predictably, the first iterations of modern speech recognition computer systems were developed to help ease the burden on secretaries taking dictation. Just as human scribes were appearing millenia ago to ease the writing of letters or accounting of crops, we were now aiming to pass on the tedious tasks of writing or typing onto the machines themselves. But you can look past the earliest computers to see mechanical iterations of such ideas. ‘Radio-Rex’, a hundred-year-old dog toy, had a primitive harmonic oscillator that responded to 500 Hz of acoustic energy, roughly the eh vowel sound in Rex. In theory, when a child would call ‘REX!’, a spring would be activated, moving Rex forward.To engineer the first iteration of speech recognition computer systems, developers had to focus on numbers instead of words. Bell Laboratories developed the ‘Audrey System’ in 1952, which could recognize single digits spoken aloud by a single voice. Ten years later, IBM’s ‘Shoebox’ could understand and respond to 16 words in english and by the end of 1960s the technology could support words with four vowels and nine consonants. These basic systems were fundamentally a guessing game, designed to recognize tell-tale sounds, known as ‘phonemes’. The unique pitch of a phoneme would act as a clue to match a specific sound-bite with a selection of preprogrammed tonal guidance.

This ‘guessing-game’ continued to develop slowly over the decades. Research from the Defense Department throughout the 1970s and Carnegie Mellon University’s ‘Harpy’ system helped increase the tonal understanding to about 1,000 words and sentences — roughly the vocabulary of a toddler, thus helping to facilitate the first rudimentary educational tools such as popular ‘Speak & Spell’ toys. The 80s brought us the ‘Hidden Markov Model’ (HMM), which helped machines determine the probability of a given sound actually being part of a word-prompt, helping to move research towards a more linguistic, rather than acoustic approach. IBM used the HMM model to build Tongera, a voice activated type-writer capable of handling a 20,000 word vocabulary. IBM’s competitor, Dragon Systems, released the first commercial speech recognition product in 1990 — a speech dictation which you could own for $9,000. In 1997 Dragon Systems put forth a further contribution to the space, the first continuous speech recognition product Dragon NaturallySpeaking. NaturallySpeaking took away the need for SR dictation systems to process one– word– at– a– time, and is widely used by medical professionals in both the US and the UK for record documentation.

The end of the relatively slow era of early software roll-out came in 2010, when with the help of cloud computing Google was able to focus its massive data-base of human speech examples over billions of web searches on its Voice Search app. This opened the door for ASR to enter the deep learning space, leading us past simple dictation and towards the prospect of machines that can fully understand our requests. It is only with the advent of deep learning in the field that we have been able to see the proliferation of devices designed to actively communicate with their human users. Today, 40% of adults in the US use voice searches daily– a number that is sure to grow in line with the snowballing of AI-driven applications. Looking at the stats, the overall speech recognition market is expected to increase at a CAGR rate of 17.2% from 2019–2025 to reach $26.8 billion. Between AI and non-AI solutions, AI solutions command a much larger portion of the shares based on AI-powered, voice-enabled applications in several industries including healthcare, hospitality, education, among others.

Facing a Critical Juncture in Business Operations — Three Practical Applications

Speech Technology and Customer Support

The advent of AI-powered voice solutions follows a radical shift towards the digitization and globalization of assets. Facing a new era where the entire world is moving online, you cannot afford to be offline, and moving online comes with the challenges of globalizing your assets and operations — especially the need for effortless and effective translation. Text-translation has made incredible strides with AI assistance; three years ago machine translation was still a questionable prospect for many companies, but today it would be considered naive not to accept its remarkable results.

Voice translation is another beast altogether, with perhaps greater potential but also greater risk. Speech cannot be post-edited, or read twice for clarification. Your speech translation must be perfect in any context the first time, and this level of accuracy and professionalism is only possible with AI-driven voice-to-voice translation. Customer support teams are already familiar with these challenges, facing an increasing need of digital support to an expanding array of customers with diverse language needs. Enhanced ASR/TTS now allows service desks to speak directly to international customers or clients and vice versa, translated through an intuitive custom MT platform covering language detection, sentiment analysis, and tone recognition. You can see such a conversation in action in the Intento Speech Demo for AI Service Desks, giving a brief look into the new normal of global communications.

The Promise of Accessibility: Training Video Subtitling and Translation

It will be the responsibility of the technology industry to ensure that those who would otherwise not have access to an online world do have the same capacity to adapt. And the promise of AI-powered speech technology is the promise of accessibility. The applications of AI-powered ASR and TTS have the ability to expand outreach and remove barriers, transforming who is able to benefit from speech-oriented platforms.

For many businesses, the current pandemic has made online training procedures a necessary aspect of operations, and while this may be viewed as an obstacle, it can also be an opportunity through potential applications in automatic video translation and subtitling. Automated and reliable video translation will expand the outreach of these companies to sources of human capital previously out of reach, while also giving those facing language barriers the chance to train for otherwise qualified job placements.

The same AI colleagues will broaden accessibility to well-suited potential employees, whose disabilities impede their capacity to train for certain positions. In 2017, UC Berkeley removed tens of thousands of public video lectures and podcasts due to their lack of accessibility for people with disabilities. If this happened today, video transcription applications offered by intelligent speech technology could have allowed these lectures to be digested in any form necessary to cater to hearing or visually impaired students. The same principle can be applied to any business operations team aiming to create and circulate training materials accessible to those with such disabilities.

Revolutionizing Online Conference Platforms

Networking is a cornerstone of growth for many businesses, and conferences have always been an essential platform for networking opportunities. With the expansion of global markets, many companies are looking beyond their own borders for engaging conference opportunities. At the same time, as COVID forced many conferences to virtual formats, these platforms are inevitably seeing the opportunity to easily open the door to a global audience. Both voice-to-voice translation and automatic video subtitling / translation have practical applications here to facilitate a smoother transition for conference structures to take a more international approach. Presentations can be effectively given to global audiences, and attendees will be able to network with potential business partners without having to worry about how their language skills are affecting their first impression. Conference and networking platforms typically rely on a level of comfortability for parties to connect naturally, and what a revolution to these platforms it will be to hold a natural and productive conversation with a new connection across the world, while speaking completely different languages.

•    •    •

It’s inspiring to look at the modern business environment and witness such an explosion of adaptation and innovation. The proliferation of online content and global audiences shouldn’t be seen as a daunting obstacle, but a glimpse into a promising future. Though our human workforce may not be able to handle all of this content alone, our new collaborative, intuitive AI colleagues are ready to help carry the load, and lead us into a new generation of global businesses. Applications in service desks, training capacities, and conference outreach are only a brief glimpse behind the curtain. The next instalment of the Intento Speech Report will discuss the current vendor landscape, so please stay tuned. Voice-to-voice translation is no longer in the domain of fiction, and there has never been a greater need for intelligent speech technology. To find out more or discuss your free demo, please reach us at

Read more

Continue reading the article after registration
Already a member? Sign In

We know how to make your business multilingual and productive. Let's talk.