Imagine you work for a late-night comedy show and want to put together a montage of news anchors saying the word “covfefe”. You could employ an army of interns to listen to hundreds of hours of recorded broadcasts, or you could use Cobalt’s Telefol engine to search. Technology to the rescue!
Understanding Keyword Spotting
Many companies and organizations have access to large volumes of recorded speech, but it can be challenging to leverage the full value of that wealth of information because it is costly and time-consuming to search through audio. One strategy is to use automatic speech recognition to transcribe the audio, then search the transcript. The “covfefe” example illustrates one limitation of that approach–because it’s not a word in the English lexicon, the transcript would not include it.
Cobalt has developed a product called Telefol to address these limitations using keyword-spotting. Telefol can run in real-time mode, listening to live audio streams and triggering events when it recognizes various configured phrases. Or it can create a phonetic index for batches of audio recordings, allowing later ad hoc searches of the audio itself. Companies can use these features either instead of or as a complement to full text transcripts to reap the most benefit from their audio assets.
When Telefol runs as a real-time event service, it allows a company to monitor many streams of audio without as much overhead as full transcription, since accurate transcription with a full conversational model requires a significant CPU and memory footprint. For some use cases, a company might not want to transcribe all the available audio, but only recognize when something relevant is said. As described in our previous blog post, ASR systems include a complex language model to predict what sequences of words are most likely in order to differentiate between similar-sounding phrases. Telefol builds a language model dynamically based on the keywords of the events the system is configured to recognize, rebuilding the model at run-time when events are added or modified. The model is faster, requiring less processing time to keep up with a real-time stream, because it does not have to predict the probability of every possible sequence and can focus on the words that are most relevant.
This streaming functionality is particularly useful for monitoring compliance, raising alerts in real time if certain key phrases are said, or if mandatory disclosures are not said within a specified portion of the audio. For example, financial advisors might have an app on their phones that reminds them if they haven’t mentioned a standard risk warning yet during a call with their client, or perhaps alerts the legal department if they do say “guarantee”!
When Telefol runs on a batch of audio recordings, it decodes the speech not into words, but into the combination of sounds that make up those words. This phonetic index is more powerful than simply doing a text search of archived transcripts because it can find out-of-vocabulary words, such as unusual proper names. For example, imagine you want to search the archives of your company’s internal meetings and presentations for all mentions of a specific client. That client’s name might not have been recognized or spelled correctly in the transcript (unless you use Cubic for transcription and had used Cobalt’s tools for adding new vocabulary to your model beforehand). Because Telefol indexes the phonemes, it will search for the several most likely pronunciations of your search term.
Telefol’s phonetic index also helps overcome another inherent limitation of transcript searches. Similar-sounding phrases can be ambiguous even for the best possible ASR models, and a transcript has to choose one, even if there were multiple reasonably likely alternatives. A text search for the second-best alternative for a given utterance won’t find it in the transcript, even if the model recognized that there was a 40% chance that phrase was said. Telefol, however, indexes all alternatives over a configured confidence threshold, so it can return all likely results. When running a search, the user can choose the confidence level to determine whether the results should be more inclusive or more focused.
More than just a simple search for a word, Telefol allows defining complex events: combining multiple phrases, specifying the time segment of audio, specifying that phrases must occur a number of times or within a number of seconds of each other. Events can also be chained together, allowing support for all kinds of sophisticated application logic.
Here are a few use cases where Telefol’s phonetic indexing might be more suitable than traditional ASR:
- Indexing recordings of meetings for phrases like “let’s table that” or “take an action item”.
- Monitoring audio-enabled security cameras for sensitive phrases that might require attention.
- In a fast-food drive-through, reminding the clerk to say the appropriate (perhaps seasonal) phrases as directed.
- Monitoring multiple live streams for mentions of a particular person or entity for PR or reputation management.
- Monitoring sensitive areas in a school for bullying phrases without transcribing or recording any of the conversation.
- In an interactive gaming setting, listening for specific commands like “launch the rocket” without having to transcribe all of the other banter between the players.
- Indexing large volumes of video for quick search and retrieval.
How can your business use these powerful tools to reap benefits from your unstructured audio data? Get in touch to see how we can help you.
About the Author
Julie Sheffield has 20 years of experience in the software industry, building enterprise platforms and applications, leading teams, and mentoring talented engineers. At Cobalt, she leads the engineering team to ensure the scalability and robustness of the various speech engines Cobalt deploys, and to support customers’ integration of those engines into their own applications. Julie loves finding solutions, making the complex seem simple, and creating software that empowers people to accomplish more than they ever thought possible.