Why Smart Speakers & In-Audio search need a diverse range of digital voices
With the rise of podcasts, there’s a growing trend where podcast platforms or podcast apps, are leveraging the use of natural language (NL) processing to help users find and listen to shows. This is similar to smart speakers that use voice activated Artificial Intelligence (AI) technology such as Alexa, Google Home and Apple Homepod.
With any new trend, where there are opportunities, there are also issues. The issue here is with the increase use of natural language and artificial intelligence presents a diversity and inclusion problem. This doesn’t only affect Caribbean and people of color but all “non-traditional” or “broadcast English” speakers.
Let’s start with the basics:
What is artificial intelligence?
“Artificial intelligence (AI) makes it possible for machines to learn from experience, adjust to new inputs and perform human-like tasks. Most AI examples that you hear about today – from chess-playing computers to self-driving cars – rely heavily on deep learning and natural language processing. Using these technologies, computers can be trained to accomplish specific tasks by processing large amounts of data and recognizing patterns in the data.” (Definition by SAS)
What is natural language?
Simply put – AI and NLP learns from what you input, it gets smarter/improves with more human interaction and it repeats the learning cycle.
What is SEO?
“SEO is the acronym for Search Engine Optimization. It’s the practice of optimizing websites to make them reach a high position in Google’s – or another search engine’s – search results. SEO focuses on rankings in the organic (non-paid) search results.” (Definition by Yoast)
The content on a website is one way to reach high in the search results of Google or any other search engine.
Why is this important?
SEO. “Voice SEO” to be exact. The use of Artificial Intelligence (AI) and Natural Language Processing (NLP) trends in podcasts and voice activated gadgets like smart speakers, is about searching. There’s already a problem with the discoverability of podcasts by people of color without the natural language processing and AI aspects, and this just makes it even more difficult. And if the tech doesn’t understand you when you’re speaking then they are lost in translation.
Podcasts & Natural Language Processing
A few months ago there was an article about Castbox a podcast app that raised 13.5 million to launch its own programming. The article mentioned how Castbox’s use of natural language was going to revolutionize podcasting.
“What makes Castbox interesting is the proprietary technology it has under the hood. The platform uses natural language processing and machine learning techniques to power some of its unique features, like personalized recommendations and in-audio search.
The app is capable of making suggestions of what to listen to next based on users’ prior listening behavior, which can help to improve discovery of podcasts people may like. Meanwhile, the in-audio search feature takes advantage of the recent leaps the industry has seen with voice recognition technology, and actually transcribes the audio content inside podcasts, indexes it and makes it available for search within the Castbox app.
That means users no longer have to rely on things like episode titles, descriptions and show notes to find a podcast related to a topic they want to listen to — they can just search the Castbox app for any podcasts where a term was mentioned.” ~ Source: Techcrunch
Both Carry On Friends The Caribbean American Podcast and The Style & Vibes Podcast are on Castbox. On it’s face all these features are great, however my primary concern with podcast platforms or apps that provide natural language processing via in-audio voice search is it will benefit only or mostly popular shows backed by network providers, and those shows that are in standard English. In my legal industry career I’ve been familiar with natural language processing and the limitations that were causing my concerns. However, being that I’m no longer in that industry perhaps things have improved, so I decided to do some experimenting.
There’s this Castbox demo video on Youtube showing how the in-audio voice search feature that leverages NLP works. As in the video, a search was done on “how to get through the present”. The results were similar and were categorized in channels (aka shows), episodes and audio. The audio it appears, is the natural language going through the audio of each show and pinpoints the exact timestamp when one or more of the terms in the search is found in particular episodes. A similar search was done using Carry On Friends’ content and didn’t have the same results.
The episode used in my demo wasn’t even in patois or heavy in it. I spoke in my natural voice. My simple experiment confirmed my concerns that with natural language processing there are limitations with standard english much less for those with accents.
In June, I did a fire side chat during CITE week on the digital voice and had a follow up lunch with a friend where we discussed the topic further. As it turns out the Washington Post and it’s vast resources where already exploring my concerns.
“At first, all accents are new and strange to voice-activated AI, including the accent some Americans think is no accent at all — the predominantly white, nonimmigrant, non regional dialect of TV newscasters, which linguists call “broadcast English”.
The AI is taught to comprehend different accents, though, by processing data from lots and lots of voices, learning their patterns and forming clear bonds between phrases, words and sounds.
To learn different ways of speaking, the AI needs a diverse range of voices — and experts say it’s not getting them because too many of the people training, testing and working with the systems all sound the same. That means accents that are less common or prestigious end up more likely to be misunderstood, met with silence or the dreaded, “Sorry, I didn’t get that.”
…for people with accents — even the regional lilts, dialects and drawls native to various parts of the United States — the artificially intelligent speakers can seem very different: inattentive, unresponsive, even isolating. For many across the country, the wave of the future has a bias problem, and it’s leaving them behind.” ~ Source: The Washington Post
Simply put people with accents or communicate in a way their audience appreciates and loves are being left out of smart speaker revolution and if more podcast platforms are using natural language processing to make shows searchable for the audience, shows produced by Breadfruit Media are already on the losing end.
Dear @Apple, can we get a Caribbean Siri?
— Melanin Queen 🍫✨ (@__Deedz) July 24, 2018
And instead of Siri she’d be called Shelly Ann https://t.co/L2Ukskk0yr
— Rewind N Come Again (@RACAblog) July 17, 2018
The same way people want themselves represented visually in images, videos, print or in text the same is applicable with the digital voice. It’s not only an issue with podcasts and voice activated speakers. Don’t believe me? Have you tried to use mic button on your smart phone to dictate and send a message? Forget using patois (patwa) to send a message and sometimes it doesn’t catch my proper english correctly!
I see this as a potential opportunity in the rise of the digital voice. Many people get into podcasting to be the talent/host however I think there needs to be diversity in those producing the technology being used in the space.
I intentionally use a transcription service that understands patois. I would outsource production if it weren’t for me wanting to control the quality. However, let’s say that I was ready to outsource – I can’t find a production company that I think would be able to handle and understand the nuances of my patois and code switching enough to know when and how to make the edits.
Podcast platforms and apps continue to offer ways to improve the discoverability of a show to reach new audiences. However some of these new efforts aren’t beneficial to all podcasters. Yes, we should leverage what we can, like Castbox’s commenting feature which allows for more engagement with the audience. However, a content creator shouldn’t have to decide between their show being searched and discovered using smart speaks or voice searchable podcast platforms over the authenticity in language that their audience loves about their shows.
The case can be made that more strides can be made when there’s diversification in the people developing and testing the technology as the Washington Post article pointed out. I think there’s opportunities for minority tech entrepreneurs to get in the space to solve some of the problems content creators are having. Or existing technology companies engage diverse voices to improve the AI offerings.