Exploring the Emergent Abilities of the Largest Text-to-Speech AI Model Yet
Amazon’s researchers have developed the largest text-to-speech model to date which shows “emergent” advances to sound out complicated sentences in a more natural manner. This advancement could be exactly what the technology needs to finally go beyond the uncanny valley.
It was expected that these models would continue to evolve and get better, but what the researchers hoped for was to witness this kind of progress when the language models surpassed a certain size. It seems that once LLMs exceed a particular size threshold, they become much more resilient and flexible, capable of accomplishing duties they weren’t programmed for.
That’s not to say they develop consciousness, but past a certain size, their performance in certain AI conversation tasks dramatically improves. The team at Amazon AGI — it’s clear what their target is — believed that the same might occur as text-to-speech models expanded, and their research indicates that this indeed appears to be true.
The new model is named Big Adaptive Streamable TTS with Emergent abilities, cleverly shortened to BASE TTS. The largest iteration of the model uses 100,000 hours of public domain speech, 90% of which is in English, with the remainder in German, Dutch, and Spanish.
BASE-large, with its astounding 980 million parameters, holds the title of the largest model in its group. It’s interesting to note that for comparative purposes, models with 400M and 150M parameters were also developed, drawing on 10,000 and 1,000 hours of audio, correspondingly. The motive behind this was to establish a window within which emergent behaviours may start appearing if observed in one model but absent in another.
The mid-sized model was the one to show the anticipated leap in capability. Although it didn’t significantly outperform in ordinary speech quality only bettering by some points, it excelled in the emergent abilities that were scrutinized and gauged. You can find elaborative examples of the complicated text in the study.
The authors explain that these examples carry complicated tasks like parsing garden-path sentences, applying phrasal stress to lengthy compound nouns, producing emotional or whispered speech, or generating accurate phonemes for non-English words like “qi” or symbols like “@”. Importantly, none of these are tasks that BASE TTS has been specifically trained to accomplish.
Normally, these features lead Text-To-Speech engines astray, resulting in mispronounced, missing words, unusual intonation, or other errors. However, when compared to its counterparts – models such as Tortoise and VALL-E – BASE TTS held its ground and performed impressively, even though it still encountered some difficulties.
There are a bunch of examples of these difficult texts being spoken quite naturally by the new model at the site they made for it. Of course these were chosen by the researchers, so they’re necessarily cherry-picked, but it’s impressive regardless. Here are a couple, if you don’t feel like clicking through:
Given that the three fundamental TTS models share a common architecture, it is apparent that the size of the model and the breadth of its training data are likely the reasons for the model’s ability to handle certain complexities. Nonetheless, this is still a trial model and process, not something commercially available. Future studies will need to pinpoint the tipping point for emerging capability and how to effectively train and deploy the final product.
This model can be “streamed,” as its name implies. It doesn’t have to generate complete sentences at once, but proceeds bit by bit at a relatively low bitrate. The team has attempted to encapsulate speech metadata such as emotion, prosody, and other characteristics in a separate, low-bandwidth channel that can complement plain audio.
It appears that text-to-speech models might have a significant breakthrough in 2024, just in time for the elections! Nonetheless, there’s no denying the utility of this technology, particularly for accessibility. The team does acknowledge that they chose not to release the model’s source and other data due to the potential of misuse by malicious actors. Eventually, this information will become publicly accessible.
Discover the pinnacle of WordPress auto blogging technology with AutomationTools.AI. Harnessing the power of cutting-edge AI algorithms, AutomationTools.AI emerges as the foremost solution for effortlessly curating content from RSS feeds directly to your WordPress platform. Say goodbye to manual content curation and hello to seamless automation, as this innovative tool streamlines the process, saving you time and effort. Stay ahead of the curve in content management and elevate your WordPress website with AutomationTools.AI—the ultimate choice for efficient, dynamic, and hassle-free auto blogging. Learn More