Meta unveils Voicebox, the most versatile generative AI for speech generation

shutterstock 2159105387 Large

Meta is proud to break new ground in the domain of generative artificial intelligence (AI) for speech with the launch of Voicebox. This innovative AI model excels in an array of speech generation tasks such as audio editing, sampling, and stylising, despite not being directly trained for these capabilities. Instead, Voicebox has mastered these skills through a process of in-context learning.

The Voicebox model, a product of Meta’s relentless innovation, is capable of creating high-quality audio clips and editing pre-recorded audio, eliminating intrusive sounds such as car horns or dogs barking, whilst retaining the original content and style. Impressively, Voicebox is not confined to a single language, with the ability to generate speech in six different languages.

Voicebox: The Future of Multipurpose Generative AI

Looking forward, Voicebox symbolises the potential held by multipurpose generative AI models. This could be the key to supplying virtual assistants and non-player characters in the ever-evolving metaverse with naturalistic voices. The model could also enable visually impaired individuals to audibly receive written messages from friends, spoken by AI in familiar tones.

Beyond these applications, Voicebox promises to revolutionise content creation, providing creators with innovative tools for effortless audio track creation and editing for videos, and much more besides.

Meta’s Voicebox: A Multitude of Applications

Voicebox, Meta’s latest offering, is versatile, with capabilities extending across a wide spectrum of tasks:

In-context text-to-speech synthesis: Voicebox can replicate the style of an audio sample as short as two seconds long for text-to-speech generation.

Speech editing and noise reduction: Voicebox can recreate a speech segment interrupted by noise or replace mispronounced words, all without the need to re-record the whole speech. Consider it as an eraser for audio editing.

Cross-lingual style transfer: Given a sample of speech and a text passage in English, French, German, Spanish, Polish, or Portuguese, Voicebox can generate a reading of the text in any of these languages, even if the sample speech and the text are not in the same language. This function could be instrumental in facilitating natural and authentic communication, even between speakers of different languages.

Diverse speech sampling: Leveraging its diverse data learning, Voicebox can generate speech that authentically reflects real-world conversations in six listed languages.

The unveiling of Voicebox is a testament to Meta’s commitment to pushing the boundaries in generative AI research. As we venture further into audio space, we eagerly await the opportunity to witness how other researchers and innovators will build upon Voicebox’s groundbreaking capabilities.

More from Qonversations

Featured

Screenshot 2024 10 24 at 8.59.43 PM

Did you know? The first 1GB hard drive weighed over 500 pounds and cost $40,000

Tech

Screenshot 2024 10 24 at 8.16.41 PM

Cuban women in science: A rising force but what’s behind the growth?

Tech

Foxconn

Foxconn and Nvidia’s world’s largest GB200 superchip plant in Mexico: What we know

Tech

Screenshot 2024 10 18 at 12.03.24 PM

Quantum computing: 5 things you need to know

Front of mind