Meta unveils Voicebox, the most versatile generative AI for speech generation

shutterstock 2159105387 Large

Meta is proud to break new ground in the domain of generative artificial intelligence (AI) for speech with the launch of Voicebox. This innovative AI model excels in an array of speech generation tasks such as audio editing, sampling, and stylising, despite not being directly trained for these capabilities. Instead, Voicebox has mastered these skills through a process of in-context learning.

The Voicebox model, a product of Meta’s relentless innovation, is capable of creating high-quality audio clips and editing pre-recorded audio, eliminating intrusive sounds such as car horns or dogs barking, whilst retaining the original content and style. Impressively, Voicebox is not confined to a single language, with the ability to generate speech in six different languages.

Voicebox: The Future of Multipurpose Generative AI

Looking forward, Voicebox symbolises the potential held by multipurpose generative AI models. This could be the key to supplying virtual assistants and non-player characters in the ever-evolving metaverse with naturalistic voices. The model could also enable visually impaired individuals to audibly receive written messages from friends, spoken by AI in familiar tones.

Beyond these applications, Voicebox promises to revolutionise content creation, providing creators with innovative tools for effortless audio track creation and editing for videos, and much more besides.

Meta’s Voicebox: A Multitude of Applications

Voicebox, Meta’s latest offering, is versatile, with capabilities extending across a wide spectrum of tasks:

In-context text-to-speech synthesis: Voicebox can replicate the style of an audio sample as short as two seconds long for text-to-speech generation.

Speech editing and noise reduction: Voicebox can recreate a speech segment interrupted by noise or replace mispronounced words, all without the need to re-record the whole speech. Consider it as an eraser for audio editing.

Cross-lingual style transfer: Given a sample of speech and a text passage in English, French, German, Spanish, Polish, or Portuguese, Voicebox can generate a reading of the text in any of these languages, even if the sample speech and the text are not in the same language. This function could be instrumental in facilitating natural and authentic communication, even between speakers of different languages.

Diverse speech sampling: Leveraging its diverse data learning, Voicebox can generate speech that authentically reflects real-world conversations in six listed languages.

The unveiling of Voicebox is a testament to Meta’s commitment to pushing the boundaries in generative AI research. As we venture further into audio space, we eagerly await the opportunity to witness how other researchers and innovators will build upon Voicebox’s groundbreaking capabilities.

More from Qonversations

Tech

Screenshot 2024 12 18 at 12.43.02 AM

Powering Ahead: China’s EV trucks set to disrupt the industry?

Tech

Screenshot 2024 12 16 at 5.35.03 PM

Explainer: Arm vs Qualcomm and the battle over Nuvia Tech

Tech

Screenshot 2024 12 12 at 5.28.16 PM

Is Grok the AI revolution we’ve been waiting for?

Tech

Screenshot 2024 12 10 at 2.51.00 PM

Vietnam’s EV boom: Can the charging network keep pace?

Front of mind