Google DeepMind unveils new AI tool for video soundtrack generation

The new tool is set to generate video soundtracks which not only use text prompts to create audio but also consider the content of the video, crafting scenes with a drama score, realistic sound effects, or matching dialogues.

Google DeepMind new AI tool

Google DeepMind has unveiled a new AI tool designed to generate video soundtracks which not only uses text prompts to create audio but also considers the content of the video. By integrating these two aspects, DeepMind claims users can use the tool to craft scenes with 'a drama score, realistic sound effects, or dialogue that matches the characters and tone of a video'.

For the video of a car driving through a cyberpunk-like cityscape, as shown above, Google used the prompt 'cars skidding, car engine throttling, angelic electronic music' to generate the audio with the sounds of skidding align with the car’s movements. 

While another example as shown above creates an underwater soundscape using the prompt, 'jellyfish pulsating underwater, marine life, ocean.'

Although users can include a text prompt, DeepMind notes it is optional. Users don’t need to precisely match the generated audio with the corresponding scenes. According to DeepMind, the tool can generate an 'unlimited' number of soundtracks for videos, providing users with an endless array of audio options. This capability distinguishes it from other AI tools, such as ElevenLabs' sound effects generator, which also uses text prompts to create audio. It might also facilitate pairing audio with AI-generated video from tools like DeepMind’s Veo and Sora.

DeepMind states it trained its AI tool using video, audio, and annotations that include 'detailed descriptions of sound and transcripts of spoken dialogue.' This training enables the video-to-audio generator to synchronize audio events with visual scenes.

However, the tool does have some limitations. For instance, DeepMind is working to improve its ability to synchronize lip movements with dialogue, as shown in this video of a claymation family. Additionally, the platform points out that the video-to-audio system relies on video quality, so anything grainy or distorted 'can lead to a noticeable drop in audio quality.'

While the tool isn’t available just yet, when it does become available, its audio output will include Google’s SynthID watermark to flag that it’s AI-generated.

