Wolf howl created by artificial intelligence
Google Deep Mind

Deep Mind on Tuesday showed off the latest results from its generative artificial intelligence research into video-to-audio conversion. This is a new system that combines what you see on the screen with the user’s written input to create synchronized soundscapes for a given video clip.

V2A AI can be combined with video generation models like Veo, generative audio team Deep Mind wrote in a blog post, to create soundtracks, sound effects and even dialogue for on-screen action. In addition, Deep Mind claims that its new system can generate “an unlimited number of soundtracks for any video input,” tuning the model with positive and negative signals that encourage or discourage the use of a particular sound, respectively.

V2A cars

The system works by first encoding and compressing the input video signal, which the diffusion model uses to iteratively clear the desired audio effects from background noise based on additional text and visual input from the user. This audio output is finally decoded and exported as a signal, which can then be recombined with the video input.

The best part is that the user does not have to go in and manually (read: tediously) synchronize audio and video tracks, since the V2A system does this automatically. “By learning video, audio, and additional annotations, our technology learns to associate specific audio events with different visual scenes, while responding to information presented in the annotations or transcripts,” the Deep Mind team writes.

V2A Wolf

However, the system is not yet perfect. First, the quality of the output audio depends on the fidelity of the video input, and the system fails when the input exhibits video artifacts or other distortions. According to the Deep Mind team, synchronizing dialogue with the audio track remains an ongoing challenge.

“V2A attempts to generate speech based on input transcripts and synchronize it with the characters’ lip movements,” the team explained. “But the paired video generation pattern may not be driven by transcripts. “This creates a mismatch, often resulting in strange lip-syncing as the video model does not generate mouth movements that match the transcription.”

The system still must undergo “rigorous security assessment and testing” before the team will consider releasing it to the public. All videos and soundtracks produced by this system will be accompanied by the SynthID Deep Mind watermark. This system is far from the only sound-generating AI system currently on the market. Stability AI launched a similar product last week, and ElevenLabs launched its audio effects tool last month.

Source: Digital Trends

Previous articlePochta Bank abandoned its plans to develop a banknote service
Next articleDuolingo launches music and math lessons in Portuguese
I am Garth Carter and I work at Gadget Onus. I have specialized in writing for the Hot News section, focusing on topics that are trending and highly relevant to readers. My passion is to present news stories accurately, in an engaging manner that captures the attention of my audience.


Please enter your comment!
Please enter your name here