Google Unveils SoundStorm for Parallel Audio Generation from Discrete Conditioning Tokens

Will ChatGPT Really be The Google Killer?

Google recently introduced a new model called ‘SoundStorm: Efficient Parallel Audio Generation’ in a recent paper. It presents a novel approach for efficient and high-quality audio generation.

Read the full paper here.

SoundStorm tackles the problem of generating lengthy audio token sequences through two innovative components:

  • An architecture tailored to the unique nature of audio tokens produced by the SoundStream neural codec.
  • A decoding scheme inspired by MaskGIT, a recently introduced method for image generation, specifically designed to operate on audio tokens.

Compared to the autoregressive decoding approach of AudioLM, SoundStorm achieves parallel generation of tokens, resulting in a 100-fold reduction in inference time for long sequences. Moreover, SoundStorm maintains audio quality while offering increased consistency in voice and acoustic conditions.

Furthermore, the paper demonstrates that by combining SoundStorm with the text-to-semantic modeling stage of SPEAR-TTS, it becomes possible to synthesize high-quality, natural dialogues. This allows for control over the spoken content through transcripts, speaker voices using short voice prompts, and speaker turns via transcript annotations. The provided examples serve as evidence of the capabilities of SoundStorm and its integration with SPEAR-TTS in producing convincing dialogues.

What’s Under the Hood

In their previous work on AudioLM, the researchers demonstrated a two-step process for generating audio. The first step involved semantic modeling, where semantic tokens were generated based on previous semantic tokens or a conditioning signal. The second step, known as acoustic modeling, focused on generating acoustic tokens from the semantic tokens.

However, in SoundStorm, the researchers specifically addressed the acoustic modeling step and aimed to replace the slower autoregressive decoding with a faster parallel decoding method.

SoundStorm used a bidirectional attention-based Conformer, which is a model architecture that combines convolutions with a Transformer. This architecture allows for capturing both local and global structures in a sequence of tokens. The model was trained to predict audio tokens produced by SoundStream, given a sequence of semantic tokens generated by AudioLM as input. The SoundStream model employed a method called residual vector quantization (RVQ), where up to Q tokens were used to represent the audio at each time step. The reconstructed audio quality improved progressively as the number of generated tokens per step increased from 1 to Q.

During inference, SoundStorm started with all audio tokens masked out, and then filled in the masked tokens over multiple iterations. It began with the coarse tokens at RVQ level q = 1 and continued with finer tokens, level by level, until reaching level q = Q. This approach enabled fast generation of audio.

Two crucial aspects of SoundStorm contributed to its fast generation capability. Firstly, tokens were predicted in parallel within a single iteration at each RVQ level. Secondly, the model architecture was designed in a way that the computational complexity was only mildly affected by the number of levels, Q. To support this inference scheme, a carefully designed masking scheme was used during training to simulate the iterative process used during inference.

When compared to AudioLM, SoundStorm is significantly faster, being two orders of magnitude quicker, and it also achieves better consistency over time when generating lengthy audio samples. By combining SoundStorm with a text-to-semantic token model similar to SPEAR-TTS, the text-to-speech synthesis can be scaled to handle longer contexts.

Additionally, it becomes possible to generate natural dialogues with multiple speaker turns, giving control over both the voices of the speakers and the content being generated. It’s worth noting that SoundStorm is not limited to speech synthesis alone; for instance, MusicLM uses SoundStorm efficiently to synthesize longer musical outputs.

Why Is This important?

The challenge addressed is the slow inference time associated with generating long sequences of audio tokens using autoregressive decoding methods. Autoregressive decoding, although ensuring high acoustic quality, generates tokens one by one, resulting in computationally expensive inference, especially for longer sequences. SoundStorm proposes a new method that addresses this challenge by introducing an architecture adapted to audio tokens and a decoding scheme inspired by MaskGIT, allowing for parallel generation of tokens. By doing so, SoundStorm significantly reduces the inference time, making audio generation more efficient without compromising the quality or consistency of the generated audio.

Many generative audio models, including AudioLM, uses auto-regressive decoding, which generates tokens one by one. Although this method ensures high acoustic quality, it can be computationally slow, particularly when dealing with long sequences.

The post Google Unveils SoundStorm for Parallel Audio Generation from Discrete Conditioning Tokens appeared first on Analytics India Magazine.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...