Researchers from the Hong Kong College of Science and Know-how and Moonshot AI have teased a brand new AI mannequin referred to as AudioX, that generates audio and music utilizing multimodal inputs.
AudioX is described as a unified mannequin providing versatile pure language management and seamless processing of inputs that embrace textual content, video, picture, music, and audio. This differs from the usual domain-specific fashions that sometimes give attention to a single modality or a restricted set of enter circumstances.

The analysis paper talked about use instances like text-to-audio, text-and-video-to-audio, and video-to-audio with AudioX. Notably, the AI mannequin additionally lets one refine current audio by way of a textual content immediate, enhance unprocessed music, and generate music from scratch.
Netizens appear excited in regards to the demo of the mannequin shared on the mannequin’s GitHub repo, highlighting attention-grabbing use instances like producing audio for a tennis video:
AudioX : Something-to-Audio Technology
Mindblowing, I couldn’t consider that tennis instance it was simply too good. pic.twitter.com/EA8clWlqmF— AshutoshShrivastava (@ai_for_success) March 19, 2025
The researchers talked about that they purpose to handle the shortage of high-quality multi-modal knowledge, which has been a significant bottleneck within the growth of versatile audio technology methods. To deal with this, they curated two complete datasets: vggsound-caps, with 190K audio captions primarily based on the VGGSound dataset, and V2M-caps, with 6 million music captions derived from the V2M dataset.
“In depth experimental outcomes present that AudioX not solely excels in intra-modal duties but in addition considerably improves inter-modal efficiency, highlighting its potential to advance the sector of multi-modal audio technology,” the analysis paper acknowledged.
At present, the code for the mannequin just isn’t accessible. The researchers talked about it could be accessible on the GitHub web page with out specifying a timeframe or licence particulars.
There are numerous text-to-music fashions and a few text-to-speech fashions accessible, which have seen artistic use instances within the AI house. It stays to be seen how AudioX opens up extra potentialities.
The publish Researchers Unveil AudioX—AI Mannequin That Converts Something to Audio, Music appeared first on Analytics India Journal.