KDnuggets News, August 2: ChatGPT Code Interpreter: Fast Data Science • Can’t Keep Up? Catch up on This Week in AI

Features

  • ChatGPT Code Interpreter: Do Data Science in Minutes by Natassha Selvaraj
  • This Week in AI, July 31: AI Titans Pledge Responsible Innovation • The Beluga Invasion by KDnuggets
  • Introduction to Statistical Learning, Python Edition: Free Book by Bala Priya C
  • 8 Programming Languages For Data Science to Learn in 2023 by Abid Ali Awan

From Our Partners

  • Advance your Career with the 3rd Best Online Master’s in Data Science Program by Bay Path University
  • Unlock the Power of AI — A Special Release by KDnuggets and Machine Learning Mastery by Machine Learning Mastery
  • MOSTLY AI: The most accurate synthetic data generator by Mostly AI

This Week's Posts

  • Mastering GPUs: A Beginner's Guide to GPU-Accelerated DataFrames in Python by KDnuggets
  • Keras 3.0: Everything You Need To Know by Kanwal Mehreen
  • Introduction to Data Science: A Beginner's Guide by Nate Rosidi
  • 5 Mistakes I Made While Switching to Data Science Career by Abid Ali Awan
  • Introducing OpenLLM: Open Source Library for LLMs by Nisha Arya
  • An MLOps Mindset: Always Production-Ready by Abhishek Gupta
  • Multivariate Time-Series Prediction with BQML by JeongMin Kwon
  • LGBMClassifier: A Getting-Started Guide by Vidhi Chugh
  • Clustering Unleashed: Understanding K-Means Clustering by Aryan Garg
  • Mastering Regular Expressions with Python by Matthew Mayo
  • Pythia: A Suite of 16 LLMs for In-Depth Research by Bala Priya C
  • Transforming AI with LangChain: A Text Data Game Changer by Josep Ferrer

From Around The Web

  • Maximizing Productivity with ChatGPT by Machine Learning Mastery
  • An Overview of Feature Selection Techniques in scikit-learn by Data Science Horizons
  • Distributed Llama 2 on CPUs by Jonathan Apple
  • 50+ New Cutting-Edge AI Tools (August 2023) by MarkTechPost
  • Patterns for Building LLM-based Systems & Products by Eugene Yan

More On This Topic

  • ChatGPT Code Interpreter: Do Data Science in Minutes
  • KDnuggets News, August 17: How to Perform Motion Detection Using Python •…
  • KDnuggets News, August 31: The Complete Data Science Study Roadmap • 7…
  • KDnuggets News, November 9: 7 Tips To Produce Readable Data Science Code •…
  • KDnuggets News, August 24: Implementing DBSCAN in Python • How to Avoid…
  • KDnuggets News, August 3: 10 Most Used Tableau Functions • Is Domain…

DSC Webinar Series: OCI & HARC: Modernizing Workloads in the Oracle Cloud

The convergence of Oracle Cloud Infrastructure (OCI) and Hitachi Application Reliability Centers (HARC) to magnify outcomes for customers.

Tech giants Oracle and Hitachi Vantara are marching together to magnify cloud outcomes. Join us for the Oracle and Hitachi Vantara virtual event, where we discuss how businesses can get the most out of OCI and HARC. Oracle experts will discuss the challenges IT leaders face in this new normal: the pressure to step up innovations, move faster in meeting customer demands, and keep the business and its data secure.

Hitachi Vantara will show a path forward with HARC to successfully tackle hybrid and multicloud complexities while redefining cloud operations and application modernization. HARC brings together engineering expertise, repeatable frameworks for DevOps, site-reliability engineering (SRE), Data Reliability Engineering (DRE), and intellectual property around automation and AIOps to help customers build Cloud Center of Excellence.

Emerging AI statistics and trends to watch

ai statistics

Artificial intelligence, or AI, has often been depicted as a terrifying force, from HAL 9000’s chilling declaration in “2001: A Space Odyssey” to the apocalyptic machine uprising in the Terminator movies. However, in reality, AI has become an integral part of our daily lives, with AI-powered Android devices in our pockets.

Though we may not have fully autonomous androids yet, the future of AI looks promising, especially when considering its impact on various industries, the economy, and the workforce, along with the increasing use of residential IP addresses to power AI systems.

Top AI statistics and trends in 2023

  1. The AI market size is projected to reach a staggering $407 billion by 2027, experiencing substantial growth from its estimated $86.9 billion revenue in 2022.
  2. AI is expected to contribute a significant 21% net increase to the United States GDP by 2030, showcasing its impact on economic growth.
  3. ChatGPT, an AI-powered language model, experienced remarkable adoption, garnering 1 million users within the first five days of its release.
  4. It is expected that 10% of vehicles will be self-driving by 2030, with the global market of self-driving cars forecasted to increase from 20.3 million in 2021 to 62.4 million.
  5. A significant 64% of businesses believe that artificial intelligence will help increase overall productivity, demonstrating growing confidence in AI’s potential to transform business operations.
  6. Voice search is on the rise, with 50% of U.S. mobile users using it daily, showcasing the growing prevalence of AI-powered voice assistants in everyday life.
  7. AI continues to revolutionize various industries, with an expected annual growth rate of 37.3% between 2023 and 2030, emphasizing its increasing impact in the coming years.
  8. As labor shortages become a pressing concern, 25% of companies are turning to AI adoption to address this issue, using AI to optimize operations and compensate for the lack of human resources.
  9. China leads in AI adoption, with 58% of companies deploying AI and 30% considering integration. In comparison, the United States has a lower adoption rate, with 25% of companies using AI and 43% exploring its potential applications.
  10. Concerns about AI-driven job loss persist, with a substantial 77% of people expressing apprehension that AI could lead to job displacement in the near future.
  11. AI could displace 400 million workers worldwide by 2030, but it is also projected to create around 97 million new jobs, potentially countering workforce displacement concerns.
  12. The manufacturing sector is expected to gain $3.8 trillion by 2035 due to AI adoption, indicating the significant financial impact AI can have on industries.

AI usage and consumer sentiment

  1. Half of U.S. mobile users use voice search daily, and more than 3 billion voice assistants were in use by the end of 2019, signaling the growing popularity of AI-powered voice interactions.
  2. Over 60% of business owners believe AI will improve customer relationships, while 43% of consumers believe companies will become more careful with customer data when using AI, highlighting the dual perspectives on AI’s impact on customer experiences and data privacy.
  3. Despite concerns about AI usage, 65% of consumers still trust businesses that employ AI technology, suggesting that responsible and transparent AI deployment can foster consumer confidence.
  4. Over half of respondents (54%) believe that AI can improve written content, showcasing the potential for AI-driven solutions like ChatGPT to enhance text quality and efficiency in content creation.

AI in self-driving vehicles

  1. 67% of Americans believe self-driving cars are safer than regular cars, and 87% prefer autonomous cars with a human driver ready to take control, illustrating the mixed public sentiment towards self-driving vehicles.
  2. 25 countries are currently working on designs for autonomous vehicles, and the self-driving car industry is predicted to reach an annual growth rate of 36% by 2023, with a global revenue nearing $173 billion.

The future of AI

  1. The AI industry could be worth more than $15 trillion by 2030, with China accounting for 26.1% of the global AI market share.
  2. There will be 8 billion voice assistants in use by 2023, and analysts predict the AI industry will generate revenues of $126 billion a year by 2025.
  3. AI is expected to create new jobs and automate laborious processes, making businesses and workers adapt to incorporate AI into their operations.

Artificial intelligence has already made a significant impact on various sectors, and its continued growth promises a transformative future. While concerns about job displacement and consumer sentiment persist, responsible and transparent AI deployment can shape a future where AI enhances productivity, improves customer experiences, and drives economic growth. Embracing AI’s potential while addressing its challenges will be crucial to shaping a bright future empowered by artificial intelligence.

Websites from where we got these statistics:

  • Marketsandmarkets
  • Statista
  • UpCity
  • GrandViewResearch
  • IBM
  • Forbes Advisor.
  • Intelligence.
  • McKinsey & Company
  • WeForum
  • Accenture

Google pulls its AI Test Kitchen app from Play Store and App Store

Google pulls its AI Test Kitchen app from Play Store and App Store Ivan Mehta 13 hours

Google has pulled its AI Test Kitchen app from the Play Store and the App Store to focus solely on the web platform.

The company launched the AI Test Kitchen experience last year to let users interact with projects powered by different AI models such as LaMDA 2. The first set of experiments included the model breaking down a goal into different subsets and talking about dogs to check if the system sticks to the topic.

Google confirmed the move to 9to5Google, which first noted the apps being pulled, and said that AI Test Kitchen will focus on just the web experience as it is easy to push updates on just one platform.

Last November, Google announced “Season 2” of the AI Test Kitchen with new experiments. But they were never rolled out. Currently, the Test Kitchen hosts only a solitary text-to-music language model experiment called MusicLM, which was announced earlier this year at Google IO.

This move is not very surprising given Google has a habit of shutting down apps and experiments without prior notice. Plus, given the rise of large language models (LLMs) and generative AI-focused tools like OpenAI’s ChatGPT and Anthtropic’s Claude, the company might want to focus more on testing features for its consumer products.

In May, during Google IO, the company announced a new portal called Google Labs, which allows users to sign up for generative AI-based experiments. Notably, this page also lists the aforementioned MusicLM experiment.

It’s on-brand for Google to make things confusing by having multiple products for AI experiments. Now, we have an AI Test Kitchen page with one experiment. A Google Labs page shows different projects like Search Labs, the company’s AI-powered note-taking project NotebookLM, AI-focused Workspace features, along with the MusicLM project.

Meta Plans to Integrate AI-Powered “Personas” Into Its Services

Meta is reportedly on the verge of integrating AI-powered “personas” into its services, including Facebook and Instagram. According to the Financial Times, this revolutionary development could launch as early as next month, substantially transforming user interaction with these platforms.

AI-Personas: Combining Personality with Utility

Reportedly, these AI-personas, also referred to as chatbots, will come with their unique personalities. They will serve users with distinct styles and vibes, ranging from a laid-back surfer offering travel advice to a version embodying the persona of Abraham Lincoln. Such a dynamic approach not only personalizes the user experience but also adds a novelty factor to routine user interactions.

The inspiration behind this innovation is Meta CEO Mark Zuckerberg's vision to embed “AI personas” into the company's products. He has already declared the establishment of a new product group dedicated to generative AI in February. These personas would be designed to assist users in various ways. The focus is on developing experiences with text, images, and video, enhancing interaction in apps like WhatsApp, Messenger, and Instagram.

In fact, early indications of a “Chat with an AI” feature in the Instagram app have already been spotted. This feature will supposedly be capable of responding to questions and offering advice, all in the distinct styles of 30 different AI personalities. The chatbot could also facilitate users in composing messages.

Meta’s Competitive Edge and Data Acquisition

This anticipated launch could significantly boost Meta's competitive position. AI chatbots could increase engagement on platforms like Facebook and Instagram, helping Meta hold its ground against competitors like TikTok. Simultaneously, it could serve as a demonstration of Meta's AI prowess, placing it in direct competition with entities like Microsoft-backed OpenAI and Google's Bard.

These developments are built on Meta's proprietary LLaMA large language model, as mentioned by Zuckerberg in a recent earnings call. He also hinted at a multitude of ways AI can enhance user interaction and creativity in their apps. Detailed plans regarding Meta's AI initiatives are expected to be unveiled at its Connect developer event in September.

However, these AI-powered chatbots don't just offer increased user engagement; they might also serve as valuable data collectors. The Financial Times suggests that this development could provide Meta with more data about user interests, enhancing its ad targeting capabilities.

With these new AI-personas, Meta is blurring the lines between human and AI interaction, setting the stage for a new era of personalized digital experience.

Meta Open Sources Audiocraft: Generative AI Music Studio

Meta Open Sources Casual Conversations v2, An Inclusive Dataset for Computer Vision

Meta today announced that it is open sourcing Audiocraft, a new family of generative AI models built for generating high-quality, realistic audio & music from text.

We're publicly releasing AudioCraft for research purposes and to further understanding of the technology. Responsible innovation can’t happen in isolation. Opening up our research and resulting models helps ensure that everyone has equal access.
GitHub ⬇https://t.co/hu1004mxX4

— Meta AI (@MetaAI) August 2, 2023

AudioCraft introduces a unified code base that encompasses music, sound, compression, and generation functionalities, providing a comprehensive solution in one place. The system comprises three models: MusicGen, AudioGen, and EnCodec.

The new release of AudioCraft is an improvement over the previous MusicGen version. It includes a better EnCodec decoder that allows for higher-quality music generation with fewer glitches. Moreover, it now has pre-trained AudioGen models, enabling the system to create environmental sounds and sound effects like a dog barking, cars honking, or footsteps on a wooden floor.

This release is exciting as it simplifies building on the top of the state of the art in audiogeneration. People can now build things like sound generators and compression algorithms with the same code base.

Meta in its blog stated that AudioCraft represents a significant advancement in generative AI research. They believe that the straightforward approach they developed for audio generation will have a profound influence on future technologies and how we interact with them.

Meta expressed its excitement over the creative potential of people using AudioCraft and looks forward to seeing what they will create with it.

Access the code here: https://bit.ly/3QnMya3

The post Meta Open Sources Audiocraft: Generative AI Music Studio appeared first on Analytics India Magazine.

Conversational AI to Fuel Contact Center Market to 16% Growth

A robot AI assistant with headphones.
Image: Adobe Stock by Tensor Spark

The virtual assistant market is heading toward 24% growth next year thanks to advances in and adoption of cloud-based contact services using conversational AI. Tech consultancy Gartner predicts that conversational AI will be a $18.6 billion market in 2023, an increase of 16.2% from 2022.

Jump to:

  • Organizations are driving growth in virtual assistants
  • Economic uncertainties won’t capsize CC AI
  • Will AI skills inoculate employees?
  • Employees expect companies to close AI skills gap

Gartner’s newly released data shows the global conversational AI/virtual assistant market is the fastest-growing segment of the contact center sector. Megan Marek Fernandez, director analyst at Gartner said near-term investment growth rates for CC and CC conversational AI and virtual assistants will actually take a breather because of business volatility, but the dip in growth, if there is one, is going to be brief.

“Longer-term, generative AI and growing maturity of conversational AI will accelerate contact center platform replacement as customer experience leaders look to simultaneously improve the efficiency of customer service operations and the overall customer experience,” she said.

Organizations are driving growth in virtual assistants tech

Grand View Research is also bullish on the global virtual assistant market because of an almost universal drive for more efficiency. The research firm reported that AI innovations will propel the market, which was already at $2.48 billion in 2022, to a compound annual growth rate of 24.3% through 2030. The firm said application of AI to virtual assistants in computers and mobile devices will allow virtual assistants to encroach on human-agent territory: It will offer product information, help with navigation and paying bills, and work within a human-in-the-loop context by directing queries to human agents in customer service.

SEE: Across the board, AI driving software investments for enterprise (TechRepublic)

Gartner, in its report, noted that while customer service interactions involving AI are growing, most of these are augmented contact center engagements rather than a completely autonomous virtual agent. Overall, Gartner estimated that around 3% of interactions will be handled via CC AI in 2023, growing to 14% of interactions in 2027.

The firm also predicted that by 2024, worldwide spending for contact center conversational AI and virtual assistants will reach nearly $23.2 million.

SEE: IBM’s WatsonX foundation models for enterprise are in the wild (TechRepublic)

Economic uncertainties won’t capsize CC AI

As for this year, in spite of the likely effects of general economic and political turmoil on on-premise CC programs, Gartner’s report suggests organizations will continue to spend on customer engagement tech because, according to the consultancy, customer-facing projects are solidly in the revenue generation and retention column.

“While many IT investment areas will be weakened as budgets tighten, customer service and support initiatives that have the potential to differentiate the customer experience or streamline customer service operations could receive easier investment buy-in,” said Fernandez. “These factors will help contact center as a service projects receive funding associated with broader corporate digital transformation budgets.”

As part of the cloud migration trend, CCaaS will enjoy greater funding because cloud-based CC services have breadth to support communications channels, plus advanced dashboards, analytics, routing, workforce optimization, knowledge and insight, and conversational AI capabilities, according to Gartner.

SEE: AI, yes, but tech leaders are also bullish on 5G, metaverse, big data (TechRepublic)

Will AI skills inoculate employees from AI?

If AI is enroute to replace human contact center employees in many routine customer engagements, workers themselves see AI skills as the key to longevity and efficiency in their roles, according to a recent Salesforce study on the IT workforce. However, the poll also casts doubt on workers’ confidence in their own abilities to use AI tools in their roles.

Eighty-two percent of business leaders polled for the study said generative AI will lower overall business costs, and 80% said it will increase revenue. Employees are also sanguine about AI: 54% of the 4,000 full-time employees Salesforce queried said they see generative AI as a career slingshot, with a caveat: skills. Unfortunately, 62% said they don’t have enough of them to effectively and safely use AI, and their employers agreed: 70% of business leaders don’t believe their teams are sufficiently trained on generative AI.

In Salesforce’s survey — part of the company’s Generative AI Snapshot Research Series in partnership with polling firm YouGov in May, 2023 — 65% of respondents were upbeat about the potential of generative AI as a supporting technology, because they think it will free them up to focus on more strategic work, saving them five hours per week, on average.

But they acknowledged being deficient in key skills:

  • Forty percent said they don’t know how to effectively use generative AI at work.
  • Forty-three percent said they don’t know how to use generative AI with trusted data sources while keeping first-party data secure.
  • Fifty-three percent said they don’t know how to get the most value out of generative AI.

Employees expect their company to close the skills gap

Salesforce study respondents said they want to learn and are looking to their companies for direction, but said employers are falling short:

  • Sixty-seven percent said they expect their employer to provide opportunities to learn how to use generative AI, but …
  • 66% said their employer doesn’t offer training on the technology.

Subscribe to the Innovation Insider Newsletter

Catch up on the latest tech innovations that are changing the world, including IoT, 5G, the latest about phones, security, smart cities, AI, robotics, and more.

Delivered Tuesdays and Fridays Sign up today

Working with Generative AI Just Got Faster!

Open source models (Falcon, Llama, Stable Diffusion, and GPT J) are not easy to work with, it gets even more complicated when you have to test all of them to fit your requirements and specific use cases, and it’s definitely an expensive affair.

But, not anymore.

“You can now test Llama 2 in less than 10 minutes,” said AI expert Santiago, introducing Monster API, a new tool that lets you effortlessly access powerful generative AI models such as Falcon, Llama, Stable Diffusion, and GPT J and others, without having to worry about managing the generative AI models or scaling them up to handle lots of requests.

Santiago said that he has been working with the Monster API platform for a while now, and seems to be impressed with the level of accessibility it provides to open source generative AI models. “They take care of the GPU infrastructure, containerisation, Kubernetes clusters, scalability, etc,” he added that you only need to focus on your code integration.

Further, he said that it leverages a distributed GPU network, so that users can access these models at a fraction of the cost.

Here is the source code of this example: https://t.co/BTWnlaraWu.
You can also join @monsterapis’ Discord server for the latest updates, free credits, and special offers: https://t.co/A7axoM858K.
Thanks to the team @monsterapis for partnering with me on this post.

— Santiago (@svpino) July 31, 2023

Decentralised GPU

Founded by brothers Gaurav Vij and Saurabh Vij in June 2023, Monster API uses idle computing power of millions of decentralised crypto mining rigs worldwide and optimises them for machine learning and packages them with popular generative AI models.

In other words, it uses distributed computing to bring down the cost of training a foundational model. The mining of bitcoins needs high levels of compute deployed on GPUs. Now, the interest in crypto is in decline and many of these devices are gathering dust. Gaurav Vij, founder of QBlocks said, “We eliminate the need to worry about GPU infrastructure, containerization, setting up a Kubernetes cluster, and managing scalable API deployments as well as offering the benefits of lower costs. One early customer has saved over $300,000 by shifting their ML workloads from AWS to Monster API’s distributed GPU infrastructure,”

His company provides a decentralised GPU network at up to 10x more affordable rates to data scientists, researchers, designers and developers. He said, “Most of the machine learning developers today rely on AWS, Google Cloud, Microsoft Azure to get resources and end up spending a lot of money.”

The artificial intelligence world is struggling to match the hardware in computing power. Demand has outstripped supply, says Saurabh, founder and CEO of Monster API. He further explained, “You can take a pre-trained foundational model; you can take datasets from free datasets like Hugging Face and quickly start fine-tuning these foundational models for your custom dataset. This can be done for under 30 to 40 dollars instead of hundreds of dollars, which you otherwise could spend on fine-tuning these models.” The company has cut fine-tuning costs by up to 90% through optimisation, with fees around $30 per model.

Their website also provides information for developers to build no code fine-tuning of large language models, build over Llama2, Alpaca, Falcon 7B, Stable LM 3B and more.

Other Similar Platforms

Monster API is not alone. There are several tools, including the likes of Gooey.AI, Illusion AIetc, that are built on frameworks like PHP, Python and Java. H20.ai are now cropping up around these models acting like the middle man between the models and the user. They take care of all the more difficult processes like providing on demand access to a pool of GPUs, reducing the cost of training and refining, providing APIs for natural language technologies and computer vision applications etc. They make fine tuning accessible for a wider audience through their no-code user interface. They do this through a visual or graphical interface making it possible for people to take advantage of state-of-the-art models.

H2O recently introduced a driverless AI as a tool. Driverless AI automates complex data science and machine learning tasks, including feature engineering, model validation, tuning, selection, and deployment. It does things like selecting the best features, fine-tuning the models, and creating a simple and fast way to use the models in real-world applications. AT&T has scaled the adoption of H2O driverless with more than 380 employees using Driverless AI across 80 business units saving the company money.

This is only two examples of an entire ecosystem providing no-code/low-code platforms for Generative AI models. There are considerable drawbacks in these platforms. Businesses are under pressure to deliver applications faster, and these options are cost effective but the security and scalability of these platforms are limited as of now. A number of open source projects and companies that are constantly improving on these issues like Alteryx, Kinme, Dataiku etc.

The post Working with Generative AI Just Got Faster! appeared first on Analytics India Magazine.

Meta open sources framework for generating sounds and music

Meta open sources framework for generating sounds and music Kyle Wiggers 10 hours

The day’s fast approaching when generative AI won’t only write and create images in a convincingly human-like style, but compose music and sounds that pass for a professional’s work, too.

This morning, Meta announced Audiocraft, a framework to generate what it describes as “high-quality,” “realistic” audio and music from short text descriptions, or prompts. It’s not Meta’s first foray into audio generation — the tech giant open sourced an AI-powered music generator, MusicGen, in June — but Meta claims that it’s made advances that vastly improve the quality of AI-generated sounds, such as dogs barking, cars honking and footsteps on a wooden floor.

In a blog post shared with TechCrunch, Meta explains that the AudioCraft framework was designed to simplify the use of generative models for audio compared to prior work in the field (e.g. Riffusion, Dance Diffusion and OpenAI’s Jukebox). AudioCraft, the code for which is available in open source, provides a collection of sound and music generators plus compression algorithms that can be used to create and encode songs and audio without having to switch between different codebases.

AudioCraft contains three generative AI models: MusicGen, AudioGen and EnCodec.

MusicGen isn’t new. But Meta’s released the training code for it, enabling users to train the model on their own data set of music.

That could raise major ethical and legal issues, considering MusicGen “learns” from existing music to produce similar effects — a fact with which not all artists or generative AI users are comfortable.

Increasingly, homemade tracks that use generative AI to conjure familiar sounds that can be passed off as authentic, or at least close enough, have been going viral. Music labels have been quick to flag them to streaming partners, citing intellectual property concerns — and they’ve generally been victorious. But there’s still a lack of clarity on whether “deepfake” music violates the copyright of artists, labels and other rights holders.

Meta makes it clear that the pretrained, out-of-the-box version of MusicGen was trained with “Meta-owned and specifically licensed music,” specifically 20,000 hours of audio — 400,000 recordings along with text descriptions and metadata — from the company’s own Meta Music Initiative Sound Collection, Shutterstock’s music library and Pond5, a large stock media library. And Meta removed vocals from the training data to prevent the model from replicating artists’ voices. But while the MusicGen terms of use discourage using the model for “out-of-scope” use cases beyond research, Meta doesn’t expressly prohibit any commercial applications.

AudioGen, the other audio-generating model contained in AudioCraft, focuses on generating environmental sounds and sound effects as opposed to music and melodies.

AudioGen is a diffusion-based model, like most modern image generators (see OpenAI’s DALL-E 2, Google’s Imagen and Stable Diffusion). In diffusion, a model learns how to gradually subtract noise from starting data made entirely of noise — for example, audio or images — moving it closer step by step to the target prompt.

Given a text description of an acoustic scene, AudioGen can generate environmental sounds with “realistic recording conditions” and “complex scene content.” Or so Meta says — we weren’t given the chance to test AudioGen or listen to its samples ahead of the model’s release. According to a whitepaper published alongside AudioGen this morning, AudioGen can also generate speech from prompts in addition to music, reflecting the makeup of its diverse training data.

In the whitepaper, Meta acknowledges that AudioCraft could be misused to deepfake a person’s voice. And, given AudioCraft’s generative music capabilities, the model raises the same ethical questions as MusicGen. But, as with MusicGen, Meta isn’t placing much of the way in restrictions on ways in which AudioCraft — and its training code — can be used, for better or worse.

The last of AudioCraft’s three models, EnCodec, is an improvement over a previous Meta model for generating music with fewer artifacts. Meta claims that it more efficiently models audio sequences, capturing different levels of information in training data audio waveforms to help craft novel audio.

“EnCodec is a lossy neural codec that was trained specifically to compress any kind of audio and reconstruct the original signal with high fidelity,” Meta explains in the blog post. “The different streams capture different levels of information of the audio waveform, allowing us to reconstruct the audio with high fidelity from all the streams.”

So what’s one to make of AudioCraft? Meta emphasizes the potential upsides, unsurprisingly, like providing inspiration for musicians and helping people iterate on their compositions “in new ways.” But as the advent of image and text generators has shown us, there’s drawbacks — and probably lawsuits — lurking in the shadows.

Consequences be damned, Meta says that it plans to keep investigating better controllability and ways to improve the performance of generative audio models, as well as ways to mitigate the limitations and biases of such models. On the subject of biases, MusicGen, Meta notes, doesn’t perform well on descriptions in languages other than English and musical styles and cultures that aren’t Western — owing to very obvious biases in its training data.

“Rather than keeping the work as an impenetrable black box, being open about how we develop these models and ensuring that they’re easy for people to use — whether it’s researchers or the music community as a whole — helps people understand what these models can do, understand what they can’t do and be empowered to actually use them,” Meta writes in the blog post. “Through the development of more advanced controls, we hope that such models can become useful to both music amateurs and professionals.”

data2vec: A Milestone in Self-Supervised Learning

Machine learning models have heavily relied on labeled data for training, and traditionally speaking, training models on labeled data yields accurate results. However, the main downside of using labeled data is the high annotation costs that rise with an increase in the size of the training data. High annotation costs are a big hurdle for developers, especially when working on a large project with substantial amounts of training data.

To tackle the annotation issue, developers came up with the concept of SSL or Self Supervised Learning. Self Supervised Learning is a machine learning process in which the model trains itself to learn a portion of the input from another part of the input. A Self Supervised Learning model aims to exploit the relationship between the data instead of using labeled data’s supervised signals.

In addition to Self Supervised Learning, there are several other methods & models to train machine learning models without the use of labeled data. However, most of these methods have two major issues

  1. They are often specialized for a single modality like an image or a text.
  2. They require a high amount of computational power.

These limitations are a major issue why an average human mind is able to learn from a single type of data much more effectively when compared to an AI model that relies on separate models & training data to distinguish between an image, text, and speech.

To tackle the issue of single modality, Meta AI released the data2vec, the first of a kind, self supervised high-performance algorithm to learn patterns information from three different modalities: image, text, and speech. With the implementation of the data2vec algorithm, text understandings could be applied to an image segmentation problem, or it can also be deployed in a speech recognition task.

In this article, we will be talking about the data2vec model in-depth. We will discuss the method overview, related work, architecture, and results of the model in greater depth so that you have a clear understanding of the data2vec algorithm.

Data2vec Introduction: The Core Idea

Although the fundamental concept of Self Supervised Learning is applied across modalities, actual objectives & algorithms differ from each other because they were designed in respect to a single modality. Designing a model for a single modality is the reason why the same self supervised learning algorithm cannot work effectively across different kinds of training data.

To overcome the challenge presented by single modality models & algorithms, Meta AI released the data2vec, an algorithm that uses the same learning methodology for either computer vision, NLP or speech.

The core idea behind the data2vec algorithm is to use the masked view of the input to predict latent representations of the full input data in a self-distillation setup with the help of standard Transformer architecture. So, instead of modality-specific objects like images, text, or voice that are local in nature, the data2vec algorithm predicts latent representations with information from the complete training or input data.

Why Does the AI Industry Need the Data2Vec Algorithm?

Self Supervised Learning models build representations of the training data using human annotated labels, and it’s one of the major reasons behind the advancement of the NLP or Natural Language Processing, and the Computer Vision technology. These self supervised learning representations are the reason why tasks like speech recognition & machine learning deploy unsupervised learning in their models.

Until now, these self supervised learning algorithms focus on individual modalities that result in learning biases, and specific designs in the models. The individual modality of self supervised learning algorithms create challenges in different AI applications including computer vision & NLP.

For example, there are vocabulary of speech units in speech processing that can define a self-supervised learning task in NLP. Similarly, in computer vision, developers can either regress the input, learn discrete visual tokens, or learn representations invariant to data augmentation. Although these learning biases are handy, it’s difficult to confirm whether these biases will generalize to other modalities.

The data2vec algorithm is a major milestone in the self-supervised learning industry as it aims at improving multiple modalities rather than just one. Furthermore, the data2vec algorithm is not reliant on reconstructing the input or contrastive learning.

So the reason why the world needs data2vec is because the data2vec algorithm has the potential of accelerating progress in AI, and contributes in developing AI models that can learn about different aspects of their surroundings seamlessly. Scientists hope that the data2vec algorithm will allow them to develop more adaptable AI and ML models that are capable of performing highly advanced tasks beyond what today’s AI models can do.

What is the Data2Vec Algorithm?

The data2vec is a unified framework that aims at implementing self-supervised machine learning across different data modalities including images, speech, and text.

The data2vec algorithm aims at developing ML models that can learn the general patterns in the environment much better by keeping the learning objective uniform across different modalities. The data2vec model unifies the learning algorithm, but it still learns the representations for each modality individually.

With the introduction of the data2vec algorithm, Meta AI hopes that it will make multimodal learning effective, and much more simpler.

How Does the Data2Vec Algorithm Work?

The data2vec algorithm combines the learnings of latent target representations with masked prediction, although it uses multiple network layers as targets to generalize the latent representations. The model specifically trains an off-the-shelf Transformer network that is then used either in the teacher or student mode.

In the teacher mode, the model first builds the representations of the input data that serves as targets in the learning task. In the student mode, the model encodes a masked version of the input data that is then used to make predictions on full data representations.

The above picture represents how the data2vec model uses the same learning process for different modalities. In the first step, the model produces representations of the input data (teacher mode). The model then regresses these representations on the basis of a masked version of the input.

Furthermore, as the data2vec algorithm uses latent representations of the input data, it can be viewed as a simplified version of the modality-specific designs like creating suitable targets by normalizing the input or learning a fixed set of visual tokens. But the crucial differentiating point between the data2vec & other algorithms is that the data2vec algorithm uses self-attention to make its target representation contextualized & continuous. On the other hand, other self-supervised learning models use a fixed set of targets that are based on a local context.

Data2vec: Model Method

The data2vec model is trained by predicting the model representations of the input data given a partial view of the input. As you can see in the given figure, the dog’s face is masked, a particular section of the voice note is masked, and the word “with” is masked in the text.

The model first encodes a masked version of the training sample(student mode), and then encodes the unmasked version of the input to construct training targets with the same model but only when it is parameterized as the exponential average of the model weights(teacher mode). Furthermore, the target representations encode the information present in the training sample, and in the student mode, the learning task is used to predict these representations when given a partial view of the input.

Model Architecture

The data2vec model uses a standard Transformer architecture with modality-specific encoding of the input data. For tasks related to computer vision, the model uses the ViT strategy to encode an image as a sequence of patches where each image spans over 16×16 pixels, and fed as a linear transformation.

Furthermore, the data for speech recognition, the model encodes the data using a multi-layer 1-D convolutional neural network that maps the 16 kHz waveforms into 50 Hz representations. To process the text data, the model preprocesses the data to extract sub-word units, and then embeds the data in distributional space via embedding vectors.

Masking

Once the model embeds the input data as a sequence of tokens, the model masks parts of these units by replacing them with an embedding token, and then feeds the sequence to the Transformer network. For computer vision, the model practices block-wise marking strategy. Latent speech representations are used to mask spans of speech data, and for language related tasks, the tokens are masked.

Training Targets

The data2vec model aims at predicting the model representations of the unmasked training sample based on an encoding of the masked sample that was originally feeded to the model. The model predicts the representations only for masked time-steps.

The model predicts contextualized representations that not only encode the particular time-step, but it also encodes other information from the sample because it uses self-attention in the Transformer network. The contextualized representations & the use of Transformer network is what distinguishes the data2vec model from already existing BERT, wav2vec, BEiT, SimMIM, MAE, and MaskFeat models that predict targets without contextual information.

Here is how the data2vec model parameterizes the teacher mode to predict the network representations that then serve as targets.

Teacher Parameterization

The data2vec model parameterized the encoding of the unmasked training sample with the use of EMA or Exponential Moving Average of the model parameters(θ) where the weights of the model in the target mode(△) are as follows

∆ ← τ∆ + (1 − τ ) θ

Furthermore, the model schedules for τ that linearly increases the parameter from τ0 to τe (target value) over the first τn updates. After these updates, the model keeps the value constant until the training gets over. The use of the EMA strategy updates the teacher much more frequently in the beginning when the training starts when the model is random. As the training proceeds & good parameters have been learned, the teacher gets updated less frequently.

The results show that the model is more efficient & accurate when it shares the parameters of the feature encoder & positional encoder between the student & the teacher mode.

Targets

The construction of the training targets are dependent on the output of the top K blocks of the teacher network for time-steps that are masked in the student mode. The output of the block l at any time-step t is denoted as alt. The model then applies normalization to each block to obtain âlt before it averages the top K blocks

to obtain the training target yt for time-step t for a network with L blocks in total.

It creates training targets that the model regresses when it's in student mode. In the initial experiments, the data2vec model performed well in predicting each block separately with a dedicated projection, and being much more efficient at the same time.

Furthermore, normalizing the targets also allows the data2vec model from collapsing into constant representations for time-steps, and preventing layers with high normalization to dominate the features in the target dataset. For speech recognition, the model uses instance normalization over the current input sample without any learned parameters. It’s mainly because as the stride over the input data is small, the neighboring representations are highly correlated.

Additionally, the researchers found that when working with computer vision and NLP, parameter-less normalization does the job sufficiently. The problem can also be solved with Variance-Invariance-Covariance regularization but the strategy mentioned above performs sufficiently well, and it does not require any additional parameters.

Objective

For contextualized training targets yt, the model uses a Smooth L1 loss to regress the targets as mentioned below

Here, β is in control of transitioning from a squared loss to an L1 loss, and it depends heavily on the size of the gap between the model prediction ft(x) at time-step t. The advantage of this loss is that it’s comparatively less sensitive to the outliers, with the need to tune the setting of β.

Experimental Setup

The data2vec model is experimented with two model sizes: data2vec Large and data2vec Base. For numerical stability, the EMA updates are done in fp32, and the models contain L= 12 or L= 24 Transformer blocks with hidden dimensions(H) = 768 or H= 1024. Let’s have a detailed look at the experimental setup for different modalities, and purposes.

Computer Vision

The data2vec model embeds images of 224×224 pixels as patches of 16×16 pixels. Each of these patches is transformed linearly, and a sequence with 196 representations is fed to the standard Transformer.

The model follows BEiT to mask blocks with adjacent patches with each block having a minimum of 16 patches with a random aspect ratio. However, instead of masking 40% of the patch as originally in the BEiT model, the data2vec model masks 60% of the patch for better accuracy.

Furthermore, the model randomly resizes the image crops, horizontal flips, and color jittering. Finally, the data2vec model uses the same modified image in both the teacher & the student mode.

The ViT-B models are pre-trained for 800 epochs, and the data2vec model uses the batch size of 8,192 for the ViT-L model, and 2,048 for the ViT-B model. The data2vec model also uses a cosine, and a Adam schedule with a single cycle to warm up the learning rate for 80 epochs to 0.001 for ViT-L, and for 40 epochs to 0.001 for ViT-B.

For both ViT-B, and ViT-L, the data2vec model uses β = 2, K = 6 and τ = 0.9998 as constant with no schedule. The model further uses the stochastic depth rate 0.2.

Furthermore, for ViT-L, the model trains for 1,600 epochs where the first 800 epochs have a learning rate as 0.9998, and then the model resets the learning rate schedule, and continues for the final 800 epochs with learning rate as 0.9999.

For image classification, the model uses the mean-pool of the output of the last Transformer block, and feeds it to the softmax-normalized classifier. The model then fine tunes the ViT-L for 50 epochs, and ViT-B for 100 epochs using the cosine, and Adam to warmup the learning rate.

Speech Processing

For speech processing, the data2vec model uses the Fairseq, a sequence-modeling kit used to train customer models for summarization, translation, and text generation. The model takes 16 kHz waveform as input that is processed using a feature encoder, and contains temporal convolutions with 512 channels, kernel widths (10,3,3,3,3,2,2), and strides (5,2,2,2,2,2,2).

The above results in the output frequency of the encoder being 50Hz, and it has a stride of 20ms between each sample. The receptive field comprises of 400 input samples or 25 ms of audio. The raw waveform fed to the encoder is normalized to unit variance, and zero mean.

The masking strategy used by the data2vec for the Base model resembles the Baevski framework for self-supervised learning in speech recognition. The model samples p = 0.065 for all time-steps to be starting indices, and proceeds to mark the following ten time-steps. For a typical training sequence, the process allows almost 49% of the total time-steps to be masked.

During training, the data2vec model linearly anneals τ using τo = 0.999, τe = 0.9999, and τn = 30,000. The data2vec model uses the Adam optimizer with the peak learning rate being 5×10-4 for the Base model. Furthermore, the base model uses a tri-stage scheduler that warms up the learning rate linearly for the first 3% of updates, maintains it for the next 90%, and then proceeds to decay it linearly for the remaining 7%.

Natural Language Processing

The data2vec model uses the byte-pair encoding of 50K types to tokenize the input, and the model then learns an embedding for each type. After the data is encoded, the model applies the BERT masking strategy to 15% of uniformly selected tokens in which 80% are replaced by learned mask tokens, 10% are replaced by random vocabulary tokens, and the remaining 10% are unchanged.

During pre-training the model uses τo = 0.999, τe = 0.9999, and τn = 100,000, K= 10, and β = 4. The model uses the Adam optimizer with a tri-stage learning rate schedule that warms up the learning rate linearly for the first 5% of updates, maintains it for the next 80%, and then proceeds to decay it linearly for the remaining 15%, with the peak learning rate being 2×10-4.

Furthermore, the model trains on 16 GPUs with a batch size of 256 sequences, and each sequence containing about 512 tokens. For downstreaming, the model is pre-trained in four different learning rates: 1×10-4, 2×10-4, 3×10-4, 4×10-4, and the one that performs the best is selected for further NLP downstreaming tasks.

Results

Let’s have a look at how the data2vec model performs when it implements the strategies discussed above for different modalities.

Computer Vision

To evaluate the results for computer vision, the data2vec model is pre-trained on the images obtained from the ImageNet-1K dataset. The resulting model is fine-tuned using the labeled data of the same benchmark. As per the standard practice, the model is then evaluated in terms of top-1 accuracy on validation data.

The results are then distinguished on the basis of a single self-supervised model, and training a separate visual tokenizer on additional data, or other self-supervised learning models.

The table below compares the performance of the data2vec model for computer vision, and other existing models: ViT-L, and ViT-B.

The results from the above table can be summarized as follows.

  • The data2vec model outperforms prior work with both the ViT-L, and ViT-B models in single model setting.
  • The masked prediction setup used in the data2vec algorithm to predict contextualized latent representations performs better when compared to methods that predict local targets like engineering image features, input pixels, or visual tokens.
  • The data2vec model also outperforms self-distillation methods that regress the final layer of the student network while taking two different augmented versions of an image as inputs.

Audio & Speech Processing

For speech & audio processing, the data2vec model is trained on about 960 hours of audio data obtained from the Librispeech(LS-960) dataset. The dataset contains clean speech audio from audiobooks in English, and it is treated as a standard benchmark in the speech & audio processing industry.

To analyze the model’s performance in different resource settings, researchers have fine tuned the data2vec model to use different amounts of labeled data(from a few minutes to several hours) for automatic speech recognition. To analyze the model’s performance, data2vec is compared against HuBERT & wav2vec 2.0, two of the most popular algorithms for speech & audio representation learnings that rely on discrete speech units.

The above table compares the performance of data2vec in terms of word rate for speech recognition with other existing models. LM represents the language model used for decoding. The results can be summarized as follows.

  • The data2vec model shows improvements for most labeled data setups with the largest gain of 10 minutes of labeled data for Base models.
  • When it comes to large models, the model performs significantly better on small labeled datasets, and the performance is comparable on resource-rich datasets with over 100 & 960 hours of labeled data. It’s because the performance generally saturates on resource-rich labeled dataset for most models.
  • After analyzing the performance, it can be deduced that when the model uses rich contextualized targets, it’s not essential to learn discrete units.
  • Learning contextualized targets during training helps in improving the overall performance significantly.

Furthermore, to validate data2vec’s approach for speech recognition, the model is also trained on the AudioSet benchmark. Although the pre-training setup for AudioSet is similar to Librispeech, the model is trained for K= 12, and for over 200K updates, where the size of each batch is 94.5 minutes.

The model then applies the DeepNorm framework, and layer normalization to the targets to help in stabilizing the training. Additionally, the model is also fine tuned on balanced subsets with batch size of 21.3 minutes over 13k updates. The model also uses Linear Softmax Pooling and mixup with a probability score of 0.7. The model then adds a single linear projection into 527 unique classes of audio, and sets the projection learning rate to 2e-4.

Furthermore, the pre-trained parameters have a learning rate of 3e-5, and the model uses masking techniques for fine tuning the dataset. The table below summarizes the results, and it can be seen that the data2vec model is capable of outperforming a comparable setup with the same fine-tuning, and pre-training data.

Natural Language Processing

To analyze data2vec’s performance on text, the model follows the same training setup as BERT and pre-training the model on English Wikipedia dataset with over 1M updates, and batch size being 256 sequences. The model is evaluated on the GLUE or General Language Understanding Evaluation benchmark that includes natural language interference tasks(MNLI or Multi Genre Natural Language Inference), sentence similarity (QQP or Quora Question Pairs benchmark, MRPC or Microsoft Research Paragraph Corpus, and STS-B or Semantic Textual Similarity Benchmark), sentiment analysis(SST-2 or Stanford Sentiment Treebank), and grammatically(CoLA).

Furthermore, to fine tune the data2vec model, the labeled data is provided by each task, and the average accuracy is reported on the development sets with 5 fine-tuning runs. The following table summarizes the performance of the data2vec model for Natural Language Processing tasks, and compares it with other models.

  • The above data shows that the data2vec model outperforms the baseline RoBERTa model as the strategy in data2vec model does not use random targets.
  • The data2vec model is the first successful pre-trained NLP model that does not use discrete units like characters, words or sub-words as training targets. Instead, the data2vec framework predicts contextualized latent representation over the complete unmasked text sequence.
  • It helps in creating a learning task in which the model is required to predict targets with specific properties from the current sequence rather than predicting representations that are generic to every text unit with particular discretion.
  • Furthermore, the training target set is not fixed, and the model is free to define new targets, and it is open to vocabulary settings.

Data2Vec: Ablations Study

Ablation is a term used to define the removal of a component in the AI, and ML systems. An ablation study is used to investigate or analyze the performance of an AI or ML model by removing certain key components from the model that allows researchers to understand the contribution of that component in the overall system.

Layer Averaged Targets

A major difference between data2vec and other self-supervised learning models is that the data2vec model uses targets that are based on averaging multiple layers from the teacher network. The idea comes from the fact that the top top layers of the wav2vec 2.0 model does not perform well for downstream tasks when compared to middle layers of the model.

In the following experiment, the performance of all three modalities is measured by averaging K= 1, 2, …, 12 layers where K= 1 predicts only the top layer. However, to extract faster turnaround time, the data2vec trains the base model with 12 layers in total. For speech recognition, the model is pre-trained on over two hundred thousand updates on Librispeech, and then fine-tuned on a 10 hour labeled split of Libri-light. For Natural Language Processing, the model reports the average GLUE score for the validation set, and pre-trains the model for 300 epochs for computer vision & then reports the top-1 accuracy obtained on the ImageNet dataset.

The above figure shows that targets based on multiple layers generally improve when only the top layer K=1 is used for all modalities. Using all the layers available is a good practice as the neural networks build features over different types of features, and numerous layers that are then extracted as feature layers.

Using features from multiple layers helps in boosting accuracy, and enriches the self-supervised learning process.

Target Feature Type

The transformer blocks in the data2vec model have several layers that can all serve as targets. To analyze how different layers affect performance, the model is pre-trained on Librispeech’s speech models that use different layers as target features.

The figure below clearly indicates that the output of the feed forward network or the FFN works ideally whereas the output of the self-attention blocks do not result in a usable model.

Target Contextualization

Teacher representations in the data2vec model use self-attention over the entire input to produce contextualized targets. It’s what separates data2vec from other self-supervised learning models that construct a learning task by reconstructing or predicting local parts of the input. It evidently poses the question: does the data2vec model require contextualized targets to work well?

To answer the question, the researchers construct target representations that do not have access to the entire input dataset but only a fraction of it that’s predetermined. The model then restricts the self-attention mechanism of the teacher that allows it to access only a portion of surrounding environment input. After the model has been trained, it’s fine-tuned to access the full context size.

The figure below indicates that larger context sizes often lead to a better performance, and when the entire input sample is visible, it yields the best accuracy. It further proves that richer target representations can yield better performance.

Modality Specific Feature Extractors and Masking

The primary objective of data2vec is to design a simple learning mechanism that can work with different modalities. It’s because, although the current models and frameworks have a unified learning regime, they still use modality specific masking, and feature extractors.

It makes sense that frameworks mostly work with a single modality given the nature of the input data varies vastly from one another. For example, speech recognition models use a high resolution input( like 10 kHz waveform) that usually have thousands of samples. The waveform is then processed by the framework using a multilayer convolutional neural network to obtain feature sequences of 50 Hz.

Structured and Contextualized Targets

The main differentiating point between the data2vec and other masked prediction models is that in the data2vec model, the features of training targets are contextualized. These features are built using self-attention of the entire masked input in teacher mode.

Some other frameworks like BYOL(Bootstrap Your Own Latent) or DINO also use latent representations like the data2vec, but their primary focus is to learn transformation invariant representations.

Final Thoughts

Recent work in the AI and ML industry have indicated that uniform model architectures can be an effective approach to tackle multiple modalities. The data2vec model uses a self-supervised learning approach for working with three modalities: speech, images, and language.

The key concept behind the data2vec model is to use partial input view to regress contextualized information or input data. The approach used by the data2vec frameworks is effective as the model performs better than prior self-supervised learning models on ImageNet-1K dataset for both ViT-B, and ViT-L single models.

Data2vec is trully a milestone in the self-supervised learning industry as it demonstrates a single learning method for learning multiple modalities can indeed make it easier for models to learn across modalities.