With the release of the latest Granite 3.0 LLM, IBM continues to adhere to its practice of maintaining transparency to a level previously unheard of in the AI industry.
Recently, the technology company released a new variant of their open-source model, Granite 3.0. A few days after the launch, Armand Ruiz, VP of Product-AI Platform at IBM, publicly disclosed the datasets on which the model is trained. This is a practice IBM has adhered to even in the past with new model releases.
Ruiz said, “This is true transparency. No other LLM provider shares such detailed information about its training datasets.”
Earlier this year, Stanford’s Foundation Model Transparency Index report stated that IBM’s Granite models achieved a 100% score in tests that measure the transparency and openness of an AI model.
Granite 3.0 language models were trained on Blue Vela supercomputing infrastructure, which is powered entirely by renewable energy. This reinforces IBM’s commitment to sustainability in AI development.
“Our generative AI book of business now stands at more than $3 billion, up more than $1 billion quarter to quarter,” said Arvind Krishna, IBM’s CEO, in the Q3 2024 earnings call.
Big Tech AI, Big on Opacity
A few days ago, Stefano Maffulli, the chief of the Open Source Initiative, accused Meta and their open-source large language models of polluting the term ‘open-source’.
This isn’t the first time. About a year ago, Open Source Initiative had criticised Meta in response to Yann LeCun’s post on X.
Congratulations but please watch your language: The license authorizes only some commercial uses. The term Open Source has a clear, well understood meaning that excludes putting any restrictions on commercial use.
See `2. Additional Commercial Terms` https://t.co/mjZPlxrknL— Open Source Initiative @osi@opensource.org (@OpenSourceOrg) July 18, 2023
Meta is doing a phenomenal job with its open-source models to make technology accessible for all. But as per OSI, giving users the ability to download and run the model locally isn’t the only factor contributing to an open-source experience.
On the other hand, we have companies like Mistral, who have labelled their large language model as ‘open weights’, in accordance with the Open Source Initiative’s direction.
Giants like Apple, OpenAI, and Google have largely remained tight-lipped about the data used to train their large language models. At best, we’ve heard vague responses that say the model was trained on “publicly available data”.
When OpenAI’s former CTO Mira Murati was asked if Sora was trained on YouTube videos, her visibly flustered reaction raised eyebrows among several users in the AI community, especially creators. According to reports, Apple, Nvidia, and Anthropic also used YouTube videos and transcripts to train their AI models.
OpenAI CTO Mira Murati confirms that the video generation AI model Sora is trained on publicly available data. Might be Youtube videos, Instagram Reels or any video content you might have put in public domain.
byu/Visdom04 inChatGPT
“If they’re scraping the subtitles, they’re going to end up scraping from YouTube API, and from what I uploaded. So they’re actively stealing paid content,” said popular tech YouTuber Marques Brownlee in a podcast. Notably, Brownlee paid to have transcripts generated for his video.
IBM’s approach is a refreshing one. But how can transparency be advantageous?
Ethical With a Business Motive
Publicly disclosing the datasets the model is trained on provides great packaging for IBM’s Granite models. Given the high bar set by GPT-o1 and Anthropic’s Sonnet 3.5, competing by building a foundational model may not be the best idea for IBM. They’d rather receive recognition for playing the good guy in the world of AI.
Moreover, this can also help businesses overcome the scepticism and negative ‘black box’ connotation attached to LLMs.
IBM said, “Making transparent models isn’t just important from an ethical standpoint. It’s a sensible business decision. Many enterprises have been reluctant to roll out LLMs at scale, and the reason is the same as any other supply chain decision: Businesses want to know where their supply of goods or services comes from.”
IBM further said, “By being transparent by default, businesses can spend more time looking for solutions to their problems rather than worrying about the reliability of the models they’re using.”
Most of the datasets on which the Granite model is trained are open-source datasets. This could propel research, development, and efforts to enhance these open-source datasets.
The good samaritans of the internet may encounter a newfound sense of motivation to contribute to these datasets and aim to build solid training data for Granite to try and take on the bigger models as much as possible.
This also strengthens the historical sentiment, and the goal of making open source the winner in the world of the internet.
AI Naysayers Will Have a Choice
In a long essay that outlines Vinod Khosla’s vision for a future dominated by AI, he wrote, “I hope in a world with less competition for resources, more humans will be driven by internal motivation and less by external pressure. Society and individuals will have the ability to choose which technology they personally want to leverage, and where they want to spend time.” Khosla’s perception indicates that people will make a strong choice on how they want to use AI, or whether they want to use it in the first place.
Sure, we may not have many transparent and open-source alternatives to leading products today. For example, DuckDuckGo has barely any users compared to Google, and even social platforms like Mastodon can’t compare with Instagram and X. This isn’t because people don’t want to use these platforms; they just aren’t as capable as their bigger and more popular counterparts.
The post IBM Commits to Transparency with Granite 3.0 LLM appeared first on Analytics India Magazine.