Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

The advent of GPT models, along with other autoregressive or AR large language models har unfurled a new epoch in the field of machine learning, and artificial intelligence. GPT and autoregressive models often exhibit general intelligence and versatility that are considered to be a significant step towards general artificial intelligence or AGI despite having some issues like hallucinations. However, the puzzling problem with these large models is a self-supervised learning strategy that allows the model to predict the next token in a sequence, a simple yet effective strategy. Recent works have demonstrated the success of these large autoregressive models, highlighting their generalizability and scalability. Scalability is a typical example of the existing scaling laws that allows researchers to predict the performance of the large model from the performance of smaller models, resulting in better allocation of resources. On the other hand, generalizability is often evidenced by learning strategies like zero-shot, one-shot and few-shot learning, highlighting the ability of unsupervised yet trained models to adapt to diverse and unseen tasks. Together, generalizability and scalability reveal the potential of autoregressive models to learn from a vast amount of unlabeled data.

Building on the same, in this article, we will be talking about Visual AutoRegressive or the VAR framework, a new generation pattern that redefines autoregressive learning on images as coarse-to-fine “next-resolution prediction” or “next-scale prediction”. Although simple, the approach is effective and allows autoregressive transformers to learn visual distributions better, and enhanced generalizability. Furthermore, the Visual AutoRegressive models enable GPT-style autoregressive models to surpass diffusion transfers in image generation for the first time. Experiments also indicate that the VAR framework improves the autoregressive baselines significantly, and outperforms the Diffusion Transformer or DiT framework in multiple dimensions including data efficiency, image quality, scalability, and inference speed. Further, scaling up the Visual AutoRegressive models demonstrate power-law scaling laws similar to the ones observed with large language models, and also displays zero-shot generalization ability in downstream tasks including editing, in-painting, and out-painting.

This article aims to cover the Visual AutoRegressive framework in depth, and we explore the mechanism, the methodology, the architecture of the framework along with its comparison with state of the art frameworks. We will also talk about how the Visual AutoRegressive framework demonstrates two important properties of LLMs: Scaling Laws and zero-shot generalization. So let’s get started.

Visual AutoRegressive Modeling: Scaling Image Generation

A common pattern among recent large language models is the implementation of a self-supervised learning strategy, a simple yet effective approach that predicts the next token in the sequence. Thanks to the approach, autoregressive and large language models today have demonstrated remarkable scalability as well as generalizability, properties that reveal the potential of autoregressive models to learn from a large pool of unlabeled data, therefore summarizing the essence of General Artificial Intelligence. Furthermore, researchers in the computer vision field have been working parallelly to develop large autoregressive or world models with the aim to match or surpass their impressive scalability and generalizability, with models like DALL-E and VQGAN already demonstrating the potential of autoregressive models in the field of image generation. These models often implement a visual tokenizer that represent or approximate continuous images into a grid of 2D tokens, that are then flattened into a 1D sequence for autoregressive learning, thus mirroring the sequential language modeling process.

However, researchers are yet to explore the scaling laws of these models, and what’s more frustrating is the fact that the performance of these models often falls behind diffusion models by a significant margin, as demonstrated in the following image. The gap in performance indicates that when compared to large language models, the capabilities of autoregressive models in computer vision is underexplored.

On one hand, traditional autoregressive models require a defined order of data, whereas on the other hand, the Visual AutoRegressive or the VAR model reconsiders how to order an image, and this is what distinguishes the VAR from existing AR methods. Typically, humans create or perceive an image in a hierarchical manner, capturing the global structure followed by the local details, a multi-scale, coarse-to-fine approach that suggests an order for the image naturally. Furthermore, drawing inspiration from multi-scale designs, the VAR framework defines autoregressive learning for images as next scale prediction as opposed to conventional approaches that define the learning as next token prediction. The approach implemented by the VAR framework takes off by encoding an image into multi-scale token maps. The framework then starts the autoregressive process from the 1×1 token map, and expands in resolution progressively. At every step, the transformer predicts the next higher resolution token map conditioned on all the previous ones, a methodology that the VAR framework refers to as VAR modeling.

The VAR framework attempts to leverage the transformer architecture of GPT-2 for visual autoregressive learning, and the results are evident on the ImageNet benchmark where the VAR model improves its AR baseline significantly, achieving a FID of 1.80, and an inception score of 356 along with a 20x improvement in the inference speed. What’s more interesting is that the VAR framework manages to surpass the performance of the DiT or Diffusion Transformer framework in terms of FID & IS scores, scalability, inference speed, and data efficiency. Furthermore, the Visual AutoRegressive model exhibits strong scaling laws similar to the ones witnessed in large language models.

To sum it up, the VAR framework attempts to make the following contributions.

  1. It proposes a new visual generative framework that uses a multi-scale autoregressive approach with next-scale prediction, contrary to the traditional next-token prediction, resulting in designing the autoregressive algorithm for computer vision tasks.
  2. It attempts to validate scaling laws for autoregressive models along with zero-shot generalization potential that emulates the appealing properties of LLMs.
  3. It offers a breakthrough in the performance of visual autoregressive models, enabling the GPT-style autoregressive frameworks to surpass existing diffusion models in image synthesis tasks for the first time ever.

Furthermore, it is also vital to discuss the existing power-law scaling laws that mathematically describe the relationship between dataset sizes, model parameters, performance improvements, and computational resources of machine learning models. First, these power-law scaling laws facilitate the application of a larger model’s performance by scaling up the model size, computational cost, and data size, saving unnecessary costs and allocating the training budget by providing principles. Second, scaling laws have demonstrated a consistent and non-saturating increase in performance. Moving forward with the principles of scaling laws in neural language models, several LLMs embody the principle that increasing the scale of models tends to yield enhanced performance outcomes. Zero-shot generalization on the other hand refers to the ability of a model, particularly a LLM that performs tasks it has not been trained on explicitly. Within the computer vision domain, the interest in building in zero-shot, and in-context learning abilities of foundation models.

Language models rely on WordPiece algorithms or Byte Pair Encoding approach for text tokenization. Visual generation models based on language models also rely heavily on encoding 2D images into 1D token sequences. Early works like VQVAE demonstrated the ability to represent images as discrete tokens with moderate reconstruction quality. The successor to VQVAE, the VQGAN framework incorporated perceptual and adversarial losses to improve image fidelity, and also employed a decoder-only transformer to generate image tokens in standard raster-scan autoregressive manner. Diffusion models on the other hand have long been considered to be the frontrunners for visual synthesis tasks provided their diversity, and superior generation quality. The advancement of diffusion models has been centered around improving sampling techniques, architectural enhancements, and faster sampling. Latent diffusion models apply diffusion in the latent space that improves the training efficiency and inference. Diffusion Transformer models replace the traditional U-Net architecture with a transformer-based architecture, and it has been deployed in recent image or video synthesis models like SORA, and Stable Diffusion.

Visual AutoRegressive : Methodology and Architecture

At its core, the VAR framework has two discrete training stages. In the first stage, a multi-scale quantized autoencoder or VQVAE encodes an image into token maps, and compound reconstruction loss is implemented for training purposes. In the above figure, embedding is a word used to define converting discrete tokens into continuous embedding vectors. In the second stage, the transformer in the VAR model is trained by either minimizing the cross-entropy loss or by maximizing the likelihood using the next-scale prediction approach. The trained VQVAE then produces the token map ground truth for the VAR framework.

Autoregressive Modeling via Next-Token Prediction

For a given sequence of discrete tokens, where each token is an integer from a vocabulary of size V, the next-token autoregressive model puts forward that the probability of observing the current token depends only on its prefix. Assuming unidirectional token dependency allows the VAR framework to decompose the chances of sequence into the product of conditional probabilities. Training an autoregressive model involves optimizing the model across a dataset, and this optimization process is known as next-token prediction, and allows the trained model to generate new sequences. Furthermore, images are 2D continuous signals by inheritance, and to apply the autoregressive modeling approach to images via the next-token prediction optimization process has a few prerequisites. First, the image needs to be tokenized into several discrete tokens. Usually, a quantized autoencoder is implemented to convert the image feature map to discrete tokens. Second, a 1D order of tokens must be defined for unidirectional modeling.

The image tokens in discrete tokens are arranged in a 2D grid, and unlike natural language sentences that inherently have a left to right ordering, the order of image tokens must be defined explicitly for unidirectional autoregressive learning. Prior autoregressive approaches flattened the 2D grid of discrete tokens into a 1D sequence using methods like row-major raster scan, z-curve, or spiral order. Once the discrete tokens were flattened, the AR models extracted a set of sequences from the dataset, and then trained an autoregressive model to maximize the likelihood into the product of T conditional probabilities using next-token prediction.

Visual-AutoRegressive Modeling via Next-Scale Prediction

The VAR framework reconceptualizes the autoregressive modeling on images by shifting from next-token prediction to next-scale prediction approach, a process under which instead of being a single token, the autoregressive unit is an entire token map. The model first quantizes the feature map into multi-scale token maps, each with a higher resolution than the previous, and culminates by matching the resolution of the original feature maps. Furthermore, the VAR framework develops a new multi-scale quantization encoder to encode an image to multi-scale discrete token maps, necessary for the VAR learning. The VAR framework employs the same architecture as VQGAN, but with a modified multi-scale quantization layer, with the algorithms demonstrated in the following image.

Visual AutoRegressive : Results and Experiments

The VAR framework uses the vanilla VQVAE architecture with a multi-scale quantization scheme with K extra convolution, and uses a shared codebook for all scales and a latent dim of 32. The primary focus lies on the VAR algorithm owing to which the model architecture design is kept simple yet effective. The framework adopts the architecture of a standard decoder-only transformer similar to the ones implemented on GPT-2 models, with the only modification being the substitution of traditional layer normalization for adaptive normalization or AdaLN. For class conditional synthesis, the VAR framework implements the class embeddings as the start token, and also the condition of the adaptive normalization layer.

State of the Art Image Generation Results

When paired against existing generative frameworks including GANs or Generative Adversarial Networks, BERT-style masked prediction models, diffusion models, and GPT-style autoregressive models, the Visual AutoRegressive framework shows promising results summarized in the following table.

As it can be observed, the Visual AutoRegressive framework is not only able to best FID and IS scores, but it also demonstrates remarkable image generation speed, comparable to state of the art models. Furthermore, the VAR framework also maintains satisfactory precision and recall scores, which confirms its semantic consistency. But the real surprise is the remarkable performance delivered by the VAR framework on traditional AR capabilities tasks, making it the first autoregressive model that outperformed a Diffusion Transformer model, as demonstrated in the following table.

Zero-Shot Task Generalization Result

For in and out-painting tasks, the VAR framework teacher-forces the ground truth tokens outside the mask, and lets the model generate only the tokens within the mask, with no class label information being injected into the model. The results are demonstrated in the following image, and as it can be seen, the VAR model achieves acceptable results on downstream tasks without tuning parameters or modifying the network architecture, demonstrating the generalizability of the VAR framework.

Final Thoughts

In this article, we have talked about a new visual generative framework named Visual AutoRegressive modeling (VAR) that 1) theoretically addresses some issues inherent in standard image autoregressive (AR) models, and 2) makes language-model-based AR models first surpass strong diffusion models in terms of image quality, diversity, data efficiency, and inference speed. On one hand, traditional autoregressive models require a defined order of data, whereas on the other hand, the Visual AutoRegressive or the VAR model reconsiders how to order an image, and this is what distinguishes the VAR from existing AR methods. Upon scaling VAR to 2 billion parameters, the developers of the VAR framework observed a clear power-law relationship between test performance and model parameters or training compute, with Pearson coefficients nearing −0.998, indicating a robust framework for performance prediction. These scaling laws and the possibility for zero-shot task generalization, as hallmarks of LLMs, have now been initially verified in our VAR transformer models.

Google Unveils RecurrentGemma, Moves Away From Transformer Based Models 

At Google Cloud Next ’24, Google unveiled a new model RecurrentGemma 2B, a family of open-weights Language Models by Google DeepMind, based on the novel Griffin architecture.

This architecture achieves fast inference when generating long sequences by replacing global attention with a mixture of local attention and linear recurrences.

Google released a pre-trained model with 2B non-embedding parameters, and an instruction tuned variant. Both models achieve comparable performance to Gemma-2B despite being trained on fewer tokens. RecurrentGemma-2B is pre-trained on 2T tokens in contrast, Gemma-2B was pre- trained on 3T tokens.

Architectural changes enable significantly higher throughput for a RecurrentGemma-variant of the Gemma models. https://t.co/pocatgfsli

— Jeff Dean (@🏡) (@JeffDean) April 9, 2024

One of RecurrentGemma’s key strengths lies in its reduced memory footprint. This feature is particularly valuable for generating longer samples on devices with constrained memory capacities, including single GPUs and CPUs.

BREAKING 🔥🤯
Google releases model with new Griffin architecture that outperforms transformers.
Across multiple sizes, Griffin out performs the benchmark scores of transformers baseline in controlled tests in both the MMLU score across different parameter sizes as well as the… pic.twitter.com/OtEDVVMbO0

— Rohan Paul (@rohanpaul_ai) April 9, 2024

By optimising memory usage, RecurrentGemma empowers users to tackle more complex tasks without encountering memory bottlenecks. The efficiency gains of RecurrentGemma extend to its throughput capabilities. Thanks to its lower memory demands, this model excels at performing inference tasks with larger batch sizes.

This translates into a significant increase in token generation per second, especially when dealing with lengthy sequences. Such enhanced throughput is a boon for tasks requiring rapid and continuous data processing.

Google also released JAX code to evaluate and fine-tune RecurrentGemma, including a specialized Pallas kernel to perform linear recurrence on TPUs. Additionally, the company provided a reference PyTorch implementation.

The post Google Unveils RecurrentGemma, Moves Away From Transformer Based Models appeared first on Analytics India Magazine.

10 GitHub Repositories to Master Python

10 GitHub Repositories to Master Python
Image by Author

We all know about free courses on Python that are the best way to learn the language, but have you ever checked out the GitHub platform for learning resources and projects? Learning from courses is great, but hands-on experience with real-world projects and open-source repositories can take your Python skills to the next level.

In this blog, we will cover 10 essential GitHub repositories that will help you master Python and provide you with essential experience for your career. These repositories offer a wealth of knowledge, ranging from beginner-friendly tutorials to advanced coding challenges, and cover a wide range of topics, such as web development, data analysis, machine learning, and more.

1. Asabeneh/30-Days-Of-Python

Asabeneh/30-Days-Of-Python kickstarts your Python journey with a challenge that spans over a month. Designed for beginners, this repository introduces you to Python basics and progressively dives into more complex topics such as statistics, data analysis, web development, and database management. By dedicating a few hours each day, you'll gain a solid foundation in Python, which will open the possibility for you to transition into any tech role.

2. trekhleb/learn-python

trekhleb/learn-python is a comprehensive resource that emphasizes learning Python through hacking. It covers a wide range of Python functions and best practices, making it suitable for learners at different levels. You can modify or add code to see how it works and test it using assertions. This interactive learning approach allows you to add and remove code to test if it works properly, helping you improve your learning experience.

3. Avik-Jain/100-Days-Of-ML-Code

For those interested in diving into machine learning with Python, Avik-Jain/100-Days-Of-ML-Code provides a structured approach to grasp the fundamentals of machine learning. Over 100 days, it introduces key concepts and algorithms in ML, leveraging Python for practical implementations. This repository is perfect for programmers looking to transition into the machine learning engineering role.

4. realpython/python-guide

realpython/python-guide is a Hitchhiker's Guide to Python book freely available on GitHub. The guide includes best practices and the use of Python in various scenarios. It offers guidance on topics ranging from setup and installation to advanced topics like web development and machine learning. Hitchhiker's Guide to Python is an invaluable resource for developers seeking to refine their Python skills.

5. zhiwehu/Python-programming-exercises

zhiwehu/Python-programming-exercises challenges you with a collection of 100+ Python exercises that range from easy to difficult. It is designed to test and improve your problem-solving skills in Python. This repository is excellent for learners who want to practice coding and prepare for the coding interview.

6. geekcomputers/Python

geekcomputers/Python is a repository filled with various Python scripts, showcasing different things you can build with Python programming. From simple scripts to complex projects, it offers a practical perspective on how Python can be used to automate things and serve as educational examples for beginners to get started with Python.

7. practical-tutorials/project-based-learning

The practical-tutorials/project-based-learning repository is a valuable resource that provides links to project-based tutorials for various programming languages, with a particular focus on Python.

Learning through a project-based approach is an effective way to apply Python concepts in real-world scenarios. Additionally, it can help you build your developer portfolio and gain experience to secure your first job.

8. avinashkranjan/Amazing-Python-Scripts

The avinashkranjan/Amazing-Python-Scripts repository is a compilation of various Python scripts that can help automate tasks, perform web scraping, and much more. This resource is particularly useful for students who want to work on small projects independently, as there are plenty of options to choose from. Additionally, these scripts can also be helpful in building more complex projects.

9. TheAlgorithms/Python

If you are interested in learning about algorithms, TheAlgorithms/Python is an excellent repository to check out. It features Python implementations of various algorithms and data structures, which provide a comprehensive understanding of algorithmic learning with Python. This repository is ideal for those who want to explore the fundamentals of computer science and competitive programming. However, note that these implementations are meant for learning purposes only and may not be as efficient as those in the Python standard library.

10. vinta/awesome-python

Lastly, vinta/awesome-python repository is a collection of remarkable Python frameworks, libraries, software, and resources. It is an excellent source for exploring Python tools and libraries that can aid you in your projects and learning journey. Whether you seek web frameworks, data analysis tools, or anything Python-related, you are likely to find it here.

Conclusion

These 10 GitHub repositories introduce you to the world of Python programming, covering basics to advanced topics, including interactive, project-based, and exercise-based learning. By exploring these repositories, you can build a strong foundation in Python, develop problem-solving skills, and work on practical projects that will help you gain experience. Remember, the journey of learning Python is continuous and ever-evolving; these repositories are just the beginning!

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in technology management and a bachelor's degree in telecommunication engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

More On This Topic

  • 10 GitHub Repositories to Master Machine Learning
  • 10 GitHub Repositories to Master Computer Science
  • 10 GitHub Repositories to Master MLOps
  • 25 Github Repositories Every Python Developer Should Know
  • Learn Data Science From These GitHub Repositories
  • Learn Data Engineering From These GitHub Repositories

AI to Lead 2024 Elections in India

AI content to lead politics 2024

As many countries gear up for elections this year, a battle between AI-generated and manual-generated content campaigns is in the air. Currently, AI is used increasingly to help politicians reach voters through phone conversations or chatbots and further draft ads and messages about political opponents.

Shamaine Daniels, a Democratic House candidate in Pennsylvania, launched an AI volunteer called Ashley to speak to voters about the campaign and ask voters about important issues to be addressed. This innovation is considered the first political phone banker to leverage generative AI technology akin to OpenAI’s ChatGPT.

Another groundbreaking innovation is the VotivateAI tool, which was created through a collaboration between Votivate, LLC, and a Louisiana-based AI startup. It aims to revolutionise political campaigning by offering innovative solutions to campaign managers. The Campaign Assistant is one of the standout features that generates detailed campaign strategy memos based on provided race information, continuously updating them as variables change.

Additionally, VotivateAI provides an AI voice-calling tool that engages in natural conversations, potentially leveling the playing field for underdog candidates.

As an effort to embrace AI in Indian politics, Hari Balasubramaniam, an Angel Network Investor, shared Prime Minister Narendra Modi’s AI image generated using Midjourney on LinkedIn. This post reflects a curiosity about reshaping our perception of political leaders. By envisioning and sharing such AI-generated representations, there’s an implicit aspiration to influence the public and empower individuals to reshape the narrative surrounding political leadership.

Source: LinkedIn

Experts suggest that another AI approach can further enhance political campaign efforts. In addition to accessing the party’s manifesto on the website, an AI chatbot can be integrated to respond swiftly to voters’ inquiries, supported by data offering direct campaign experiences.

On the other end of AI content integration

When Pakistan went to polls early this year, Imran Kahn, Pakistan’s former prime minister, campaigned from behind bars. His Pakistan Tehreek-e-Insaf (PTI) party released a video message with a voice clone of the opposition leader giving an emotional speech on his behalf.

“Imran Khan’s voice was cloned during Pakistan elections, with his face superimposed onto an existing video. Such manipulated content, which can sway public opinion and influence voter sentiments, is bound to find its way into India, too”, mentioned Shelly Walia, Executive Editor of The Quint, emphasising the threats of AI deepfakes entering Indian politics.

In another case, New York City Mayor Eric Aams was criticised for using AI to call city residents in many languages he doesn’t speak, including Spanish, Yiddish, and Mandarin. This was defined as an unethical and misleading way to reach potential voters.

In a recent update, a report from the Mircosoft Threat Analysis Centre (MTAC) highlighted that China is attempting to use of AI- generated content to influence elections in several countries, including India, the US, and South Korea. It has experimented with the same during Taiwan’s presidential polls.

AI is changing the practice of campaigning

Political campaigns can harness AI capabilities to tailor to specific localities and individuals, thereby eliminating the one-size-fits-all narrative prevalent in traditional campaigning. This hyper-localization and hyper-personalization not only enhance engagement but also ensure that constituents receive information designed to their needs and concerns.

Moreover, AI is redefining the landscape of translation and transcreation by considering context and tone holistically rather than merely translating individual words. This approach enhances communication across languages and cultures, fostering greater understanding and cooperation on a larger scale.

Overall, AI-generated content has the potential to transform political communication and engagement in positive ways.

The post AI to Lead 2024 Elections in India appeared first on Analytics India Magazine.

This 20-year-old AI Researcher Created the much-needed Indic LLM Leaderboard

This 20-year-old AI Researcher Created the much-needed Indic LLM Leaderboard

Ever since the Indian open source community got its hands on Meta’s Llama 2, there has been a surge of several Indic language AI models. But there was no way to compare the capabilities of these models against each other. Just a few weeks back, AIM pointed out that there is a dire need for creating an Indic LLM Leaderboard.

Adithya S Kolavi, the founder, CEO, and AI researcher at CognitiveLab saw this, and took up the task to build an Indic LLM Leaderboard himself.

Studying at PES University, and juggling through internships, Kolavi founded CognitiveLab around a year ago. “I was covering web development and cloud, but I wanted to focus on generative AI,” Kolavi told AIM. He saw that there was a need for fine-tuning AI models for companies for their own tasks. “That is when the idea of CognitiveLab started,” he added.

CognitiveLab, with a lean team of 10, is providing fine-tuning as a service for companies globally. It built Cognitune, an enterprise-grade LLMops platform which reduced the production time for deploying LLMs by around 60% compared to other platforms.

The team also released Ambari, the first bilingual Kannada model built on top of Llama 2.

“After I created Ambari, I asked myself how I even evaluate this model,” Kolavi narrated the story of the genesis of the idea of an Indic LLM leaderboard. “I can use it very well, but how do I compare it with other Indian models? There was no uniformity,” he added.

This led Kolavi to embark on a project to develop a full-stack comprehensive framework encompassing everything from the training process to evaluation, with the added features of seamless integration and accessibility.

A lot of work to be done

The Indic LLM Leaderboard offers support for 7 Indic languages, including Hindi, Kannada, Tamil, Telugu, Malayalam, Marathi, and Gujarati, providing a comprehensive assessment platform. Hosted on Hugging Face, it initially supports 4 Indic benchmarks, with plans for additional benchmarks in the future.

“Presently, I’m heavily focused on refining the training core aspect of this framework, which essentially comprises three main products,” said Kolavi. The first product is ‘Indic LLM,’ an open-source framework designed for fine-tuning models using Mistral and Llama.

The second component is the indic_eval, a tool that simplifies the evaluation process by providing ready-made benchmarks that can be effortlessly incorporated into the platform. The last one is the Indic LLM Leaderboard as an alpha release to encourage usage and gather feedback on the framework.

However, the benchmarks are still a work in progress. Currently, it is utilising benchmarks derived from the Arc, HellaSwag, MMLU, BoolQ datasets, while using AI4Bharat’s Indic Trans model for translation. “This approach isn’t entirely satisfactory, as it is still a translation,” said Kolavi. “Now that everything is open source, researchers can focus on generating the benchmark for datasets and not needing to build anything from scratch.”

The Need for GPUs and Data

It takes just one NVIDIA A100 GPU to test the model on the Indic LLM Leaderboard, but Kolavi and his team were operating on just three GPUs for building the benchmarks. “The problem with Indic models is that they take more time than English models because the number of tokens is significantly higher,” said Kolavi. Just before the release of the evaluation metric, the company was running 10 GPUs for 10 different evals simultaneously.

“We wanted to see if other people can use it effectively, and there is no fault in the process,” Kolavi added. Since there is a lack of Indic language datasets, CognitiveLab utilised open-source datasets and data from the open web, and now they have been leveraging AI4Bharat’s dataset for training models.

CognitiveLab is part of several startup programs, such as AWS and Microsoft Azure, which gives the company access to GPUs. However, much of the research that the company does internally is funded by its own resources.

AI Should Become Second-Nature Internet

Kolavi’s next goal is to build a trustable LLM benchmark for Indic models. For this, he has been referencing several Chinese papers that created similar benchmarks for the Chinese language. “The next step will be to build up a benchmark to be added to the leaderboard, to give it more accountability.”

Kolavi loves the idea of open source and admires what Hugging Face and AI4Bharat are doing. If India wants to focus on something, it should be making AI accessible to everyone. “Even the remotest villages in the country should be able to access it,” he added. “It should become like a second-nature Internet where people can openly use and experience the usefulness of AI.”

“I am looking forward to the day when people integrate AI into common apps such as WhatsApp and have a Kannada interface. The whole experience of Indian languages becomes seamless,” Kolavi concluded, adding that researchers should focus on delivering purpose through the models and solving real problems instead of building foundational models in India.

The post This 20-year-old AI Researcher Created the much-needed Indic LLM Leaderboard appeared first on Analytics India Magazine.

Can AI Resolve Bengaluru’s Water Crisis?

Over concretisation, coupled with depleting groundwater levels, lack of rainfall, and scorching heat, has resulted in a severe water shortage in Bengaluru. The situation is so dire that residents of Parkwest Housing Society in central Bengaluru’s Hosakere Road were heard chanting ‘We want water’ at a protest.

The scarcity of water predominantly stems from the declining groundwater levels, resulting in numerous borewells across the city running dry. Out of approximately 16,000 public borewells in Bengaluru, nearly 7000 of them have already dried up.

To tackle the city’s current water problem, the Bangalore Water Supply and Sewerage Board (BWSSB) has started leveraging AI. The board has deployed Internet of Things (IoT) sensors to analyse flow patterns and transmit data to the cloud for evaluation.

By running an AI model on top of it, the system will control motor operations to ensure efficient usage. In the event of decreasing water levels, automatic signals will trigger shutdowns, reducing the need for manual intervention, officials said.

While it’s commendable that BWSSB is turning its focus to AI to address Bengaluru’s water challenges, it’s important to note that AI alone cannot completely resolve the city’s water crisis. However, it can certainly aid in enhancing planning and increasing the efficiency of existing water systems.

It’s all about data

For AI to be helpful, it relies on a significant amount of data and BWSSB is already data-rich, according to Navaneethan Santhanam, chief scientist at SmarTerra.

He believes AI can help fix some of Bengaluru’s water-related problems. “It is not the only solution, but certainly a very important one. At its heart, AI relies on data—data about water consumption, supply volumes, pressures, and timings, surface and groundwater levels, etc. This data is usually available with utilities, particularly one as advanced as BWSSB,” Santhanam told AIM.

Utilities often collect billing data, metre data, customer information systems data, etc. Moreover, utilities track pressure and flow data across their networks, ensuring adequate supply and detecting leaks. They also possess comprehensive GIS maps of their networks, detailing pipe locations, materials, and maintenance records, including historical leakages.

AI models can also examine historical water consumption data, population growth rates, weather patterns, and other pertinent factors to predict future water demand across various zones within the city. Such analysis with aid BWSSB in strategically planning and distributing water resources for enhanced efficiency and allocation.

Moreover, in Bengaluru’s case, AI can be leveraged to predict groundwater depth and better prepare for a scenario that the city is facing today.

However, there is a need for better data collection as well as data cleansing and visualisation. “We need tools to help make sense of the data. In reality, there is too much ‘raw’ data for the utility to understand in its entirety. The data needs to be cleaned, formatted, and visualised in a way that does not overwhelm,” Santhanam said.

Addressing leakages

The situation could have been even more dire in the city; however, the Bangalore Water Supply and Sewerage Board (BWSSB) provides Cauvery water to approximately 70-80% of the city’s 13 million population.

However, according to officials, the situation appears dire as the Krishnaraja Sagar Dam in Mandya district, the primary source of Cauvery water supply to Bengaluru, is experiencing inadequate water levels exacerbated by the summer.

Santhanam believes his startup’s AI-powered data analytics platform could come to Bengaluru’s aid. SmartTerra uses a mix of generative AI along with modern geospatial analysis, forecasting, and hydraulic modelling to help utilities pinpoint network failures such as leaks, failing pipes, and faulty metres.

“Our focus lies in aiding cities to mitigate water losses. Presently, in India, approximately 40% to 50% of water dissipates before reaching end-users.

“Take Bangalore, for instance, where water is pumped from over 100 kilometres away, treated, and distributed through networks to households and enterprises. However, for every litre dispatched, roughly one litre is lost due to pipe leakages, bursts, or customer-related issues,” he said.

He indicates that nearly half of the generated water intended for city residents’ use fails to reach its final destination. AI can pinpoint areas where water consumption exceeds normal levels, potentially indicating the presence of leaks that require fixing. Addressing these leaks can conserve more water, resulting in increased availability and reduced wastage.

Addressing Data Silos

Even though municipalities and water bodies are data-rich, they often remain segregated, hindering their effective utilisation. “This is where our solution comes in. We directly ingest utility data into our systems, performing data cleaning and validation automatically using AI.

“We then apply various AI algorithms to detect leaks, failing metres, and pipes. Results are presented through intuitive visualisations, both spatially and temporally, aiding utilities in identifying and addressing issues efficiently.

“Additionally, AI-generated tasks prioritise areas requiring attention, optimising risk management efforts. AI streamlines data management, unifying disparate datasets and highlighting anomalous patterns for enhanced utility operations,” Santhanam pointed out.

While SmarTerra deploys a predictive AI model, the startup has recently started experimenting with Large Language Models (LLMs).

They are currently developing a natural language query engine utilising LLMs. “With this technology, individuals can easily inquire, for example, “What are the 10 most leaky areas in Bangalore?” and the model will provide comprehensive results in the form of charts, maps, and tables.

“Our objective is to leverage LLMs to streamline access to analytics, making it accessible to anyone with an interest.”

The Bengaluru-based startup already operates in many Indian cities and works closely with utility companies such as L&T, Suez, and many state-owned water bodies in India as well as in foreign markets such as Singapore and the Philippines.

The company has previously worked with BWSSB as well as is looking to enhance its collaboration with them.

The post Can AI Resolve Bengaluru’s Water Crisis? appeared first on Analytics India Magazine.

Adani Group Acquires 25 Acres in Pune for Data Centre Development

Adani Groups Set an AI Joint Venture with UAE’s IHC

The Adani Group company Terravista Developers has acquired leasehold rights for a 25-acre land parcel in the Pimpri industrial zone of Pune’s Haveli locality from Finolex Industries for around Rs 471 crore. The company plans to develop a large data centre on this land.

Terravista Developers paid a stamp duty of over Rs 23.52 crore for the transaction registration on April 3, 2024.

The Maharashtra Industrial Development Corporation (MIDC) had originally leased the plots to Swastik Rubber, which further leased them to Finolex Group entities in 1982 for the balance lease period and the rights to renew the lease for an additional 95 years.

The Adani Group’s data centre business is led by AdaniConneX, a 50:50 joint venture between Adani Enterprises and US-based global hyperscale data centre provider EdgeConneX.

Formed in February 2021, the joint venture aims to develop and operate data centres across India, starting with Chennai, Navi Mumbai, Noida, Vizag, and Hyderabad markets.

AdaniConneX plans to develop 1 GW of data centre capacity over the next decade.

In recent years, AdaniConneX has made substantial investments totalling $1.5 billion and is currently in the process of securing an additional $400 million offshore loan.

The venture is establishing data centres in key locations like Visakhapatnam, New Delhi, Mumbai, and Chennai, with details of two data centres in Andhra Pradesh involving an aggregate investment of ₹21,844 crore released in May 2023.

The Adani Group aims to become one of the top three data centre operators by 2030, capitalising on India’s increasing digitisation and setting up data processing hubs overseas in the UAE, Singapore, Nepal, and Thailand.

The data centres being developed by Adani are designed to provide customised solutions with 5G connectivity, creating a comprehensive digital ecosystem.

India’s data centre market is witnessing significant growth, with investments totalling $10 billion since 2020 and a projected doubling of data centre stock to reach 20 million square feet by 2025, according to property consultant Colliers India.

The post Adani Group Acquires 25 Acres in Pune for Data Centre Development appeared first on Analytics India Magazine.

Who Needs OpenAI’s GPT-4?

Cohere, the Toronto, Canada-based leading provider of enterprise-grade AI solutions, recently introduced Command R+, a scalable LLM designed to excel at real-world business applications.

According to the latest Arena result, Cohere’s Command R+ has climbed to the 6th spot, matching the GPT-4-0314 level by 13K+ human votes! “It’s undoubtedly the best open model on the leaderboard now.”

Source: Twitter

Building upon the strengths of its predecessor, Command R, the new model offers advanced retrieval augmented generation (RAG) capabilities, improved performance, and multilingual support.

RAG is an innovative approach that combines the strengths of retrieval-based and generative models. While the former involves accessing and extracting information from a large corpus of sources such as databases, articles, or websites, the latter excels in generating coherent and context-aware text. By combining both these components, RAG stands out in generating more informative and contextually relevant responses.

Command R+, which features a 128k-token context window, is optimised for advanced RAG to provide business-ready solutions. The new model improves response accuracy and provides in-line source citations that mitigate hallucinations, empowering enterprises to scale with AI to support tasks across various business functions like finance, HR, sales, marketing, and customer support, across different sectors.

It also offers support for 10 key languages of global business: English, French, Spanish, Italian, German, Portuguese, Japanese, Korean, Arabic, and Chinese.

Tool Use Capabilities

Command R+ comes with tool use capabilities that are accessible through the Cohere and LangChain APIs. This can help automate complex business workflows, such as updating CRM tasks, activities, and records.

Multi-step tool use, a new feature in Command R+, enables the model to combine multiple tools over multiple steps to accomplish complex tasks. Command R+ also possesses the ability to self-correct when it tries to use a tool and fails, such as when encountering a bug or malfunction in a tool. This empowers the model to make repeated attempts to complete the task and enhances the likelihood of success.

Image: Cohere

When evaluated on conversational tool-use and single-turn function-calling capabilities, Command R+ outperforms OpenAI’s GPT-4 Turbo, as well as Anthropic’s Claude 3 Sonnet, and Mistral Large in key enterprise AI benchmarks such as Microsoft’s ToolTalk (Hard) and Berkeley’s Function Calling Leaderboard.

Rising Competition in the Enterprise LLM Space

We’re living in the era of GenAI, and LLMs are the dynamite behind the generative AI boom. All big tech companies are either incubating their AI foundation models or are on the lookout to back the popular ones.

Various open source models like DBRX, Llama 2, Falcon 180b, and DeciLM 6B offer a means to equalise opportunities by allowing any enterprise access to powerful Gen AI tools that they can build applications upon or modify for their specific uses.

However, even then, those deploying open source AI models still need cloud servers for storing data and conducting inferences. Any enterprise signing up for Microsoft Azure or AWS would rather use one of the AI models suggested by the cloud server provider instead of trying to build or run their own. That is why cloud hyperscalers are the key players in the generative AI race.

This also explains the level of escalation between the three cloud computing leaders, Amazon, Microsoft, and Google, to collaborate with, benefit from, and provide their consumers with the latest cutting-edge generative AI technologies.

Source: Linkedin

Microsoft CEO Satya Nadella announced on Linkedin that Azure would be the first cloud to offer Cohere’s latest Command R+ LLM, also available on Amazon Sagemaker. The new model will soon be accessible on Oracle Cloud Infrastructure (OCI) and other cloud platforms. However, there is no collaboration announcement with Google Cloud as of yet.

OpenAI is still dominating, but others are fast catching up

Source: Reddit

Although OpenAI is still dominating the LLM space, various other alternatives like those by Google, Llama, Anthropic, Mistral AI, and Cohere are booming and increasingly being adopted by cloud providers and enterprises alike.

With advanced capabilities and competitive pricing, they have the potential to emerge as a leader in the enterprise AI market. With input and output costs for one million tokens set at $3 and $15, Cohere’s pricing for Command R+ is competitive. Compared with others, the pricing is on par with Claude 3 Sonnet, whereas the latest OpenAI GPT-4 Turbo model costs $10 for one million input tokens and $30 for one million output tokens.

With the entrance of new competitors in the current steady march of AI innovation, it’s time for OpenAI to move quickly and release GPT-5 if it wants to keep its lead in the AI field.

The post Who Needs OpenAI’s GPT-4? appeared first on Analytics India Magazine.

Andrej Karpathy Trains GPT-2 in Pure C Without PyTorch

Former OpenAI researcher Andrej Karpathy has introduced llm.c, a project aimed at training LLMs systems in pure C without the hefty dependencies of PyTorch and cPython.

Have you ever wanted to train LLMs in pure C without 245MB of PyTorch and 107MB of cPython? No? Well now you can! With llm.c:https://t.co/w2wkY0Ho5m
To start, implements GPT-2 training on CPU/fp32 in only ~1,000 lines of clean code. It compiles and runs instantly, and exactly…

— Andrej Karpathy (@karpathy) April 8, 2024

The llm.c project, available on GitHub, offers a simple approach to implementing GPT-2 training on CPU/fp32 in just around 1,000 lines of code.

“I chose GPT-2 as the first working example because it is the grand-daddy of LLMs, the first time the modern stack was put together,” wrote Karpathy in his GitHub repository.

One of the key advantages of llm.c is its instant compilation and execution, matching the performance of the PyTorch reference implementation. By allocating memory in a single block at the beginning of training, llm.c maintains a constant memory footprint, enhancing efficiency during data streaming and batch processing.

The core of llm.c lies in manually implementing forward and backward passes for individual layers like layernorm, encoder, matmul, self-attention, gelu, residual, softmax, and cross-entropy loss. This meticulous process ensures accurate pointer arrangements and tensor offsets, crucial for seamless model operation.

“I am curious to learn more about Rust and totally understand the appeal. But I still find C so nice, simple, clean, portable and beautiful, aesthetically. It’s as close as you want to get to direct communion with the machine,” wrote Karpathy.

Karpathy’s next endeavor involves porting llm.c to CUDA layer by layer, aiming for efficient performance comparable to PyTorch but without the heavyweight dependencies. This transition to CUDA opens doors for lowering precision from fp32 to fp16/below and supporting modern architectures like llama 2, Mistral, Gemma, and more.

The post Andrej Karpathy Trains GPT-2 in Pure C Without PyTorch appeared first on Analytics India Magazine.

OpenAI Launches GPT-4 Turbo with Vision in API

OpenAI has unveiled the latest addition to its AI arsenal with the release of GPT-4 Turbo with Vision, now available in the API. This new version comes with enhanced capabilities, including support for JSON mode and function calling for Vision requests. The upgraded GPT-4 Turbo model promises improved performance and is set to roll out in ChatGPT as well.

Majorly improved GPT-4 Turbo model available now in the API and rolling out in ChatGPT. https://t.co/HMihypFusV

— OpenAI (@OpenAI) April 9, 2024

GPT-4 Turbo is a robust multimodal model capable of processing both text and image inputs, delivering accurate outputs thanks to its extensive general knowledge and advanced reasoning abilities.

OpenAI introduced GPT-4 Turbo last November during DevDay, showcasing its enhanced capabilities and expanded knowledge base up to April 2023. With a 128k context window, this model can accommodate over 300 pages of text in a single prompt, providing users with a comprehensive understanding of diverse topics.

One of the notable highlights of GPT-4 Turbo is its optimized performance, leading to a substantial reduction in costs for users. Input tokens are now priced at three times less, while output tokens are available at half the cost compared to the previous GPT-4 model, making this upgrade both efficient and cost-effective for customers.

OpenAI recently announced Voice Engine, built to generate natural-sounding speech from text input and a mere 15-second audio sample. Notably, Voice Engine can create emotive and realistic voices using this brief audio input. However, it hasn’t been made available to the public yet.

OpenAI has indicated that its next model GPT-5 is coming soon with better reasoning skills. OpenAI’s chief operating officer Brad Lightcap, OpenAI’s COO, mentioned in an interview with the Financial Times that GPT-5 will focus on tackling tough problems, especially in the area of reasoning.

The post OpenAI Launches GPT-4 Turbo with Vision in API appeared first on Analytics India Magazine.