Get Ready for an AI System That Can Smell the Future

The compounds behind someone’s terrible odour are not known to science. But it turns out that it’s not just them. Mapping any compounds to the scents they make is harder than it looks. A team in the US has set out to solve the problem, creating a ‘principal odour map’ to decode how our brains perceive scents. This research not only bridges the gap in humans’ understanding of the sense of smell but can also change the way we recognise olfactory perceptions.

In vision, we understand the mappings between physical properties and colour perception. In the audition, we can identify the relationship between physical vibrations and pitch perception. However, the connection between chemical structures and olfactory percepts has remained a mystery.

This long-standing problem has been solved via machine learning. Brian K Lee and his team developed a neural network-based model capable of mapping chemical structures to odour perceptions, creating the “Principal Odor Map” (PoM). They used a dataset containing 5,000 known odorants, each having odour labels to train their ML model, pushing the boundaries of our understanding of the sense of smell.
To validate their findings, they conducted a prospective challenge, demonstrating that the model’s predictions closely mirrored human ratings for novel odorants. The POM they created preserved the intricate perceptual relationships, outshining traditional structure-based maps and pointing the way forward for olfactory science.

The team also compared a graph neural network (GNN) model to a traditional count-based fingerprint model for predicting odour preferences. The GNN model emerged as the frontrunner, either matching or surpassing human panellists’ ratings for 55% of the odour labels.

A discovery along the way was impurities identification in chemical reactions as potential contributors to odour perceptions, with a significant 31.5% rate of contamination in the stimulus set. It was observed that the neural networks performed well for labels with clear structural determinants and training examples, while human panellists’ performance varied based on their familiarity with the labels. Compared to human testers the model excelled in describing the smell of various molecules for 53% of the molecules tested.

The post Get Ready for an AI System That Can Smell the Future appeared first on Analytics India Magazine.

Introduction to Databases in Data Science

Introduction to Databases in Data Science
Image by Author

Data science involves extracting value and insights from large volumes of data to drive business decisions. It also involves building predictive models using historical data. Databases facilitate effective storage, management, retrieval, and analysis of such large volumes of data.

So, as a data scientist, you should understand the fundamentals of databases. Because they enable the storage and management of large and complex datasets, allowing for efficient data exploration, modeling, and deriving insights. Let’s explore this in greater detail in this article.

We’ll start by discussing the essential database skills for data science, including SQL for data retrieval, database design, optimization, and much more. We’ll then go over the main database types, their advantages, and use cases.

Essential Database Skills for Data Science

Database skills are essential for data scientists, as they provide the foundation for effective data management, analysis, and interpretation.

Here's a breakdown of the key database skills that data scientists should understand:

Introduction to Databases in Data Science
Image by Author

Though we’ve tried to categorize the database concepts and skills into different buckets, they go together. And you’d often need to know or learn them along the way when working on projects.

Now let's go over each of the above.

1. Database Types and Concepts

As a data scientist, you should have a good understanding of different types of databases, such as relational and NoSQL databases, and their respective use cases.

2. SQL (Structured Query Language) for Data Retrieval

Proficiency in SQL achieved through practice is a must for any role in the data space. You should be able to write and optimize SQL queries to retrieve, filter, aggregate, and join data from databases.

It’s also helpful to understand query execution plans and be able to identify and resolve performance bottlenecks.

3. Data Modeling and Database Design

Going beyond querying database tables, you should understand the basics of data modeling and database design, including entity-relationship (ER) diagrams, schema design, and data validation constraints.

You should be also able to design database schemas that support efficient querying and data storage for analytical purposes.

4. Data Cleaning and Transformation

As a data scientist, you’ll have to preprocess and transform raw data into a suitable format for analysis. Databases can support data cleaning, transformation, and integration tasks.

So you should know how to extract data from various sources, transform it into a suitable format, and load it into databases for analysis. Familiarity with ETL tools, scripting languages (Python, R), and data transformation techniques is important.

5. Database Optimization

You should be aware of techniques to optimize database performance, such as creating indexes, denormalization, and using caching mechanisms.

To optimize database performance, indexes are used to speed up data retrieval. Proper indexing improves query response times by allowing the database engine to quickly locate the required data.

6. Data Integrity and Quality Checks

Data integrity is maintained through constraints that define rules for data entry. Constraints such as unique, not null, and check constraints ensure the accuracy and reliability of the data.

Transactions are used to ensure data consistency, guaranteeing that multiple operations are treated as a single, atomic unit.

7. Integration with Tools and Languages

Databases can integrate with popular analytics and visualization tools, allowing data scientists to analyze and present their findings effectively. So you should know how to connect to and interact with databases using programming languages like Python, and perform data analysis.

Familiarity with tools like Python's pandas, R, and visualization libraries is necessary too.

In summary: Understanding various database types, SQL, data modeling, ETL processes, performance optimization, data integrity, and integration with programming languages are key components of a data scientist's skill set.

In the remainder of this introductory guide, we’ll focus on fundamental database concepts and types.

Introduction to Databases in Data Science
Image by Author Fundamentals of Relational Databases

Relational databases are a type of database management system (DBMS) that organize and store data in a structured manner using tables with rows and columns. Popular RDBMS include PostgreSQL, MySQL, Microsoft SQL Server, and Oracle.

Let's dive into some key relational database concepts using examples.

Relational Database Tables

In a relational database, each table represents a specific entity, and the relationships between tables are established using keys.

To understand how data is organized in relational database tables, it’s helpful to start with entities and attributes.

You’ll often want to store data about objects: students, customers, orders, products, and the like. These objects are entities and they have attributes.

Let’s take the example of a simple entity—a "Student" object with three attributes: FirstName, LastName, and Grade. When storing the data The entity becomes the database table, and the attributes the column names or fields. And each row is an instance of an entity.

Introduction to Databases in Data Science
Image by Author

Tables in a relational database consists of rows and columns:

  • The rows are also known as records or tuples, and
  • The columns are referred to as attributes or fields.

Here's an example of a simple "Students" table:

StudentID FirstName LastName Grade
1 Jane Smith A+
2 Emily Brown A
3 Jake Williams B+

In this example, each row represents a student, and each column represents a piece of information about the student.

Understanding Keys

Keys are used to uniquely identify rows within a table. The two important types of keys include:

  • Primary Key: A primary key uniquely identifies each row in a table. It ensures data integrity and provides a way to reference specific records. In the "Students" table, "StudentID" could be the primary key.
  • Foreign Key: A foreign key establishes a relationship between tables. It refers to the primary key of another table and is used to link related data. For example, if we have another table called "Courses," the "StudentID" column in the "Courses" table could be a foreign key referencing the "StudentID" in the "Students" table.

Relationships

Relational databases allow you to establish relationships between tables. Here are the most important and commonly occurring relationships:

  • One-to-One Relationship: Under one-to-one relationship, each record in a table is related to one—and only one—record in another table in the database. For example, a "StudentDetails" table with additional information about each student might have a one-to-one relationship with the "Students" table.
  • One-to-Many Relationship: One record in the first table is related to multiple records in the second table. For instance, a "Courses" table could have a one-to-many relationship with the "Students" table, where each course is associated with multiple students.
  • Many-to-Many Relationship: Multiple records in both tables are related to each other. To represent this, an intermediary table, often called a junction or link table, is used. For example, a "StudentsCourses" table could establish a many-to-many relationship between students and courses.

Normalization

Normalization (often discussed under database optimization techniques) is the process of organizing data in a way that minimizes data redundancy and improves data integrity. It involves breaking down large tables into smaller, related tables. Each table should represent a single entity or concept to avoid duplicating data.

For instance, if we consider the "Students" table and a hypothetical "Addresses" table, normalization might involve creating a separate "Addresses" table with its own primary key and linking it to the "Students" table using a foreign key.

Advantages and Limitations of Relational Databases

Here are some advantages of relational databases:

  • Relational databases provide a structured and organized way to store data, making it easy to define relationships between different types of data.
  • They support ACID properties (Atomicity, Consistency, Isolation, Durability) for transactions, ensuring that data remains consistent.

On the flip side, they have the following limitations:

  • Relational databases have challenges with horizontal scalability, making it challenging to handle massive amounts of data and high traffic loads.
  • They also require a rigid schema, making it challenging to accommodate changes in data structure without modifying the schema.
  • Relational databases are designed for structured data with well-defined relationships. They may not be well-suited for storing unstructured or semi-structured data like documents, images, and multimedia content.

Exploring NoSQL Databases

NoSQL databases do not store data in tables in the familiar row-column format (so are non-relational). The term "NoSQL" stands for "not only SQL"—indicating that these databases differ from the traditional relational database model.

The key advantages of NoSQL databases are their scalability and flexibility. These databases are designed to handle large volumes of unstructured or semi-structured data and provide more flexible and scalable solutions compared to traditional relational databases.

NoSQL databases encompass a variety of database types that differ in their data models, storage mechanisms, and query languages. Some common categories of NoSQL databases include:

  • Key-value stores
  • Document databases
  • Column-family databases
  • Graph databases.

Now, let's go over each of the NoSQL database categories, exploring their characteristics, use cases, and examples, advantages, and limitations.

Key-Value Stores

Key-value stores store data as simple pairs of keys and values. They are optimized for high-speed read and write operations. They are suitable for applications such as caching, session management, and real-time analytics.

These databases, however, have limited querying capabilities beyond key-based retrieval. So they’re not suitable for complex relationships.

Amazon DynamoDB and Redis are popular key-value stores.

Document Databases

Document databases store data in document formats such as JSON and BSON. Each document can have varying structures, allowing for nested and complex data. Their flexible schema allows easy handling of semi-structured data, supporting evolving data models and hierarchical relationships.

These are particularly well-suited for content management, e-commerce platforms, catalogs, user profiles, and applications with changing data structures. Document databases may not be as efficient for complex joins or complex queries involving multiple documents.

MongoDB and Couchbase are popular document databases.

Column-Family Stores (Wide-Column Stores)

Column-family stores, also known as columnar databases or column-oriented databases, are a type of NoSQL database that organizes and stores data in a column-oriented fashion rather than the traditional row-oriented manner of relational databases.

Column-family stores are suitable for analytical workloads that involve running complex queries on large datasets. Aggregations, filtering, and data transformations are often performed more efficiently in column-family databases. They’re helpful for managing large amounts of semi-structured or sparse data.

Apache Cassandra, ScyllaDB, and HBase are some column-family stores.

Graph Databases

Graph databases model data and relationships in nodes and edges, respectively. to represent complex relationships. These databases support efficient handling of complex relationships and powerful graph query languages.

As you can guess, these databases are suitable for social networks, recommendation engines, knowledge graphs, and in general, data with intricate relationships.

Examples of popular graph databases are Neo4j and Amazon Neptune.

There are many NoSQL database types. So how do we decide which one to use? Well. The answer is: it depends.

Each category of NoSQL database offers unique features and benefits, making them suitable for specific use cases. It's important to choose the appropriate NoSQL database by factoring in access patterns, scalability requirements, and performance considerations.

To sum up: NoSQL databases offer advantages in terms of flexibility, scalability, and performance, making them suitable for a wide range of applications, including big data, real-time analytics, and dynamic web applications. However, they come with trade-offs in terms of data consistency.

Advantages and Limitations of NoSQL Databases

The following are some advantages of NoSQL databases:

  • NoSQL databases are designed for horizontal scalability, allowing them to handle massive amounts of data and traffic.
  • These databases allow for flexible and dynamic schemas. They have flexible data models to accommodate various data types and structures, making them well-suited for unstructured or semi-structured data.
  • Many NoSQL databases are designed to operate in distributed and fault-tolerant environments, providing high availability even in the presence of hardware failures or network outages.
  • They can handle unstructured or semi-structured data, making them suitable for applications dealing with diverse data types.

Some limitations include:

  • NoSQL databases prioritize scalability and performance over strict ACID compliance. This can result in eventual consistency and may not be suitable for applications that require strong data consistency.
  • Because NoSQL databases come in various flavors with different APIs and data models, the lack of standardization can make it challenging to switch between databases or integrate them seamlessly.

It's important to note that NoSQL databases are not a one-size-fits-all solution. The choice between a NoSQL and a relational database depends on the specific needs of your application, including data volume, query patterns, and scalability requirements amongst others.

Relational vs. NoSQL Databases

Let’s sum up the differences we’ve discussed thus far:

Feature Relational Databases NoSQL Databases
Data Model Tabular structure (tables) Diverse data models (documents, key-value pairs, graphs, columns, etc.)
Data Consistency Strong consistency Eventual consistency
Schema Well-defined schema Flexible or schema-less
Data Relationships Supports complex relationships Varies by type (limited or explicit relationships)
Query Language SQL-based queries Specific query language or APIs
Flexibility Not as flexible for unstructured data Suited for diverse data types, including
Use Cases Well-structured data, complex transactions Large-scale, high-throughput, real-time applications

A Note on Time Series Databases

As a data scientist, you’ll also work with time series data. Time series databases are also non-relational databases, but have a more specific use case.

They need to support storing, managing, and querying timestamped data points—data points that are recorded over time—such as sensor readings and stock prices. They offer specialized features for storing, querying, and analyzing time-based data patterns.

Some examples of time series databases include InfluxDB, QuestDB, and TimescaleDB.

Conclusion

In this guide, we went over relational and NoSQL databases. It’s also worth noting that you can explore a few more databases beyond popular relational and NoSQL types. NewSQL databases such as CockroachDB provide the traditional benefits of SQL databases while providing the scalability and performance of NoSQL databases.

You can also use an in-memory database that stores and manages data primarily in the main memory (RAM) of a computer, as opposed to traditional databases that store data on disk. This approach offers significant performance benefits due to the much faster read and write operations that can be performed in memory compared to disk storage.
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she's working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more.

More On This Topic

  • From Oracle to Databases for AI: The Evolution of Data Storage
  • Column-Oriented Databases, Explained
  • Document Databases, Explained
  • Key-Value Databases, Explained
  • NoSQL Databases and Their Use Cases
  • Graph Databases, Explained

India’s Reliance partners with Nvidia to build large language model

India’s Reliance partners with Nvidia to build large language model Manish Singh 10 hours

Reliance Industries’s Jio Platforms has partnered with GPU giant Nvidia to work on building a large language model that is trained on India’s diverse languages, the two firms said Friday, as the largest Indian corporate firm expands into the fast-growing but locally uncontested space.

The companies will also work together to build an AI infrastructure that is “over an order of magnitude more powerful than the fastest supercomputer in India today,” they said, without sharing a timeframe. Reliance said the cloud infrastructure would provide accelerated computing access to researchers, developers, startups, scientists, AI experts, and others throughout India.

As part of the deal, Nvidia will equip Jio with comprehensive AI supercomputer solutions — Nvidia GH200 Grace Hopper Superchip and Nvidia DGX Cloud — as well as frameworks for crafting advance AI models. Jio, in turn, will be responsible for the management of the AI cloud infrastructure and will also handle customer interactions and access.

“We are delighted to partner with Reliance to build state-of-the-art AI supercomputers in India,” said Nvidia chief Jensen Huang, who met several local entrepreneur and Prime Minister Narendra Modi during his recent India visit. “India has scale, data and talent. With the most advanced AI computing infrastructure, Reliance can build its own large language models that power generative AI applications made in India, for the people of India.”

India, despite being the world’s most populous country, has yet to make a significant mark in the global AI arena. Most Indian startups and established local companies have primarily focused on developing applications using large language models created by organizations like OpenAI. Elsewhere in the world, companies and countries are racing to secure the highly sought-after Nvidia chips to power their own large language models.

Reliance, whose biggest revenue driver is its oil business, has expanded to numerous sectors in the past decade, including telecom and video streaming, as it has sought to diversify its empire. Jio Platforms — backed by Meta, Google, Qualcomm and Intel — is increasingly also positioning itself as the technology distribution partner for many global giants. It maintains a 10-year deal with Microsoft to launch cloud data centers and resell many business offerings, and just last month the firm deepened its collaboration with Netflix.

“As India advances from a country of data proliferation to creating technology infrastructure for widespread and accelerated growth, computing and technology super centres like the one we envisage with Nvidia will provide the catalytic growth just like Jio did to our nation’s digital march,” said Mukesh Ambani, chairman and managing director of Reliance Industries, in a statement.

Nvidia said separately it has partnered with India’s Tata Group to train 600,000 employees at the consultancy firm TCS with advancements in AI and build AI infrastructure with Tata Communications.

Industry insiders attribute India’s dearth of AI-first startups in part to a skills gap among the nation’s workforce. With the advent of generative AI could displace many service jobs, analysts warn.

“Among its over 5 million employees, IT in India still has a high mix of low-end employees like BPO or system maintenance. While AI isn’t at the level of causing disruptions, the systems are improving rapidly,” Bernstein analysts wrote in a report this year.

In response to it, New Delhi has said that India will not regulate the growth of AI, taking a different approach from many other countries.

Apple Springs a Surprise, Embraces Open-Source Training Method

When the news of Apple working on its generative AI tools and chatbot appeared two months ago, the positive market sentiments pushed Apple’s shares to a record high of $198.23, reflecting a gain of 2.3%. However, apart from Apple using Ajax for its LLM and employees internally naming it AppleGPT, no other details on the model were released.

In a new development, as per a report by The Information, Apple is training Ajax GPT on more than 200 billion parameters, believed to be more powerful than GPT-3.5, and, hear here, Apple did something it has never done before — open sourced its code on GitHub!

An Unprecedented Move

In July, Apple discreetly uploaded the code for AXLearn on GitHub, making it accessible to the public for training their own large language models without the need to start from scratch. AXLearn, an internal software developed by Apple over the past year for training Ajax GPT, is a machine-learning framework. It serves as a pre-built tool for rapidly training machine-learning models. Ajax is a derivative of JAX, an open-source framework created by Google researchers, and some of the components of AXLearn are specifically designed for optimisation on Google TPUs.

While Apple might be way ahead in bringing innovative solutions, there is a rotten side that puts company’s priorities before anything else. Apple has been infamous for fostering a closed-source environment. None of their technologies or codes have been open to the public. When big-tech companies are releasing superior open source models such as Meta’s Llama-2, Anthropic’s Claude-2, Falcon, Vicuna and others, Apple has always stuck to their conventional route of secrecy, something OpenAI has also been following. Apple’s close-source approach has been criticised by the tech community, labelling the company as one that benefits from research released by big tech but never gives anything in return.

Apple’s decision to open-source its training software, AXLearn, is a significant step from its secrecy approach. This move could foster collaboration and innovation within the AI research community and reflect a broader trend of openness in AI development.

While the exact motive behind Apple’s decision to release the code on GitHub remains undisclosed, it is evident that the company’s substantial investment, amounting to millions of dollars spent daily on AI development, reflects its determination to compete vigorously in the AI race.

Interestingly, last month the company filed for the trademark “AXLearn” in Hong Kong.

Emulating Google Culture

Apple’s head of AI John Giannandrea, and Ruoming Pang, the lead of its conversational AI team called ‘Foundational Model’, both bring extensive experience from their previous roles at Google. Giannandrea brought his vision of making Apple like Google where employees had more freedom to conduct diverse research, publish papers and explore innovative ideas. Apple’s prior limitations in these areas had hindered talent growth and recruitment.

Reportedly, Apple has also hired talent from Google and Meta’s AI platform teams. In the past two years, at least seven of the 18 contributors to AXLearn on GitHub, previously worked at either Google or Meta. Apple has likely tweaked its approach to foster talent through the research community, which makes open-sourcing the right way ahead.

Decoding The Clues

Piecing together available information, it appears that Apple has formed two new teams that are working on language and image models. Apple’s recent AI research paper hints towards work on software capable of generating images, videos and 3D scenes, also implying a multimodal AI.

However, uncertainties remain on the integration of LLM into Apple’s products. Apple has always leaned towards bringing its new software on its devices, but integrating a 200-billion parameter LLM that requires more storage space and computing power on an iPhone, is not plausible. It is possible that the company might work on smaller models for phone integration or that the model will be used for something else, the details of which remain elusive.

The post Apple Springs a Surprise, Embraces Open-Source Training Method appeared first on Analytics India Magazine.

Time 100 AI: The Most Influential?

Time 100 AI
Image source: Time 100 AI

Yesterday, Time Magazine released its Time 100 AI list, coinciding with the cover story of their latest issue.

[B]ehind every advance in machine learning and large language models are, in fact, people—both the often obscured human labor that makes large language models safer to use, and the individuals who make critical decisions on when and how to best use this technology. Reporting on people and influence is what TIME does best. That led us to the TIME100 AI.

Time's list includes 100 individuals involved in the current AI landscape, grouped into the categories of leaders, innovators, shapers, and thinkers.

While there is an array of impressive names on the list in the various categories, there was some immediate discussion on Twitter of significant exclusions, with Jürgen Schmidhuber and Andrej Karpathy being two names that seemed to come up repeatedly. I had some oversights of my own that I would have added, along with several names I feel could have been left off, but it's not my list so there isn't much I can do about it.

Instead of complaining, I have decided to look at the list as a snapshot of how an influential popular publication is reporting on AI to the masses. As such, I would not endorse this as authoritative in any manner, but still choose to bring it to our readers for informational purposes.

You can find out more about how they put the list together here.

TIME’s most knowledgeable editors and reporters spent months fielding recommendations from dozens of sources, to put together hundreds of nominations that we whittled down to the group you see today. We interviewed nearly all of the individuals on this list to get their perspective on the path of AI today.

Whether or not you agree fully with the inclusions, I feel that it is still important to know what popular publications are reporting to the masses about the AI industry. I would imagine that even the most well-versed follower of the AI landscape would encounter some unfamiliar names with which to familiarize themselves.

Time 100 AI
Image source: How We Chose the TIME100 Most Influential People in AI

I encourage everyone to have a look at the list themselves, gain some insight into contemporary AI leaders, and — just as importantly — get an understanding of how those outside of the industry see it.

Matthew Mayo (@mattmayo13) holds a Master's degree in computer science and a graduate diploma in data mining. As Editor-in-Chief of KDnuggets, Matthew aims to make complex data science concepts accessible. His professional interests include natural language processing, machine learning algorithms, and exploring emerging AI. He is driven by a mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.

More On This Topic

  • How I Redesigned over 100 ETL into ELT Data Pipelines
  • Awesome list of datasets in 100+ categories
  • DeepMind’s MuZero is One of the Most Important Deep Learning Systems Ever…
  • Django's 9 Most Common Applications
  • Advice to aspiring Data Scientists — your most common questions answered
  • 7 Most Recommended Skills to Learn to be a Data Scientist

NVIDIA Partners with Reliance for Advancing AI in India

In a groundbreaking announcement for the development of AI in India, NVIDIA has announced that it is collaborating with Reliance Industries to develop India’s foundational large language model trained on the nation’s diverse languages and tailored for generative AI applications to serve the world’s most populous nation.

The collaboration aims to build AI infrastructure in India that they claim to be more powerful than the fastest supercomputer in India today. Further, NVIDIA said that it will provide access to Reliance with GH200 Grace Hopper Superchip and access to NVIDIA DGX Cloud for exceptional performance.

This NVIDIA infrastructure would be the next frontier for Mukesh Ambani’s Reliance Jio Infocomm, which has already provided the country with network connectivity at huge speed economically. With this, Reliance aims to create AI applications and services for more than 450 million Jio customers and provide energy-efficient AI infrastructure for research to scientists and startups across India.

To achieve this, the AI infrastructure will be hosted on AI-ready computing data centres that will eventually expand to 2,000 MW. The implementation and execution will be managed by Jio through its already established 5G spectrum and fibre networks.

Jensen Huang, CEO and founder of NVIDIA said, “We are delighted to partner with Reliance to build SOTA AI supercomputers in India.” He further adds that India has the scale, data, and talent that can be fostered to create the most advanced AI computing infrastructure, using which, “Reliance can build its own LLMs that power generative AI applications made in India, for the people of India.”

Expanding on the partnership, Mukesh Ambani, chairman of Reliance, expressed his delight for the partnership. “As India advances from a country of data proliferation to creating technology infrastructure for widespread and accelerated growth, computing and technology super centres like the one we envisage with NVIDIA will provide the catalytic growth just like Jio did to our nation’s digital march,” he said.

“At Jio, we are committed to fueling India’s technology renaissance by democratising access to cutting-edge technologies, and our collaboration with NVIDIA is a significant step in this direction,” said Akash Ambani, chairman of Reliance Jio Infocomm. “Together, we will develop a state-of-the-art Al cloud infrastructure that is secure, sustainable and deeply relevant across India, accelerating the nation’s journey towards becoming an Al powerhouse.”

This announcement comes just days after Huang visited India for the second time to meet Prime Minister Narendra Modi to discuss AI potential within the country.

Furthermore, last month, At Reliance’s 46th Annual General Meeting (AGM), Mukesh Ambani announced that his networking giant Jio is set to build “India-specific AI models” that will benefit different verticals of the country including government, business and consumers to make them accessible for “everyone, everywhere”.

The post NVIDIA Partners with Reliance for Advancing AI in India appeared first on Analytics India Magazine.

Time 100 AI: The Most Influential?

Time 100 AI
Image source: Time 100 AI

Yesterday, Time Magazine released its Time 100 AI list, coinciding with the cover story of their latest issue.

[B]ehind every advance in machine learning and large language models are, in fact, people—both the often obscured human labor that makes large language models safer to use, and the individuals who make critical decisions on when and how to best use this technology. Reporting on people and influence is what TIME does best. That led us to the TIME100 AI.

Time's list includes 100 individuals involved in the current AI landscape, grouped into the categories of leaders, innovators, shapers, and thinkers.

While there is an array of impressive names on the list in the various categories, there was some immediate discussion on Twitter of significant exclusions, with Jürgen Schmidhuber and Andrej Karpathy being two names that seemed to come up repeatedly. I had some oversights of my own that I would have added, along with several names I feel could have been left off, but it's not my list so there isn't much I can do about it.

Instead of complaining, I have decided to look at the list as a snapshot of how an influential popular publication is reporting on AI to the masses. As such, I would not endorse this as authoritative in any manner, but still choose to bring it to our readers for informational purposes.

You can find out more about how they put the list together here.

TIME’s most knowledgeable editors and reporters spent months fielding recommendations from dozens of sources, to put together hundreds of nominations that we whittled down to the group you see today. We interviewed nearly all of the individuals on this list to get their perspective on the path of AI today.

Whether or not you agree fully with the inclusions, I feel that it is still important to know what popular publications are reporting to the masses about the AI industry. I would imagine that even the most well-versed follower of the AI landscape would encounter some unfamiliar names with which to familiarize themselves.

Time 100 AI
Image source: How We Chose the TIME100 Most Influential People in AI

I encourage everyone to have a look at the list themselves, gain some insight into contemporary AI leaders, and — just as importantly — get an understanding of how those outside of the industry see it.

Matthew Mayo (@mattmayo13) holds a Master's degree in computer science and a graduate diploma in data mining. As Editor-in-Chief of KDnuggets, Matthew aims to make complex data science concepts accessible. His professional interests include natural language processing, machine learning algorithms, and exploring emerging AI. He is driven by a mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.

More On This Topic

  • How I Redesigned over 100 ETL into ELT Data Pipelines
  • Awesome list of datasets in 100+ categories
  • DeepMind’s MuZero is One of the Most Important Deep Learning Systems Ever…
  • Django's 9 Most Common Applications
  • Advice to aspiring Data Scientists — your most common questions answered
  • 7 Most Recommended Skills to Learn to be a Data Scientist

Tencent latest to join China’s GenAI race with foundation model for enterprises

brain AI concept

Tencent is the latest player to hop on China's generative artificial intelligence (AI) bandwagon, offering its foundation model on which local organizations can test and build their own applications.

Called Hunyuan, the large language model can be accessed via APIs on Tencent's cloud platform. Chinese enterprises can tweak the model to fit their specific requirements, tapping capabilities that include Chinese language processing and advanced logic reasoning, Tencent said.

Also: How to use ChatGPT to write code

The foundation AI model also facilitates a range of functions, such as image creation, text recognition, and copywriting. The Chinese cloud vendor is touting Hunyuan as a platform for various industries, including finance, e-commerce, transport, and games.

The AI model currently has more than 100 billion parameters and more than two trillion tokens in pre-training data.

Tencent said it has integrated Hunyuan with 50 of its own products, spanning its fintech, advertising, games, office productivity, and search applications.

Also: Generative AI is changing your technology career path. What to know

Tencent Meeting, for example, now offers an AI assistant that can carry out common tasks, such as generating minutes from a meeting, via natural language processing and user prompts. Its advertising tools can also be used to create shopping guides that retailers can tap as marketing assets.

"In launching Hunyuan and making it available to domestic enterprises, Tencent has opted for an approach that balances the exciting performance of consumer-facing, large-model AI powered chatbots, with the pragmatic need for the business community to increase operational efficiencies, reduce costs, protect privacy as well as proprietary data," said Dowson Tong, Tencent's senior executive vice president of cloud and smart industries.

The tech giant is among a growing list of domestic players that are heating up China's generative AI market, which now includes Baidu, JD.com, and Alibaba Cloud.

Also: These experts are racing to protect AI from hackers

JD.com's ChatRhino also boasts a base of 100 billion parameters, up from the 10 billion-parameter benchmark clocked by its previous model Vega early last year. Vega had led the General Language Understanding Evaluation (GLUE) list, outpacing models from Microsoft and Facebook, according to JD.com.

Alibaba Cloud's Tongyi Qianwen is available to its domestic customers for beta testing as well as to developers via an API. The Chinese cloud vendor also introduced a partnership program in the hope of fuelling the development of AI applications for verticals, including finance, and petrochemicals.

The accelerated drive toward AI comes amid interim regulations in China, which were pushed out to ensure the healthy development of the technology and safeguard both national security and public interests, the Chinese government said.

Also: ChatGPT vs. Bing Chat: Which AI chatbot should you use?

Effective from August 15, the interim legislation outlines various measures that aim to facilitate these objectives, including steps to be taken to improve the quality of training data, such as its accuracy, objectivity, and diversity.

Generative AI service providers also assume legal responsibility for the information generated and its security, and they must sign service-level agreements with users of their service, clarifying each party's rights and obligations.

Featured

LLMs Ride the Overconfidence Wave

Developers trying to fine-tune LLMs often encounter a plethora of issues. An experiment by Jonathan Whitaker and Jeremy Howard from fast.ai highlighted a rather unscrutinized problem with LLM models — overconfidence, which shouldn’t be confused with the widely discussed LLM hallucination.

Overconfidence is when the model insists on certain information provided in the dataset even if it is incorrect for the said question, which is possibly caused by the infamous two terms — underfitting and overfitting.

To start with, overfitting is when a model becomes overly intricate and tailors itself too closely to the training data. And underfitting, as the name suggests, is exactly the opposite, when the model does not have enough training data to make predictions. This balance is often referred to as the bias-variance tradeoff.

To tackle these problems, developers apply several techniques, some work, and some bring up other problems. When it comes to the case of fast.ai researchers, they tried to train the model on a single example, which, to their surprise, gave out very different results than they expected.

Enter Overconfident LLMs

When the model is given new unseen data, it can display unwarranted confidence in its predictions, despite being wrong. This is contrary to the conventional belief that neural networks typically require a multitude of examples due to the bumpy nature of loss surfaces during training.

Imagine a language model that has been fine-tuned on a comprehensive medical dataset to diagnose diseases based on patient descriptions. When provided with cases featuring evident symptoms and clear diagnostic criteria, the model confidently assigns high probabilities to specific diseases. For instance, if a patient describes classic symptoms of the flu, the model might assign a near 1.0 probability to influenza as the diagnosis.

However, when confronted with complex medical cases with ambiguous symptoms or multiple potential diseases, the model might distribute probabilities more evenly among different diagnostic options, indicating its uncertainty about the correct diagnosis.

Similarly, when training neural network classifiers, which are typically exposed to extensive datasets repeatedly, Howard and Whitaker noticed that even a single example of an input-output pair had a remarkable impact on these models. It was discovered that during training, the models exhibited overconfidence. As their confidence increased, they assigned close to 1.0 probability to their predictions, even if those predictions were incorrect.

This overconfidence, particularly in the early stages of training, raised concerns about how neural networks handle new information and adapt to it.

They found out that the model can learn to make accurate predictions after seeing a single example, the model essentially memorised the training data (a single example) and demonstrated a robust generalisation, making it less likely to overfit. The intention was to get the machine to learn efficiently and make reliable predictions regulating its confidence scores.

Is overfitting the cause of overconfident models?

While overfitting, the phenomenon where a model becomes too specific to the training data, is a well-known challenge in machine learning, the real problem here appears to be overconfidence. These predictions led to an unexpected result: the validation loss, which measures the model’s performance on unseen data, got worse, even as the model’s training loss improved.

As expected, the experiment brought up several discussions on a HackerNews thread. When the model learns the training data too well, it performs poorly on new data. The researchers of the model explained that they are not pointing out any problem, but just pointing out an opportunity if it is possible to train models with a single example.

Interestingly, the two terms are closely related, overconfidence can be a symptom of overfitting. When a model is overfit, it learns the statistical noise in the training data, as well as the underlying patterns. This can lead to the model being overly confident in its predictions, even when those predictions are not accurate.

However, overconfidence is not always caused by overfitting. A model can also be overconfident if it is not trained on enough data, or if the data is not representative of the real world.

Lucas Beyer, researcher at GoogleAI, clarifies that these findings are specific to fine-tuning pre-trained models and don’t necessarily change how models are initially pre-trained. He also pointed out that the findings are more applicable to fine-tuning scenarios and might not be as relevant to training models entirely from scratch.

While there are other questions and critiques of this experiment, one oversight not missed by anyone is the lack of the base model or any detail of the model trained on for this experiment. It is not even clear if they used the same dataset again and again to fine-tune the model, which resulted in overfitting, and thus, overconfidence.

The post LLMs Ride the Overconfidence Wave appeared first on Analytics India Magazine.

Meet the Genius behind Med-PaLM 2

In December, last year, when OpenAI’s ChatGPT was struggling to find real use cases, Google decided to explore the use of large language models (LLMs) for healthcare, resulting in the creation of Med-PaLM —an open-sourced large language model designed for medical purposes.

Since then, the team has released scaled-up versions of healthcare LLMs, including Med-PaLM-2 and Med-PaLM-M, both of which have had a direct impact on human lives. Currently, Med-PaLM-2 is also undergoing testing at renowned healthcare institutions such as the Mayo Clinic. One of the prominent contributors to these projects is Vivek Natarajan, an AI researcher at Google Health.

Currently, based in the San Francisco Bay Area, the Tamilian with deep Bengali roots, began his journey as an engineering intern at Qualcomm, progressing to a role with Meta AI, and ultimately finding a fulfilling place at Google Health.

However, there is a story behind why he chose to transition into the field of medical AI.

How it All Began

It is 2023, and India’s healthcare system still faces significant hurdles with insufficient medical infrastructure and a severe shortage of medical professionals, especially in rural regions. The ratio of doctors to patients falls well below global standards, with a mere 0.7 doctors per 1,000 people. Adding to that, we have only 0.9 beds per 1,000 population, and out of those, only 30% are in rural areas.

Most had to walk tens of kilometres, often in extreme conditions, leading to delayed diagnoses, poorly managed chronic conditions, and even untimely deaths. This healthcare disparity affected both the underprivileged and affluent individuals, underscoring the stark healthcare inequalities in these areas.

Having grown up in different parts of India, Natarajan witnessed these immense challenges faced by people in small towns and villages when it came to accessing medical care. “It always bothered me that people should not have to suffer so much to receive basic healthcare, and I always wanted to do something about it,” Natarajan told AIM in an exclusive interaction.

From starting out by building ‘Ask the Doctor, Anytime Anywhere’, an app aimed at democratizing healthcare access in 2013 to being the research lead behind Google’s state-of-the-art LLM for medicine, Med-PaLM 2, Natarajan has come a long way. “I guess the name gives away what we were trying to do. Ask the Doctor was bootstrapped using older AI techniques and a lot of rules, and it clearly did not work well, leading to its discontinuation,” he said.

The app was made by leveraging pre-deep learning ML techniques — a combination of expert systems and rules. However, even back in 2013, he had this intuition that AI would be the most important piece of solving this healthcare problem.

How Google Happened

After completing a bachelor’s degree at NIT Trichy in Electronics Engineering and graduating with a master’s degree in Computer Science from UT Austin in 2015, Natarajan joined Meta AI. Despite being in the pre-transformer era, Natarajan’s time at Meta AI, which was his first job, taught him the potential of deep learning. At Meta, he worked in various areas, from speech recognition to conversational and multimodal AI, and on various business-critical platforms such as Newsfeed and Messenger.

However things took a different turn. Unfortunately, it was during this period that his father began showing signs of an aggressive form of Parkinson’s disease, which couldn’t have been identified sooner due to the limited care options and resources. “That persuaded me to go back to the problem that I always deeply cared about — using AI to democratise access to healthcare and put world-class medical expertise in the pocket of billions,” said Natarajan.

Coincidentally, this was also the time when researchers from Google Brain and DeepMind (now referred to as Google DeepMind), after some seminal medical AI papers, were coming together to form Google Health AI, aligning with his aim. “So when Greg Corrado, co-founder of Google Brain and head of Google Health AI, offered me the chance to join, I took it up without hesitation,” he added.

Since then, he has collaborated with esteemed AI researchers like Greg and Dr Alan Karthikesalingam to work toward the vision of making an AI doctor accessible to billions.

Behind the Making of Med-PaLM

If not an AI researcher, Natarajan would have probably been a cricket commenter like Harsha Bhogle. Well, let’s take a moment to appreciate that he didn’t embark on that career, otherwise, we might have missed out on his stellar work in building Med-PaLM, Med-PaLM 2, Med-PaLM M, and related projects.

The core concept driving the development of Med-PaLM is the utilisation of general-purpose language models like PaLM and GPT-4, which excel in predicting text but lack specialised medical knowledge. However, the challenge lies in transforming these models into medical experts. “So, we need to do the same with AI and ‘send them to medical school’ if we want to use them for medical applications. Make them learn from high-quality medical domain information spanning human biology to practice of medicine as well as from clinical expert demonstrations and feedback — similar to residency after medical school,” he added.

However, the primary obstacle was the scarcity of large-scale medical datasets due to privacy concerns and healthcare in the global south not being digital. Additionally, there’s a pressing concern about bias in LLMs used in healthcare. These cultural, social, racial, and gender biases can result in unequal access to care, misdiagnoses, and treatment disparities. The root of this problem lies in the reliance of healthcare LLMs on extensive datasets that mirror historical healthcare inequities, potentially leading to inaccurate diagnoses and treatment recommendations for marginalised communities.

The Med-PaLM models, derived from the PaLM general-purpose language models, are tailored for medical applications through fine-tuning with high-quality medical datasets and clinical expert demonstrations, covering areas like professional medical exams, PubMed research, and user-generated medical questions. These datasets, including the openly available HealthSearchQA dataset from Google, are instrumental in the development of Med-PaLM and its likes.

In the Med-PaLM paper, researchers introduced an evaluation rubric for assessing LLMs in medical applications, with bias being one of the key dimensions. “Additionally, in Med-PaLM 2, we introduced adversarial questions evaluation, specifically targeting sensitive topics like vaccine misinformation, COVID-19, obesity, mental health, and suicide. These topics have a high potential to exacerbate bias and healthcare disparities through the spread of medical misinformation,” said Natarajan.

“Our approach to mitigating bias involves rigorous evaluation and expert clinician demonstrations to train the model. While it’s a complex challenge, we are steadily making progress in this area,” he added.

Consequently, he added that the fine-tuning approach used depends on the available data. In the case of the first Med-PaLM, prompt tuning was employed, wherein the majority of the LLM parameters remained fixed, and only a small set of additional parameters were learned. However, for subsequent versions such as Med-PaLM 2 and Med-PaLM M, the team had access to more data, enabling them to fine-tune the models end-to-end in order to enhance performance and align them more closely with medical expertise.

AI Doctor for Everyone

As we continue to ride the generative AI wave, Natarajan believes that understanding LLMs is crucial, as they differ from human intelligence and require specialised methods, such as “mechanistic interpretability or artificial neuroscience”, posing a plethora of new challenges that need to be solved. According to him, there lies immense potential for exciting research beyond large language models. He is particularly excited about LLMs’ potential in biology and neurology, such as analysing the human genome and decoding brain signals.

Although he has no plans to directly revisit building a similar app like Ask the Doctor, he believes that his work on Med-PaLM and medical AI as a whole at Google will eventually lead to something very similar. “While there is still a long way to go, given the incredible progress made in LLMs just last year, it appears that my dream of making an AI doctor accessible to billions is no longer science fiction. Fingers crossed!” Natarajan concluded.

Read more: Pushmeet Kohli On Solving Intelligence at DeepMind for Humanity & Science

The post Meet the Genius behind Med-PaLM 2 appeared first on Analytics India Magazine.