Semantic Search with Vector Databases

Semantic Search with Vector Databases
Image generated with Ideogram.ai

I am sure that most of us have used search engines.

There is even a phrase such as “Just Google it.” The phrase means you should search for the answer using Google's search engine. That’s how universal Google can now be identified as a search engine.

Why search engine is so valuable? Search engines allow users to easily acquire information on the internet using limited query input and organize that information based on relevance and quality. In turn, search enables accessibility to massive knowledge that was previously inaccessible.

Traditionally, the search engine approach to finding information is based on lexical matches or word matching. It works well, but sometimes, the result could be more accurate because the user intention differs from the input text.

For example, the input “Red Dress Shot in the Dark” can have a double meaning, especially with the word “Shot.” The more probable meaning is that the Red Dress picture is taken in the dark, but traditional search engines would not understand it. That’s why Semantic Search is emerging.

Semantic search could be defined as a search engine that considers the meaning of words and sentences. The semantic search output would be information that matches the query meaning, which contrasts with a traditional search that matches the query with words.

In the NLP (Natural Language Processing) field, vector databases have significantly improved semantic search capabilities by utilizing the storage, indexing, and retrieval of high-dimensional vectors representing text's meaning. So, semantic search and vector databases were closely related fields.

This article will discuss semantic search and how to use a Vector Database. With that in mind, let’s get into it.

How Semantic Search works

Let’s discuss Semantic Search in the context of Vector Databases.

Semantic search ideas are based on the meanings of the text, but how could we capture that information? A computer can’t have a feeling or knowledge like humans do, which means the word “meanings” needs to refer to something else. In the semantic search, the word “meaning” would become a representation of knowledge that is suitable for meaningful retrieval.

The meaning representation comes as Embedding, the text transformation process into a Vector with numerical information. For example, we can transform the sentence “I want to learn about Semantic Search” using the OpenAI Embedding model.

[-0.027598874643445015, 0.005403674207627773, -0.03200408071279526, -0.0026835924945771694, -0.01792600005865097,...]

How is this numerical vector able to capture the meanings, then? Let’s take a step back. The result you see above is the embedding result of the sentence. The embedding output would be different if you replaced even just one word in the above sentence. Even a single word would have a different embedding output as well.

If we look at the whole picture, embeddings for a single word versus a complete sentence will differ significantly because sentence embeddings account for relationships between words and the sentence's overall meaning, which is not captured in the individual word embeddings. It means each word, sentence, and text is unique in its embedding result. This is how embedding could capture meaning instead of lexical matching.

So, how does semantic search work with vectors? A semantic search aims to embed your corpus into a vector space. This allows each data point to provide information (text, sentence, documents, etc.) and become a coordinate point. The query input is processed into a vector via embedding into the same vector space during search time. We would find the closest embedding from our corpus to the query input using vector similarity measures such as Cosine similarities. To understand better, you can see the image below.

Semantic Search with Vector Databases
Image by Author

Each document embedding coordinate is placed in the vector space, and the query embedding is placed in the vector space. The closest document to the query would be selected as it theoretically has the closest semantic meaning to the input.

However, maintaining the vector space that contains all the coordinates would be a massive task, especially with a larger corpus. The Vector database is preferable for storing the vector instead of having the whole vector space as it allows better vector calculation and can maintain efficiency as the data grows.

The high-level process of Semantic Search with Vector Databases can be seen in the image below.

Semantic Search with Vector Databases
Image by Author

In the next section, we will perform a semantic search with a Python example.

Python Implementation

In this article, we will use an open-source Vector Database Weaviate. For tutorial purposes, we also use Weaviate Cloud Service (WCS) to store our vector.

First, we need to install the Weavieate Python Package.

pip install weaviate-client

Then, please register for their free cluster via Weaviate Console and secure both the Cluster URL and the API Key.

As for the dataset example, we would use the Legal Text data from Kaggle. To make things easier, we would also only use the top 100 rows of data.

import pandas as pd  data = pd.read_csv('legal_text_classification.csv', nrows = 100)

Semantic Search with Vector Databases
Image by Author

Next, we would store all the data in the Vector Databases on Weaviate Cloud Service. To do that, we need to set the connection to the database.

import weaviate  import os  import requests  import json      cluster_url = "YOUR_CLUSTER_URL"  wcs_api_key = "YOUR_WCS_API_KEY"  Openai_api_key ="YOUR_OPENAI_API_KEY"    client = weaviate.connect_to_wcs(      cluster_url=cluster_url,      auth_credentials=weaviate.auth.AuthApiKey(wcs_api_key),      headers={          "X-OpenAI-Api-Key": openai_api_key      }  )

The next thing we need to do is connect to the Weaviate Cloud Service and create a class (like Table in SQL) to store all the text data.

import weaviate.classes as wvc    client.connect()  legal_cases = client.collections.create(      name="LegalCases",      vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),        generative_config=wvc.config.Configure.Generative.openai()    )

In the code above, we create a LegalCases class that uses the OpenAI Embedding model. In the background, whatever text object we would store in the LegalCases class would go through the OpenAI Embedding model and be stored as the embedding vector.

Let’s try to store the Legal text data in a vector database. To do that, you can use the following code.

sent_to_vdb = data.to_dict(orient='records')  legal_cases.data.insert_many(sent_to_vdb)

You should see in the Weaviate Cluster that your Legal text is already stored there.

With the Vector Database ready, let’s try the Semantic Search. Weaviate API makes it easier, as shown in the code below. In the example below, we will try to find the cases that happen in Australia.

response = legal_cases.query.near_text(        query="Cases in Australia",        limit=2    )    for i in range(len(response.objects)):    print(response.objects[i].properties)

The result is shown below.

{'case_title': 'Castlemaine Tooheys Ltd v South Australia [1986] HCA 58 ; (1986) 161 CLR 148', 'case_id': 'Case11', 'case_text': 'Hexal Australia Pty Ltd v Roche Therapeutics Inc (2005) 66 IPR 325, the likelihood of irreparable harm was regarded by Stone J as, indeed, a separate element that had to be established by an applicant for an interlocutory injunction. Her Honour cited the well-known passage from the judgment of Mason ACJ in Castlemaine Tooheys Ltd v South Australia [1986] HCA 58 ; (1986) 161 CLR 148 (at 153) as support for that proposition.', 'case_outcome': 'cited'}    {'case_title': 'Deputy Commissioner of Taxation v ACN 080 122 587 Pty Ltd [2005] NSWSC 1247', 'case_id': 'Case97', 'case_text': 'both propositions are of some novelty in circumstances such as the present, counsel is correct in submitting that there is some support to be derived from the decisions of Young CJ in Eq in Deputy Commissioner of Taxation v ACN 080 122 587 Pty Ltd [2005] NSWSC 1247 and Austin J in Re Currabubula Holdings Pty Ltd (in liq); Ex parte Lord (2004) 48 ACSR 734; (2004) 22 ACLC 858, at least so far as standing is concerned.', 'case_outcome': 'cited'}

As you can see, we have two different results. In the first case, the word “Australia” was directly mentioned in the document so it is easier to find. However, the second result did not have any word “Australia” anywhere. However, Semantic Search can find it because there are words related to the word “Australia” such as “NSWSC” which stands for New South Wales Supreme Court, or the word “Currabubula” which is the village in Australia.

Traditional lexical matching might miss the second record, but the semantic search is much more accurate as it takes into account the document meanings.

That’s all the simple Semantic Search with Vector Database implementation.

Conclusion

Search engines have dominated information acquisition on the internet although the traditional method with lexical match contains a flaw, which is that it fails to capture user intent. This limitation gives rise to the Semantic Search, a search engine method that can interpret the meaning of document queries. Enhanced with vector databases, semantic search capability is even more efficient.

In this article, we have explored how Semantic Search works and hands-on Python implementation with Open-Source Weaviate Vector Databases. I hope it helps!

Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and data tips via social media and writing media. Cornellius writes on a variety of AI and machine learning topics.

More On This Topic

  • Python Vector Databases and Vector Indexes: Architecting LLM Apps
  • How Semantic Vector Search Transforms Customer Support Interactions
  • An Honest Comparison of Open Source Vector Databases
  • Vector Databases in AI and LLM Use Cases
  • What are Vector Databases and Why Are They Important for LLMs?
  • A Comprehensive Guide to Pinecone Vector Databases

Zuckerberg Doesn’t Want AI to End Up Like Mobile Apps, Controlled by Apple and Google

Meta Llama open source

On the sidelines of the launch of the most powerful open-source model, Llama 3, Meta chief Mark Zuckerberg spoke about his vision for AI, affirming that he doesn’t want it to end up like mobile apps – controlled by Apple and Google.

“One thing that I think generally sucks about the mobile ecosystem is that you have these two gatekeeper companies, Apple and Google, that can tell you what you’re allowed to build,” said Zuckerberg in a recent podcast.

The Concentration of Power

With the duo exercising control over new applications and models, Zuckerberg believes it is better to build a model themselves to ensure they do not end up in that position. “I don’t want any of those other companies telling us what we can build,” he said.

Openly challenging the tech biggies, Zuckerberg expressed his disdain for how Apple or Google stores exert control over what is released. He even spoke about how, in the past, when they wanted to launch new features, Apple didn’t allow it. “Nope, you’re not launching that,” Apple had apparently told Zuckerberg.

With AI in the picture, it becomes even more critical not to have a scenario that puts the power into a few hands where their closed models will control APIs and dictate what others can build with it. Zuck believes that Llama’s philosophy of open source should address this issue.

Meta is doing what OpenAI was supposed to

Source: X

“In a lot of ways, it’s a very permissive open source license, except that we have a limit for the largest companies using it,” said Zuckerberg, who clarified the limit is to not prevent large companies from using it, rather, initiating a proper partnership with them if they intend to use Meta’s product to build something and resell to make money of it.

“If you’re like Microsoft Azure or Amazon, if you’re going to be reselling the model, then we should have some revenue share on that. So just come, talk to us before you go do that,” he said.

Llama 3’s open-source philosophy continues and has witnessed high adoption since its launch last week. Clem Delangue, the co-founder and CEO of Hugging Face, posted that there are almost 1,000 Llama 3 model variations on the HF platform in a matter of four days.

The open sourcing route will also allow a user’s work to be differentiated and unaffected by others. “We’ll be able to do what we do. We’ll benefit, and all the systems – ours, and the communities’ – will be better because it’s open source,” said Zuckerberg.

The thought aligns perfectly with Meta AI chief Yann LeCun, who urges AI to become open-source. LeCun also believes that open-sourcing becomes primary as he does not want a small number of AI assistants run by a few players to dominate the digital world.

Future of AI Inference

The future mode of communication with the digital world will rely heavily on AI agents. Zuckerberg calls Meta AI the most intelligent, freely available AI assistant that people can use and hints at Meta AI’s ‘general assistant’ type of platform.

“I think that will shift from something that feels more like a chatbot, where you ask a question and it formulates an answer, to things where you’re giving it more complicated tasks, and then it goes away and does them,” he said.

Autonomous AI agents have been on the rise with a number of tech leaders and investors predicting the same. Recently, entrepreneur Vinod Khosla predicted that in another 10-15 years, internet access will be mostly done by AI agents.

He believes that most of the consumer access of the internet will be agents acting for consumers doing tasks and fending off bots. Having billions of agents on the internet will be considered normal.

While the future may not be exactly well-charted, the potential for an AI to turn up on the app store under the constraints of big tech companies is something Meta is trying to break with Llama.

Moreover, it’s not new for Google Play Store to exercise undue control over the operations of apps on the platform. Recently, Google delisted a number of prominent apps, including Naukri, 99acres and Bharat Matrimony apps, from their store citing non-compliance with the billing policies.

The post Zuckerberg Doesn’t Want AI to End Up Like Mobile Apps, Controlled by Apple and Google appeared first on Analytics India Magazine.

India Draws Inspiration from Census To Collect Data for AI

india census inspiration ai

In 1950, Jawaharlal Nehru, India’s first prime minister, initiated the National Sample Survey to gather granular data on India’s economy. In 1953, the Hindustan Times dubbed it “the biggest and most comprehensive sampling inquiry ever undertaken in any country in the world”.

Over centuries, India has demonstrated its expertise in conducting successful population censuses. The decennial census of India, too, is often regarded as one of the biggest data collection exercises in the world.

India’s census efforts involved sending trained enumerators to every household in India and collecting data based on various socio-economic parameters.

In today’s era of AI, India draws inspiration from these monumental endeavours as it gathers data to train AI models.

Collecting data to train AI models

Indian IT giant Tech Mahindra, as part of Project Indus, has developed a Hindi LLM consisting of 539 million parameters and 10 billion Hindi+ dialect tokens.

The model can take instructions in 37 different dialects of Hindi, such as Dongri (Jammu & Kashmir), Kinnauri, Kangri, Chambeli, Garhwali, (Himachal), Kumaoni, Jaunsari (Uttar Pradesh), Bhojpuri, Maithili, and Magahi (Bihar), among others.

For Tech Mahindra too, the biggest challenge was data. “Despite various efforts, in India, datasets for languages other than Hindi are scarce and incomplete. Additionally, even Hindi data is fragmented,” Nikhil Malhotra, global head-Makers Lab, Tech Mahindra, told AIM.

Hence, Malhotra too sent experts to different geographies in India, especially the northern belt, where Hindi and its different dialects are predominantly spoken.

“Our team went to Madhya Pradesh, Rajasthan, and some remote areas in Bihar, and their job was also to collect data by interacting with professors and speakers of these languages,” Malhotra explained.

Likewise, in Telangana, the Swecha open-source software movement played a key role in constructing the inaugural Telugu small language model (SLM) named ‘AI Chandamama Kathalu‘ from scratch.

To collect data for the model, Swecha held datathons at different educational institutions in Telangana.

“Volunteers at Swecha collaborated with nearly 25-30 colleges, and over 10,000 students were involved in translating, correcting, and digitalising 40,000-45,000 pages of Telugu folk tales.

“Me and my R&D team and Ozonetel supported them with the graphics processing units (GPUs) to train the model,” Chaitanya Chokkareddy, co-founder and chief technology officer at Ozonetel Communications, told AIM.

Recently, Swecha also created a Telugu ASR dataset by sending volunteers to different parts of Telangana and Andhra Pradesh to speak to native speakers and collect voice samples.

The volunteers visited remote villages and schools and even collected data while on the road. Swecha gathered 1.5 million voice samples, which were trained to build a Telugu ASR model.

Similarly, under the Google-supported Project Vaani, the Indian Institute of Science (IISc) is gathering 150,000 hours of speech data spanning 773 districts across India.

To accomplish this, individuals engaged in the project are journeying to remote areas, displaying images to local residents, prompting them to describe the images, and subsequently recording their responses.

Creating employment opportunities

In India, some entrepreneurs also turn these data collection exercises into business opportunities and create rural employment. For example, Bengaluru-based Karya pays Indian citizens in rural and marginalised areas for data labelling and annotation.

“Our goal is to reach 100,000 rural Indians by the end of this fiscal year, 1.5 million rural Indians by next fiscal year and 100 million rural Indians by 2030,” Manu Chopra, co-founder at Karya previously told AIM.

Likewise, NextWealth has set up a network of ten centres across India and has assembled a workforce of nearly 5,000 individuals.

Through these centres, the company delivers a spectrum of services that encompass full support for AI/GenAI pipelines, including desk-based end-to-end human evaluation in AI/GenAI pipelines for some complex applications, including labelling and annotation of datasets, testing of outputs, etc.

India needs good Indic datasets

Although open-source datasets exist for some popular Indian languages like Hindi, and initiatives are ongoing to enhance these datasets, many languages still lack adequate datasets. This poses a significant challenge in developing LLMs for these languages.

Popular models like the Llama series by Meta or the GPT series by OpenAI are predominantly trained on large English datasets scraped from the web.

Even though we have seen Indic LLMs like Tamil LLama and Telugu LLama, they are also predominantly trained on open-source datasets available on the web.

However, there is a need to gather more data and build even better datasets. “The current volume of this type of data is relatively small; we need to collect even more,” Vivek Raghavan, co-founder of Sarvam AI, told AIM.

While efforts are already underway, for this to happen on a larger scale, an ecosystem needs to develop where different stakeholders, including researchers, startups and corporate houses, need to come together and work towards a common goal.

The post India Draws Inspiration from Census To Collect Data for AI appeared first on Analytics India Magazine.

How to Standout and Safeguard Your Job in the Generative AI Era

How to Standout and Safeguard Your Job in the Generative AI Era
Image by Author

Several playbooks, roadmaps, and career tracks boast of helping you land your first job in AI or make the transition into the field. However, automation that comes with AI advancements is putting a lot of jobs at risk too.

So, how do you make a career in AI, especially in today’s Generative Era?

Firstly, it is important to note that the fundamentals of AI are still very much needed to understand how algorithms work, what are the assumptions of the algorithms, how to debug them if the expected behavior deviates from the actual behavior, the difference between sample vs. population, what is the need to collect sample and the different ways of collecting it, conducting the hypothesis test, and more.

Time for Action

Great, so with this understanding of AI fundamentals and their significance, even in the GenAI era, let us quickly cover the roadmap to learning AI.

Starting with the foundational pillars of learning algorithms i.e. linear algebra, calculus, statistics, and probability, you will be equipped with understanding concepts, such as, what, why, and how of derivatives, where are they used, and what is forward and backward pass. It will also solidify your understanding of data distribution, and probability distributions, such as Gaussian, Poisson, etc.

Most of this knowledge is available for free; the recommended go-to resources are:

  • 3 Brown 1 Blue YouTube channel
  • Khan Academy for Statistics and Probability

How to Standout and Safeguard Your Job in the Generative AI Era
Image by Author

Now, we are ready to learn machine learning concepts that would cover key algorithms including linear regression, logistic regression, decision trees, clustering, and more.

Before we proceed further, it is important to note that learning AI has become much easier in today’s times due to the democratization of education. For example, all the suggested readings in this roadmap are available for free.

In addition to developing intuition behind algorithms, learning concepts such as cost functions, regularization, optimization algorithms, and error analysis are important too.

At this time, let’s also start getting a handle on software programming. Learning to code and implement the solution enables you to get hands-on seamlessly. The 4-hour video course on Python (as shown in the roadmap image) covers the fundamentals to get you started from the get-go. Now, we are ready to learn the ropes of deep learning focusing on fundamental concepts, including layers, nodes, activation functions, backpropagation, hyperparameter tuning, etc.

Great, having learned enough, we have reached the final stage, I typically refer to as, playground. This is where you put all your knowledge to use. One excellent way to do this is through practicing and participating in Kaggle competitions. One can also find winning solutions and develop an approach to handle varied business problems.

AI Workflows

This is a typical path to learning AI, all this while one gets to internalize AI workflows that start with data exploration, i.e., dissecting data to understand patterns underneath. It is during this phase, that data scientists get to know the data transformations to prepare it for modeling purposes.

How to Standout and Safeguard Your Job in the Generative AI Era
Image by Author

Feature selection and engineering are the most powerful skills of distinguished data scientists. This step, if done right, can accelerate the model’s learning process.

Now is the time every data scientist looks forward to, i.e., building models and selecting the best performing one. The definition of “best-performing” is done through evaluation metrics, which are of two types – scientific like precision, recall, and mean squared error, and the other includes business metrics like increase in clicks, conversions, or dollar value impact.

Reaching this stage while reading an article looks like an easy process, but in practice, it is an extensive process.

Differentiator

So far, we have discussed the conventional path, learning what everyone is doing. But, where is the differentiator here to stand apart in the GenAI era?

One prevalent notion learners have is to keep consuming learning content. While studying fundamentals is important, it is equally important to start practicing and experimenting to build an intuitive understanding of the learned concepts.

Also, the crucial component of building AI solutions is to know whether AI is a right fit, which includes the ability to map the business problem to the correct technical solution. If the starting step itself is done wrong, then one can not expect the implemented solution to meet business objectives in a meaningful way.

How to Standout and Safeguard Your Job in the Generative AI Era
Image by Author

Further, data science is seen as more of a technical role, but in effect, its success quotient depends a lot on, the often underrated skill, that is to collaborate with the stakeholders. Ensuring bringing stakeholders from varied backgrounds and expertise onboard plays a key role.

Even if the model is showing good results, still the model may be not adopted due to a lack of clarity and ability to link those with business results. This gap can be addressed by effective communication skills.

Lastly, be the data-first in your approach to AI. The success of any AI model depends on the data. Also, find your AI champions who believe in the capabilities and possibilities of AI, while understanding the associated risks.

With these skills on your side, I wish you a stellar career in AI.

Vidhi Chugh is an AI strategist and a digital transformation leader working at the intersection of product, sciences, and engineering to build scalable machine learning systems. She is an award-winning innovation leader, an author, and an international speaker. She is on a mission to democratize machine learning and break the jargon for everyone to be a part of this transformation.

More On This Topic

  • How to Get Hired as Data Scientist in the GPT-4 Era
  • The AI Transformation Strategy in the GenAI Era
  • Job Trends in Data Analytics: NLP for Job Trend Analysis
  • How Generative AI Can Help You Improve Your Data Visualization Charts
  • Will Your Job be Replaced by a Machine?
  • 5 Tips to Get Your First Data Scientist Job

Sakana AI Releases Japanese DALLE-3, Calls it  EvoSDXL-JP

Japanese AI startup, Sakana AI introduced EvoSDXL-JP, an image generation model built via Evolutionary Model Merge, which delivers 10x faster image generation for Japanese styles. EvoSDXL-JP, is now publicly available on the HuggingFace platform for research and educational purposes, accompanied by an accessible demo for immediate testing.

The model that can support Japanese and generate Japanese style images by fusing different open models. Compared to the existing Japanese model, the inference speed is 10 times faster, but it shows better performance in the benchmark, said the company in its blog post.

EvoSDXL-JP is capable of high-speed and low-cost image generation, and is the best model to easily try and experience generative AI. The company said it expects it to be used in educational sites in Japan so that more people can enjoy the benefits of generative AI.

Sakana AI recently introduced an innovative model construction approach using evolutionary algorithms called “Evolutionary Model Merge.” The company says Evolutionary model merge is not limited to specific modalities, and can be applied to models of any modality in principle.

Furthermore, the company has released the EvoLLM-JP, a large-scale Japanese language model, and the EvoVLM-JP, an image language model, both constructed through Evolutionary Model Merge. These models were based on self-regressive Transformer models designed for language generation.

EvoLLM-JP, was made by merging the large-scale language model (LLM) of Japanese and the LLM of mathematics, and was found to be good not only in mathematics but also in the overall ability of Japanese.

In addition, EvoVLM-JP, which was made by merging Japanese LLM and image language model (VLM), can respond to knowledge of Japanese culture and achieved the best results in benchmarks using Japanese images and Japanese text.

The post Sakana AI Releases Japanese DALLE-3, Calls it EvoSDXL-JP appeared first on Analytics India Magazine.

Alibaba Launches LLM-R2 to Optimise SQL Query Efficiency

Alibaba Launches LLM-R2 to Optimise SQL Query Efficiency

Researchers at Nanyang Technological University, Singapore University of Technology and Design, and Alibaba‘s DAMO Academy recently introduced LLM-R2, a rule-based query rewrite system enhanced with an LLM, to significantly boost SQL query efficiency.

The core objective of query rewrite is to transform an SQL query into a new format that maintains the original results while executing more efficiently. This involves three key criteria: executability, equivalence, and efficiency. Traditional query rewrite systems heavily rely on predefined rules and are often constrained by the computational limitations and inaccuracies of DBMS cost estimators.

LLM-R2 addresses these challenges by integrating a LLM to suggest optimal rewrite rules for SQL queries, which are then implemented using an existing database platform. This allows the rewritten queries to maintain their executability and accuracy while improving efficiency.

One of the key advancements of LLM-R2 is its use of contrastive learning models that help in refining the selection of rewrite rules by understanding the structure and context of each query. This allows LLM-R2 to adapt and apply the most appropriate optimizations, leading to a significant reduction in query execution times across various datasets.

This method has proven to significantly cut down query execution times across various datasets including TPC-H, IMDB, and DSB, demonstrating improvements over both traditional rule-based methods and other LLM-based systems.

The results show that LLM-R2 can reduce the execution time of SQL queries to about 52.5% on average compared to original queries, and about 40.7% compared to state-of-the-art methods. This performance boost is especially pronounced in complex queries where the traditional methods often struggle to make effective improvements.

The research acknowledges that its main limitation lies in the higher rewrite latency compared to DB only methods. Because, compared to traditional DB methods, calling LLM API and selecting demonstrations consumes more time.

Despite this delay, the benefits are clear, LLM-R2 greatly reduces the time it takes to execute queries, making the system overall very effective. This shows that LLM-enhanced methods could be an effective solution for efficiency-oriented query rewrite.

The post Alibaba Launches LLM-R2 to Optimise SQL Query Efficiency appeared first on Analytics India Magazine.

5 Free Stanford University Courses to Learn Data Science

5 Free Stanford University Courses to Learn Data Science
Image by Author

Learning data science has never been more accessible. If you’re motivated, you can teach yourself data science—for free—with the courses from elite universities across the world.

We've put together this list of free courses from Stanford University to help you learn all the essential data science skills:

  • Programming fundamentals
  • Databases and SQL
  • Machine Learning
  • Working with large datasets

So start learning today to achieve your learning goals and kickstart your data career. Now let’s go over these courses.

1. Programming Methodology

To get started with data science, building programming foundations in a programming language like Python is important. The Programming Methodology class teaches Python programming from the ground up and does not assume any previous programming experience.

In this course, you’ll learn problem solving with Python while becoming familiar with the features of the language. You’ll start with the basics such as variables and control flow and then learn about built-in data structures like lists and dictionaries.

Along the way, you’ll also learn how to work with images, explore object-oriented programming in Python and memory management.

Link: Programming Methodology

2. Databases

A strong understanding of databases and SQL is important to succeed in any data career. You can take the popular databases course by Prof. Jennifer Widom as a series of five self-paced courses on edX.

Note: You can audit the course and access all course contents for free.

If you are new to databases, take the first course covering the basics of relational databases before you proceed to the courses on more advanced topics. By working through the series of courses, you’ll learn:

  • Relational databases and SQL
  • Query performance
  • Transaction and concurrency control
  • Database constraints, triggers, views
  • OLAP cubes, star schema
  • Database modeling
  • Working with semi-structured data like JSON and XML

Links to the courses:

  1. Databases: Relational Databases and SQL
  2. Databases: Advanced Topics in SQL
  3. Databases: OLAP and Recursion
  4. Databases: Modeling and Theory
  5. Databases: Semistructured Data

3. Machine Learning

As a data scientist, you should be able to analyze data using Python and SQL and answer business questions. But sometimes you may also need to build predictive models. Which is why learning machine learning is helpful.

Machine Learning or CS229: Machine Learning at Stanford university is one of the most popular and highly recommended ML courses. You’ll learn everything you’d typically learn in a semester-long university course. This course covers the following topics:

  • Supervised learning
  • Unsupervised learning
  • Deep learning
  • Generalization and regularization
  • Reinforcement learning and control

Link: Machine Learning

4. Statistical Learning with Python

An Introduction to Statistical Learning with Applications in Python (or ISL with Python) is the Python edition of the popular ISLR book on statistical learning.

The Statistical Learning with Python course covers all the contents of the ISL with Python book. So you’ll learn essential tools for data science and statistical modeling. Here is an overview of important topics that this course covers:

  • Linear regression
  • Classification
  • Resampling
  • Linear model selection
  • Tree-based methods
  • Unsupervised learning
  • Deep learning

Link: Statistical Learning with Python

5. Mining Massive Data Sets

Mining Massive Data Sets is a course focusing on data mining and machine learning algorithms for working with and analyzing massive datasets.

To make the most out of this course you should be comfortable with programming, preferably with Java or Python. You should also be familiar with math: probability and linear algebra. If you’re a beginner, consider working through the courses mentioned earlier before you take this one.

Here are some topics this course covers:

  • Nearest neighbor search in high-dimensional space
  • Locality Sensitive Hashing (LSH)
  • Dimensionality reduction
  • Large-scale supervised machine learning
  • Clustering
  • Recommendation systems

You can use the Mining Massive Datasets book as a companion to this course. The book is also accessible for free online.

Link: Mining Massive Data Sets

Wrapping Up

This compilation of free courses from Stanford University should help you learn almost everything you need if you ever want to explore data science.

If you’re looking for university courses to learn Python and data science for free, here are a couple of articles you may find helpful:

  • 5 Free University Courses to Learn Python
  • 5 Free University Courses to Learn Data Science

Happy learning!

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she's working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

More On This Topic

  • Learn Probability in Computer Science with Stanford University for FREE
  • A Collection Of Free Data Science Courses From Harvard, Stanford,…
  • 5 Free University Courses to Learn Data Science
  • 5 Free University Courses to Learn Computer Science
  • 5 Free University Courses to Learn Databases and SQL
  • 5 Free University Courses to Learn Python

Hugging Face Already has 1000s of Llama 3 Models – and Counting

Last week, Meta released early versions of its latest large language model, Llama 3, and the reception has been huge. Clem Delangue, co-founder and CEO of Hugging Face, mentioned in a post that by next weekend there will be 10,000 variants available, as already 1000 Llama 3 model variations have been shared publicly on Hugging Face.

Llama 3 model variations, Source: LinkedIn

This new model includes an image generator that can update pictures in real time as users type prompts. Meta has released two versions of Llama 3 – one with 8 billion parameters and another with 70 billion parameters.

Meta claims both sizes of Llama 3 beat similarly sized models like Google’s Gemma and Gemini, Mistral 7B, and Anthropic’s Claude 3 on certain benchmarking tests.

Compared to Meta’s Llama 2 model, the claim made in a Reddit conversation that the Llama-3’s 8B instructed model outperforms the Llama-2’s 70B instructed model on benchmarks is quite remarkable.

The number of tokens in Llama 3 has quadrupled from 32,000 (Llama 2) to 128,000. With more tokens, Llama 3 can compress sequences more efficiently, cite 15% fewer tokens, and deliver better downstream performance.

Andrej Karpathy, the director of AI at Tesla, in his post, expressed support for releasing base and fine-tuned models of 8B and 70B sizes. He also highlighted the need for smaller models, particularly for educational purposes, unit testing, and potentially for embedded applications.

Congrats to @AIatMeta on Llama 3 release!! 🎉https://t.co/fSw615zE8S
Notes:
Releasing 8B and 70B (both base and finetuned) models, strong-performing in their model class (but we'll see when the rankings come in @ @lmsysorg :))
400B is still training, but already encroaching…

— Andrej Karpathy (@karpathy) April 18, 2024

Karpathy also spoke about the limitations. While an increase in the sequence length is a step in the right direction, he noted that it still falls short of the industry-leading standards. “The maximum number of tokens in the context window was bumped up to 8192… quite small w.r.t. modern standards.”

Beyond the limitations, Perplexity AI CEO Arvind Srinivas, said, “One thing that impresses me most about Llama 3 is how did they pack so much knowledge and reasoning into a dense 8b and a 70b so well, when everyone else has been scaling sparse MoEs.

This still doesn’t mean having a lot of GPUs is not important. [It’s] probably even more important, considering how many sweeps one has to run to get the right data mixes.”

Pratik Desai, the founder of Kissan AI, released Dhenu Llama 3, fine-tuned on Llama3 8B. “It is available for anyone to tinker with and provide feedback. Feel free to host and share if you have a spare GPU. We will have an instruction version with a dataset five times larger in the near future,” wrote Desai on X.

While supporting the researchers, Reddit Llama 3 (now available to developers via GroqChat and GroqCloud™) introduces ‘Llama 3 Researcher’ by GroqInc, delivering Llama 3 8B at 876 tokens/s – the fastest speed we benchmark of any model.

It is like a GPT-4 level chatbot, available to use completely free, running at over 800 tokens per second on Groq, says Rowan Cheung, the founder of AI newsletter, The Rundown AI.

Additionally, Groq is spitting out 800 tokens per second on Llama 3, this portends to new use cases where multiple actions will take place under local AI agent, posted by Brian Roemmele.

Going beyond Llama 3

Meta’s chief AI scientist, Yann LeCun revealed that even more powerful language models are currently under development. LeCun noted that the most advanced Llama model, with over 400 billion parameters, is undergoing training.

The newly unveiled AI models are set to be integrated into Meta’s virtual assistant, Meta AI, which the company claims is the most advanced among its free-to-use counterparts.

NVIDIA’s Jim Fan said that the upcoming Llama-3 400B+ will mark the watershed moment when the community gains open-weight access to a GPT-4-class model. Further, he said that it will change the calculus for many research efforts and grassroots startups.

“I pulled the numbers on Claude 3 Opus, GPT-4, and Gemini. Llama 3 400B is still training and will hopefully get even better in the next few months,” he added, saying that there is so much research potential that can be unlocked with such a powerful backbone.

Expect a surge in builder energy across the ecosystem!

The post Hugging Face Already has 1000s of Llama 3 Models – and Counting appeared first on Analytics India Magazine.

‘Many Indian VCs Don’t Even have a Thesis on Deep-tech Investment’

‘Many Indian VCs Don’t Even have a Thesis on Deep-tech Investment’

Even though the global AI hype is creating an ocean of funds, Indian investors are focused on the precious few droplets. In an exclusive interview with AIM, Vishnu Vardhan, founder & CEO of SML and Vizzhy, which is the creator of Hanooman, said that most Indian investors are not ready to spend money on research and deep tech startups.

“Many VCs do not even have a thesis on how to invest in deep tech,” said Vardhan, referring to the country’s ill-informed deep tech investors.

Citing Zepto, Dunzo, and other startups running without profits for investors for a long time, Vardhan said: “People are happy losing money there, but they don’t want to lose money here [AI startups],” he added.

He also said India has created 125 unicorns, but nothing is great tech. “They are all business ideas and consumer apps,” he said.

Vardhan narrated a story about meeting a deep-tech investor who said his ticket size was only $2 million, which is minuscule compared to the investment required to do AI research. “I need at least INR 100 crore to set up a lab in India,” he added.

Once, an investor asked Vardhan why he needed to set up a lab, talking about investing in the medical field. “Why don’t you treat 100 patients and tell me how much money you make?” the investor told Vardhan. After discussing it with his sister, Vardhan laughingly added that raising money in the US is better as Indian investors do not understand deep tech.

How true is this?

Amit Sheth, the chair and founding director of the Artificial Intelligence Institute at the University of Southern Carolina (AIISC) also believes that there is a need for a lot more investment in AI in India compared to now. “VCs really don’t take as much risk as we expect/hope,” said Sheth.

“Most VCs do not understand the technology in-depth and don’t take risks with undeveloped markets where revenue and payoff are further off,” he added. “It is easy for them to understand consumer tech; most run after the fad and buy into the hype,” he added, while highlighting that big-tech companies like Microsoft or Google would be much happier to take the risk than VCs in India.

According to Rajan Anandan, managing director at Peak XV Partners, VCs are sitting on a total of $20 billion cash to invest in Indian startups, and the focus is currently on AI.

Arjun Rao, partner at Speciale Invest, which invests in deep tech startups, believes otherwise. “You could say there’s less investment when compared to Silicon Valley, but that is because Silicon Valley has much deeper pockets,” Rao told AIM, highlighting that VCs in India are only focused on investing in generative AI.

Rao explained that a lot of it is because there is not a concrete exit strategy when it comes to generative AI as the investment scenario is still just 1-2 years old in India. “India is still a young ecosystem. It is an unfair comparison,” said Rao. “We are moving at a very fast pace, which is the most important thing so we can catch up.”

Another great example is Khosla Ventures. The firm has invested in Upliance AI and Sarvam AI, two AI startups in India, and is bullish on AI globally.

“We believe AI has the power to disrupt numerous economic models and change the way we lead our daily lives over the coming years. We invest in deep tech and invest where we can be early, bold and impactful,” believes Khosla Ventures.

It is also the startups’ fault

A few weeks ago, Gaurav Aggarwal, a former Google Research employee who is building an AI startup called Ananas Labs in India, put forth a similar opinion. He said that VCs are not ready to put money in deep-tech startups but are only interested in OpenAI wrappers and so-called consumer-tech startups.

Arguably, Aggarwal’s point of view does make sense. India’s generative AI scene is on an upswing, but the investors are cautious when it comes to investing in research startups. Though India should also focus on foundational research, there should be investors willing to invest in deep-tech research as well, which is clearly not the case.

Though it is rare to find any other initiative being built from scratch, the lack of VCs’ interest in such initiatives also shows a lack of understanding of the field, which, fortunately, is slowly changing as well.

On the other hand, the investors are wary as well. There have been several early AI unicorns which have disappeared or are running dry with fundings. The fate of Jasper, Stability AI, or the very famous Instoried debacle, have made the VCs tread with caution. However, there is clearly a need for deeper pockets as well to invest in actual AI startups and take bigger risks.

The post ‘Many Indian VCs Don’t Even have a Thesis on Deep-tech Investment’ appeared first on Analytics India Magazine.

PhysDreamer Study Reveals Breakthrough in Video Generation for Dynamic 3D Object Interactions

Researchers from MIT, Stanford University, Columbia University, and Cornell University have developed a new framework called PhysDreamer. This system allows static 3D objects to interact dynamically and realistically within a virtual environment, based on their physical properties like stiffness.

PhysDreamer works by using video generation models to predict how objects will respond to different physical interactions, such as being pushed or manipulated.

“By distilling these priors, PhysDreamer enables the synthesis of realistic object responses to novel interactions, such as external forces or agent manipulations,” the authors of the paper wrote.

They demonstrate their approach on various elastic objects like flowers, plants, a hat and a telephone cord.

This method is distinct because it accurately incorporates the material properties of objects into its predictions, which is a significant advancement over previous techniques that did not consider these details.

In experiments, PhysDreamer demonstrated its ability to generate realistic movements of various elastic objects. It was shown to outperform existing methods significantly, providing a more immersive and engaging experience in virtual simulations.

“PhysDreamer takes a step towards more engaging and realistic virtual experiences by enabling static 3D objects to dynamically respond to interactive stimuli in a physically plausible manner,” the authors concluded.

In comparison, OpenAI’s Sora also focuses on generating realistic video, but without explicit physics modeling. Sora is a large text-conditional diffusion model trained on both videos and images at scale.

It can generate high-fidelity videos up to a minute long with consistent 3D motion and long-range coherence. However, Sora does not aim to simulate accurate physical interactions or estimate material properties like PhysDreamer does.

Yann LeCun, VP & Chief AI Scientist at Meta, pointed out that while technologies like Sora are groundbreaking for video generation, they might not be optimal for understanding deep video representations or simulating real-world physics.

PhysDreamer opens new possibilities for applications in virtual reality, gaming, and simulations, promising more realistic and interactive user experiences.

The post PhysDreamer Study Reveals Breakthrough in Video Generation for Dynamic 3D Object Interactions appeared first on Analytics India Magazine.