Using window functions for advanced data analysis

Window Functions for Advanced Data Analysis

Window functions are an advanced feature of SQL that provides powerful tools for detailed data analysis and manipulation without grouping data into single output rows, which is common in aggregate functions. These functions operate on a set of rows and return a value for each row based on the calculation against the set.

In this article, we delve into window functions in SQL Server. You will learn how to apply various window functions, including moving averages, ranking, and cumulative sums, to achieve comprehensive analytics on data sets.

You will also see how to partition and filter data using the window functions.

Finally, you will study some best practices and pitfalls to avoid when working with the Window functions. These are the types of things that are covered during the more advanced SQL workshops that are available online.

Note: We will use the Microsoft Pubs database as an example to execute various window function queries.

Understanding window functions

Window functions are used for calculations across sets of rows related to the current row. Unlike standard aggregate functions, window functions do not collapse rows and allow us to perform calculations across rows related to the current row. This capability is crucial for running totals, moving averages, and cumulative statistics, which are invaluable for time-series data analysis, financial data, inventory management, and more.

With window functions, you can specify a “window” of rows related to the current row over which SQL Server performs a calculation. You can define this window using clauses like OVER, PARTITION BY, and ORDER BY.

Basic syntax

The basic syntax for a window function is:

{function_name}() OVER (
[PARTITION BY column_name]
[ORDER BY column_name [ASC|DESC]]
)

Each part of the syntax has a specific purpose:

  • {function_name}(): This is the window function you want to apply. SQL supports various window functions such as SUM(), AVG(), COUNT(), RANK(), ROW_NUMBER(), and more. These functions can compute values over a specified range of rows.
  • OVER: This keyword defines the window over which the SQL server executes the function. It signifies the start of the window specification.
  • PARTITION BY: Divides the data into partitions (or groups) to which the function is applied. If you don’t include the PARTITION BY clause, all the rows will be treated as a single partition.
  • ORDER BY: Defines the order of data within each partition.

Practical scenarios using window functions

Let’s explore some practical scenarios using window functions on the Microsoft Pubs database. We will look into calculating moving averages, ranking, and cumulative sums.

Calculating moving averages for sales quantities

Moving averages smooth out data series and are commonly used to understand trends.
Let’s calculate a moving average for the sales quantities in the sales table of the Pubs database.

USE pubs
SELECT ord_num, ord_date, qty,
AVG(qty) OVER (ORDER BY qty ROWS UNBOUNDED PRECEDING) AS MovingAvgQty
FROM sales;

Output:

  • Using window functions for advanced data analysis

In the above query, we use the AVG window functions to calculate the moving average for the qty column. The ROWS UNBOUNDED PRECEDING means we want to calculate the moving average of all the previous rows up to the current ones.

You can also calculate the moving average for a specific number of previous rows.

For example, the following script returns the moving average for the previous two rows and the current row. Notice here that we cast the qty column to a floating type to have a precise average value.

USE pubs
SELECT ord_num, ord_date, qty,
AVG(CAST(qty AS float)) OVER (ORDER BY qty ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS MovingAvgQty
FROM sales;

Output:

Using window functions for advanced data analysis

Ranking Sales Data by Price

Ranking can help in comparing items, like listing products by sales price.

Let’s see an example where we rank the total sales price for each sale. We will first join the sales and titles tables. Next, we will calculate the total sale price for each record by multiplying qty with the price column of the corresponding tables). Finally, we will use the RANK function to rank all the records in descending order of the total sale price. This will give you information about which sale made the most money.

SELECT
S.ord_num,
S.ord_date,
S.qty,
T.title,
S.qty * T.price AS TotalSalePrice,
RANK() OVER (ORDER BY S.qty * T.price DESC) AS SalesRank
FROM
sales S
JOIN
titles T ON S.title_id = T.title_id;

Output:

Using window functions for advanced data analysis

Cumulative Sums of Total Sales Price

Cumulative sums are useful for running totals, which can be essential for inventory or account balance tracking.

For example, let’s calculate the cumulative total sale prices for all the rows in the sales table. As we did previously, we will join the sales and titles column to calculate the total sale price for each row.

Next, you can use the SUM window function to calculate the cumulative sales by ordering the results using the ord_date column. This will return you the cumulative sales by date.

USE pubs;
SELECT
S.ord_num,
S.ord_date,
S.qty,
T.title,
S.qty * T.price AS TotalSalePrice,
SUM(S.qty * T.price) OVER (ORDER BY s.ord_date ROWS UNBOUNDED PRECEDING) AS CumulativeSales
FROM
sales S
JOIN
titles T ON S.title_id = T.title_id;

Output:

Using window functions for advanced data analysis

Partitioning and Filtering with Window Functions

You can partition and filter records in a window function using the PARTITION BY and the CASE statement.

Partitioning with PARTITION BY Clause

You can use the PARTITION BY clause in conjunction with window functions. This allows you to apply window functions separately for each partition.

For example, the following query returns cumulative prices for various title types in the titles table.

USE pubs;
SELECT
title_id,
title,
type,
price,
SUM(price) OVER (PARTITION BY type ORDER BY price ROWS UNBOUNDED PRECEDING) AS CumulativePriceByType
FROM
titles

Output:

Using window functions for advanced data analysis

In the above output, cumulative prices are calculated separately for each title type.

Filtering with CASE Statement

You can use the CASE statement inside a window function to filter the records before applying the window function.

For example, you can use the following query containing the CASE statement if you only want to include titles in the cumulative sum where the price is greater than $10:

SELECT
title_id,
title,
type,
price,
SUM(CASE WHEN price > 10 THEN price ELSE 0 END)
OVER (PARTITION BY type ORDER BY price ROWS UNBOUNDED PRECEDING) AS CumulativePriceByType
FROM
titles
ORDER BY
type,
price;

Output:

Using window functions for advanced data analysis

Best Practices and Common Pitfalls When Using Window Functions

Let’s now discuss some of the best practices and common pitfalls to avoid when using window functions in SQL Server.

Best practices

  • Indexing for Performance: Ensure columns used in ORDER BY and PARTITION BY are indexed to improve query performance, especially with large datasets.
  • Use PARTITION BY Judiciously: Use PARTITION BY thoughtfully. Overpartitioning, especially by columns with high cardinality, can reduce performance. Balance meaningful data segmentation with efficiency.
  • Limit Window Frames: Use specific boundaries like ROWS or RANGE to limit window sizes instead of defaulting to UNBOUNDED PRECEDING, which improves performance by reducing the number of rows processed.

Common pitfalls

  • Ignoring NULL Values: Window functions include NULL values by default. To ensure accuracy, exclude or handle NULLs as necessary.
  • Forgetting to Order Data: Omitting ORDER BY can yield incorrect results since the order of rows affects calculations like running totals or moving averages.
  • Performance Issues: Be mindful of potential performance issues with large datasets or complex queries. Check execution plans to identify and mitigate bottlenecks.

Conclusion

Window functions in SQL Server are indispensable tools for anyone seeking to perform sophisticated data analysis without the constraints of traditional aggregate functions. Their ability to operate over a set of rows and dynamically compute values makes them essential for various applications—from financial modelling and time-series analysis to inventory management.

In this article, you saw how SQL window functions work with the help of different practical scenarios. You also learned how to partition and filter data using window functions and what are the best practices and pitfalls to avoid when using window functions.

Level Up with DataCamp’s New Azure Certification

Azure with DataCamp
Image by Author

When you decided to become a data professional, you were aware of one thing: learning never stops. It can be hard to keep up with learning new things in the market or upskilling to ensure you remain competitive.

KDnuggets are here to help you with that journey.

We want to introduce DataCamp's new Azure learning track where you can get certified and start your new journey.

Should I Learn Azure?

In today's market, Azure certifications and knowledge are some of the most sought-after in the tech industry. If you are looking to pursue a career in cloud computing, infrastructure, and scalability in large enterprises — you need to learn Azure.

But why?

If you look at the current market, more start-ups are entering which need on-demand IT resources. This is where cloud computing becomes an asset that every start-up needs! And with that being said — they need the right professionals to handle the tasks.

Gaining an Azure certificate will not only increase your knowledge and skillset, but will also make you highly competitive in a market that has a high demand for said professionals but little supply.

You know where I’m going with this right?

Job security and high pay!

DataCamp's Azure Certification

Link to Certification: DataCamp's Azure Certification

This certification which has been co-created with Microsoft allows you to start from the beginning. You will master the basics concepts of cloud computing, models, public/private, and hybrid cloud, whilst diving into Infrastructure-as-a-Service (IaaS), Platform-as-as-Service (PaaS), and Software-as-a-Service (SaaS).

Once you have a good foundational understanding of Azure and its architectural components and services, such as computing, networking, and storage, you will then move on to learning about tools used to secure, govern, and administer Azure.

You will learn all of this without having to code anything.

Once you have a good understanding of everything about Azure and have it all under your belt, your next step will be to complete the AZ-900 certification — for which you can get 50% off with DataCamp!

And just like that — you’re certified!

Wrapping it Up

As a data professional, you should be always looking at new ways to up-skill and broaden your knowledge. You want to be competitive in today's market, especially when we’re currently facing a lot of layoffs.

Figure out what you aspire to be and get learning!

Nisha Arya is a data scientist, freelance technical writer, and an editor and community manager for KDnuggets. She is particularly interested in providing data science career advice or tutorials and theory-based knowledge around data science. Nisha covers a wide range of topics and wishes to explore the different ways artificial intelligence can benefit the longevity of human life. A keen learner, Nisha seeks to broaden her tech knowledge and writing skills, while helping guide others.

More On This Topic

  • What’s New in SAS Certification?
  • Hone Your Data Skills With Free Access to DataCamp
  • Become Data-Driven Faster with DataCamp’s Analyst Takeover
  • Black Friday Deal — Master Machine Learning for Less with DataCamp
  • Experience the Joy of Data with DataCamp
  • Get World-class Data Science Learning with DataCamp at 25% off

Investors are growing increasingly wary of AI

Investors are growing increasingly wary of AI Kyle Wiggers 8 hours

After years of easy money, the AI industry is facing a reckoning.

A new report from Stanford’s Institute for Human-Centered Artificial Intelligence (HAI), which studies AI trends, found that global investment in AI fell for the second year in a row in 2023.

Both private investment — that is, investments in startups from VCs — and corporate investment — mergers and acquisitions — in the AI industry were on the downswing in 2023 versus the year prior, according to the report, which cites data from market intelligence firm Quid.

AI-related mergers and acquisitions fell from $117.16 million in 2022 to $80.61 million in 2023, down 31.2%; private investment dipped from $103.4 million to $95.99 million. Factoring in minority stake deals and public offerings, total investment in AI dropped to $189.2 billion last year, a 20% decline compared to 2022.

Yet some AI ventures continue to attract substantial tranches, like Anthropic’s recent multibillion-dollar investment from Amazon and Microsoft’s $650 million acquisition of Inflection AI. And more AI companies are receiving investments than ever before, with 1,812 AI startups announcing funding in 2023, up 40.6% versus 2022, according to the Stanford HAI report.

So what’s going on?

Gartner analyst John-David Lovelock says that he sees AI investing “spreading out” as the largest players — Anthropic, OpenAI and so on — stake out their ground.

“The count of billion-dollar investments has slowed and is all but over,” Lovelock told TechCrunch. “Large AI models require massive investments. The market is now more influenced by the tech companies that’ll utilize existing AI products, services and offerings to build new offerings.”

Umesh Padval, managing director at Thomvest Ventures, attributes the shrinking overall investment in AI to slower-than-expected growth. The initial wave of enthusiasm has given way to the reality, he says: that AI is beset with challenges — some technical, some go-to-market — that’ll take years to address and fully overcome.

“The deceleration in AI investing reflects the recognition that we’re still navigating the early phases of the AI evolution and its practical implementation across industries,” Padval said. “While the long-term market potential remains immense, the initial exuberance has been tempered by the complexities and challenges of scaling AI technologies in real-world applications … This suggests a more mature and discerning investment landscape.”

Other factors could be afoot.

Greylock partner Seth Rosenberg contends that there’s simply less appetite to fund “a bunch of new players” in the AI space.

“We saw a lot of investment in foundation models during the early part of this cycle, which are very capital intensive,” he said. “Capital required for AI applications and agents is lower than other parts of the stack, which may be why funding on an absolute dollar basis is down.”

Aaron Fleishman, a partner at Tola Capital, says that investors might be coming to the realization that they’ve been too reliant on “projected exponential growth” to justify AI startups’ sky-high valuations. To give one example, AI company Stability AI, which was valued at over $1 billion in late 2022, reportedly brought in just $11 million in revenue in 2023 while spending $153 million on operating expenses.

“The performance trajectories of companies like Stability AI might hint at challenges looming ahead,” Fleishman said. “There’s been a more deliberate approach by investors in evaluating AI investments compared to a year ago. The rapid rise and fall of certain marquee name startups in AI over the past year has illustrated the need for investors to refine and sharpen their view and understanding of the AI value chain and defensibility within the stack.”

“Deliberate” seems to be the name of the game now, indeed.

According to a PitchBook report compiled for TechCrunch, VCs invested $25.87 billion globally in AI startups in Q1 2024, up from $21.69 billion in Q1 2023. But the Q1 2024 investments spanned across only 1,545 deals compared to 1,909 in Q1 2023. Mergers and acquisitions, meanwhile, slowed from 195 in Q1 2023 to 176 in Q1 2024.

Despite the general malaise within AI investor circles, generative AI — AI that creates new content, such as text, images, music and videos — remains a bright spot.

Funding for generative AI startups reached $25.2 billion in 2023, per the Stanford HAI report, nearly ninefold the investment in 2022 and about 30 times the amount from 2019. And generative AI accounted for over a quarter of all AI-related investments in 2023.

Samir Kumar, co-founder of Touring Capital, doesn’t think that the boom times will last, however. “We’ll soon be evaluating whether generative AI delivers the promised efficiency gains at scale and drives top-line growth through AI-integrated products and services,” Kumar said. “If these anticipated milestones aren’t met and we remain primarily in an experimental phase, revenues from ‘experimental run rates’ might not transition into sustainable annual recurring revenue.”

To Kumar’s point, several high-profile VCs, including Meritech Capital — whose bets include Facebook and Salesforce — TCV, General Atlantic and Blackstone, have steered clear of generative AI so far. And generative AI’s largest customers, corporations, seem increasingly skeptical of the tech’s promises, and whether it can deliver on them.

In a pair of recent surveys from Boston Consulting Group, about half of the respondents — all C-suite executives — said that they don’t expect generative AI to bring about substantial productivity gains and that they’re worried about the potential for mistakes and data compromises arising from generative AI-powered tools.

But whether skepticism and the financial downtrends that can stem from it are a bad thing depends on your point of view.

For Padval’s part, he sees the AI industry undergoing a “necessary” correction to “bubble-like investment fervor.” And, in his belief, there’s light at the end of the tunnel.

“We’re moving to a more sustainable and normalized pace in 2024,” he said. “We anticipate this stable investment rhythm to persist throughout the remainder of this year … While there may be periodic adjustments in investment pace, the overall trajectory for AI investment remains robust and poised for sustained growth.”

We shall see.

Geospatial Data Analysis with Geemap

Geospatial Data Analysis with Geemap
Illustration by Author

Geospatial data analysis is a field addressed to deal with, visualize and analyze a special type of data, called geospatial data. Compared to the normal data, we have tabular data with an additional column, the location information, such as latitude and longitude.

There are two main types of data: vector data and raster data. When dealing with vector data, you still have a tabular dataset, while raster data are more similar to images, such as satellite images and aerial photographs.

In this article, I am going to focus on raster data provided by Google Earth Engine, a cloud computing platform that provides a huge data catalog of satellite imagery. This kind of data can be easily mastered from your Jupyter Notebook using a life-saving Python package, called Geemap. Let’s get started!

What is Google Earth Engine?

Geospatial Data Analysis with Geemap
Screenshot by Author. Home page of Google Earth Engine.

Before getting started with the Python Library, we need to understand the potential of Google Earth Engine. This cloud-based platform, powered by Google Cloud platform, hosts public and free geospatial datasets for academic, non-profit and business purposes.

Geospatial Data Analysis with Geemap
Screenshot by Author. Overview of Earth Engine Data Catalog.

The beauty of this platform is that it provides a multi-petabyte catalog of raster and vector data, stored on the Earth Engine servers. You can have a fast overview from this link. Moreover, it provides APIs to facilitate the analysis of raster datasets.

What is Geemap?

Geospatial Data Analysis with Geemap
Illustration by Author. Geemap library.

Geemap is a Python library that allows to analyze and visualize huge amounts of geospatial data from Google Earth Engine.

Before this package, it was already possible to make computational requests through JavaScript and Python APIs, but Python APIs had limited functionalities and lacked documentation.

To fill this gap, Geemap was created to permit users to access resources of Google Earth Engine with few lines of code. Geemap is built upon eartengine-api, ipyleaflet and folium.

To install the library, you just need the following command:

pip install geemap

I recommend you experiment with this amazing package in Google Colab to understand its full potential. Take a look at this free book written by professor Dr. Qiusheng Wu for getting started with Geemap and Google Earth Engine.

How to Access Earth Engine?

First, we need to import two Python libraries, that will be used within the tutorial:

import ee  import geemap

In addition to geemap, we have imported the Earth Engine Python client library, called ee.

This Python library can be utilized for the authentication on Earth Engine, but it can be faster by using directly the Geemap library:

m = geemap.Map()  m

You need to click the URL returned by this line of code, which will generate the authorization code. First, select the cloud project and, then, click the “GENERATE TOKEN” button.

Geospatial Data Analysis with Geemap
Screenshot by Author. Notebook Authenticator.

After, it will ask you to choose the account. I recommend taking the same account of Google Colab if you are using it.

Geospatial Data Analysis with Geemap
Screenshot by Author. Choose an account.

Then, click the check box next to Select All and press the “Continue” button. In a nutshell, this step allows the Notebook Client to access the Earth Engine account.

Geospatial Data Analysis with Geemap
Screenshot by Author. Allow the Notebook Client to access your Earth Engine account.

After this action, the authentication code is generated and you can paste it into the notebook cell.

Geospatial Data Analysis with Geemap
Screenshot by Author. Copy the Authentication Code.

Once the verification code is entered, you can finally create and visualize this interactive map:

m = geemap.Map()  m

Geospatial Data Analysis with Geemap

For now, you are just observing the base map on top of ipyleaflet, a Python package that enables the visualization of interactive maps within the Jupyter Notebook.

Create Interactive Maps

Previously, we have seen how to authenticate and visualize an interactive map using a single line of code. Now, we can customize the default map by specifying the latitude and longitude of the centroid, the level of zoom and the height. I have chosen the coordinates of Rome for the centre to focus on the map of Europe.

m = geemap.Map(center=[41, 12], zoom=6, height=600)  m

Geospatial Data Analysis with Geemap

If we want to change the base map, there are two possible ways. The first way consists of writing and running the following code line:

m.add_basemap("ROADMAP")  m

Geospatial Data Analysis with Geemap

Alternatively, you can change manually the base map by clicking the icon of ring spanner positioned at the right.

Geospatial Data Analysis with Geemap

Moreover, we see the list of base maps provided by Geemap:

basemaps = geemap.basemaps.keys()  for bm in basemaps:     print(bm)

This is the output:

OpenStreetMap  Esri.WorldStreetMap  Esri.WorldImagery  Esri.WorldTopoMap  FWS NWI Wetlands  FWS NWI Wetlands Raster  NLCD 2021 CONUS Land Cover  NLCD 2019 CONUS Land Cover  ...

As you can notice, there is a long series of base maps, most of them available thanks to OpenStreetMap, ESRI and USGS.

Earth Engine Data Types

Before showing the full potential of Geemap, it’s important to know two main data types in Earth Engine. Take a look at the Google Earth Engine’s documentation for more details.

Geospatial Data Analysis with Geemap
Illustration by Author. Example of vector data types: Geometry, Feature and FeatureCollection.

When handling vector data, we use principally three data types:

  • Geometry stores the coordinates needed to draw the vector data on a map. Three main types of geometries are supported by Earth Engine: Point, LineString and Polygon.
  • Feature is essentially a row that combines geometry and non-geographical attributes. It’s very similar to the GeoSeries class of GeoPandas.
  • FeatureCollection is a tabular data structure that contains a set of features. FeatureCollection and GeoDataFrame are almost identical conceptually.

Geospatial Data Analysis with Geemap
Screenshot by Author. Example of Image data type. It shows the Australian Smoothed Digital Elevation Model (DEM-S)

In the world of raster data, we focus on Image objects. Google Earth Engine’s Images are composed of one or more brands, where each band has a specific name, estimated minimum and maximum, and description.

If we have a collection or time series of images, ImageCollection is more appropriate as a data type.

Geospatial Data Analysis with Geemap
Screenshot by Author. Copernicus CORINE Land Cover.

We visualize the satellite imagery showing the land cover map of Europe. This dataset provides the changes between 1986 and 2018.

First, we load the image using ee.Image and, then, select the band “landcover”. Finally, let’s visualize the image by adding the loaded dataset to the map as a layer using Map.addLayer.

Map = geemap.Map()  dataset = ee.Image('COPERNICUS/CORINE/V20/100m/2012')  landCover = dataset.select('landcover')  Map.setCenter(16.436, 39.825, 6)  Map.addLayer(landCover, {}, 'Land Cover')  Map

Geospatial Data Analysis with Geemap
Screenshot by Author.

Similarly, we can do the same thing to load and visualize the satellite imagery showing the land cover map of Europe. This dataset provides the changes between 1986 and 2018.

Geospatial Data Analysis with Geemap
Screenshot by Author. Offline high-resolution imagery of methane concentrations.

To visualize an Earth Engine ImageCollection, the lines of code are similar, except for ee.ImageCollection.

Map = geemap.Map()  collection = ee.ImageCollection('COPERNICUS/S5P/OFFL/L3_CH4').select('CH4_column_volume_mixing_ratio_dry_air').filterDate('2019-06-01', '2019-07-16')  band_viz = {   'min': 1750,   'max': 1900,   'palette': ['black', 'blue', 'purple', 'cyan', 'green', 'yellow', 'red']  }    Map.addLayer(collection.mean(), band_viz, 'S5P CH4')  Map.setCenter(0.0, 0.0, 2)  Map

Geospatial Data Analysis with Geemap
Screenshot by Author.

That’s great! From this map, we notice how Methane, one of the most important contributors to the greenhouse effect, is distributed within the globe.

Final Thoughts

This was an introductory guide that can help you work with Google Earth Engine data using Python. Geemap is the most complete Python library to visualize and analyze this type of data.

If you want to go deeper into this package, you can take a look at the resources I suggested below.

The code can be found here. I hope you found the article useful. Have a nice day!

Useful resources:

  • Google Earth Engine
  • Geemap Documentation
  • Book Earth Engine and Geemap: Geospatial Data Science with Python

Eugenia Anello is currently a research fellow at the Department of Information Engineering of the University of Padova, Italy. Her research project is focused on Continual Learning combined with Anomaly Detection.

More On This Topic

  • 5 Python Packages For Geospatial Data Analysis
  • Leveraging Geospatial Data in Python with GeoPandas
  • Building a Geospatial Application in Python with Google Earth…
  • Collection of Guides on Mastering SQL, Python, Data Cleaning, Data…
  • A Data Scientist’s Essential Guide to Exploratory Data Analysis
  • Data Cleaning in SQL: How To Prepare Messy Data for Analysis

Why India Needs to Build its Focus on AI Theory

Why India Needs to Focus More on AI Theory

There have been several theories floating around the internet claiming that India has arrived late to the AI party. Even with the alleged late arrival, most of the AI development in India is focused around building use cases of AI, rather than the core technology, by adopting AI models that have already been developed by the western countries.

These claims, however, may not be completely true.

To begin with, when one looks at the curriculum, premier institutions in India, such as the IITs, have been heavily focused on the theoretical aspects of AI. Many of the prominent contributions in the field of AI have also been made by professors from these institutes that have been the bedrock of innovation for several decades.

But where are we lacking? Anirbit Mukherjee, assistant professor at department of computer science, University of Manchester, said that India should focus more on AI theory. “If even a quarter of the mathematical talent in India were to get into AI theory, it would cause a tectonic shift,” he said in conversation on LinkedIn.

He believes that India’s core theory communities in mathematics, statistics, and physics departments are not actually interested in getting into AI theory. “It’s freaking cool – and more Indians should be doing it,” he added.

‘Honourable mentions’

Nikhil Malhotra, the chief innovation officer of Maker’s Lab, which is building Project Indus, said, “Most LLMs produced in India are built on top of the already-available LLMs. They cannot be called fundamental research or foundational LLMs.” In another comment, he wrote, “Who has challenged the original algorithm? While transformers are a great piece, they have flaws in terms of compute and carbon,” reiterating that most of the research in India is done on fine-tuned models.

Sourav Das, researcher at IIIT Kalyani also weighed in with his thoughts. “How many of them have made an algorithm, theory, or model from scratch,” questioned Das, saying that everything is available on the internet and the researchers are just exploiting the resources. “There is no invention in India, just reusing the things that are already there,” adding that all the fine-tuning is just getting “honourable mentions”.

A lot of AI development currently is being driven by young developers who are building AI models on top of existing ones such as LLaMA and Mistral, but nothing concrete has come up yet. Though there are initiatives such as Ola’s Krutrim, Sarvam AI, Tech Mahindra’s Project Indus, and BharatGPT that are focused on building models from scratch, a lot of work still needs to be done.

On the other hand, “Issue that the sceptics don’t realise is that there’s not much capital available in India for the youth to take it to the next level,” rued Sreekanth Sreedharan.

This issue was also highlighted by several others in the conversation talking about how a lot of investors are not interested in investing in deep research, but just application-based startups that will mint money easier. “India can’t compete in AI foundational research unless the investment behaviour changes,” added Rishabh Bhardwaj.

Similar thoughts were shared by Hakim Hacid, executive director and acting chief researcher at Technology Innovation Institute (TII). “You need a lot of funding to sustain open source and we believe that not everyone will be able to do it,” Hacid told AIM.

Innovation Requires an Entire Ecosystem

Several researchers from IITs point out that the institutes have mastered the art of publishing papers, however only a miniscule amount of such research is actually fundamental. It might be true that we don’t actually need more research on LLMs, but research on something that replaces the current paradigm of AI research.

Some experts argue that there is a need for a push from the industry, along with the government, for fundamental research in AI and focusing on AI theory. “The students need incentives [such as placements, internships, and media coverage] to solve difficult problems,” said Abhishek Gupta.

To build the ecosystem, it is necessary to change the curriculum, while also building an ecosystem which supports groundbreaking research in the field, and not just incentivising AI wrappers.

The recent QS World University Rankings currently feature 72 universities recognised for providing top-notch data science and AI courses. Among these are four Indian institutions that made it to the top 50 list, namely, IIT Bombay (30), IIT Kanpur (36), IIT Kharagpur (44), and IISc (45). Additionally, IIT Guwahati was part of the top 72 universities.

These premier Indian institutions’ inclusion in the rankings may suggest a trend of higher education institutions increasingly integrating data science and AI into their academic offerings.

To skyrocket this, Amit Sheth, the chair and founding director of the Artificial Intelligence Institute at the University of Southern Carolina (AIISC), has been continuously working with Indian academic institutions to drive research in the country. He has proposed Ekagrid, a private research university with an ambition to be ranked among the top in the world and contribute to India’s research-driven ecosystem as Stanford and UC-Berkely have done for Silicon Valley.

This team includes experts from 11 of the 25 top universities in the world. The project is still in its initial stages and is looking to raise funds. “The Prime Minister was very prompt and quick to understand the need for this project and provide actionable guidance,” he told AIM.

“There needs to be a lot more investment in AI research, which is still very low,” Sheth said. “But India is still doing great despite less funding.”

The post Why India Needs to Build its Focus on AI Theory appeared first on Analytics India Magazine.

5 Free Courses to Master Math for Data Science

5 Free Courses to Master Math for Data Science
Image by storyset on Freepik

When you’re learning data science, building a good foundation in math will make your learning journey easier and much more effective. Even if you’ve already landed your first data role, learning math fundamentals for data science will only take your skills further.

From exploratory data analysis to building machine learning models, having a good foundation in math topics like linear algebra and statistics will give you a better understanding of why you do what you do. So even if you are a beginner, this list of courses will help you learn:

  • Basic math skills
  • Calculus
  • Linear Algebra
  • Probability and Statistics
  • Optimization

Sounds interesting, yes? Let’s get started!

1. Data Science Math Skills – Duke University

Data science courses require you to be comfortable with math as a prerequisite. To be specific, most courses assume that you're comfortable with high school algebra and calculus. But no worries if you are not there yet.

The Data Science Math Skills course, offered by Duke University on Coursera will help you get up and running with math fundamentals in as little time as possible. The topics covered in this course include:

  • Problem solving
  • Functions and graphs
  • Intro to calculus
  • Intro to probability

It’s recommended that you go through this course before you start the other courses that explore specific math topics in greater depth.

Link: Data Science Math Skills – Duke University on Coursera

2. Calculus – 3Blue1Brown

When we talk about math for data science, calculus is definitely something you should be comfortable with. But most learners find high school calculus intimidating (I’ve been there, too!). This, however, is partly because of how we learn—mostly focusing on concepts, a small number of illustrative examples, and a ton of practice exercises.

But you’ll understand and learn calculus much better if there are helpful visualizations—to help go from intuition to equation—focusing on the why.

The Calculus course by Grant Sanderson of 3Blue1Brown is exactly what all of us need! Through a series of lessons with super helpful visualizations—going from geometry to formula wherever possible—this course will help you learn the following and more:

  • Limits and derivatives
  • Power rule, chain rule, product rule
  • Implicit differentiation
  • Higher order derivatives
  • Taylor series
  • Integration

Link: Calculus — 3Blue1Brown

3. Linear Algebra – 3Blue1Brown

As a data scientist, the datasets that you work are essentially matrices of dimensions num_samples x num_features. You can, therefore, think of each data point as a vector in the feature space. So understanding how matrices work, common operations on matrices, matrix decomposition techniques are all important.

If you loved the calculus course from 3Blue1Brown, you’ll probably enjoy the linear algebra course from Grant Sanderson just as much if not more. The Linear Algebra course from 3Blue1Brown will help you learn help you learn the following:

  • Fundamentals of vectors and vector spaces
  • Linear combinations, span, and basis
  • Linear transformation and matrices
  • Matrix multiplication
  • 3D linear transformation
  • Determinant
  • Inverses, column space, and null space
  • Dot and cross products
  • Eigenvalues and eigenvectors
  • Abstract vector spaces

Link: Linear Algebra — 3Blue1Brown

4. Probability and Statistics – Khan Academy

Statistics and probability are great skills to add to your data science toolbox. But they are by no means easy to master. However, it’s relatively easier to get your fundamentals down and build on them.

The Statistics and Probability course from Khan Academy will help you learn the probability and statistics you need to start working with data more effectively. Here is an overview of the topics covered:

  • Analyzing categorical and quantitative data
  • Modeling data distributions
  • Probability
  • Counting, permutations, and combinations
  • Random variables
  • Sampling distribution
  • Confidence interval
  • Hypothesis testing
  • Chi-square test
  • ANOVA

If you’re interested in diving deep into statistics, also check out 5 Free Courses to Master Statistics for Data Science.

Link: Statistics and Probability — Khan Academy

5. Optimization for Machine Learning – ML Mastery

If you’ve ever trained a machine learning model, you know that the algorithm learns the optimal values of the parameters of the model. Under the hood, it runs an optimization algorithm to find the optimal value.

The Optimization for Machine Learning Crash Course from Machine Learning Mastery is a comprehensive resource to learn optimization for machine learning.

This course takes a code-first approach using Python. So after understanding the importance of optimization, you’ll write Python code to see popular optimization algorithms in action. Here’s an overview of the topics covered:

  • The need for optimization
  • Grid search
  • Optimization algorithms in SciPy
  • BFGS algorithm
  • Hill climbing algorithm
  • Simulated annealing
  • Gradient descent

Link: Optimization for Machine Learning Crash Course — MachineLearningMastery.com

Wrapping Up

I hope you found these resources helpful. Because most of these courses are tailored towards beginners, you should be able to pick up all the essential math without feeling overwhelmed.

If you’re looking for courses to learn Python for data science, read 5 Free Courses to Master Python for Data Science.

Happy learning!

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she's working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

More On This Topic

  • How To Overcome The Fear of Math and Learn Math For Data Science
  • 25 Free Courses to Master Data Science, Data Engineering, Machine…
  • 5 Free Courses to Master SQL for Data Science
  • 5 Free Courses to Master Data Science
  • 5 Free Courses to Master Python for Data Science
  • 5 Free Courses to Master Statistics for Data Science

Lawhive raises $12M to expand its legaltech AI platform for small firms

Lawhive raises $12M to expand its legaltech AI platform for small firms Mike Butcher @mikebutcher / 8 hours

UK-based legaltech startup Lawhive, which offers an AI-based, in-house “lawyer” through a software-as-a-service platform targeted at small law firms, has raised £9.5 million ($11.9 million) in a seed round to expand the reach of AI-driven services for “main street” law firms.

To date, most legaltech startups that have deployed AI have concentrated on the big, juicy market of “Big Law” — large law firms that have a presence throughout the country or globally and are keen on pushing AI into their workflows. Such startups include Harvey (U.S.-based; raised $106 million), Robin AI (UK-based; raised $43.4 million), and Spellbook (Canada-based; raised $32.4 million). But startups have paid scant attention to the thousands of “main street” law firms, which have far smaller budgets and are harder to monetize.

Lawhive targets small law firms or solo lawyers running their own shops. Lawyers can use its software to onboard and manage their own clients, or can sign up for Lawhive’s marketplace to be matched with individual customers and small businesses.

The startup says it applies a variety of foundational AI models as well as its own in-house model to summarize documents and speed up repetitive tasks such as KYC/AML, client onboarding and document collection for both lawyers and their clients. The company says its in-house AI lawyer, “Lawrence,” is built on top of its own large language model (LLM), which it claims has passed the Solicitors Qualifying Examination (SQE) with an 81% grade against a passing grade of 55%.

“Pretty much all of the existing legaltech — AI companies like Harvey, Robin AI, or Spellbook — all go after the corporate market,” Pierre Proner, CEO and co-founder of Lawhive, told TechCrunch. “That’s a very small number of big law firms in the U.S. or the UK. We’re trying to solve the problem in the consumer legal space, which is a totally different and separate market. It’s served, at the moment, by — in the UK — 10,000 small law firms.”

Proner said smaller firms have to manage higher costs amid a shrinking market. “They’ve got high costs of staffing, with paralegals, junior lawyers and trainees. They only have one-to-three actual senior lawyers who are earning any money. So the model doesn’t work. There’s this huge exodus of mid-career lawyers from the main-street/high-street model, and a lot of them are going freelance or self-employed. That’s where we’ve sort of seen a lot of traction.”

Although the UK’s consumer legal market is worth an estimated £25 billion, like most legal markets, it’s groaning under the weight of its own costs. This means around 3.6 million people have an unmet legal need involving a dispute each year, and around a million small businesses handle legal issues on their own. So there’s a strong opportunity for automation to help the sector dial up productivity.

Proner added, “We do combine [our model] with foundational models from OpenAI and Anthropic, as well as open source models. But it is our own model, which has been trained on the data that we’ve been able to gather from thousands of cases.”

The startup plans to use the seed round to enter other markets, per Proner. “We have our eyes on other markets yet to be publicly disclosed,” he said.

Lawhiev’s lead investor for this round may provide some clues about which markets the company may be considering: The seed round was led by GV, the venture capital investment arm of U.S.-based Alphabet. London-based investor Episode 1 Ventures also participated.

In a statement, Vidu Shanmugarajah, a partner at GV, said, “As a lawyer by training, I have experienced first-hand how needed technology-driven innovation is in the legal sector. Lawhive represents a transformative shift for both lawyers and consumers.”

Previously, Lawhive had raised £1.3m from Episode1 at Pre-seed.

Alexa Saves Young Girl from Monkey Attack, Aims to Aid Older Adults Too

AWS IoT Alexa

Alexa’s ability to produce animal sounds through the Wild Planet skill recently helped save a 13-year old girl and her 15-month old niece from monkey attack in Basti, Uttar Pradesh. By asking “Alexa, kutte ki awaz nikalo”, the girl was able to scare away the monkeys.

“The option to access a number of useful kid-friendly experiences with simple voice commands makes Alexa a great addition for a family with young kids. Parents often tell us how Alexa has become a companion in their parenting journeys,” says Dilip R.S., Director and Country Manager for Alexa, Amazon India.

From listening to Indian folktales to playing animal sounds, Indian households with young kids who use Alexa at home are two times more engaged than other users. Parents of young kids take Alexa’s help in managing their day-to-day parenting tasks and keeping their kids engaged by asking Alexa for rhymes, stories, games, GK-related questions, and more.

Users enjoy the ease and convenience of giving simple voice commands to Alexa in Hindi, English, and Hinglish – making the AI a great aid for parents and companion for kids.

“While it is a great learning and entertainment tool for kids, Alexa can help parents manage their day-to-day tasks better. Whether it is controlling smart home appliances with voice while juggling numerous tasks or asking for a bedtime story as part of their child’s daily routine, Alexa’s right there to help them,” Dilip adds.

Today, families across India are asking Alexa for information, games, quizzes, music, managing day-to-day tasks, stories, and much more. In fact, weekends are family time with Alexa – last year there was a 15% increase in requests to Alexa over the weekends in requests for music with many of them being for kids’ music.

The top five, most popular songs for kids on Alexa are: Baby Shark, Lakdi Ki Kathi, Johnny Johnny Yes Papa, Wheels on the Bus, and Twinkle Twinkle Little Star. Indian folktales, like Akbar Birbal, Tenali Raman, and Panchatantra stories, see high interest from customers, especially in Hindi. In 2023, customers asked for these stories on an average of 34 times every hour.

The post Alexa Saves Young Girl from Monkey Attack, Aims to Aid Older Adults Too appeared first on Analytics India Magazine.

7 AI Startups that Featured on Shark Tank India Season 3

As Shark Tank India’s Season 3 wraps up, we look back to assess the surge in AI startups stepping into the tank to pitch their ideas. This was indeed the first time we saw AI startups making an impact on the show. While some secured funding, others were still in their early stages.

Interestingly, only four out of the seven were able to get investment.

The panel featured Aman Gupta from boAt, Ritesh Agarwal of OYO Rooms, Deepinder Goyal of Zomato, Anupam Mittal from Shaadi.com, Namita Thapar of Emcure Pharmaceuticals, and Vineeta Singh from Sugar Cosmetics.

Besides, there was Radhika Gupta of Edelweiss Mutual Fund, Peyush Bansal from Lenskart, Amit Jain of CarDekho, Azhar Iqubal from Inshorts, Varun Dua from ACKO, and Ronnie Screwvala from UpGrad.

Let’s look at the AI startups that featured on Shark Tank India.

Model Verse

Model Verse, founded by IIM Kozhikode graduate Srijan Mehrotra, creates images of models for advertising and catalogs using AI. Srijan said that high quality product images are crucial for marketing, but traditional methods like hiring models can be expensive and time-consuming.

In his pitch, he claimed that the platform can produce professional photos of models, saving brands a lot of money and time. Each image created on his platform costs only INR 35, a tiny fraction of the usual INR 3.5-4 lakh that models charge. Mehrotra developed the tool using PyTorch.

The founder of Shaadi.com, Anupam Mittal, questioned whether Model Verse could be easily replaced by OpenAI’s Dall-E 3 and said that building wrapper-based models on OpenAI’s API has become more feasible nowadays.

Mehrotra claimed that he had built Model Verse from scratch and needed GPUs to scale. He secured a deal from Anupam Mittal, Ritesh Aggarwal, and Amit Jain for INR 25 lakh in exchange for 10% equity.

AI Kavach

AI Kavach by Panoplia.ai, founded by Pratyusha Vemuri, is a platform that offers AI-powered solutions to tackle online fraud. The tool can detect fraudulent websites, calls, apps, and messages. Online fraud is a significant concern for individuals and businesses in India, resulting in substantial financial losses. Till date, the company has detected 200 fake messages and websites.

During her pitch, Vemuri shared her personal story of falling prey to a fraudulent website while purchasing a swing. Before founding AI Kavach, Vemuri served as the head of product identity, security, and privacy at Microsoft.

She claimed that AI Kavach can detect URLs in real time and, with the use of AI, predict whether a website is fraudulent or not. She also added that the algorithm is trained on millions of websites. Her plan is to charge consumers INR 99 per month for the service.

Vemuri secured an INR 1 crore deal from Aman Gupta and Peyush Bansal, offering 5% equity in her company in return.

Beauty GPT

Beauty GPT is a product of Orbo.ai, founded by Abhit Sinha, Manoj Shinde, and Danish Jamil. Similar to a Snapchat filter, BeautyGPT is a virtual makeup simulation which helps users visualise how they would look with makeup on. This includes experimenting with various products such as lipstick, blushes, highlighter, eyeliner, etc.

Using a combination of machine learning algorithms and LLMs, it analyses various data points, including facial attributes, customer demographics (age, skin type, etc.), skin concerns and weather impact (e.g., recommending different moisturizers for dry vs. humid weather).

Personalised recommendations can help customers find the products they need, potentially increasing sales conversions for cosmetic companies. During the Shark Tank pitch, Peyush Bansal offered to buy 51% of the company for INR 15 crore. However, Orbo.ai secured a deal of INR 1 crore for 1% equity from Vineeta Singh.

Ai Cars

The founder of Ai Cars, Harshal Mahadev Nakshane, from Yavatmal, Maharashtra, presented an AI-powered hydrogen fuel cell car prototype. During his Shark Tank pitch, Nakshane said that he built the car in his garage in about 18 months.

Judges Anupam Mittal, Vineeta Singh, and Namita Thapar took a test drive of the vehicle, during which it autonomously made sharp turns and was able to navigate on the road, leaving a strong impression on the judges.

The hydrogen fuel cell-powered AI Car stands out with its rapid 5-minute refueling time and an impressive range exceeding 1,000 kilometers – all achieved with an investment of INR 60 lakh. However, Nakshane didn’t receive funding because the Sharks felt that the product is not scalable and that it would be difficult to compete with established players like Tesla and Google Waymo.

FUTR STUDIOS

FUTR STUDIOS, co-founded by Himanshu Goel and George Tharian, specialises in developing AI influencers for marketing and advertising. Kyra, India’s first AI influencer, is also a product of FUTR Studios. Today, Kyra has more than 2.5 lakh followers on Instagram and has collaborated on campaigns with brands like boAt, Titan, Morris Garages, and more.

FUTR Studios develops high-fidelity, 3D virtual humans that can be used for various applications. They integrate these virtual humans into metaverse platforms, enabling interactive experiences within virtual worlds. The company further plans to build autonomous virtual humans that can think on their own using LLMs.

The company failed to get a deal from the sharks.

upliance.ai

upliance.ai is an AI-powered home appliance company founded by Mahek Mody. upliance.ai’s AI Cooking Assistant comes in the form of a jar with a smart 8-inch screen attached. It can prepare over 500+ dishes, including paneer stir fry, chole masala, steamed rice, and chai, for instance.

The screen comes with the in-built ‘ChefGPT’, which answers all your queries about the recipes and suggests ingredients for the dish. Moreover, it can suggest new recipes based on the ingredients you have at your disposal. Not only does it provide step-by-step instructions while you are cooking, it also enhances your overall cooking experience.

The company wasn’t able to secure funding on the tank. However, the company later secured an investment of INR 34 crore in a seed round at a valuation of INR 143 crore from Khosla Ventures.

Kibo

Kibo by Trestle Labs was one of the unique AI companies pitched on Shark Tank 3. Kibo, which stands for Knowledge in a Box, is an AI-based education tool that converts printed documents, handwritten notes, PDFs, and digital text into audio for listening. This can be particularly beneficial for the visually impaired or those who prefer audio learning.

The brand’s journey started during the engineering days of founders Akshita Sachdeva and Bonny Dave. While in her third year of computer science engineering, Sachdeva worked on a project focused on creating reading and mobility hand gloves for the visually impaired.

The company secured a funding of INR 60 lakh for 6% equity from Peyush Bansal and Ronnie Screwvala.

The post 7 AI Startups that Featured on Shark Tank India Season 3 appeared first on Analytics India Magazine.

Power of Rerankers and Two-Stage Retrieval for Retrieval Augmented Generation

Retrieval Augmented Generation

When it comes to natural language processing (NLP) and information retrieval, the ability to efficiently and accurately retrieve relevant information is paramount. As the field continues to evolve, new techniques and methodologies are being developed to enhance the performance of retrieval systems, particularly in the context of Retrieval Augmented Generation (RAG). One such technique, known as two-stage retrieval with rerankers, has emerged as a powerful solution to address the inherent limitations of traditional retrieval methods.

In this comprehensive blog post, we'll delve into the intricacies of two-stage retrieval and rerankers, exploring their underlying principles, implementation strategies, and the benefits they offer in enhancing the accuracy and efficiency of RAG systems. We'll also provide practical examples and code snippets to illustrate the concepts and facilitate a deeper understanding of this cutting-edge technique.

Understanding Retrieval Augmented Generation (RAG)

swe agent LLM

Before diving into the specifics of two-stage retrieval and rerankers, let's briefly revisit the concept of Retrieval Augmented Generation (RAG). RAG is a technique that extends the knowledge and capabilities of large language models (LLMs) by providing them with access to external information sources, such as databases or document collections. Refer more from the article “A Deep Dive into Retrieval Augmented Generation in LLM“.

“RAFT: A Fine-Tuning and RAG Approach to Domain-Specific Question Answering” “A Full Guide to Fine-Tuning Large Language Models” “The Rise of Mixture of Experts for Efficient Large Language Models” and “A Guide to Mastering Large Language Models”

The typical RAG process involves the following steps:

  1. Query: A user poses a question or provides an instruction to the system.
  2. Retrieval: The system queries a vector database or document collection to find information relevant to the user's query.
  3. Augmentation: The retrieved information is combined with the user's original query or instruction.
  4. Generation: The language model processes the augmented input and generates a response, leveraging the external information to enhance the accuracy and comprehensiveness of its output.

While RAG has proven to be a powerful technique, it is not without its challenges. One of the key issues lies in the retrieval stage, where traditional retrieval methods may fail to identify the most relevant documents, leading to suboptimal or inaccurate responses from the language model.

The Need for Two-Stage Retrieval and Rerankers

Traditional retrieval methods, such as those based on keyword matching or vector space models, often struggle to capture the nuanced semantic relationships between queries and documents. This limitation can result in the retrieval of documents that are only superficially relevant or miss crucial information that could significantly improve the quality of the generated response.

To address this challenge, researchers and practitioners have turned to two-stage retrieval with rerankers. This approach involves a two-step process:

  1. Initial Retrieval: In the first stage, a relatively large set of potentially relevant documents is retrieved using a fast and efficient retrieval method, such as a vector space model or a keyword-based search.
  2. Reranking: In the second stage, a more sophisticated reranking model is employed to reorder the initially retrieved documents based on their relevance to the query, effectively bringing the most relevant documents to the top of the list.

The reranking model, often a neural network or a transformer-based architecture, is specifically trained to assess the relevance of a document to a given query. By leveraging advanced natural language understanding capabilities, the reranker can capture the semantic nuances and contextual relationships between the query and the documents, resulting in a more accurate and relevant ranking.

Benefits of Two-Stage Retrieval and Rerankers

The adoption of two-stage retrieval with rerankers offers several significant benefits in the context of RAG systems:

  1. Improved Accuracy: By reranking the initially retrieved documents and promoting the most relevant ones to the top, the system can provide more accurate and precise information to the language model, leading to higher-quality generated responses.
  2. Mitigated Out-of-Domain Issues: Embedding models used for traditional retrieval are often trained on general-purpose text corpora, which may not adequately capture domain-specific language and semantics. Reranking models, on the other hand, can be trained on domain-specific data, mitigating the “out-of-domain” problem and improving the relevance of retrieved documents within specialized domains.
  3. Scalability: The two-stage approach allows for efficient scaling by leveraging fast and lightweight retrieval methods in the initial stage, while reserving the more computationally intensive reranking process for a smaller subset of documents.
  4. Flexibility: Reranking models can be swapped or updated independently of the initial retrieval method, providing flexibility and adaptability to the evolving needs of the system.

ColBERT: Efficient and Effective Late Interaction

One of the standout models in the realm of reranking is ColBERT (Contextualized Late Interaction over BERT). ColBERT is a document reranker model that leverages the deep language understanding capabilities of BERT while introducing a novel interaction mechanism known as “late interaction.”

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT

The late interaction mechanism in ColBERT allows for efficient and precise retrieval by processing queries and documents separately until the final stages of the retrieval process. Specifically, ColBERT independently encodes the query and the document using BERT, and then employs a lightweight yet powerful interaction step that models their fine-grained similarity. By delaying but retaining this fine-grained interaction, ColBERT can leverage the expressiveness of deep language models while simultaneously gaining the ability to pre-compute document representations offline, considerably speeding up query processing.

ColBERT's late interaction architecture offers several benefits, including improved computational efficiency, scalability with document collection size, and practical applicability for real-world scenarios. Additionally, ColBERT has been further enhanced with techniques like denoised supervision and residual compression (in ColBERTv2), which refine the training process and reduce the model's space footprint while maintaining high retrieval effectiveness.

This code snippet demonstrates how to configure and use the jina-colbert-v1-en model for indexing a collection of documents, leveraging its ability to handle long contexts efficiently.

Implementing Two-Stage Retrieval with Rerankers

Now that we have an understanding of the principles behind two-stage retrieval and rerankers, let's explore their practical implementation within the context of a RAG system. We'll leverage popular libraries and frameworks to demonstrate the integration of these techniques.

Setting up the Environment

Before we dive into the code, let's set up our development environment. We'll be using Python and several popular NLP libraries, including Hugging Face Transformers, Sentence Transformers, and LanceDB.

 # Install required libraries !pip install datasets huggingface_hub sentence_transformers lancedb 

Data Preparation

For demonstration purposes, we'll use the “ai-arxiv-chunked” dataset from Hugging Face Datasets, which contains over 400 ArXiv papers on machine learning, natural language processing, and large language models.

</pre> from datasets import load_dataset dataset = load_dataset("jamescalam/ai-arxiv-chunked", split="train") <pre>

Next, we'll preprocess the data and split it into smaller chunks to facilitate efficient retrieval and processing.

</pre> from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") def chunk_text(text, chunk_size=512, overlap=64): tokens = tokenizer.encode(text, return_tensors="pt", truncation=True) chunks = tokens.split(chunk_size - overlap) texts = [tokenizer.decode(chunk) for chunk in chunks] return texts chunked_data = [] for doc in dataset: text = doc["chunk"] chunked_texts = chunk_text(text) chunked_data.extend(chunked_texts) 
For the initial retrieval stage, we'll use a Sentence Transformer model to encode our documents and queries into dense vector representations, and then perform approximate nearest neighbor search using a vector database like LanceDB.
 from sentence_transformers import SentenceTransformer from lancedb import lancedb # Load Sentence Transformer model model = SentenceTransformer('all-MiniLM-L6-v2') # Create LanceDB vector store db = lancedb.lancedb('/path/to/store') db.create_collection('docs', vector_dimension=model.get_sentence_embedding_dimension()) # Index documents for text in chunked_data: vector = model.encode(text).tolist() db.insert_document('docs', vector, text) from sentence_transformers import SentenceTransformer from lancedb import lancedb # Load Sentence Transformer model model = SentenceTransformer('all-MiniLM-L6-v2') # Create LanceDB vector store db = lancedb.lancedb('/path/to/store') db.create_collection('docs', vector_dimension=model.get_sentence_embedding_dimension()) # Index documents for text in chunked_data: vector = model.encode(text).tolist() db.insert_document('docs', vector, text) 

With our documents indexed, we can perform the initial retrieval by finding the nearest neighbors to a given query vector.

</pre> from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") def chunk_text(text, chunk_size=512, overlap=64): tokens = tokenizer.encode(text, return_tensors="pt", truncation=True) chunks = tokens.split(chunk_size - overlap) texts = [tokenizer.decode(chunk) for chunk in chunks] return texts chunked_data = [] for doc in dataset: text = doc["chunk"] chunked_texts = chunk_text(text) chunked_data.extend(chunked_texts) <pre>

Reranking

After the initial retrieval, we'll employ a reranking model to reorder the retrieved documents based on their relevance to the query. In this example, we'll use the ColBERT reranker, a fast and accurate transformer-based model specifically designed for document ranking.

</pre> from lancedb.rerankers import ColbertReranker reranker = ColbertReranker() # Rerank initial documents reranked_docs = reranker.rerank(query, initial_docs) <pre>

The reranked_docs list now contains the documents reordered based on their relevance to the query, as determined by the ColBERT reranker.

Augmentation and Generation

With the reranked and relevant documents in hand, we can proceed to the augmentation and generation stages of the RAG pipeline. We'll use a language model from the Hugging Face Transformers library to generate the final response.

</pre> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("t5-base") model = AutoModelForSeq2SeqLM.from_pretrained("t5-base") # Augment query with reranked documents augmented_query = query + " " + " ".join(reranked_docs[:3]) # Generate response from language model input_ids = tokenizer.encode(augmented_query, return_tensors="pt") output_ids = model.generate(input_ids, max_length=500) response = tokenizer.decode(output_ids[0], skip_special_tokens=True) print(response) <pre>

In the code snippet above, we augment the original query with the top three reranked documents, creating an augmented_query. We then pass this augmented query to a T5 language model, which generates a response based on the provided context.

The response variable will contain the final output, leveraging the external information from the retrieved and reranked documents to provide a more accurate and comprehensive answer to the original query.

Advanced Techniques and Considerations

While the implementation we've covered provides a solid foundation for integrating two-stage retrieval and rerankers into a RAG system, there are several advanced techniques and considerations that can further enhance the performance and robustness of the approach.

  1. Query Expansion: To improve the initial retrieval stage, you can employ query expansion techniques, which involve augmenting the original query with related terms or phrases. This can help retrieve a more diverse set of potentially relevant documents.
  2. Ensemble Reranking: Instead of relying on a single reranking model, you can combine multiple rerankers into an ensemble, leveraging the strengths of different models to improve overall performance.
  3. Fine-tuning Rerankers: While pre-trained reranking models can be effective, fine-tuning them on domain-specific data can further enhance their ability to capture domain-specific semantics and relevance signals.
  4. Iterative Retrieval and Reranking: In some cases, a single iteration of retrieval and reranking may not be sufficient. You can explore iterative approaches, where the output of the language model is used to refine the query and retrieval process, leading to a more interactive and dynamic system.
  5. Balancing Relevance and Diversity: While rerankers aim to promote the most relevant documents, it's essential to strike a balance between relevance and diversity. Incorporating diversity-promoting techniques can help prevent the system from being overly narrow or biased in its information sources.
  6. Evaluation Metrics: To assess the effectiveness of your two-stage retrieval and reranking approach, you'll need to define appropriate evaluation metrics. These may include traditional information retrieval metrics like precision, recall, and mean reciprocal rank (MRR), as well as task-specific metrics tailored to your use case.

Conclusion

Retrieval Augmented Generation (RAG) has emerged as a powerful technique for enhancing the capabilities of large language models by leveraging external information sources. However, traditional retrieval methods often struggle to identify the most relevant documents, leading to suboptimal performance.

Two-stage retrieval with rerankers offers a compelling solution to this challenge. By combining an initial fast retrieval stage with a more sophisticated reranking model, this approach can significantly improve the accuracy and relevance of the retrieved documents, ultimately leading to higher-quality generated responses from the language model.

In this blog post, we've explored the principles behind two-stage retrieval and rerankers, highlighting their benefits and providing a practical implementation example using popular NLP libraries and frameworks. We've also discussed advanced techniques and considerations to further enhance the performance and robustness of this approach.