Introduction to Numpy and Pandas

Introduction to Numpy and Pandas
Illustration by Author. Source: Flaticon

Python is the most popular language you’ll encounter in the field of data science for its simplicity, the large community and the huge availability of open-source libraries.

If you are working on a data science project, Python packages will ease your life since you just need a few lines of code to do complicated operations, like manipulating the data and applying a machine learning/deep learning model.

When starting your data science journey, it’s recommended to start by learning two of the most useful Python packages: NumPy and Pandas. In this article, we are introducing these two libraries. Let’s get started!

What is NumPy?

NumPy stands for Numerical Python and is used to operate efficient computations of arrays and matrices behind the scenes of machine learning models. The building block of Numpy is the array, which is a data structure very similar to the list, with the difference that it provides a huge amount of mathematical functions. In other words, the Numpy array is a multidimensional array object.

Create Numpy Arrays

We can define NumPy arrays using a list or list of lists:

import numpy as np  l = [[1,2,3],[4,5,6],[7,8,9]]  numpy_array = np.array(l)  numpy_array
array([[1, 2, 3],         [4, 5, 6],         [7, 8, 9]])

Differently from a list of lists, we can visualise the matrix 3X3 with an indentation between each row. Moreover, NumPy provides more than 40 built-in functions for array creation.

To create an array filled with zeros, there is the function np.zeros , in which you just need to specify the shape you desire:

zeros_array = np.zeros((3,4))  zeros_array
array([[0., 0., 0., 0.],         [0., 0., 0., 0.],         [0., 0., 0., 0.]])

In the same way, we can create an array filled with ones:

ones_array = np.ones((3,4))  ones_array
array([[1., 1., 1., 1.],         [1., 1., 1., 1.],         [1., 1., 1., 1.]])

There is also the possibility to create the identity matrix, which is a square array with 1s on the main diagonal and off-diagonal elements are 0s:

identity_array = np.identity(3)  identity_array
array([[1., 0., 0.],         [0., 1., 0.],         [0., 0., 1.]])

Furthermore, NumPy provides different functions to create random arrays. To create an array filled with random samples from a uniform distribution over [0,1], we just need the function np.random.rand :

random_array = np.random.rand(3,4)  random_array
array([[0.84449279, 0.71146992, 0.48159787, 0.04927379],         [0.03428534, 0.26851667, 0.65718662, 0.52284251],         [0.1380207 , 0.91146148, 0.74171469, 0.57325424]])

Similarly to the previous function, we can define an array with random values, but this time time are taken from a standard normal distribution:

randn_array = np.random.randn(10)  randn_array
array([-0.68398432, -0.25466784,  0.27020797,  0.29632334, -0.20064897,          0.7988508 ,  1.34759319, -0.41418478, -0.35223377, -0.10282884])

In case, we are interested on building an array with random integers that belong to the interval [low,high), we just need the function np.random.randint :

randint_array = np.random.randint(1,20,20)  randint_array
array([14,  3,  1,  2, 17, 15,  5, 17, 18,  9,  4, 19, 14, 14,  1, 10, 17,         19,  4,  6])

Indexing and Slicing

Beyond the built-in functions for array creation, another good point of NumPy is that it’s possible to select elements from the array using a set of square brackets. For example, we can try to take the first row of the matrix:

a1 = np.array([[1,2,3],[4,5,6]])  a1[0]
array([1, 2, 3])

Let’s suppose that we want to select the third element of the first row. In this case, we need to specify two indices, the index of the row and the index of the column:

print(a1[0,2]) #3

An alternative is to use a1[0][2], but it’s considered inefficient because it first creates the array containing the first row and, then, it selects the element from that row.

Moreover, we can take slices from the matrix with the syntax start:stop:step inside the brackets, where the stop index is not included. For example, we want again to select the first row, but we just take the first two elements:

print(a1[0,0:2]) 
[1 2]

If we prefer to select all the rows, but we want to extract the first element of each row:

print(a1[:,0])
[1 4]

In addition to the integer array indexing, there is also the boolean array indexing to select the elements from an array. Let’s suppose that we want only the elements that respect the following condition:

a1>5
array([[False, False, False],         [False, False,  True]])

If we filter the array based on this condition, the output will show only the True elements:

a1[a1>5]
array([6])

Array Manipulation

When working in data science projects, it often happens to reshape an array to a new shape without changing the data.

For example, we start with an array of dimension 2X3. If we are not sure of our array’s shape, there is the attribute shape that can helps us:

a1 = np.array([[1,2,3],[4,5,6]])  print(a1)  print('Shape of Array: ',a1.shape)
[[1 2 3]   [4 5 6]]  Shape of Array:  (2, 3)

To reshape the array to the dimension 3X2, we can simply use the function reshape:

a1 = a1.reshape(3,2)  print(a1)  print('Shape of Array: ',a1.shape)
[[1 2]   [3 4]   [5 6]]  Shape of Array:  (3, 2)

Another common situation is to turn a multidimensional array into a single dimensional array. This is possible by specifying -1 as shape:

a1 = a1.reshape(-1)  print(a1)  print('Shape of Array: ',a1.shape)
[1 2 3 4 5 6]  Shape of Array:  (6,)

It can also occur that you need to obtain a transposed array:

a1 = np.array([[1,2,3,4,5,6]])  print('Before shape of Array: ',a1.shape)  a1 = a1.T  print(a1)  print('After shape of Array: ',a1.shape)
Before shape of Array:  (1, 6)  [[1]   [2]   [3]   [4]   [5]   [6]]  After shape of Array:  (6, 1)

In the same way, you can apply the same transformation using np.transpose(a1).

Array Multiplication

If you try to build machine learning algorithms from scratch, you’ll surely need to calculate the matrix product of two arrays. This is possible using the function np.matmul when the array have more than 1 dimension:

a1 = np.array([[1,2,3],[4,5,6]])  a2 = np.array([[1,2],[4,5],[7,8]])  print('Shape of Array a1: ',a1.shape)  print('Shape of Array a2: ',a2.shape)  a3 = np.matmul(a1,a2)   # a3 = a1 @ a2  print(a3)  print('Shape of Array a3: ',a3.shape)
Shape of Array a1:  (2, 3)  Shape of Array a2:  (3, 2)  [[30 36]   [66 81]]  Shape of Array a3:  (2, 2)

@ can be a shorter alternative to np.matmul.

If you multiply a matrix with a scalar, np.dot is the best choice:

a1 = np.array([[1,2,3],[4,5,6]])  a3 = np.dot(a1,2)  # a3 = a1 * 2  print(a3)  print('Shape of Array a3: ',a3.shape)
[[ 2  4  6]   [ 8 10 12]]  Shape of Array a3:  (2, 3)

In this case, * is a shorter alternative to np.dot.

Mathematical Functions

NumPy provides a huge variety of mathematical functions, such as the trigonometric functions, rounding functions, exponentials, logarithms and so on. You can find the full list here. We are going to show the most important functions that you can apply to your problems.

The exponential and the natural logarithm are surely the most popular and known transformations:

a1 = np.array([[1,2,3],[4,5,6]])  print(np.exp(a1))
[[  2.71828183   7.3890561   20.08553692]   [ 54.59815003 148.4131591  403.42879349]]
a1 = np.array([[1,2,3],[4,5,6]])  print(np.log(a1))
[[0.         0.69314718 1.09861229]   [1.38629436 1.60943791 1.79175947]]

If we want to extract the minimum and the maximum in a single line of code, we just need to call the following functions:

a1 = np.array([[1,2,3],[4,5,6]])  print(np.min(a1),np.max(a1))  # 1 6

We can also calculate the square-root from each element of the array:

a1 = np.array([[1,2,3],[4,5,6]])  print(np.sqrt(a1))
[[1.         1.41421356 1.73205081]   [2.         2.23606798 2.44948974]]

What is Pandas?

Pandas is built on Numpy and is useful for manipulating the dataset. There are two main data structures: Series and Dataframe. While the Series is a sequence of values, the dataframe is a table with rows and columns. In other words, the series is a column of the dataframe.

Create Series and Dataframe

To build the Series, we can just pass the list of values to the method:

import pandas as pd  type_house = pd.Series(['Loft','Villa'])  type_house
0     Loft  1    Villa  dtype: object

We can create a Dataframe by passing a dictionary of objects, in which the keys correspond to the column names and the values are the entries of the columns:

df = pd.DataFrame({'Price': [100000, 300000], 'date_construction': [1960, 2010]})  df.head()

Introduction to Numpy and Pandas

Once the Dataframe is created, we can check the type of each column:

type(df.Price),type(df.date_construction)
(pandas.core.series.Series, pandas.core.series.Series)

It should be clear that columns are data structures of type Series.

Summary functions

From now on, we show the potentialities of Pandas by using the bike sharing dataset, available on Kaggle. We can import the CSV file in the following way:

df = pd.read_csv('/kaggle/input/bike-sharing-demand/train.csv')  df.head()

Introduction to Numpy and Pandas

Pandas doesn’t only allow reading CSV files, but also Excel file, JSON, Parquet and other types of files. You can find the full list here.

From the output, we can visualise the first five rows of the dataframe. If we want to display the last four rows of the dataset, we use the tail() method:

df.tail(4)

Introduction to Numpy and Pandas

Few rows are not enough to have a good idea of the data we have. A good way of starting the analysis is by looking at the shape of the dataset:

df.shape                    #(10886, 12)

We have 10886 rows and 12 columns. Do you want to see the column names? It’s very intuitive to do:

df.columns

Introduction to Numpy and Pandas

There is a method that allows to visualise all this information into a unique output:

df.info()

Introduction to Numpy and Pandas

If we want to display the statistics of each column, we can use the describe method:

df.describe()

Introduction to Numpy and Pandas

It’s also important to extract information from the categorical fields. We can find the unique values and the number of unique values of the season column:

df.season.unique(),df.season.nunique()

Output:

(array([1, 2, 3, 4]), 4)

We can see that the values are 1, 2, 3,4. Then, there are four possible values. This verification is crucial to understand the categorical variables and prevent possible noise contained in the column.

To display the frequency of each level, we can use value_counts() method:

df.season.value_counts()

Introduction to Numpy and Pandas

The last step should be the inspection of the missing values on each column:

df.isnull().sum()

Introduction to Numpy and Pandas

Luckily we don’t have any missing value in any of these fields.

Indexing and Slicing

Like in Numpy, there is the index-based selection to select data from the data structure. There are two main methods to take entries from the dataframe:

  • iloc selects the elements based on the integer position
  • loc takes the items based on labels or a boolean array.

To select the first row, iloc is the best choice:

df.iloc[0]

Introduction to Numpy and Pandas

If we want instead to select all the rows and only the second column, we can do the following:

df.iloc[:,1]

Introduction to Numpy and Pandas

It’s also possible to select more columns at the same time:

df.iloc[0:3,[0,1,2,5]]

Introduction to Numpy and Pandas

It becomes complex to select the columns based on the indices. It would be better to specify the column names. This is possible using loc:

df.loc[0:3,['datetime','season','holiday','temp']]

Introduction to Numpy and Pandas

Similarly to Numpy, it’s possible to filter the dataframe based on conditions. For example, we want to return all the rows where weather is equal to 1:

df[df.weather==1]

Introduction to Numpy and Pandas

In case we want to return an output with specific columns, we can use loc:

df.loc[df.weather==1,['season','holiday']]

Introduction to Numpy and Pandas

Create new variables

The creation of new variables has a huge impact on extracting more information from the data and improving the interpretability. We can create a new categorical variable based on the values of workingday:

df['workingday_c'] = df['workingday'].apply(lambda x: 'work' if x==1 else 'relax')  df[['workingday','workingday_c']].head()

Introduction to Numpy and Pandas

If there are more than one condition, it’s better to map the values using a dictionary and the method map:

diz_season = {1:'winter',2:'spring',3:'summer',4:'fall'}  df['season_c'] = df['season'].map(lambda x: diz_season[x])  df[['season','season_c']].head()

Introduction to Numpy and Pandas

Grouping and Sorting

It can happen that you want to group the data based on categorical column(s). This is possible using groupby:

df.groupby('season_c').agg({'count':['median','max']})

Introduction to Numpy and Pandas

For each level of the season, we can observe the median and the maximum count of rented bikes. This output can be confusing without ordering based on a column. We can do it using the sort_values() method:

df.groupby('season_c').agg({'count':['median','max']}).reset_index().sort_values(by=('count', 'median'),ascending=False)

Introduction to Numpy and Pandas

Now, the output makes more sense. We can deduce that the highest number of bikes rented is in summer, while winter is not a good month for renting bikes.

Final thoughts

That’s it! I hope you have found this guide useful to learn the basics of NumPy and Pandas. They are often studied separately, but it can be insightful to understand first NumPy and then Pandas, which is built on top of NumPy.

There are surely methods that I didn’t cover within the tutorial, but the goal was to cover the most important and popular methods of these two libraries. The code can be found on Kaggle. Thanks for reading! Have a nice day!
Eugenia Anello is currently a research fellow at the Department of Information Engineering of the University of Padova, Italy. Her research project is focused on Continual Learning combined with Anomaly Detection.

More On This Topic

  • Beyond Numpy and Pandas: Unlocking the Potential of Lesser-Known Python…
  • Introduction to Pandas for Data Science
  • Vector and Matrix Norms with NumPy Linalg Norm
  • Using Numpy's argmax()
  • How to Select Rows and Columns in Pandas Using [ ], .loc, iloc, .at and…
  • How to Auto-Detect the Date/Datetime Columns and Set Their Datatype When…

The Go-To Friend for AI Programming

Yes, we are talking about Python. This modern programming language is ubiquitous in machine learning, data analysis and pretty much the entire tech ecosystem. If you scroll down papers with code you’ll find most of the research on machine learning is done using Pytorch, a framework built out of python. The language isn’t used only in research but also in scripting, automation, web development, testing etc. But why is the language so popular?

It has a simple and readable syntax which resembles natural language. With more than 137,000 libraries that include everything from data analysis, deep learning, computer vision, web development to name a few, Python serves as a general purpose language with a dynamic use case. Python enjoys a strong community support of active developers who contribute to the growth of the language by creating libraries, frameworks and tools, for example the Python Package Index (PyPi) which hosts thousands of third party packages that extend Python’s capabilities, enabling developers to solve complex problems efficiently.

Python and AI

Python is widely used to build AI models, more so than any other languages. The language is the second most used one because it is simple, direct, and easy to learn. Python also allows computationally expensive libraries to be written in C and then imported as Python modules, meaning users do not have to write in C which is more clunky and difficult to work with.

This is done with Python’s CFFI. This module allows Python to leverage libraries in C and combined with tools like Cython, allows developers to write Pythonic code while achieving speeds comparable to those of C, which is particularly useful for performance-critical applications. This is evident in its 30 million downloads per month.

Not limited to C but other programming languages that provide C-compatible interfaces, allowing interaction by creating a C layer around functions in these languages.

Most importantly, Python is better focused, as a community, on finding a Pythonic way to proceed, and then advocating it, than previous cultures. They have multiple independent communities of use: web, data science, ML, devops. They also built the right kind of libraries like Numpy and pandas (for data analysis and machine learning respectively) that sealed the deal for it in the scientific and research communities.

The language also saw massive support from corporates, Google invested heavily in building Tensorflow. While PyTorch is primarily developed and maintained by Facebook’s AI Research (FAIR) lab, which is part of Meta Platforms, Inc. It isn’t surprising that a bigger community usually means better support and more libraries, which feeds back into its growth of the massive community.

The Python Software Foundation has been responsible for maintaining and developing Python, and they are constantly adding new features and functionality. Users can be sure that the language will be supported with for the foreseeable future makes Python a good choice for AI development.

Other languages catching up?

While none of the other languages hold up to the breadth of development in Python, they are nonetheless used for specific purposes. Rust is gaining attention in AI development due to its focus on memory safety, performance, and concurrent programming. Rust is known for preventing common programming errors that can lead to security vulnerabilities. This is crucial for AI systems that handle sensitive data. Its memory management is more manual compared to Python, but this provides fine-grained control over resources.

Ruby’s adoption in AI is not as widespread as Python, but its ease of use and community support make it an attractive choice for AI development in certain contexts. Ruby has gained attention in AI development, especially in the context of web applications that leverage AI features. Ruby has libraries like TensorFlow.rb, which brings TensorFlow to the Ruby community, and other AI-related gems.

Python still remains a dominant force in AI development, with more than 275,495 companies using it. The language is beginner friendly while at the same time being used by experts for the development of AI thanks to its extensive documentation.

There are many languages you can use, but it's hard to find documentation for anything other than Python.
That's why I think Python is the best bet to start learning.

— Santiago (@svpino) August 26, 2023

The language has a bright future as it’s now being taught to children in schools, and now is a part of the curriculum for students as young as 7 years old.

The post The Go-To Friend for AI Programming appeared first on Analytics India Magazine.

7 Frameworks to Build Your AI Ethically

As the field of AI is flourishing the need for some sort of digital Hippocratic oath on the part of the creators of these technologies has become apparent. White ethics in AI has become a recent subject of discussion in the news, ethicists globally have been exploring the subject for decades; suggesting frameworks to build technology upon.

Out of the lot, we have handpicked 7 ethical AI frameworks for developers to incorporate ethics into their technological innovations.

Responsible Tech Playbook

In 2021 Thoughtworks employees put together a catalogue of techniques that can be used to step bank and understand the ethical implications of their work, which they published as the Responsible Tech Playbook.

The 50-slide framework highlights three techniques; solicit different points of view; identify and address ethical challenges before they become bigger problems; and ensure the technology is designed to meet the needs and support the values of the people.

There seems to be a fair amount of overlap among the approaches, so the study highlights a three-step subset that gets much of the benefit with a relatively small impact on a business’s current processes.

The PiE Model

Developed by the founder of AI Ethics Lab, Cansu Canca, the Puzzle-solving in Ethics (PiE) Model is a model for integrating ethics into the AI innovation cycle implemented through consulting the AI Ethics Lab.

The model introduced in 2018 focuses on the core ‘What is the right thing to do?’ question, in a systematic manner while integrating ethics analyses and solutions at every step of innovation.

The Alethia Framework

In 2020 Rolls Royce released a comprehensive guide for businesses to check and balance their AI projects in a fair, ethical and trustworthy manner. The company said it is essentially a checklist for companies to consider the impacts of using AI before deciding whether to proceed.

The document looks across a total of 32 facets of societal impact, governance and trust, and transparency and requires executives and boards to provide evidence that these have been rigorously considered. The company released an updated version later in 2021.

NIST AI Risk Management Framework

The framework guides government agencies and the private sector on managing new AI risks and promoting responsible AI.

AI’s general-purpose nature makes it challenging” for information technology risk management. The framework introduces “socio-technical” dimensions through the approach. Experts have pointed to the depth of the framework, especially its specificity in implementing controls and policies to better govern AI systems within different organisational contexts.

Securing Machine Learning Algorithms by ENISA

The European Union Agency for Cybersecurity, (ENISA), released a framework in December 2021, addressing the security of ML algorithms. The report mainly focuses on identifying the potential risks and weaknesses associated with the technology.

Additionally, it recommends a set of security measures to upgrade cybersecurity in machine learning powered systems. One challenge highlighted in the report is the delicate balance between implementing security controls and maintaining the expected performance levels of these systems.

Ethical OS Framework

In collaboration with Institute for the Future (IFTF), Omidyar Network, a Silicon Valley investment firm took the first step with the release of the ‘Ethical OS’ toolkit in 2018.

The framework attempts to tackle challenges of great concern that have risen as an unintended consequence of technology. It’s a three-part toolkit to help technologists understand how tech might be compromised down the road, and build safeguards against future risks.

Big tech and societal sustainability: an ethical framework

The 2020 paper authored by Bernard Arogyaswamy argues that sustainability consists of three forces: economic, social, and ecological, in tension with one another.

The framework suggests that companies should have clear rules about what’s right and wrong, and the people who enforce these rules should have enough power. It highlights the importance of ethical criteria, social impacts, and what users think when they use big tech in a way that’s good for the long term and follows ethical principles.

The post 7 Frameworks to Build Your AI Ethically appeared first on Analytics India Magazine.

Indian Bank Partners with IBM to Boost Scalability and Agility

IBM today announced that it has collaborated with Indian Bank, a premier public sector bank, to accelerate its modernization path with flexible and secure compute infrastructure for deploying new front-end banking applications, while addressing the business growth of the Core Banking Solution (CBS) workloads for the bank.

The key objective of this partnership is to achieve greater agility, scalability, and security to meet the bank’s growth requirements in today’s dynamic and hyper-competitive market.

With over 100 million customers and a vast network of 5798 branches spanning India and overseas, Indian Bank as part of their ambitious “Project PARADISE,” sought to consolidate their Front Branch Servers for improved manageability and availability at their Data Centers (DC) and Disaster Recovery (DR) centers. Additionally, Indian Bank aims to run its Core Banking Solution (CBS) workloads on the latest generation of IBM Power® servers.

Deepak Sarda, chief general manager – IT and Partnership at Indian Bank, expressed his enthusiasm for this collaboration, stating, “As part of our capacity management initiative and to manage evolving application requirements, we wanted a cloud-like functionality within our own data center. With benefits like better flexibility, lower costs, and enhanced security, we are creating a value-oriented solution by setting up a proactive and service-oriented model for the bank.”

Viswanath Ramaswamy, vice president, technology, IBM India & South Asia, emphasized the importance of embracing new technologies in the banking sector. He stated, “Evolving customer needs demand a new operating model for banks driven by a nimbler digital environment. Those that embrace new technologies to modernize can gain a significant competitive edge, build market differentiation through innovation, and prepare themselves for the new financial frontier.”

IBM’s proposed solution for Indian Bank includes a hybrid cloud approach for all AIX workloads, aimed at boosting agility. This approach combines Capex and Opex models for the consumption of compute resources, effectively providing Infrastructure-as-a-Service (IaaS). It also creates a private cloud using available resources, establishing a single pool of processors across DC and DR sites, and enabling services equivalent to a pay-as-you-grow model.

To enhance security, Indian Bank is implementing IBM PowerSC for all AIX endpoints, ensuring better support for compliance audits. IBM PowerVC has been deployed for simplified virtualization management, cloud deployments, workload image management, and dynamic resource optimization for IBM AIX Servers.

By implementing Power Enterprise Pools, Indian Bank has streamlined the utilization of compute resources across their DC and DR sites, ultimately reducing the total cost of ownership.

The new cloud environment provided by IBM offers several advanced features, including virtual machine replication for consistency, a self-service portal for streamlined provisioning, and capabilities such as virtual image capture, deployment, and cataloging. Through the consolidation of front-end branch servers, Indian Bank enhances manageability and availability at its data centers in Chennai and Mumbai, ensuring uninterrupted service for customers.

In another news, Indian Bank recently set up 10 startup cells across different centres in the country to serve its specialised banking requirements. Through this the bank aims to create banking products and services for startups keeping in mind their unique and specilaised baking requirements.

The post Indian Bank Partners with IBM to Boost Scalability and Agility appeared first on Analytics India Magazine.

Why PyTorch is Love

Why PyTorch is Love

Ask almost any developer, in the ever-evolving world of deep learning and generative AI, the name PyTorch stands tall as a beacon of affection and admiration. While Large Language Models (LLMs) based on Transformers have been the talk of the town, they have not overshadowed the significance of more traditional frameworks like PyTorch.

PyTorch’s popularity among data scientists and engineers remains steadfast, and for good reason. According to several discussions, one of the primary attractions is PyTorch’s “inherent goodness.” It offers an intuitive and dynamic approach to building neural networks, making it an ideal choice for deep learning experiments and prototyping.

Unlike some other frameworks like TensorFlow, PyTorch has a reputation for keeping things simple yet powerful. Though TensorFlow is undoubtedly powerful, it is buggy. Some people even say that TensorFlow is still better in terms of production, but developers have just shifted to PyTorch. Even years ago, it was known to work flawlessly “out of the box” on relatively simple systems. This user-friendly aspect of PyTorch has endeared it to researchers and experimenters alike, making it a no-brainer choice for those in pursuit of innovation.

TensorFlow’s death gave rise to PyTorch’s glow

PyTorch is like a trusted companion. Its flexibility and ease of use allow them to quickly implement their ideas, test hypotheses, and iterate on their models. The dynamic computation graph in PyTorch allows for real-time debugging and experimentation, which is crucial for refining algorithms and achieving breakthroughs.

While PyTorch reigns supreme in the realm of research and experimentation, TensorFlow has found its calling in end-user facing applications. It has become the framework of choice for deploying machine learning models in production environments. However, even within the realm of deep learning research, TensorFlow’s popularity has seen a decline.

TensorFlow is dead and if you're using it at work you're on a career dead end https://t.co/CC5h1zjrlc

— Jupyter Meowbooks 🐈/acc (@untitled01ipynb) August 12, 2023

Google learned from Meta’s PyTorch and made TensorFlow 2.0, which is better and easier for research than its previous version. Still, researchers have no reason to return to giving TensorFlow another chance. Now, with PyTorch 2.0 in the picture, hopes for TensorFlow fall even shorter.

Moreover, even Google and DeepMind have shifted away from TensorFlow in many of their projects. Instead, they have embraced JAX and frameworks built on top of it, such as Haiku and Flax. This shift underscores the evolving landscape of deep learning frameworks, with PyTorch and JAX emerging as the preferred options.

Moreover, Python is the king of AI research at the moment. Interestingly, PyTorch is often referred to as “Pythony” by developers. This explains its wide adoption, and people shifting to PyTorch (built largely on Python) because it seemed comfortable and easy to use, and has a faster learning curve for new users.

Community is the reason for PyTorch’s success

Another reason for PyTorch’s success is its compatibility with NVIDIA’s CUDA. CUDA is the beloved framework for developing AI models, and PyTorch’s code just made it a lot easier for developers. Earlier, Google was leading it with TensorFlow, but Meta’s PyTorch won hearts with the ease of use.

Handling Cuda and multiple GPU is why I love PyTorch. Tensorflow is an headache lol

— Predstan (@RajiAdeola10) January 28, 2022

One of the other factors contributing to PyTorch’s enduring love is its strong presence within the Hugging Face ecosystem. The dominance of PyTorch in the Hugging Face ecosystem is evident from the statistics on StackOverflow Developer Survey. In 2022, a staggering 45,000 PyTorch exclusive models were added to Hugging Face, while only 4,000 new TensorFlow exclusives made their way onto the platform. This resulted in a whopping 92% of models on Hugging Face being PyTorch exclusive, leaving a mere 8% for TensorFlow.

This disparity in model availability on Hugging Face showcases the widespread preference for PyTorch among developers and researchers. It also underlines the practicality and efficiency that PyTorch offers in creating and deploying state-of-the-art models. Moreover, PyTorch’s core developers are known for their responsiveness to user issues and feature requests. This dynamic interaction fosters a sense of partnership between the framework creators and its users, further solidifying PyTorch’s place in the hearts of many.

The warmth and affection for PyTorch extend beyond its technical merits. It can be attributed to the vibrant and supportive community that has grown around it. PyTorch enthusiasts and experts readily share knowledge, offer assistance, and collaborate on open-source projects.

As we look at the future of deep learning and artificial intelligence, it becomes increasingly clear that PyTorch and JAX are poised to play pivotal roles. These frameworks offer the flexibility and performance needed to tackle the complex challenges of tomorrow. The fusion of PyTorch’s user-centric design and JAX’s efficiency paints a promising picture of what lies ahead.

The post Why PyTorch is Love appeared first on Analytics India Magazine.

Embracing Neuronal Diversity: A Leap in AI Efficiency and Performance

The role of diversity has been a subject of discussion in various fields, from biology to sociology. However, a recent study from North Carolina State University's Nonlinear Artificial Intelligence Laboratory (NAIL) opens an intriguing dimension to this discourse: diversity within artificial intelligence (AI) neural networks.

The Power of Self-Reflection: Tuning Neural Networks Internally

William Ditto, professor of physics at NC State and director of NAIL, and his team built an AI system that can “look inward” and adjust its neural network. The process allows the AI to determine the number, shape, and connection strength between its neurons, offering the potential for sub-networks with different neuronal types and strengths.

“We created a test system with a non-human intelligence, an artificial intelligence, to see if the AI would choose diversity over the lack of diversity and if its choice would improve the performance of the AI,” says Ditto. “The key was giving the AI the ability to look inward and learn how it learns.”

Unlike conventional AI that uses static, identical neurons, Ditto's AI has the “control knob for its own brain,” enabling it to engage in meta-learning, a process that boosts its learning capacity and problem-solving skills. “Our AI could also decide between diverse or homogenous neurons,” Ditto states, “And we found that in every instance the AI chose diversity as a way to strengthen its performance.”

Progression from conventional artificial neural network to diverse neural network to learned diverse neural network. Line thicknesses represent weights

Performance Metrics: Diversity Trumps Uniformity

The research team measured the AI's performance with a standard numerical classifying exercise and found remarkable results. Conventional AIs, with their static and homogenous neural networks, managed a 57% accuracy rate. In contrast, the meta-learning, diverse AI reached a staggering 70% accuracy.

According to Ditto, the diversity-based AI shows up to 10 times more accuracy in solving more complex tasks, such as predicting a pendulum's swing or the motion of galaxies. “Indeed, we also observed that as the problems become more complex and chaotic, the performance improves even more dramatically over an AI that does not embrace diversity,” he elaborates.

The Implications: A Paradigm Shift in AI Development

The findings of this study have far-reaching implications for the development of AI technologies. They suggest a paradigm shift from the currently prevalent ‘one-size-fits-all' neural network models to dynamic, self-adjusting ones.

“We have shown that if you give an AI the ability to look inward and learn how it learns it will change its internal structure — the structure of its artificial neurons — to embrace diversity and improve its ability to learn and solve problems efficiently and more accurately,” Ditto concludes. This could be especially pertinent in applications that require high levels of adaptability and learning, from autonomous vehicles to medical diagnostics.

This research not only shines a spotlight on the intrinsic value of diversity but also opens up new avenues for AI research and development, underlining the need for dynamic and adaptable neural architectures. With ongoing support from the Office of Naval Research and other collaborators, the next phase of research is eagerly awaited.

By embracing the principles of diversity internally, AI systems stand to gain significantly in terms of performance and problem-solving abilities, potentially revolutionizing our approach to machine learning and AI development.

This veteran VC doesn’t think ARM’s IPO will have the impact that everyone is hoping it will

This veteran VC doesn’t think ARM’s IPO will have the impact that everyone is hoping it will Connie Loizos @Cookie / 8 hours

The startup industry has been whistling a happy tune ever since the British chip designer ARM filed paperwork with the SEC late last month for an initial public offering. The growing expectation is that the hotly anticipated offering will force open the IPO window for many other offerings, too. But while ARM’s beleaguered owner, SoftBank, is likely to wring out a substantial return from ARM once it’s rolled out on the Nasdaq, one “blockbuster IPO” may have far less impact on the rest of the industry than many anticipate, says former operator, entrepreneur, and longtime VC Heidi Roizen.

We recently talked with Roizen — who has spent the last decade with Theshold Ventures — about the offering and what else is happening in the market right now. You can listen to that longer conversation here or read excerpts from it, edited for length, below.

TC: You have a new podcast and recently covered down rounds — a big topic this year. Is there any non-conventional wisdom for founders you can offer? VCs I’ve talked with throughout the year say it’s better to take a lower valuation than accept certain terms, or “structure,” in order to maintain an inflated valuation.

HR: Sure, venture capitalists will say, ‘Just take the lower valuation.’ But I think it’s one thing to tell people, ‘Terms are more important than valuation.’ It’s another thing to show someone, ‘Hey, you’re gonna walk away with 24% if you do this, but you’re gonna walk away with 48% if you do that.’ Entrepreneurs should run the math and make sure [they] understand that when [they’re] giving downside protection [to VCs], that’s probably going to come out of their own pocket. On the podcast, what I’ve tried to do is give them real examples.

“Participating preferred” is a term that no one heard for many years and which resurfaced this year. What else were many founders not exposed to previously and so are struggling with?

There’s a lot going on right now that entrepreneurs need to be aware of. The financing world is just one component. Compensation is another place where [founders] really have to look and say, ‘We need to right size.’ I’m also working on a future episode about secondaries.

Secondaries are interesting in that they were once seen as something shameful that you didn’t discuss, then it was fine to discuss them — you were actually smart taking money off the table. Then things really went haywire, with founders allowed to sell a lot of shares in their company — sometimes at sky-high prices — at the same time they were raising primary capital from investors.

It became Netflix documentary material.

Exactly! What did you make of a recent report that Tiger Global is nearing a sale of part of its stake in a very buzzy AI company called Cohere. According to The Information, it’s selling 2.1% of its stake and keeping 5%. Basically, it’s just pulling out the money that it put into the company and taking it off the table. Tiger is reportedly having liquidity issues, but doesn’t that kind of secondary sale also impact how the market sees Cohere?

I think it’s more of an indicator about Tiger than Cohere. It’s a very small percent [that it’s selling]. Tiger is purportedly in a cash crunch, and they’re portfolio managers. They look around at their holdings and they say, ‘Gee, we have a bunch that if we were to try to sell in a secondary, we’d have to take a loss. Meanwhile, we have Cohere where it’s even money, so we can book that and it doesn’t hit our books that bad. We return the money of the LPs and it’s kind of a wash.’ Part of those are psychological decisions. It’s very hard to sell your losers.

In separate AI news, Salesforce just led a big round in the AI startup Hugging Face, which is just the latest bet for Salesforce, which also has stakes in Cohere and Anthropic. As someone on an AI committee at Stanford, do you think relationships with strategic investors are any more important for today’s AI startups than other types of startups? It’s nice to have the muscle of a Salesforce or an Oracle behind you, but there are downsides as well.

Strategic investors are a huge part of the financial ecosystem for entrepreneurs. Something like 20% of all deals have a strategic investor in them. But as I once said to an entrepreneur, ‘When when I invest in you, I only make money if your stock goes up. But when a strategic invests in you, they also make money when their stock goes up.’ To me, that summarizes something really important. I understand Salesforce paid like 100 times revenue and to the best of my knowledge, there is no public company trading at 100 times revenue. Unless you’re planning to sell that stock sometime in the future, that’s a pretty aggressive price.

If you are also doing some sort of coincident biz dev deal that is going to allow you to leverage what [a startup has] into your customer base and into your technology and into your new market segments, that makes your stock go up. So we’re going to have to wait and see, but I would imagine that that’s how [Salesforce] justified paying a price like that.

In the meantime, everyone is waiting on this ARM IPO. The widespread thinking seems to be that this chip design company is going to worth anywhere from $40 billion to $80 billion and blow open the IPO window. Do you think so, too?

Every company that goes public is different. I’ve never understood this concept of, ‘Well, the market is closed, but you take one super big company, and you put it out there, and all of a sudden everybody gets to go public again.’ I personally don’t understand that. So, no, I don’t think it’s gonna blow the market open and that a whole line is going to march out there and we’re going to have 50 IPOs between now and December.

Getting Started with Python Data Structures in 5 Steps

Getting Started with Python Data Structures in 5 Steps Introduction to Python Data Structures

When it comes to learning how to program, regardless of the particular programming language you use for this task, you find that there are a few major topics of your newly-chosen discipline that into which most of what you are being exposed to could be categorized. A few of these, in general order of grokking, are: syntax (the vocabulary of the language); commands (putting the vocabulary together into useful ways); flow control (how we guide the order of command execution); algorithms (the steps we take to solve specific problems… how did this become such a confounding word?); and, finally, data structures (the virtual storage depots that we use for data manipulation during the execution of algorithms (which are, again… a series of steps).

Essentially, if you want to implement the solution to a problem, by cobbling together a series of commands into the steps of an algorithm, at some point data will need to be processed, and data structures will become essential. Such data structures provide a way to organize and store data efficiently, and are critical for creating fast, modular code that can perform useful functions and scale well. Python, a particular programming language, has a series of built-in data structures of its own.

This tutorial will focus on these four foundational Python data structures:

  • Lists — Ordered, mutable, allows duplicate elements. Useful for storing sequences of data.
  • Tuples — Ordered, immutable, allows duplicate elements. Think of them as immutable lists.
  • Dictionaries — Unordered, mutable, mapped by key-value pairs. Useful for storing data in a key-value format.
  • Sets — Unordered, mutable, contains unique elements. Useful for membership testing and eliminating duplicates.

Beyond the fundamental data structures, Python also provides more advanced structures, such as heaps, queues, and linked lists, which can further enhance your coding prowess. These advanced structures, built upon the foundational ones, enable more complex data handling and are often used in specialized scenarios. But you aren't constrained here; you can use all of the existing structures as a base to implement your own structures as well. However, the understanding of lists, tuples, dictionaries, and sets remains paramount, as these are the building blocks for more advanced data structures.

This guide aims to provide a clear and concise understanding of these core structures. As you start your Python journey, the following sections will guide you through the essential concepts and practical applications. From creating and manipulating lists to leveraging the unique capabilities of sets, this tutorial will equip you with the skills needed to excel in your coding.

Step 1: Using Lists in Python

What is a List in Python?

A list in Python is an ordered, mutable data type that can store various objects, allowing for duplicate elements. Lists are defined by the use of square brackets [ ], with elements being separated by commas.

For example:

fibs = [0, 1, 1, 2, 3, 5, 8, 13, 21]

Lists are incredibly useful for organizing and storing data sequences.

Creating a List

Lists can contain different data types, like strings, integers, booleans, etc. For example:

mixed_list = [42, "Hello World!", False, 3.14159]

Manipulating a List

Elements in a list can be accessed, added, changed, and removed. For example:

# Access 2nd element (indexing begins at '0')  print(mixed_list[1])    # Append element   mixed_list.append("This is new")    # Change element  mixed_list[0] = 5    # Remove last element  mixed_list.pop(0)

Useful List Methods

Some handy built-in methods for lists include:

  • sort() — Sorts list in-place
  • append() — Adds element to end of list
  • insert() — Inserts element at index
  • pop() — Removes element at index
  • remove() — Removes first occurrence of value
  • reverse() — Reverses list in-place

Hands-on Example with Lists

# Create shopping cart as a list  cart = ["apples", "oranges", "grapes"]    # Sort the list   cart.sort()    # Add new item   cart.append("blueberries")     # Remove first item  cart.pop(0)    print(cart)

Output:

['grapes', 'oranges', 'blueberries']

Step 2: Understanding Tuples in Python

What Are Tuples?

Tuples are another type of sequence data type in Python, similar to lists. However, unlike lists, tuples are immutable, meaning their elements cannot be altered once created. They are defined by enclosing elements in parentheses ( ).

# Defining a tuple  my_tuple = (1, 2, 3, 4)

When to Use Tuples

Tuples are generally used for collections of items that should not be modified. Tuples are faster than lists, which makes them great for read-only operations. Some common use-cases include:

  • Storing constants or configuration data
  • Function return values with multiple components
  • Dictionary keys, since they are hashable

Accessing Tuple Elements

Accessing elements in a tuple is done in a similar manner as accessing list elements. Indexing and slicing work the same way.

# Accessing elements  first_element = my_tuple[0]  sliced_tuple = my_tuple[1:3]

Operations on Tuples

Because tuples are immutable, many list operations like append() or remove() are not applicable. However, you can still perform some operations:

  • Concatenation: Combine tuples using the + operator.
concatenated_tuple = my_tuple + (5, 6)
  • Repetition: Repeat a tuple using the * operator.
repeated_tuple = my_tuple * 2
  • Membership: Check if an element exists in a tuple with the in keyword.
exists = 1 in my_tuple

Tuple Methods

Tuples have fewer built-in methods compared to lists, given their immutable nature. Some useful methods include:

  • count(): Count the occurrences of a particular element.
count_of_ones = my_tuple.count(1)
  • index(): Find the index of the first occurrence of a value.
index_of_first_one = my_tuple.index(1)

Tuple Packing and Unpacking

Tuple packing and unpacking are convenient features in Python:

  • Packing: Assigning multiple values to a single tuple.
packed_tuple = 1, 2, 3
  • Unpacking: Assigning tuple elements to multiple variables.
a, b, c = packed_tuple

Immutable but Not Strictly

While tuples themselves are immutable, they can contain mutable elements like lists.

# Tuple with mutable list  complex_tuple = (1, 2, [3, 4])

Note that while you can't change the tuple itself, you can modify the mutable elements within it.

Step 3: Mastering Dictionaries in Python

What is a Dictionary in Python?

A dictionary in Python is an unordered, mutable data type that stores mappings of unique keys to values. Dictionaries are written with curly braces { } and consist of key-value pairs separated by commas.

For example:

student = {"name": "Michael", "age": 22, "city": "Chicago"}

Dictionaries are useful for storing data in a structured manner and accessing values by keys.

Creating a Dictionary

Dictionary keys must be immutable objects like strings, numbers, or tuples. Dictionary values can be any object.

student = {"name": "Susan", "age": 23}    prices = {"milk": 4.99, "bread": 2.89}

Manipulating a Dictionary

Elements can be accessed, added, changed, and removed via keys.

# Access value by key  print(student["name"])    # Add new key-value   student["major"] = "computer science"      # Change value  student["age"] = 25    # Remove key-value  del student["city"]

Useful Dictionary Methods

Some useful built-in methods include:

  • keys() — Returns list of keys
  • values() — Returns list of values
  • items() — Returns (key, value) tuples
  • get() — Returns value for key, avoids KeyError
  • pop() — Removes key and returns value
  • update() — Adds multiple key-values

Hands-on Example with Dictionaries

scores = {"Francis": 95, "John": 88, "Daniel": 82}    # Add new score  scores["Zoey"] = 97    # Remove John's score  scores.pop("John")      # Get Daniel's score  print(scores.get("Daniel"))    # Print all student names   print(scores.keys())

Step 4: Exploring Sets in Python

What is a Set in Python?

A set in Python is an unordered, mutable collection of unique, immutable objects. Sets are written with curly braces { } but unlike dictionaries, do not have key-value pairs.

For example:

numbers = {1, 2, 3, 4}

Sets are useful for membership testing, eliminating duplicates, and mathematical operations.

Creating a Set

Sets can be created from lists by passing it to the set() constructor:

my_list = [1, 2, 3, 3, 4]  my_set = set(my_list) # {1, 2, 3, 4}

Sets can contain mixed data types like strings, booleans, etc.

Manipulating a Set

Elements can be added and removed from sets.

numbers.add(5)     numbers.remove(1)

Useful Set Operations

Some useful set operations include:

  • union() — Returns union of two sets
  • intersection() — Returns intersection of sets
  • difference() — Returns difference between sets
  • symmetric_difference() — Returns symmetric difference

Hands-on Example with Sets

A = {1, 2, 3, 4}  B = {2, 3, 5, 6}    # Union - combines sets   print(A | B)     # Intersection   print(A & B)    # Difference    print(A - B)    # Symmetric difference  print(A ^ B)

Step 5: Comparing Lists, Dictionaries, and Sets

Comparison of Characteristics

The following is a concise comparison of the four Python data structures we referred to in this tutorial.

Structure Ordered Mutable Duplicate Elements Use Cases
List Yes Yes Yes Storing sequences
Tuple Yes No Yes Storing immutable sequences
Dictionary No Yes Keys: No
Values: Yes
Storing key-value pairs
Set No Yes No Eliminating duplicates, membership testing

When to Use Each Data Structure

Treat this as a soft guideline for which structure to turn to first in a particular situation.

  • Use lists for ordered, sequence-based data. Useful for stacks/queues.
  • Use tuples for ordered, immutable sequences. Useful when you need a fixed collection of elements that should not be changed.
  • Use dictionaries for key-value data. Useful for storing related properties.
  • Use sets for storing unique elements and mathematical operations.

Hands-on Example Using All Four Data Structures

Let's have a look at how these structures can all work together in an example that is a little more complex than a one liner.

# Make a list of person names  names = ["John", "Mary", "Bob", "Mary", "Sarah"]    # Make a tuple of additional information (e.g., email)  additional_info = ("john@example.com", "mary@example.com", "bob@example.com", "mary@example.com", "sarah@example.com")    # Make set to remove duplicates  unique_names = set(names)    # Make dictionary of name-age pairs  persons = {}  for name in unique_names:    persons[name] = random.randint(20,40)    print(persons)

Output:

{'John': 34, 'Bob': 29, 'Sarah': 25, 'Mary': 21}

This example utilizes a list for an ordered sequence, a tuple for storing additional immutable information, a set to remove duplicates, and a dictionary to store key-value pairs.

Moving Forward

In this comprehensive tutorial, we've taken a deep look at the foundational data structures in Python, including lists, tuples, dictionaries, and sets. These structures form the building blocks of Python programming, providing a framework for data storage, processing, and manipulation. Understanding these structures is essential for writing efficient and scalable code. From manipulating sequences with lists, to organizing data with key-value pairs in dictionaries, and ensuring uniqueness with sets, these essential tools offer immense flexibility in data handling.

As we've seen through code examples, these data structures can be combined in various ways to solve complex problems. By leveraging these data structures, you can open the doors to a wide range of possibilities in data analysis, machine learning, and beyond. Don't hesitate to explore the official Python data structures documentation for more insights.

Happy coding!

Matthew Mayo (@mattmayo13) holds a Master's degree in computer science and a graduate diploma in data mining. As Editor-in-Chief of KDnuggets, Matthew aims to make complex data science concepts accessible. His professional interests include natural language processing, machine learning algorithms, and exploring emerging AI. He is driven by a mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.

More On This Topic

  • Getting Started with Python for Data Science
  • Getting Started with Python Generators
  • Getting Started Cleaning Data
  • Getting Started with 5 Essential Natural Language Processing Libraries
  • Getting Started with Distributed Machine Learning with PyTorch and Ray
  • Getting Started with Reinforcement Learning

After Getting Banned in Schools, OpenAI Launches ChatGPT Tool for Teachers

When students were finding different ways around ChatGPT and other GPT applications to assist them with school or college work, and even self-learn a thing or two, the question of who needs teachers when you have AI arose. To carve a niche, and probably appease the teacher segment who seemed to be left out, OpenAI announced a new guide for teachers to use ChatGPT in classrooms. The announcement comes within days of releasing an exclusive product for enterprises with ChatGPT Business. Catering to targeted professionals, has OpenAI finally managed to fine-tune use cases for its famed chatbot?

Teacher’s Pet

The newly released guide contains recommended prompts, an overview of ChatGPT’s functioning and limitations, the efficacy of AI detectors, and a discussion on biases. In the announcement blog, the company even shared use cases from professors and teachers who have already been using the chatbot to assist with their teaching. From building quizzes, tests, lesson plans from curriculum materials, to designing role playing challenging conversations, ChatGPT has already been functioning as a companion for teaching professionals.

Brand New Packaging

With teachers already using ChatGPT, the ‘teacher’s guide’ released by OpenAI is a mere bundling of ChatGPT features being highlighted to encourage instructors, and normalise the idea of using the chatbot in an education domain. Considering how ChatGPT saw a decline in users a few months ago, probably owing to schools and colleges being shut for summer, the latest blog comes at a time when education institutes have reopened. Furthermore, there is seemingly a shift in the way ChatGPT is adopted today.

Earlier this year, a number of universities banned the use of the chatbot, however, educators are accepting the practicality of how ChatGPT can be used in a number of ways that are assisting students and teachers alike.

Lance Eaton, Director of faculty development & innovation at College Unbound, compiled a spreadsheet on generative AI policies in universities, mainly featuring entries about restricting tools like ChatGPT last spring. However, today, the new list maintained by educators from various universities, both in the US and internationally, is inclusive of ChatGPT. Instead of bans, schools are actively promoting and instructing students on how to effectively utilise these tools.

OpenAI Hits Bull’s-eye

With the educational shift happening, OpenAI has been on a spree of launching products and features targeting professionals from different domains. Hitting bull’s-eye by targeting the right set of groups, OpenAI is pushing its way via product adoption.

The latest ChatGPT Business that offers a range of product functionalities that exclusively caters to enterprises is poised to be a game-changer. Considering how a number of organisations had been building products utilising ChatGPT APIs, the enterprise-specific ChatGPT Business will pose a threat to SaaS startups.

Two months ago, OpenAI released a product that would make a data scientist obsolete. Code Interpreter, now called Advanced Data Analysis, assists with multiple tasks such as visualisation, trend analysis, data transformation, and workflow management among many tasks.

Not All Problems Have Solutions

While OpenAI continues to innovate through ways that will help people at their jobs, it truly will never be rid of problems.

I am most happy that they included this blunt statement on AI detectors, which promise what they can't deliver, and can hurt students as a result. pic.twitter.com/bmuGluGh81

— Ethan Mollick (@emollick) August 31, 2023

Through a list of FAQs, OpenAI has stated its limitations in various scenarios. The company has clarified that though there are tools to detect AI-generated content, including that of OpenAI, none of them are proven to be effectively working. ChatGPT also tends to give incorrect information when asked about a content being human or AI-generated.

Geetha Venugopal, a high school computer science teacher at American International School in Chennai, emphasises on students implementing critical thinking when it comes to AI tools. She counsels students to keep in mind that ChatGPT’s responses may not always be trustworthy and accurate. She also encourages them to evaluate answers and to verify the information using other primary sources.

Though laden with shortcomings, ChatGPT is always finding a place in every field. OpenAI being able to fine-tune use cases now, it is possible that more improvements and features are in the pipeline.

The post After Getting Banned in Schools, OpenAI Launches ChatGPT Tool for Teachers appeared first on Analytics India Magazine.

Integrating GenAI into “Thinking Like a Data Scientist” Methodology – Part III

Slide1

This blog post is not the end of my journey to integrate GenAI with my “Thinking Like a Data Scientist” (TLADS) methodology, but it is the last post on this leg of the journey. And the journey has been fascinating. I can’t wait to get this modified material in front of my students.

In part 1 of the series, I used GenAI (Bing AI) to enhance the first three steps in the TLADS methodology (Figure 1):

  • TLADS Step #1: Identify Business Initiative. GenAI augmented my assessment of the targeted business initiative. That included identifying additional desired outcomes, benefits, impediments, failure risks, unintended consequences, and KPIs and metrics I needed to explore.
  • TLADS Step #2: Empathize with Stakeholders. GenAI helped me understand my key internal and external stakeholders more thoroughly (through persona-based prompting). That included the business initiative’s importance to them, their key decisions, and the KPIs and metrics against which they would measure the success of the business initiative. Note: GenAI does not replace conducting stakeholder interviews, but it does prepare one for a more relevant and engaging interview.
  • TLADS Step #3: Understand Business Entities. Finally, GenAI helped me identify, explore, and understand the business initiative’s key business entities. The business entity’s predicted behavioral and performance propensities are vital in optimizing the decisions that support our prioritized use cases.

At this point in the GenAI – TLADS integration process, I now have a very comprehensive prompt that sets the stage for part 2 of the series, where I used GenAI to enhance Steps 4 and 5 of the TLADS methodology:

  • TLADS Step #4: Identify Use Cases. Because the business initiative detail is already entered via prompts into the GenAI tool, I could thoroughly explore the use cases that support the targeted business initiative. That included identifying and exploring the factors that impact use case “value” and “implementation feasibility” scores used in the use case prioritization process.
  • TLADS Step #5: Brainstorm Scores & Features. This was my biggest aha in integrating GenAI with the TLADS methodology. I could leverage the detailed information about my targeted use case to identify and explore analytic scores and their supporting features.

In this final part of this series, I will explore the integration of GenAI (Bing AI) into the final steps of the TLADS process:

  • TLADS Step #6: Explore Analytic Algorithms. We can build off the scores and features identified in Step #5 to explore the data science requirements more deeply. This is an entirely new step of the TLADS process, totally enabled by the GenAI tool.
  • TLADS Step #7: Map Scores to Recommendations. We will use Bing to validate the linkage between analytic scores and the recommendations that will drive the precision decisions that optimize our prioritized use case.
  • TLADS Step #8: Analytic Score Model Feedback. We will use Bing to ensure we have created a holistic feedback loop to build analytic scores that can continuously learn and adapt.

TLADS Step #6: Explore Analytic Algorithms

Nowhere did Bing AI impact the existing TLADS methodology more than this new step. I built on the learnings from Step 5 (Brainstorm Scores & Features) to explore what analytic algorithms I should use to create the analytic scores and why Bing felt these were the most relevant algorithms to generate that score (Figure 1).

Slide2

Figure 1: TLADS Step 6: Explore Analytic Algorithms

This presents an excellent opportunity to expand the understanding and familiarity of “Citizens of Data Science” with the different analytic algorithms, even if they never program machine learning (ML) models. By exposing them to the various algorithms and how they collaborate to generate an analytic score, their comfort level in working alongside the data science team to develop, design, and implement the analytic scores is greatly enhanced.

But wait, there is more. I not only leveraged GenAI to identify the algorithms to create a specific analytic score, but GenAI even wrote the pseudo-code for creating that score. See the Bing response for the pseudocode for creating the customer loyalty score using the k-means clustering algorithm.

=====

Bing Response: Pseudo code for creating the customer loyalty score using k-means clustering:

Step 1: Define the number of clusters, k, to be generated by the algorithm. For example, we can use two clusters: loyal and non-loyal customers.

k = 2

Step 2: Load the customer data from a CSV file.

customer_data = load_csv(“customer_data.csv”)

Step 3: Extract the features that affect the customer loyalty score. These are:

  • Number of visits: The more visits a customer makes to Chipotle, the higher their loyalty score is.
  • Average spend per visit: The more a customer spends per visit at Chipotle, the higher their loyalty score is.
  • Time since last visit: The shorter the time since a customer’s last visit to Chipotle, the higher their loyalty score is.
  • Length of visit: The longer a customer stays at Chipotle during their visit, the higher their loyalty score is.
  • Number of referrals: The more referrals a customer makes to their friends and family to visit Chipotle, the higher their loyalty score is.
  • Ratings and reviews: The more positive ratings and reviews a customer gives to Chipotle on online platforms, the higher their loyalty score is.
  • Social media mentions and likes: The more mentions and likes a customer gives to Chipotle on social media platforms, the higher their loyalty score is.

features = [“number_of_visits”, “average_spend_per_visit”, “time_since_last_visit”, “length_of_visit”, “number_of_referrals”, “ratings_and_reviews”, “social_media_mentions_and_likes”]

Step 4: Normalize the features to have a common scale.

normalized_features = normalize(features)

Step 5: Apply the k-means clustering algorithm to the normalized features.

clusters = k_means(normalized_features, k)

Step 6: Assign each customer to a cluster based on their feature values.

for customer in customer_data:

cluster = assign_cluster(customer, clusters)

customer[“cluster”] = cluster

Step 7: Calculate the customer loyalty score for each customer based on their cluster. For example, we can assign a score of 1 to loyal customers and 0 to non-loyal customers.

For customer in customer_data:

if customer[“cluster”] == 0:

customer[“loyalty_score”] = 1

Else:

customer[“loyalty_score”] = 0

Step 8: Save the customer data with the loyalty score to a new CSV file.

save_csv(“customer_data_with_loyalty_score.csv”, customer_data)

=====

It’s incredible how much a non-data scientist can leverage GenAI to understand the algorithms and mechanics necessary to build an analytic score. However, it is unlikely that non-data scientists will be able to leverage GenAI tools for developing, deploying, and managing analytical models by themselves (data scientists can take a breath now). There are vital data science tasks that are outside the capabilities of GenAI, including:

  • GenAI cannot perform data engineering tasks such as data preparation, integration, cleansing, wrangling, and munging. These tasks require specific knowledge of the data sources, formats, quality, and transformations needed for the analytic model.
  • GenAI cannot write, test, and debug the actual code for the model. It only provides a high-level outline of the steps and logic in creating the model. You will still need to use a programming language to implement and run the pseudocode on your data.

TLADS Step #7: Map Scores to Recommendations

There were only a few new insights from integrating the GenAI tool into this step, as this step is the summary of the TLADS workflow – to map the Analytic Scores to the prescriptive Recommendations that drive the optimization of the Decisions (Figure 2).

Slide3

Figure 2: Map Analytic Scores to Stakeholder Recommendations to Decisions

One area where GenAI helped was in identifying the stakeholders that would need to apply the analytic scores to make recommendations. This is an area that one would undoubtedly want to verify in interviews.

Here are some of the prompts that I explored in Step 7:

  • Bing Prompt: For each key decision, map the stakeholders who impact that decision, the recommendations that the stakeholder would need to make in support of that decision, and the analytic scores that the stakeholder would leverage to make recommendations.
  • Bing prompt: For a specific use case, what are the potential linkages between the Business Entities that support that use case, and what might be the ramifications of those linkages?
  • Bing Prompt: How do these linkages impact the information or data I want to capture on each Business Entity?

TLADS Step #8: Analytic Score Feedback

Step #8 is an entirely new step in the TLADS process. The additional time for this new step was created when we leveraged GenAI to combine previous Steps 5 and 6 to create a new Step #5: Brainstorm Scores & Features. What? I wasn’t going to give my students more time off! Time to expand our learning, baby!

After much exploration and testing, I decided to leverage our extra time to formalize the feedback loop that ensures that our analytic models continuously learn and adapt. I read some random book titled “The Economics of Data, Analytics, and Digital Transformation” that advocated for that (Figure 3).

Slide4

Figure 3: Analytic Score Model Feedback

Some of the prompts that I explored in Step 8 included:

  • Bing Prompt: For each use case recommendation, can you determine what variables and metrics we should monitor to determine the analytic score and recommendation effectiveness,
  • Bing Prompt: What are the best techniques or methods for feeding the decision effectiveness assessment information back to the respective AI / ML model that generated the analytic score and the recommendation?
  • Bing Request: Given the decisions the key stakeholders are trying to make, what are the best KPIs and metrics for measuring the effectiveness of that score to improve decision effectiveness?
  • Bing Request: What are the best methods for measuring that effectiveness and feeding the learnings from that effectiveness assessment back into the analytic algorithms that support the “Improve Local Events Marketing Effectiveness” analytic score?

Summary: Integrating GenAI + TLADS – Part 3

The integration of GenAI into my “Thinking Like a Data Scientist” process significantly impacted almost every step and supporting design canvas. This resulted in a more comprehensive understanding of the factors that drive the successful definition, design, development, and adoption of data and analytics to deliver more relevant, meaningful, responsible, and ethical outcomes (refer to Figure 4).

Slide5

Figure 4: GenAI Impact on Thinking Like a Data Scientist Methodology

This experience was exhilarating. I literally pumped my fist and danced around as I discovered new areas where I could apply GenAI to enhance my TLADS methodology (much to the chagrin of the folks sitting around me at my Caribou Coffee shop).

And maybe the most exciting news is that I’m not done exploring, trying, failing, learning, and trying again as I seek to expand the integration of GenAI into my Thinking Like a Data Scientist methodology.

I hope my university and corporate students are ready to continue this journey together!