NVIDIA Trying to Keep AI Chatbots’ Hallucinations ‘On Track’

NVIDIA Trying to Keep AI Chatbots’ Hallucinations ‘On Track’

AI chatbots and hallucinations go hand-in-hand. Even if the technology has all the hype around it with ChatGPT, Bard, and a lot others releasing every second week, we cannot deny that it makes mistakes that can be dangerous as well. Sometimes the models might sound motivated to lie and even gaslight its users and also say negative things.

NVIDIA decided to take matters into its own hands in trying to fix this. The company has released NeMo Guardrails, an open-source toolkit that aims to make large language models (LLMs) based applications “accurate, appropriate, on topic and secure”, as announced in a blog by the company.

The toolkit made by NVIDIA is powered by and for LangChain, which was created by Harrison Chase. The toolkit was built for providing easy-to-use templates and patterns for building LLM-powered applications. Users can easily create boundaries around AI apps by adding the NeMo Guardrails on apps built using LangChain. It can also work with Zapier platform apps.

Chase said that John C, another developer behind LangChain, also had the idea of installing guardrails around their native development a few months back and have incorporated the ideas from it on the new guardrails by NVIDIA.

Jonathan Cohen, VP of applied research at NVIDIA, said that the company has been working on guardrails around similar systems for quite some time and GPT-4 and ChatGPT gave the right idea to him and the company. Cohen told TechCrunch, “AI model safety tools are critical to deploying models for enterprise use cases.”

The open-source nature of NeMo guardrails along with LangChain on Python, will allow any developer to use them, even if they are not a machine learning expert, the company said. These can be used for any tool that an enterprise uses with just a few lines of code. The NeMo framework is available on the NVIDIA AI Enterprise, along with being present on GitHub for developers.

Existence of Guardrails

Some criticise OpenAI and ChatGPT for their capability of generating harmful results, while others like Elon Musk criticise it for being too woke. Either way, it is important to put up guardrails to ensure that these models stop hallucinating at the least.

Apart from recently publishing a blog about brand guidelines refraining people from developing ‘GPT’-named apps and also filing for trademark on the same, the company has also been concerned about the safety and reliability of its models. The company had updated their usage policy last month to ensure there is no illegal activity including hateful content and generation of malware, while also disallowing a lot of content. It is now also allowing users to delete their chat history and data.

if your AI product is actively scaring users from saying negative things about it, for fear that the AI will then see what they say afterwards, well, this seems bad
what made OpenAI change course to ship such powerful models so aggressively? https://t.co/3lQO3Ih2Y9

— near (@nearcyan) February 14, 2023

This also brings in the question of biases in chatbots. In February, OpenAI had published the blog, “How should AI systems behave, and who should decide?”, where the company explained the working of ChatGPT and how it is willing to allow more customisation and input from the public on the decision-making process. For this, the company decided to get reviewers to fine-tune their model.

Interestingly, the blog also said that OpenAI will allow ChatGPT to generate content that many people, including them, would strongly disagree with. “In some cases ChatGPT currently refuses outputs that it shouldn’t, and in some cases, it doesn’t refuse when it should. We believe that improvement in both respects is possible.” Striking the right balance is important.

Mira Murati, the CTO of OpenAI, on The Daily Show with Trever Noah said that there should be more involvement of the government to build regulations around AI products like ChatGPT.

On similar lines, Sundar Pichai said in a CBS “60 Minutes” interview that installing guardrails around AI is not for the company to decide alone. Similarly, former Google head Eric Schmidt also said that instead of pausing AI advancements and training of models that was proposed in a recent petition, it is more important for everyone to come together and discuss the appropriate guardrails.

Are there any problems?

Essentially, as defined by NVIDIA, these guardrails are sitting between the user and the conversational AI application. Though it will filter out content based on the topic, making it sound more relevant, it will also filter out content specified by the developer of the chatbot as unsafe or unethical.

This brings in the question of human induced bias in the chatbots as well. It is true that there should be some guardrails around chatbots to prevent them from generating dangerous content, defining hateful and banning that content can be biassed. This might make chatbots like ChatGPT even more restrictive than they are, even though the intention was the opposite.

This could be a social disaster, with NVIDIA dictating the whims of the identity obsessed political correct. So NVIDIA please think about bias, and single point of view when rolling this out, as it might be a version of Thinkpol (1984)

— Trisodium Garrard 🏴‍☠️ (@OrderOfMycelium) April 26, 2023

Guardrails might make the content generated by chatbots more topical, but might also induce more bias, making them less reliable, though “safe”

The two month old OpenAI blog about letting people fine-tune its model, and to also allow content that they do not agree with sounds somewhat opposite to this. Who knows what would happen with Musk’s TruthGPT? Would it have the right type of guardrails?

The post NVIDIA Trying to Keep AI Chatbots’ Hallucinations ‘On Track’ appeared first on Analytics India Magazine.

Text Summarization Development: A Python Tutorial with GPT-3.5

Text Summarization Development: A Python Tutorial with GPT-3.5
Image by frimufilms on Freepik

This is an era where AI breakthrough is coming daily. We didn’t have many AI-generated in public a few years ago, but now the technology is accessible to everyone. It’s excellent for many individual creators or companies that want to significantly take advantage of the technology to develop something complex, which might take a long time.

One of the most incredible breakthroughs that change how we work is the release of the GPT-3.5 model by OpenAI. What is the GPT-3.5 model? If I let the model talk for themselves. In that case, the answer is “a highly advanced AI model in the field of natural language processing, with vast improvements in generating contextually accurate and relevant text”.

OpenAI provides an API for the GPT-3.5 model that we can use to develop a simple app, such as a text summarizer. To do that, we can use Python to integrate the model API into our intended application seamlessly. What does the process look like? Let’s get into it.

Prerequisite

There are a few prerequisites before following this tutorial, including:

— Knowledge of Python, including knowledge of using external libraries and IDE

— Understanding of APIs and handling the endpoint with Python

— Having access to the OpenAI APIs

To obtain OpenAI APIs access, we must register on the OpenAI Developer Platform and visit the View API keys within your profile. On the web, click the “Create new secret key” button to acquire API access (See image below). Remember to save the keys, as they will not be shown the keys after that.

Text Summarization Development: A Python Tutorial with GPT-3.5
Image by Author

With all the preparation ready, let’s try to understand the basic of the OpenAI APIs model.

Understanding GPT-3.5 OpenAI API

The GPT-3.5 family model was specified for many language tasks, and each model in the family excels in some tasks. For this tutorial example, we would use the gpt-3.5-turbo as it was the recommended current model when this article was written for its capability and cost-efficiency.

We often use the text-davinci-003 in the OpenAI tutorial, but we would use the current model for this tutorial. We would rely on the ChatCompletion endpoint instead of Completion because the current recommended model is a chat model. Even if the name was a chat model, it works for any language task.

Let’s try to understand how the API works. First, we need to install the current OpenAI packages.

pip install openai

After we have finished installing the package, we will try to use the API by connecting via the ChatCompletion endpoint. However, we need to set the environment before we continue.

In your favorite IDE (for me, it’s VS Code), create two files called .env and summarizer_app.py, similar to the image below.

Text Summarization Development: A Python Tutorial with GPT-3.5
Image by Author

The summarizer_app.py is where we would build our simple summarizer application, and the .env file is where we would store our API Key. For security reasons, it is always advised to separate our API key in another file rather than hard-code them in the Python file.

In the .env file put the following syntax and save the file. Replace your_api_key_here with your actual API key. Don’t change the API key into a string object; let them as it is.

OPENAI_API_KEY=your_api_key_here

To understand the GPT-3.5 API better; we would use the following code to generate the word summarizer.

openai.ChatCompletion.create(      model="gpt-3.5-turbo",      max_tokens=100,      temperature=0.7,      top_p=0.5,      frequency_penalty=0.5,      messages=[          {            "role": "system",            "content": "You are a helpful assistant for text summarization.",          },          {            "role": "user",            "content": f"Summarize this for a {person_type}: {prompt}",          },      ],  )

The above code is how we interact with the OpenAI APIs GPT-3.5 model. Using the ChatCompletion API, we create a conversation and will get the intended result after passing the prompt.

Let’s break down each part to understand them better. In the first line, we use the openai.ChatCompletion.create code to create the response from the prompt we would pass into the API.

In the next line, we have our hyperparameters that we use to improve our text tasks. Here is the summary of each hyperparameter function:

  • model: The model family we want to use. In this tutorial, we use the current recommended model (gpt-3.5-turbo).
  • max_tokens: The upper limit of the generated words by the model. It helps to limit the length of the generated text.
  • temperature: The randomness of the model output, with a higher temperature, means a more diverse and creative result. The value range is between 0 to infinity, although values more than 2 are not common.
  • top_p: Top P or top-k sampling or nucleus sampling is a parameter to control the sampling pool from the output distribution. For example, value 0.1 means the model only samples the output from the top 10% of the distribution. The value range was between 0 and 1; higher values mean a more diverse result.
  • frequency_penalty: The penalty for the repetition token from the output. The value range between -2 to 2, where positive values would suppress the model from repeating token while negative values encourage the model to use more repetitive words. 0 means no penalty.
  • messages: The parameter where we pass our text prompt to be processed with the model. We pass a list of dictionaries where the key is the role object (either "system", "user", or "assistant") that helps the model to understand the context and structure while the values are the context.
    • The role “system” is the set guidelines for the model “assistant” behavior,
    • The role “user” represents the prompt from the person interacting with the model,
    • The role “assistant” is the response to the “user” prompt

Having explained the parameter above, we can see that the messages parameter above has two dictionary object. The first dictionary is how we set the model as a text summarizer. The second is where we would pass our text and get the summarization output.

In the second dictionary, you will also see the variable person_type and prompt. The person_type is a variable I used to control the summarized style, which I will show in the tutorial. While the prompt is where we would pass our text to be summarized.

Continuing with the tutorial, place the below code in the summarizer_app.py file and we will try to run through how the function below works.

import openai  import os  from dotenv import load_dotenv    load_dotenv()  openai.api_key = os.getenv("OPENAI_API_KEY")      def generate_summarizer(      max_tokens,      temperature,      top_p,      frequency_penalty,      prompt,      person_type,  ):      res = openai.ChatCompletion.create(          model="gpt-3.5-turbo",          max_tokens=100,          temperature=0.7,          top_p=0.5,          frequency_penalty=0.5,          messages=         [           {            "role": "system",            "content": "You are a helpful assistant for text summarization.",           },           {            "role": "user",            "content": f"Summarize this for a {person_type}: {prompt}",           },          ],      )      return res["choices"][0]["message"]["content"]

The code above is where we create a Python function that would accept various parameters that we have discussed previously and return the text summary output.

Try the function above with your parameter and see the output. Then let’s continue the tutorial to create a simple application with the streamlit package.

Text Summarization Application with Streamlit

Streamlit is an open-source Python package designed for creating machine learning and data science web apps. It’s easy to use and intuitive, so it is recommended for many beginners.

Let’s install the streamlit package before we continue with the tutorial.

pip install streamlit

After the installation is finished, put the following code into the summarizer_app.py.

import streamlit as st    #Set the application title  st.title("GPT-3.5 Text Summarizer")    #Provide the input area for text to be summarized  input_text = st.text_area("Enter the text you want to summarize:", height=200)    #Initiate three columns for section to be side-by-side  col1, col2, col3 = st.columns(3)    #Slider to control the model hyperparameter  with col1:      token = st.slider("Token", min_value=0.0, max_value=200.0, value=50.0, step=1.0)      temp = st.slider("Temperature", min_value=0.0, max_value=1.0, value=0.0, step=0.01)      top_p = st.slider("Nucleus Sampling", min_value=0.0, max_value=1.0, value=0.5, step=0.01)      f_pen = st.slider("Frequency Penalty", min_value=-1.0, max_value=1.0, value=0.0, step=0.01)    #Selection box to select the summarization style  with col2:      option = st.selectbox(          "How do you like to be explained?",          (              "Second-Grader",              "Professional Data Scientist",              "Housewives",              "Retired",              "University Student",          ),      )    #Showing the current parameter used for the model   with col3:      with st.expander("Current Parameter"):          st.write("Current Token :", token)          st.write("Current Temperature :", temp)          st.write("Current Nucleus Sampling :", top_p)          st.write("Current Frequency Penalty :", f_pen)    #Creating button for execute the text summarization  if st.button("Summarize"):      st.write(generate_summarizer(token, temp, top_p, f_pen, input_text, option))

Try to run the following code in your command prompt to initiate the application.

streamlit run summarizer_app.py

If everything works well, you will see the following application in your default browser.

Text Summarization Development: A Python Tutorial with GPT-3.5
Image by Author

So, what happened in the code above? Let me briefly explain each function we used:

  • .st.title: Provide the title text of the web application.
  • .st.write: Writes the argument into the application; it could be anything but mainly a string text.
  • .st.text_area: Provide an area for text input that can be stored in the variable and used for the prompt for our text summarizer
  • .st.columns: Object containers to provide side-by-side interaction.
  • .st.slider: Provide a slider widget with set values that the user can interact with. The value is stored on a variable used as the model parameter.
  • .st.selectbox: Provide a selection widget for users to select the summarization style they want. In the example above, we use five different styles.
  • .st.expander: Provide a container that users can expand and hold multiple objects.
  • .st.button: Provide a button that runs the intended function when the user presses it.

As streamlit would automatically design the UI following the given code from top to bottom, we could focus more on the interaction.

With all the pieces in place, let’s try our summarization application with a text example. For our example, I would use the Theory of Relativity Wikipedia page text to be summarized. With a default parameter and second-grader style, we obtain the following result.

Albert Einstein was a very smart scientist who came up with two important ideas about how the world works. The first one, called special relativity, talks about how things move when there is no gravity. The second one, called general relativity, explains how gravity works and how it affects things in space like stars and planets. These ideas helped us understand many things in science, like how particles interact with each other and even helped us discover black holes! 

You might obtain a different result than the above one. Let’s try the Housewives style and tweak the parameter a bit (Token 100, Temperature 0.5, Nucleus Sampling 0.5, Frequency Penalty 0.3).

The theory of relativity is a set of physics theories proposed by Albert Einstein in 1905 and 1915. It includes special relativity, which applies to physical phenomena without gravity, and general relativity, which explains the law of gravitation and its relation to the forces of nature. The theory transformed theoretical physics and astronomy in the 20th century, introducing concepts like 4-dimensional spacetime and predicting astronomical phenomena like black holes and gravitational waves.

As we can see, there is a difference in style for the same text we provide. With a change prompt and parameter, our application can be more functional.

The overall look of our text summarizer application can be seen in the image below.

Text Summarization Development: A Python Tutorial with GPT-3.5
Image by Author

That is the tutorial on creating text summarizer application development with GPT-3.5. You could tweak the application even further and deploy the application.

Conclusion

Generative AI is rising, and we should utilize the opportunity by creating a fantastic application. In this tutorial, we will learn how the GPT-3.5 OpenAI APIs work and how to use them to create a text summarizer application with the help of Python and streamlit package.
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and Data tips via social media and writing media.

More On This Topic

  • Approaches to Text Summarization: An Overview
  • Getting Started with Automated Text Summarization
  • PyTorch LSTM: Text Generation Tutorial
  • Summarization with GPT-3
  • Build a Text-to-Speech Converter with Python in 5 Minutes
  • Simple Text Scraping, Parsing, and Processing with this Python Library

Best Machine Learning Model For Sparse Data

Best Machine Learning Model For Sparse Data
Image by Author

Sparse data refers to datasets with many features with zero values. It can cause problems in different fields, especially in machine learning.

Sparse data can occur as a result of inappropriate feature engineering methods. For instance, using a one-hot encoding that creates a large number of dummy variables.

Sparsity can be calculated by taking the ratio of zeros in a dataset to the total number of elements. Addressing sparsity will affect the accuracy of your machine-learning model.

Also, we should distinguish sparsity from missing data. Missing data simply means that some values are not available. In sparse data, all values are present, but most are zero.

Also, sparsity causes unique challenges for machine learning. To be exact, it causes overfitting, losing good data, memory problems, and time problems.

This article will explore these common problems related to sparse data. Then we will cover the techniques used to handle this issue.

Finally, we will apply different machine learning models to the sparse data and explain why these models are suitable for sparse data.

Throughout the article, I will predominantly use the scikit-learn library, and if you wish to modify the code and arguments, I will provide the official documentation links too.

Now let's start with the common problems with sparse data.

Common Problems With Sparse Data

Sparse data can pose unique challenges for data analysis. We already mentioned that some of the most common issues include overfitting, losing good data, memory problems, and time problems.

Now, let’s have a detailed look at each.

Best Machine Learning Model For Sparse Data
Image by Author

Overfitting

Overfitting occurs when a model becomes too complex and starts to capture noise in the data instead of the underlying patterns.

In sparse data, there may be a large number of features, but only a few of them are actually relevant to the analysis. This can make it difficult to identify which features are important and which ones are not.

As a result, a model may overfit to noise in the data and perform poorly on new data.

If you are new to machine learning or want to know more, you can do that in the scikit-learn documentation about overfitting.

Losing Good Data

One of the biggest problems with sparse data is that it can lead to the loss of potentially useful information.

When we have very limited data, it becomes more difficult to identify meaningful patterns or relationships in that data. This is because the noise and randomness inherent to any data set can more easily obscure essential features when the data is sparse.

Furthermore, because the amount of data available is limited, there is a higher chance that we will miss out on some of the truly valuable patterns or relationships in the data. This is especially true in cases where the data is sparse due to a lack of sampling, as opposed to simply being missing. In such cases, we may not even be aware of the missing data points and thus may not realize we are losing valuable information.

That’s why if too many features are removed, or the data is compressed too much, important information may be lost, resulting in a less accurate model.

Memory Problem

Memory problems can arise due to the large size of the dataset. Sparse data often results in many features, and storing this data can be computationally expensive. This can limit the amount of data that can be processed at once or require significant computing resources.

Here you can see different strategies to scale your data by using scikit-learn.

Time Problem

The time problem can also occur due to the large size of the dataset. Sparse data may require longer processing times, especially when dealing with a large number of features. This can limit the speed at which data can be processed, which can be problematic in time-sensitive applications.

What Are the Methods for Working With Sparse Features? Best Machine Learning Model For Sparse Data
Image by Author

Sparse data poses a challenge in data analysis due to its low occurrence of non-zero values. However, there are several methods available to mitigate this issue.

One common approach is removing the feature causing sparsity in the dataset.

Another option is to use Principal Component Analysis (PCA) to reduce the dimensionality of the dataset while retaining important information.

Feature hashing is another technique that can be employed, which involves mapping features to a fixed-length vector.

T-Distributed Stochastic Neighbor Embedding (t-SNE) is another useful method that can be utilized to visualize high-dimensional datasets.

In addition to these techniques, selecting a suitable machine learning model that can handle sparse data, such as SVM or logistic regression, is crucial.

By implementing these strategies, one can effectively address the challenges associated with sparse data in data analysis.

Now let’s start with the tactics used to reduce sparse data first, then we will go deeper into the models.

Remove it!

When working with sparse data, one approach is to remove features that contain mostly zero values. This can be done by setting a threshold on the percentage of non-zero values in each feature. Any feature that falls below this threshold can be removed from the dataset.

This approach can help reduce the dimensionality of the dataset and improve the performance of certain machine learning algorithms.

Code Example

In this example, we set the dimensions of the dataset, as well as the sparsity level, which determines how many values in the dataset will be zero.

Then, we generate random data with the specified sparsity level to check whether our method works or not. At this step, we calculate the sparsity to compare afterward.

Next, the code sets the number of zeros to remove and randomly removes a specific number of zeros from the dataset. Then we recalculate the sparsity of the modified dataset to check whether our method works or not.

Finally, we recalculate the sparsity to see the changes.

Here is the code.

import numpy as np    # Set the dimensions of the dataset  num_rows = 1000  num_cols = 100    # Set the sparsity level of the dataset  sparsity = 0.9    # Generate random data with the specified sparsity level  data = np.random.random((num_rows, num_cols))  data[data < sparsity] = 0    # Calculate the sparsity of the dataset  num_zeros = (data == 0).sum()  total_elements = data.shape[0] * data.shape[1]  sparsity = num_zeros / total_elements    print(f"The sparsity of the dataset before removal {sparsity:.4f}")    # Set the number of zeros to remove  num_zeros_to_remove = 50000    # Remove a specific number of zeros randomly from the dataset  zero_indices = np.argwhere(data == 0)  zeros_to_remove = np.random.choice(      zero_indices.shape[0], num_zeros_to_remove, replace=False  )  data[      zero_indices[zeros_to_remove, 0], zero_indices[zeros_to_remove, 1]  ] = np.nan    # Calculate the sparsity of the modified dataset    num_zeros = (data == 0).sum()  total_elements = data.shape[0] * data.shape[1]  sparsity = num_zeros / total_elements    print(      "Sparsity after removing {} zeros:".format(num_zeros_to_remove), sparsity  )  

Here is the output.

Best Machine Learning Model For Sparse Data

PCA

PCA is a popular technique for dimensionality reduction. It identifies the principal components of the data, which are the directions in which the data varies the most.

These principal components can then be used to represent the data in a lower-dimensional space.

In the context of sparse data, PCA can be used to identify the most important features that contain the most variation in the data.

By selecting only these features, we can reduce the dimensionality of the dataset while still retaining most of the important information.

You can implement PCA by using the sci-kit learn library, as we will do it next in the code example. Here is the official documentation if you want to learn more about it.

Code Example

To apply PCA to sparse data, we can use the scikit-learn library in Python.

The library provides a PCA class that we can use to fit a PCA model to the data and transform it into lower-dimensional space.

In the first section of the following code, we create a dataset as we did in the previous section, with a given dimension and sparsity.

In the second section, we will apply PCA to reduce the dimension of the dataset to 10. After that, we will recalculate the sparsity.

Here is the code.

import numpy as np    # Set the dimensions of the dataset  num_rows = 1000  num_cols = 100    # Set the sparsity level of the dataset  sparsity = 0.9    # Generate random data with the specified sparsity level  data = np.random.random((num_rows, num_cols))  data[data < sparsity] = 0    # Calculate the sparsity of the dataset  num_zeros = (data == 0).sum()  total_elements = data.shape[0] * data.shape[1]  sparsity = num_zeros / total_elements    print(f"The sparsity of the dataset before removal {sparsity:.4f}")    # Apply PCA to the dataset  pca = PCA(n_components=10)  data_pca = pca.fit_transform(data)  # Calculate the sparsity of the reduced dataset  num_zeros = (data_pca == 0).sum()  total_elements = data_pca.shape[0] * data_pca.shape[1]  sparsity = num_zeros / total_elements    print(f"Sparsity after PCA: {sparsity:.4f}")  

Here is the output.

Best Machine Learning Model For Sparse Data

Feature Hashing

Another method for working with sparse data is called feature hashing. This approach converts each feature into a fixed-length array of values using a hashing function.

The hashing function maps each input feature to a set of indices in the fixed-length array. The values are summed together if multiple input features are mapped to the same index. Feature hashing can be useful for large datasets where storing a large feature dictionary may not be feasible.

We will cover this together in the next section, yet if you want to dig deeper into it, here you can see the official documentation of the feature hasher in the scikit-learn library.

Code Example

Here, we again use the same method in dataset creation.

Then we apply feature hashing to the dataset using the FeatureHasher class from scikit-learn.

We specify the number of output features with the n_features parameter and the input type as a dictionary with the input_type parameter.

We then transform the input data into hashed arrays using the transform method of the FeatureHasher object.

Finally, we calculate the sparsity of the resulting dataset.

Here is the code.

import numpy as np    # Set the dimensions of the dataset  num_rows = 1000  num_cols = 100    # Set the sparsity level of the dataset  sparsity = 0.9    # Generate random data with the specified sparsity level  data = np.random.random((num_rows, num_cols))  data[data < sparsity] = 0    # Calculate the sparsity of the dataset  num_zeros = (data == 0).sum()  total_elements = data.shape[0] * data.shape[1]  sparsity = num_zeros / total_elements    print(f"The sparsity of the dataset before removal {sparsity:.4f}")    # Apply feature hashing to the dataset  hasher = FeatureHasher(n_features=10, input_type="dict")  data_dict = [      dict(("feature" + str(i), val) for i, val in enumerate(row))      for row in data  ]  data_hashed = hasher.transform(data_dict).toarray()    # Calculate the sparsity of the reduced dataset  num_zeros = (data_hashed == 0).sum()  total_elements = data_hashed.shape[0] * data_hashed.shape[1]  sparsity = num_zeros / total_elements    print(f"Sparsity after feature hashing: {sparsity:.4f}")  

Here is the output.

Best Machine Learning Model For Sparse Data

t-SNE Embedding

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a non-linear dimensionality reduction technique used to visualize high-dimensional data. It reduces the dimensionality of the data while preserving its global structure and has become a popular tool in machine learning for visualizing and clustering high-dimensional data.

t-SNE is particularly useful for working with sparse data because it can effectively reduce the dimensionality of the data while maintaining its structure. The t-SNE algorithm works by calculating pairwise distances between data points in high- and low-dimensional spaces. It then minimizes the difference between these distances in high- and low-dimensional space.

To use t-SNE with sparse data, the data must first be converted into a dense matrix. This can be done using various techniques, such as PCA or feature hashing. Once the data has been converted, t-SNE can be high-x to obtain a low-dimensional embedding of the data.

Also, if you are curious about t-SNE, here is the official documentation of the scikit-learn to see more.

Code Example

The following code first sets the dimensions of the dataset and the sparsity level, generates random data with the specified sparsity level, and calculates the sparsity of the dataset before t-SNE is applied, as we did in the previous examples.

It then applies t-SNE to the dataset with 3 components and calculates the sparsity of the resulting t-SNE embedding. Finally, it prints out the sparsity of the t-SNE embedding.

Here is the code.

import numpy as np    # Set the dimensions of the dataset  num_rows = 1000  num_cols = 100    # Set the sparsity level of the dataset  sparsity = 0.9    # Generate random data with the specified sparsity level  data = np.random.random((num_rows, num_cols))  data[data < sparsity] = 0    # Calculate the sparsity of the dataset  num_zeros = (data == 0).sum()  total_elements = data.shape[0] * data.shape[1]  sparsity = num_zeros / total_elements    print(f"The sparsity of the dataset before removal {sparsity:.4f}")    # Apply t-SNE to the dataset  tsne = TSNE(n_components=3)  data_tsne = tsne.fit_transform(data)    # Calculate the sparsity of the t-SNE embedding  num_zeros = (data_tsne == 0).sum()  total_elements = data_tsne.shape[0] * data_tsne.shape[1]  sparsity = num_zeros / total_elements    print(f"Sparsity after t-SNE: {sparsity:.4f}")  

Here is the output.

Best Machine Learning Model For Sparse Data Best Machine Learning Model for Sparse Data

Now that we have addressed the challenges of working with sparse data, we can explore machine learning models specifically designed to perform well with sparse data.

These models can handle the unique characteristics of sparse data, such as a high number of features with many zeros and limited information, which can make it challenging to achieve accurate predictions with traditional models.

By using models designed explicitly for sparse data, we can ensure that our predictions are more precise and reliable.

Now let’s talk about the models good for sparse data.

SVC (Support Vector Classifier)

SVC (Support Vector Classifier) with the linear kernel can perform well with sparse data because it uses a subset of training points, known as support vectors, to make predictions. This means it can handle high-dimensional, sparse data efficiently.

You can use Support Vector for regression, too.

I explained the Support Vector Machine here if you want to learn more about the Support Vector algorithm, both classification and regression.

Logistic Regression

This can also work well with sparse data because logistic regression uses a regularization term to control the model complexity, which can help prevent overfitting on sparse datasets.

If you want to learn more about logistic regression and also for other classification algorithms, here is the Overview of Machine Learning Algorithms: Classification.

KNeighboursClassifier

This algorithm can work well with sparse data since it computes distances between data points and can handle high-dimensional data.

You can see KNN and other machine learning algorithms here that you should know for data science.

MLPClassifier

The MLPClassifier can perform well with sparse data when the input data is standardized, as it uses gradient descent for optimization.

Here you can see the implementation of MLP Classifier, along witha bunch of other algorithms, with the help of ChatGPT.

DecisionTreeClassifier

It can work well with sparse data when the number of features is small. If you do not know about decision trees, I explained decision trees and random forests here, which will be our final model for analyzing the models for sparse data.

RandomForestClassifier

The RandomForestClassifier can work well with sparse data when the number of features is small.

Best Machine Learning Model For Sparse Data
Image by Author

Now, I will show you how these models perform on the generated data. But, I will add another algorithm to see whether these algorithms will outperform this algorithm (which is typically not good for sparse data) or not.

Code Example

In this section, we will test multiple machine learning models on a sparse dataset, which is a dataset with a lot of empty or zero values.

We will calculate the sparsity of the dataset and evaluate the models using the F1 score.

Then, we will create a data frame with the F1 scores for each model to compare their performance. Also, we will filter out any warnings that may appear during the evaluation process.

import numpy as np  from scipy.sparse import random  import numpy as np  from scipy.sparse import random  from sklearn.model_selection import train_test_split  from sklearn.metrics import f1_score  from sklearn.svm import SVC  from sklearn.linear_model import LogisticRegression, Lasso  from sklearn.cluster import KMeans  from sklearn.neighbors import KNeighborsClassifier  from sklearn.neural_network import MLPClassifier  from sklearn.datasets import make_classification  from sklearn.preprocessing import StandardScaler  from sklearn.model_selection import train_test_split  from sklearn.exceptions import ConvergenceWarning  import warnings    # Generate a sparse dataset  X = random(1000, 20, density=0.1, format="csr", random_state=42)  y = np.random.randint(2, size=1000)    # Calculate the sparsity of the dataset  sparsity = 1.0 - X.nnz / float(X.shape[0] * X.shape[1])  print("Sparsity:", sparsity)    X_train, X_test, y_train, y_test = train_test_split(      X, y, test_size=0.2, random_state=42  )    # Train and evaluate multiple classifiers  classifiers = [      SVC(kernel="linear"),      LogisticRegression(),      KMeans(          n_clusters=2,          init="k-means++",          max_iter=100,          random_state=42,          algorithm="full",      ),      KNeighborsClassifier(n_neighbors=5),      MLPClassifier(          hidden_layer_sizes=(100, 50),          max_iter=1000,          alpha=0.01,          solver="sgd",          verbose=0,          random_state=21,          tol=0.000000001,      ),      DecisionTreeClassifier(),      RandomForestClassifier(),  ]    # Create an empty DataFrame with column names  df = pd.DataFrame(columns=["Classifier", "F1 Score"])    # Filter out the specific warning  warnings.filterwarnings(      "ignore", category=ConvergenceWarning  )  # Filter warning that mlp classifier will possibly print out.    for clf in classifiers:      clf.fit(X_train, y_train)      y_pred = clf.predict(X_test)      f1 = f1_score(y_test, y_pred)      df = pd.concat(          [              df,              pd.DataFrame(                  {"Classifier": [type(clf).__name__], "F1 Score": [f1]}              ),          ],          ignore_index=True,      )  df = df.sort_values(by="F1 Score", ascending=True)  df  

Here is the output.

Best Machine Learning Model For Sparse Data

By now, you might catch an algorithm that is not well-suited for the sparse data. Yes, the answer is the KMeans. But why?

KMeans is typically not well suited for sparse data because it is based on distance measures, which can be problematic with high-dimensional, sparse data.

There are also some algorithms that we can’t even try. For instance, if you try to include the GaussianNB classifier in this list, you will get an error. It suggests that the GaussianNB classifier expects dense data instead of sparse data. This is because the GaussianNB classifier assumes that the input data follows Gaussian distribution and is unsuitable for sparse data.

Conclusion

In conclusion, working with sparse data can be challenging due to various problems like overfitting, losing good data, memory, and time problems.

However, several methods are available for working with sparse features, including removing features, using PCA, and feature hashing.

Moreover, certain machine learning models like SVM, Logistic Regression, Lasso, Decision Tree, Random Forest, MLP, and k-nearest neighbors are well-suited for handling sparse data.

These models have been designed to handle high-dimensional and sparse data efficiently, making them the best choices for sparse data problems. Using these methods and models can improve your model's accuracy and save time and resources.
Nate Rosidi is a data scientist and in product strategy. He's also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Connect with him on Twitter: StrataScratch or LinkedIn.

More On This Topic

  • Working With Sparse Features In Machine Learning Models
  • Machine Learning Model Development and Model Operations: Principles and…
  • Model Drift in Machine Learning — How To Handle It In Big Data
  • The Significance of Data Quality in Making a Successful Machine Learning…
  • How To Use Synthetic Data To Overcome Data Shortages For Machine Learning…
  • Machine Learning Model Deployment

Baize: An Open-Source Chat Model (But Different?)

Baize: An Open-Source Chat Model (But Different?)
Image by Author

I think it’s safe to say 2023 is the year of Large Language Models (LLMs). From the widespread adoption of ChatGPT, which is built on the GPT-3 family of LLMs, to the release of GPT-4 with enhanced reasoning capabilities, it has been a year of milestones in generative AI. And we wake up everyday to the release of new applications in the NLP space that leverage the ChatGPT’s capabilities to address novel problems.

In this article, we’ll learn about Baize, a recently released open-source chat model.

What is Baize?

Baize is an open-source chat model. Cool. But why another chat model?

Well, in a typical session with a chatbot, you don't have a single question that you’re seeking an answer to. Rather, you’ll ask a series of questions that the bot answers. This conversation chain continues—until you get your answers or an acceptable solution to your problem—in this multi-turn chat.

So if you want to start building your own chat models, such a multi-turn chat corpus is not super common to come by. Baize aims at facilitating the generation of such a corpus using ChatGPT and uses it to fine-tune a LLaMA model. This helps you build better chatbots with reduced training time.

Project Baize is funded by the McAuley lab at UC San Diego, and is the result of collaboration between researchers at UC San Diego, Sun Yat-Sen university, and Microsoft Research, Asia.

Baize is named after the Chinese mythical creature Baize that can understand human languages [1]. And understanding human languages is something we’d all like chat models to have, yes? The research paper for Baize was first uploaded to arxiv on 3rd April, 2023. The model’s weights and code have all been made available on GitHub solely for research purposes. So now is a great time to explore this new open-source chat model.

And, yeah, let's learn more about Baize.

How Does Baize Work?

The working of Baize can be (almost) summed up in two key points:

  • Generate a large corpus of multi-turn chat data by leveraging ChatGPT
  • Use the generated corpus to fine-tune LLaMA

Baize: An Open-Source Chat Model (But Different?)
The Pipeline for Training Baize | Image source

Data Collection with ChatGPT Self-Chatting

We mentioned that Baize uses ChatGPT to construct the chat corpus. It does so using a process called self-chatting in which ChatGPT has a conversation with itself.

A typical chat session requires a human and an AI. The self-chatting process in the data collection pipeline is designed such that ChatGPT has a conversation with itself—to supply both sides of the conversation. For the self-chatting process, a template is provided along with the requirements.

The quality of conversations generated by ChatGPT is quite high (we’ve seen this more in our social media feeds than in our own ChatGPT sessions). So we get a high-quality dialogue corpus.

Let's take a look at the data used by Baize:

  • There is a seed that sets the topic for the chat session. It can be a question or a phrase that supplies the central idea of the conversation. In the training of Baize, questions from StackOverflow and Quora were used as seeds.
  • In the training of Baize, ChatGPT (gpt-turbo-3.5) model is used in the self-chatting data collection pipeline. The generated corpus has about 115K dialogues—with approximately 55K dialogues coming from each of the above sources.
  • In addition, data from Stanford Alpaca was also used.
  • Currently three versions of the model: Baize-7B, Baize-13B, and Baize-30B have been released. (In Baize-XB, XB denotes X billion parameters.)
  • The seed can also be sampled from a specific domain. Meaning we can run the data collection process to construct a domain-specific chat corpus. In this direction, the Baize-Healthcare model is available, trained on the publicly available MedQuAD dataset to create a corpus of about 47K dialogues.

Fine-Tuning in Low-Resource Settings

The next part is the fine-tuning of the LLaMA model on the generated corpus. Model Fine-tuning is generally a resource-intensive task. As tuning all the parameters of a large language model is infeasible under resource constraints, Baize uses Low-Rank Adaptation (LoRA) to fine tune the LLaMA model.

In addition, at inference time, there’s a prompt that instructs Baize not to indulge in conversations that are unethical and sensitive. This mitigates the need for human intervention in moderation.

The functional app fetches the LLaMA model and LoRA weights from the HugingFace hub.

Advantages and Limitations of Baize

Next, let’s go over some of the advantages and limitations of Baize.

Advantages

Let’s start by stating some advantages of Baize:

  • High availability: You can try out Baize-7B on HuggingFaces spaces or run it locally. Baize is not restricted by the number of API calls and alleviates concerns of availability in times of high demand.
  • Built-in moderation support: The prompts at inference time to stop indulging in conversations on sensitive and unethical topics is advantageous as it minimizes efforts needed to moderate conversations.
  • Chat corpora generation: As mentioned, Baize can help build large corpora of multi-turn conversations. This can be helpful in training chat models at scale.
  • Accessibility in low-resource settings: As mentioned in [1], we can run Baize on a single GPU machine, which makes it accessible in low-resource settings that have limited access to computation resources.
  • Domain-specific applications: By carefully sampling the seed from a specific domain, we can have chat bots for domain-specific applications such as healthcare, agriculture, finance and more.
  • Reproducibility and customization: The code is publicly available and the data collection and training pipeline is reproducible. If you want to collect data from various specific sources to build a custom corpus, you can modify the <code>collection.py</code> script in the project’s codebase.

Limitations

Like all LLM-powered chat apps, Baize has the following limitations:

  • Inaccurate information: Just the way ChatGPT’s responses are sometimes prone to inaccuracies resulting from outdated training data and contextual nuances, Baize’s responses might as well be technically inaccurate at times.
  • Challenge with up-to-date information: The LLaMA model is not trained on recent data. This makes it challenging for tasks that require up-to-date information for accurate and helpful response.
  • Bias and toxicity: By changing the inference prompt, the behavior of the model to decline engaging in sensitive, unethical conversations can be manipulated.

Wrapping Up

That’s all for today! To explore more about Baize, be sure to try out the demo on HuggingFace spaces or run it locally. ChatGPT and GPT-4 have inspired a wide range of applications in the NLP space.

With novel OpenAI wrappers hitting the developer space almost everyday, it can be overwhelming to keep up with these rapid advancements and releases. At the same time, we’re excited to see what the future of generative AI holds.

References and Resources for Further Learning

[1] C Xu, D Guo, N Duan, J McAuley, Baize: An Open-Source Model with Parameter-Efficient Tuning on Self-Chat Data, arXiv, 2023.

[2] Project Baize on GitHub

[3] Demo on HuggingFace Spaces
Bala Priya C is a technical writer who enjoys creating long-form content. Her areas of interest include math, programming, and data science. She shares her learning with the developer community by authoring tutorials, how-to guides, and more.

More On This Topic

  • Google’s Model Search is a New Open Source Framework that Uses Neural…
  • The 7 Best Open Source AI Libraries You May Not Have Heard Of
  • Top Open Source Large Language Models
  • OpenChatKit: Open-Source ChatGPT Alternative
  • 8 Open-Source Alternative to ChatGPT and Bard
  • First Open Source Implementation of DeepMind’s AlphaTensor

ChatGPT is coming for your job. Why that’s a good thing

This illustration shows the ChatGPT logo on a phone in front of the OpenAI logo.
Image: gguy/Adobe Stock

According to a new study from researchers at the University of Pennsylvania and OpenAI, if you’re an accountant, translator or writer, your job prospects are bleak. By analyzing which jobs could be done at least 50% faster by generative pre-trained transformers, the report authors suggested that 20% of the U.S. workforce is at risk of being rendered obsolete by large language models like ChatGPT. It’s a scary prospect. It’s also likely wrong.

As we’re learning with developers, yes, large language models can eliminate some repetitive tasks but, no, this doesn’t make software developers obsolete. Done right, it makes them much more productive. The same can be true of other jobs and industries. The trick is to learn how to harness the power of LLMs without being mowed down by them.

Death of the developer?

In some ways, software development should be highly susceptible to GPTs. For any LLM and GPT to produce good results, it needs training data — the better the training data, the better the output.

For software, training data is vast and easily accessible. Small wonder, then, that products like GitHub’s Copilot have amazed some developers with how quickly they can improve productivity. For others, these results have prompted doomsday declarations of the death of the software developer.

The mixed reactions are understandable. Take, for example, the ability of Copilot to write code for the developer. You can see that as a replacement for the developer, or you can see it as an enhancement. The developers I follow are in the latter camp. For example, some who have tried Copilot find it quite additive and addictive.

“I have grown used to Copilot uncannily inferring what I am trying to do after writing the first couple of words,” wrote developer Manuel Odendahl.

“You get the LLM to draft some code for you that’s 80% complete/correct [and] you tweak the last 20% by hand,” suggested Sourcegraph developer Steve Yegge.

That’s a significant performance boost, and it’s accruing to those developers who figure out how to put LLMs to good use.

But it’s more than that. For the founder of the open-source project Datasette, Simon Willison, GPTs enable him to be dramatically more ambitious with what he codes because they change how he codes.

“ChatGPT (and GitHub Copilot) save me an enormous amount of ‘figuring things out’ time. For everything from writing a for loop in Bash to remembering how to make a cross-domain CORS request in JavaScript — I don’t need to even look things up anymore, I can just prompt it and get the right answer 80% of the time,” noted Willison.

In other words, a technology that could replace developers hasn’t, and won’t. Not for those developers who learn how to make the LLMs work for them, rather than in place of them.

Which brings us back to one of the report’s central arguments: “Most occupations exhibit some degree of exposure to LLMs, with varying exposure levels across different types of work.”

Furthermore, “Roles heavily reliant on science and critical thinking skills show a negative correlation with exposure, while programming and writing skills are positively associated with LLM exposure.”

This may suggest how little the report authors understand programming and writing; both involve heavy doses of critical thinking.

Bloody awful poetry

In the report, among the list of occupations most exposed to replacement by GPTs/LLMs, there are some head-scratchers. Take public relations specialists. If a PR person’s job is writing press releases, I’d agree with this assessment. The average press release sounds like it was written by a computer, and not a particularly advanced computer. But this isn’t what good PR people do. They build relationships with journalists. They try to understand shifting industry narratives and how to incorporate their company’s products or services therein. They are, in summary, thinking about content and its place in a wider context, rather than just mindlessly outputting press releases.

For example, the report authors conclude that poets, lyricists and creative writers are among the groups most at risk from LLMs mowing them down. Never mind that the training data for the LLMs is human-generated content (in this case, poems, lyrics and prose). That means the machine is always dependent on a person to give it the semblance of smarts.

Going further, while it’s superficially impressive to tell ChatGPT to write a talk, short story or poem for you, but in my experience the results sound a bit tinny — a little off or even derivative. I have no doubt that ChatGPT could do some content marketing copy on my behalf because, let’s face it, most content marketing is a bit derivative and dull. It’s designed to speak to machines (SEO, anyone?) and, hence, doesn’t try to offer great writing.

Even great writing is a bit derivative. Steinbeck’s “East of Eden” is a retelling of the biblical Cain and Abel story, for example. But anyone who thinks ChatGPT could come up with that masterpiece of creative writing is way too high on their LLM paint thinners. Great writing emerges from human genius, articulating common themes in uncommon ways. The day I see that come from a prompt I drop into ChatGPT will be the day it’s all over for the human race, but guess what? That day isn’t coming.

Not now. Not soon. Not ever. Machines, as with the development examples above, are good at incorporating human-created input and mimicking it to generate human-acceptable output. But they’re not ever thinking through the all-too-human experience that gives rise to great literature, just as they’re not able to grok and respond to the business problems that great developers resolve with code.

Instead, we have a happy union of people and machines. How happy that union will be for given industries and the people therein depends on how well they use GPTs to remove repetitive tasks or code so that they can focus on the innovative, human side of their jobs.

Disclosure: I work for MongoDB, but the views expressed herein are mine.

Innovation Insider Newsletter

Catch up on the latest tech innovations that are changing the world, including IoT, 5G, the latest about phones, security, smart cities, AI, robotics, and more.

Delivered Tuesdays and Fridays Sign up today

DataLang: A New Programming Language for Data Scientists… Created by ChatGPT?

DataLang: A New Programming Language for Data Scientists... Created by ChatGPT?
Image created by Author with Midjourney

This article will provide you with an overview of a project that I gave ChatGPT to run with, the creation of a new data science-oriented programming language. The details are all spelled out below, but for reasons that may become evident in subsequent reading, I wanted to give ChatGPT the opportunity to introduce the language in a captivating way. And it's a doozy. So have a read of this first, and then we can catch up on the other side.

The world of data science is about to experience a seismic shift with the advent of a groundbreaking programming language, custom-built for data scientists by data scientists. In today's post, we'll take you behind the scenes of the inception, development, and realization of this revolutionary language: DataLang.

Imagine a language meticulously crafted to streamline your data science tasks, with a razor-sharp focus on usability, efficiency, and collaboration. A language designed to break down barriers and propel your data-driven insights to unprecedented heights. DataLang is the answer to the prayers of data scientists and analysts who yearn for a more powerful toolset, one tailored exclusively for their unique needs and challenges.

In this riveting tell-all, we will explore the conception of DataLang and its journey to becoming the most sought-after language in the data science domain. Join us as we uncover the exclusive insights, unanticipated challenges, and pivotal moments that brought DataLang to life, narrated by the very team that made it possible.

But beware — this is no ordinary blog post. We'll reveal hidden gems and mind-bending features that promise to intrigue, engage, and leave you craving more. The revelation of DataLang is just the beginning, and trust us when we say, you won't want to miss a single line of code.

So, brace yourselves for a thrilling ride into the future of data science, and let DataLang guide your way. Let the journey begin!

At least we didn't start this off by blatantly overselling the concept…

The Initial Design

I recently tasked ChatGPT-4's to come up with a new programming language appropriate for data scientists in their day to day tasks. Here is the initial prompt used to set the stage for its new assignment:

You are a seasoned computer scientist working on systems programming. A data science team has approached you to derive a new programming language designed and optimized for data-oriented tasks, making it perfectly suitable for data scientists to use in their day to day job. Provide a high-level description of this language; how it would differ from other popular data science languages such as Python, R, and SQL; what features would be most important to such a language; issues that a language might have; and anything else you feel would be of importance. Keep in mind that its standard libraries and built-in functionality should provide the most support possible for day to day data science operations. Write a short program to demonstrate the features, syntax, and functionality of this language as you perceive it.

Language Name: DataLang

High-Level Description

DataLang is a language designed specifically for data-oriented tasks and optimized for performance and ease of use in data science applications. It combines the best features of Python, R, and SQL, along with unique features designed to streamline data science workflows.

Key Differentiating Features

Native data structures: DataLang will have built-in data structures like data frames and tensors, which are commonly used in data science tasks. These structures will have integrated functionality for handling missing values, schema enforcement, and data type coercion.

Lazy evaluation: To improve performance, DataLang will use lazy evaluation for data manipulation tasks, where operations are only executed when necessary. This can save memory and computation time, especially when working with large datasets.

Querying capabilities: Inspired by SQL, DataLang will have built-in querying capabilities that allow users to filter, sort, join, and aggregate data without needing an external library.

Integrated visualization: DataLang will have a built-in visualization library that supports various chart types and is easily customizable. The visualization capabilities will integrate seamlessly with the native data structures.

Functional and parallel programming: DataLang will encourage functional programming practices and support parallel processing for better performance on large-scale data tasks.

Extensive libraries: DataLang will have extensive standard libraries focused on statistical analysis, machine learning, deep learning, natural language processing, and time-series analysis.

Interoperability: DataLang will have built-in support for interoperability with other popular languages like Python and R, allowing users to leverage existing libraries and codebases.

Possible Issues

Adoption: As a new language, DataLang may face challenges in attracting users and building a strong community around it.

Compatibility: Ensuring compatibility with existing tools, libraries, and platforms will be crucial for the success of DataLang.

Performance optimization: Balancing ease of use and performance optimization will be an ongoing challenge.

Top Posts April 3-9: Top 19 Skills You Need to Know in 2023 to Be a Data Scientist

Top 19 Skills You Need to Know in 2023 to Be a Data Scientist
Most Popular Posts Last Week

  1. Top 19 Skills You Need to Know in 2023 to Be a Data Scientist by Nate Rosidi
  2. LangChain 101: Build Your Own GPT-Powered Applications by Bala Priya C
  3. 8 Open-Source Alternative to ChatGPT and Bard by Abid Ali Awan
  4. Automate the Boring Stuff with GPT-4 and Python by Natassha Selvaraj
  5. 4 Ways to Generate Passive Income Using ChatGPT by Youssef Rafaat

Most Popular Posts Past 30 Days

  1. 4 Ways to Generate Passive Income Using ChatGPT by Youssef Rafaat
  2. GPT-4: Everything You Need To Know by Nisha Arya
  3. Automate the Boring Stuff with GPT-4 and Python by Natassha Selvaraj
  4. 5 Free Tools For Detecting ChatGPT, GPT3, and GPT2 by Abid Ali Awan
  5. Top 19 Skills You Need to Know in 2023 to Be a Data Scientist by Nate Rosidi
  6. OpenChatKit: Open-Source ChatGPT Alternative by Abid Ali Awan
  7. ChatGPT for Data Science Cheat Sheet by KDnuggets
  8. 4 Ways to Rename Pandas Columns by Abid Ali Awan
  9. LangChain 101: Build Your Own GPT-Powered Applications by Bala Priya C
  10. 8 Open-Source Alternative to ChatGPT and Bard by Abid Ali Awan

More On This Topic

  • Top 19 Skills You Need to Know in 2023 to Be a Data Scientist
  • Top April Stories: The Most In-Demand Skills for Data Scientists in 2021
  • Modern Data Science Skills: 8 Categories, Core Skills, and Hot Skills
  • Top 13 Skills That Every Data Scientist Should Have
  • 7 Most Recommended Skills to Learn to be a Data Scientist
  • KDnuggets™ News 20:n34, Sep 9: Top Online Data Science Masters…

Meta beats revenue expectations, remains committed to metaverse

Meta beats revenue expectations, remains committed to metaverse Amanda Silberling 7 hours

Things are looking up for Meta. The company beat revenue expectations, reporting an increase in year-over-year revenue for the first time in three quarters. But this ray of light for the company formerly known as Facebook comes amid harsh restructuring, resulting in more than 10,000 jobs eliminated this year.

Tech moves through buzzwords like clockwork. Though the metaverse was all the rage when Meta changed its name — a sort of recursive and self-referential generation of hype — now, it’s all about AI. Despite big losses in its metaverse investments, CEO Mark Zuckerberg made it a point to tell investors that he is not making a u-turn into the AI lane. Rather, he sees AI as technology that works in tandem with the metaverse.

“A narrative has developed that we’re somehow moving away from focusing on the metaverse vision, so I just want to say up front that that’s not accurate,” Zuckerberg said. “We’ve been focusing on AI and the metaverse, and we will continue to.”

Image Credits: TechCrunch

Meta isn’t expecting Reality Labs to make money yet, but investors have voiced concerns that this hefty investment might not pay off. Zuckerberg’s interest in ongoing AI development might quiet some of those worries — Meta’s stock price has increased, for one — but the metaverse is still bleeding money.

Reality Labs, Meta’s department for VR and AR, lost nearly $4 billion this quarter. In all of last year, it lost $13.7 billion. Zuckerberg pointed out, though, that VR and AR technology do in fact involve AI.

“Our vision for AR glasses involves an AI-centric operating system that we think will be the basis for the next generation of computing,” Zuckerberg said on the call.

When it comes to its metaverse vision, Zuckerberg said that half of daily active users on its Quest headsets spend more than one hour per day on their device. Meta did not disclose just how many people are actually using Quest headsets on a daily basis.

“Building the metaverse is a long-term project, but the rationale for it remains the same, and we remain committed to it,” Zuckerberg said.

Another round of mass layoffs expected at Meta this week

AI Career Notes: April 2023 Edition

AI Career Notes: April 2023 Edition April 13, 2023 by Mariana Iriarte

In this monthly feature, we bring you up to date on the latest career developments in the enterprise AI community – promotions, new hires and accolades. Here's the place to read about the movers and shakers, your colleagues, your friends, and maybe yourself.

Wayne Allen

atNorth appointed Wayne Allen as its sales director of the United States region. Wayne brings over 30 years of experience to atNorth. He was part of the founding team at Digital Realty Trust when it became the first data center company to go public on the NY stock exchange.

"I couldn't be more excited to join atNorth as the Sales Director for the US," said Allen. "Being a part of this global team that's dedicated to delivering sustainable data centers for high-performance computing needs is a great honor. With my extensive background in the data center industry, I am confident in my ability to showcase the numerous advantages of utilizing atNorth's facilities in the Nordics to the many companies in the US seeking more sustainable options."

Saty Bahadur

Appen Limited, providers of high-quality training data for artificial intelligence systems, appointed Saty Bahadur as its chief technology officer. Bahadur brings more than 25 years of experience developing groundbreaking technology, products, and platforms to Appen.

“Artificial intelligence is rapidly transitioning from addressing niche challenges to solving mainstream industry sector and ecosystem problems. Appen stands at the vanguard of AI enablement, offering comprehensive solutions across diverse industries and clients, while crafting the next generation of human experiences powered by ethical AI," said Bahadur. "During my in-depth due diligence of Appen, I engaged with top business leaders and technologists who validated our shared ambition to harness Appen’s AI platform. I am genuinely thrilled to join Appen on its mission to become the intelligent backbone that fuels the development of generative AI solutions, ultimately benefiting the industry with Appen’s AI For Good strategy of Do Good, Be Good, and Lead Good to enrich the human experience."

Kit Beall

Cohesity, a data security and management company, appointed Kit Beall as its chief revenue officer. Beall brings more than 30 years of experience to Cohesity, with deep expertise in software, cloud, security, artificial intelligence and managed services.

“I’m thrilled to join Cohesity at this important juncture of the company’s evolution. I have long regarded Cohesity as having the best technology in the space, and deeply respect their CEO, founder, board, and management team,” said Beall. “Cohesity sits at the junction of three of the highest priority business issues today – security, cloud, and data management, and it provides a radically simple yet highly differentiated approach to securing, protecting, and deriving value from data. I look forward to helping customers and partners succeed, and working with this incredibly talented team.”

David Blyth, Nathan Kalyanasundharam, Suresh Ramalingam, Ben Sander, and Ralph Wittig

AMD appointed five technical leaders – David Blyth, Nathan Kalyanasundharam, Suresh Ramalingam, Ben Sander, and Ralph Wittig – to the role of AMD Corporate Fellow. These appointments recognize each leader’s significant impact on semiconductor innovation across various areas, from graphics architecture to advanced packaging. Click here to learn more about each AMD Corporate Fellow.

Peter Brennan

Scality, a leader in distributed file and object storage, appointed former HPE executive, Peter Brennan, as its chief revenue officer. Prior to HPE, Brennan led a worldwide specialist sales organization for VMware, focused on hyper-converged and hybrid-cloud solutions. He has also held channel and sales leadership positions at LeftHand Networks, Opsware, CommVault, and EMC.

“Scality’s technology has the enterprise-scale capabilities that stand above a crowded object storage market space,” said Brennan. “I’ve seen what goes on behind the scenes, and loyal customers trust Scality for guidance. In addition, our affordable mid-market ARTESCA solution has increased revenue opportunities for partners — bringing enterprise-class features within the reach of more customers. It’s a perfect mix for expansion, and I look forward to extending our partner-first approach to achieve Scality’s growth objectives.”

Ali Fenn and Scott McFarland

Lancium, an energy technology and infrastructure company, appointed Ali Fenn as its president. Most recently, Fenn served as president of ITRenew, which provided datacenter solutions and services to the global hyperscale, broader cloud service provider, and enterprise markets.

In addition, Scott McFarland joined Lancium as its chief revenue officer. McFarland joined Lancium from Dell Technologies. He will be responsible for the company’s sales function with a focus on growing revenue across new end markets.

Jason Forget

Cockroach Labs, maker of the enterprise-grade distributed SQL database, CockroachDB, appointed Jason Forget as its president and chief revenue officer. Forget most recently worked at Redis, where he was the first sales hire and led the growth there to over 9,000 paying customers.

"Over the last eight years, Cockroach Labs has created a tremendous team and product, putting them at the helm of changing how organizations manage transactional data," said Forget. "It's been inspiring to see how the company has fundamentally disrupted an industry that had remained unchanged for decades, and I look forward to going after the massive opportunity in front of us as a part of this team."

Christian Guttmann

Pegasystems Inc. appointed Dr. Christian Guttmann as its vice president of engineering, decisions and AI. Guttmann will oversee the architecture and delivery of AI technology that powers the Pega Infinity product suite. Guttmann joined Pega from Tietoevry, a global software and services company where he founded its AI and data business and served as vice president, global head of AI & data.

As a scientist, Dr. Guttmann serves as a senior research fellow (adjunct) of AI and digital health at Karolinska Institutet, one of the world's top medical schools. He also serves as the executive director at the Nordic AI Institute, a top European independent institute on AI thought leadership.

Natalia Harris

Mezmo, an observability data platform provider, appointed Natalia Harris as its vice president of people and inclusion. Harris will responsible for overseeing the company’s people development strategies and diversity, equity, and inclusion efforts.

“Creating a positive and diverse workplace culture, building strong teams, and fostering employee development are paramount to success,” Harris said. “Although these have always been important at Mezmo, I plan to work with the executive team to elevate, improve, and expand existing programs to ensure we remain a leader on the way to a brighter world of work.”

Sudhir Hasbe

Neo4j, the graph database and analytics leader, appointed Sudhir Hasbe as its chief product officer. Hasbe will oversee the company's software portfolio across its native graph database and data science offerings. He previously led product management for Google Cloud's Data Analytics Platform.

"As the world becomes more connected, so does our data, making the relationships between both data and metadata matter more than ever," said Hasbe. "Graph enables organizations to find hidden relationships and patterns across billions of data connections. It's why I've come to Neo4j, and why graph will one day be foundational for every modern enterprise."

Gus Hunt

Domino Data Lab, provider of the Enterprise MLOps platform, appointed former Central Intelligence Agency Chief Technology Officer Gus Hunt as a strategic advisor to Domino's senior leadership team. Hunt will advise on Domino's hybrid cloud data science strategy for federal agencies.

"We need powerful data science and analytics to continuously identify the next acute threat facing our country," said Hunt. "If I can help the national security community do that job in part by making the right introductions and brokering conversations between the government's technologists and Domino, I'm confident that I'm helping our federal agencies stay current on vital technologies."

Jack Huynh

AMD appointed Jack Huynh as its senior vice president and general manager of computing and graphics. Huynh has been at AMD for more than 24 years and was most recently responsible for leading all aspects of the company’s semi-custom business.

Huynh has served in a variety of leadership roles at AMD, most recently as the senior vice president and general manager for the AMD semi-custom business group, leading strategy, business management, and engineering execution for high performance custom solutions. Prior to that, Huynh served as corporate vice president and general manager where he led end to end business execution of mobility solutions for the AMD Client PC business group.

Raja Koduri

Raja Koduri is departing Intel after five-and-a-half years to focus on a new software startup. He was recently serving as the company’s Chief Architect and had been previously leading the graphics division (AXG, which in December was integrated into the DCAI group). Prior to joining Intel, Koduri had key roles at Apple and AMD, where he led the Radeon Technologies Group.

Intel CEO Pat Gelsinger wished Koduri well, tweeting, “Thank you @RajaXg for your many contributions to Intel tech & architecture-especially w/high-performance graphics that helped bring 3 new product lines to market in ‘22. Wishing you success as you create a new software co. around generative AI for gaming, media & entertainment.”

Koduri responded, “Thank you Pat and @intel for many cherished memories and incredible learning over the past 5 years. Will be embarking on a new chapter in my life, doing a software startup as noted below. Will have more to share in coming weeks.”

Mike Kropp, Ben Rathsack, and Peter Cleveland

SEMI elected three new members to the SEMI North America Advisory Board: Mike Kropp, president and chief executive officer of PEER Group; Ben Rathsack, vice president of product and technology development at Tokyo Electron America, Inc.; and Peter Cleveland, senior vice president at Taiwan Semiconductor Manufacturing Company Limited.

“The SEMI North America Advisory Board warmly welcomes Mike, Ben and Peter,” said Joe Stockunas, president of SEMI Americas. “Each has extensive industry experience that supports the Board’s interests in diversifying member representation. Mike Kropp of factory automation software supplier PEER Group, a mid-sized Canada-based company, actively contributes to more than 30 SEMI programs. Based in Austin, Texas, home to TEL’s U.S. headquarters, Ben Rathsack has extensive product development experience and passionately supports SEMI’s efforts to attract students to our industry. Peter Cleveland of TSMC has significant experience in advocacy, a key component of SEMI’s efforts to support members.”

Mark Luo

Quantum Brilliance, a developer of room-temperature miniaturized quantum computing products and solutions, appointed Mark Luo as its chief executive officer. Luo is a co-founder of Quantum Brilliance and previously held the executive role of chief operating officer.

“I am proud of Quantum Brilliance’s significant growth and achievements over the past four years and am honored to see our approach to quantum computing validated through investments, top-tier partnerships and technological developments,” said Luo. “I hope to continue fostering the company’s unique take on quantum computing and expand our reach into untapped markets.”

Narayan Menon

Matillion, the leader in data productivity, appointed Narayan Menon as its chief financial officer and chief operating officer. Menon brings over 25 years of experience in finance and operations, having held key positions in various companies, including Microsoft, Skype, Cisco, Intuit, Prezi, and Vimeo.

“As someone who uses data and analytics in almost everything I do at work, I am keenly aware of the importance of data – as well as how the lack of business-ready data can hamper an organization,” said Menon. “Matillion is uniquely positioned to help organizations get data business-ready, faster — accelerating time-to-value and increasing the impact data can have on growth and success. I am excited to be part of a fantastic team and to help expand upon Matillion’s success.”

Bob Metcalfe

The Association for Computing Machinery (ACM) named Bob Metcalfe as the recipient of the 2022 ACM A.M. Turing Award. Metcalfe was recognized for the invention, standardization, and commercialization of Ethernet.

Metcalfe is an Emeritus Professor of Electrical and Computer Engineering at The University of Texas at Austin and a Research Affiliate in Computational Engineering at the Massachusetts Institute of Technology Computer Science & Artificial Intelligence Laboratory.

Aaron Moore

QuSecure, Inc. appointed former Northrop Grumman, Raytheon, and National Security Agency executive Aaron Moore as its executive vice president and head of engineering. Moore will be responsible for leading QuSecure’s engineering and software development activities.

“It’s an honor to join QuSecure at such an exciting and critical time for the company and post-quantum cybersecurity,” said Moore. “The advent of quantum computing has brought our nation to an existential inflection point. If we fail to address the threat or harness the potential of this technology now, we risk nothing less than losing our way of life as we know it. I look forward to helping QuSecure exceed their aggressive goals during this critical time to ensure our nation’s quantum-resiliency readiness.”

Ernie Ostic and Bernard Desarnauts

MANTA, the leader in data lineage and metadata management, promoted Ernie Ostic to chief evangelist. Ostic joined MANTA nearly four years ago, bringing with him over 40 years of industry experience in data integration, including 14 years at IBM.

In addition, MANTA appointed Bernard Desarnauts as its senior vice president of product. Desarnauts brings a strong track record of product leadership, including previous roles as SVP of product at PandaDoc and Rinse.

Stuart Pann

Intel Corp. appointed Stuart Pann as senior vice president and general manager of Intel Foundry Services (IFS), Intel’s commercial foundry business. Pann will responsible for driving the continued growth of IFS and its differentiated systems foundry offering.

“Intel Foundry Services is a critical pillar of our IDM 2.0 strategy, and it’s been exciting to watch it grow from an idea to an operating business with a world-class IP portfolio and significant customers in less than two years,” Pann said. “I am committed to championing the interests of our foundry customers and to helping them take advantage of Intel’s leading-edge process technology and full stack of open systems foundry offerings so they can succeed in a world that demands ever more computing.”

Jérôme Serve

CGG, a global technology and HPC leader, appointed Jérôme Serve as its group chief financial officer. Serve joined CGG from the Interiors division of Forvia/Faurecia, where he held the role of CFO.

Serve started his career in Research in the Petroleum Engineering department of Stanford University before joining TotalEnergies as a Reservoir Engineer in Abu Dhabi and the UK. He moved into Finance as part of the Oil & Gas Corporate Finance team of ABN Amro, and then joined the M&A and Financing team of Shell.

Gayle Sheppard

Bright Machines, a technology company bringing an innovative approach to intelligent, software-defined manufacturing automation, appointed Gayle Sheppard as its chief executive officer. Sheppard will transition from her current role at the company as co-CEO to CEO. Before joining Brightmachines, Sheppard held leadership positions at Microsoft and Intel.

“After serving as co-CEO with Lior for the past three months, I’m proud to step into the CEO role to lead the company on its journey to fundamentally reshape how and where products are made across the global economy,” said Sheppard. “Witnessing how the company has navigated the complexities of today’s market while cementing its place as a leader with some of the biggest global brands has been inspiring. I am laser-focused on ensuring our customers, technology, and company are well-equipped to meet the needs of modern industry. I look forward to working with, learning from, and growing the company with the outstanding team at Bright Machines.”

Dirk-Peter van Leeuwen

SUSE, a provider of innovative, open and secure infrastructure software solutions for multi-cloud environments, appointed Dirk-Peter van Leeuwen as its chief executive officer. van Leeuwen has spent almost two decades at Red Hat. Most recently, he served as Red Hat’s senior vice president and general manager of North America, and before that of APAC.

“I have admired the organization for many years and am now looking forward to working with the executive team and the entire organization to execute on the various opportunities the market offers and serve our customers and partners,” van Leeuwen said. “I know that SUSE’s people are some of the best in the industry and I am excited to see what we can achieve together.”

Asaf Yigal

Logz.io appointed its co-founder Asaf Yigal to the role of chief technology officer. Yigal has served as vice president of product at Logz.io since the company’s founding in 2015. Prior to that, he co-founded and served as vice president of product at Currensee.

“First, I’m eager to develop partnerships with providers who are adjacent to us in the market ecosystem to build out innovative, holistic solutions that offer value to our mutual customers,” he said. “Secondly, this CTO position is unique in that it bridges the gap between product development and marketing, giving me the rare opportunity to tell the story but also stay well connected to our engineering team. And thirdly, we’re a company that is committed to open source, which is a passion of mine, and I look forward to driving more active participation in the open source community.”

To read last month's edition of Career Notes, click here.

Do you know someone that should be included in next month's list? If so, send us an email at [email protected]. We look forward to hearing from you.

Related

The Future of Healthcare is Data-driven

The Future of Healthcare is Data-driven Sponsored Content by Microsoft/NVIDIA April 13, 2023 by Rudeon Snell Global Partner Lead: Customer Experience & Success at Microsoft

As analytics tools and machine learning capabilities mature, healthcare innovators are speeding up the development of enhanced treatments supported by Azure’s GPU-accelerated AI infrastructure powered by NVIDIA.

Improving diagnosis and elevating patient care

Man’s search for cures and treatments for common ailments has driven millennia of healthcare innovation. From the use of traditional medicine in early history to the rapid medical advances of the past few centuries, healthcare providers are locked in a constant search for effective solutions to old and emerging diseases and conditions.

The pace of healthcare innovation has increased exponentially over the past few decades, with the industry absorbing radical changes as it transitions from a health care to a health cure society. From telemedicine, personalized wellbeing, and precision medicine to genomics and proteomics, all powered by AI and advanced analytics, modern medical researchers can access more supercomputing capabilities than ever before. This quantum leap in computational capability, powered by AI, enables healthcare services dissemination and consumption in ways, and at a pace, that were previously unimaginable.

Today, health and life sciences leaders leverage Microsoft Azure high-performance computing (HPC) and purpose-built AI infrastructure to accelerate insights into genomics, precision medicine, medical imaging, and clinical trials, with virtually no limits to the computing power they have at their disposal. These advanced computing capabilities are allowing healthcare providers to gain deeper insights into medical data by deploying analytics and machine learning tools on top of clinical simulation data, increasing the accuracy of mathematical formulas used for molecular dynamics and enhancing clinical trial simulation.

By utilizing the infrastructure as a service (IaaS) capabilities of Azure HPC and AI, healthcare innovators can overcome the challenges of scale, collaboration, and compliance without adding complexity. And with access to the latest GPU-enabled virtual machines, researchers can fuel innovation through high-end remote visualization, deep learning, and predictive analytics.

Data scalability powers rapid testing capabilities

Take the example of the National Health Service, where the use of Azure HPC and AI led to the development of an app that could analyze COVID-19 tests at scale, with a level of accuracy and speed that is simply unattainable for human readers. This drastically improved the efficiency and scalability of analysis as well as capacity management.

Another advance worth noting, is that with Dragon Ambient Experience (DAX), an AI-based clinical solution offered by Nuance, doctor-patient experiences are optimized through the digitization of patient conversations into highly accurate medical notes, helping ensure high-quality care. By freeing up time for doctors to engage with their patients in a more direct and personalized manner, DAX improves the patient experience, reducing patient stress and saving time for doctors.

“With support from Azure and PyTorch, our solution can fundamentally change how doctors and patients engage and how doctors deliver healthcare.”—Guido Gallopyn, Vice President of Healthcare Research at Nuance.

Another exciting partnership between Nuance and NVIDIA brings directly into clinical settings medical imaging AI models developed with MONAI, a domain-specific framework for building and deploying imaging AI. By providing healthcare professionals with much needed AI-based diagnostic tools, across modalities and at scale, medical centers can optimize patient care at fractions of the cost compared to traditional health care solutions.

“Adoption of medical imaging AI at scale has traditionally been constrained by the complexity of clinical workflows and the lack of standards, applications, and deployment platforms. Our partnership with Nuance clears those barriers, enabling the extraordinary capabilities of AI to be delivered at the point of care, faster than ever.”—David Niewolny, Director of Healthcare Business Development at NVIDIA.

GPU-accelerated virtual machines are a healthcare game changer

In the field of medical imaging, progress relies heavily on the use of the latest tools and technologies to enable rapid iterations. For example, when Microsoft scientists sought to improve on a state-of-the-art algorithm used to screen blinding retinal diseases, they leveraged the power of the latest NVIDIA GPUs running on Azure virtual machines.

Using Microsoft Azure Machine Learning for computer vision, scientists reduced misclassification by more than 90 percent from 3.9 percent to a mere 0.3 percent. Deep learning model training was completed in 10 minutes over 83,484 images, achieving better performance than a state-of-the-art AI system. These are the types of improvements that can assist doctors in making more robust and objective decisions, leading to improved patient outcomes for patients.

Photo of doctor reviewing films’

For radiotherapy innovator Elekta, the use of AI could help expand access to life-saving treatments for people around the world. Elekta believes AI technology can help physicians by freeing them up to focus on higher-value activities such as adapting and personalizing treatments. The company accelerates the overall treatment planning process for patients undergoing radiotherapy by automating time-consuming tasks such as advanced analysis services, contouring targets, and optimizing the dose given to patients. In addition, they rely heavily on the agility and power of on-demand infrastructure and services from Microsoft Azure to develop solutions that help empower their clinicians, facilitating the provision of the next generation of personalized cancer treatments.

Elekta uses Azure HPC powered by NVIDIA GPUs to train its machine learning models with the agility to scale storage and compute resources as its research requires. Through Azure’s scalability, Elekta can easily launch experiments in parallel and initiate its entire AI project without any investment in on-premises hardware.

“We rely heavily on Azure cloud infrastructure. With Azure, we can create virtual machines on the fly with specific GPUs, and then scale up as the project demands.”—Silvain Beriault, Lead Research Scientist at Elekta.

With Azure high-performance AI infrastructure, Elekta can dramatically increase the efficiency and effectiveness of its services, helping to reduce the disparity between the many who need radiotherapy treatment and the few who can access it.

Learn more

Leverage Azure HPC and AI infrastructure today or request an Azure HPC demo.

Read more about Azure Machine Learning:

  • Multimodal 3D Brain Tumor Segmentation with Azure ML and MONAI.
  • Practical Federated Learning with Azure Machine Learning.

#MakeAIYourReality
#AzureHPCAI
#NVIDIAonAzure

Related