How to Get Hired as Data Scientist in the GPT-4 Era

How to Get Hired as Data Scientist in the GPT-4 Era
Image by Author

The world of technology is advancing at an unprecedented pace, and companies are constantly striving to stay ahead of the game by either integrating generative AI or developing their own using open-source models and datasets. As a data scientist seeking employment in this era, it is crucial to acquire a diverse range of tools and skills to remain competitive in the job market.

In this blog, we will discuss what are the core topics you need to focus on to become an AI Data Scientist and get hired by your favorite company. We will focus on learning about statistics, core data science concepts, NLP, prompt engineering, data science portfolio, preparing for interviews, and AIOps. By mastering these core topics, you'll be well on your way to becoming a successful AI Data Scientist and securing your dream job at your favorite company.

Statistics

Even though you can ask GPT-4 to interpret the result, you need to understand statistical terminology to come up with a conclusion or even ask a question. After the interpretation of the results, you need to come up with a plan that is suitable for your company. GPT-4 is not good at coming up with the right answer when there are multiple moving parts. It is where our knowledge of statistical analysis will come in handy.

How to Get Hired as Data Scientist in the GPT-4 Era
Photo by Kaboompics .com Core Data Science Concepts

ChatGPT and GPT-4 are not good at devising customizable plans for your data project. You have to write too many follow-up prompts just to get the right action plan. And even then, you need to double-check the project plan before you present it to your manager. All of the follow-up promptings require an understanding of data core concepts like data ingestion, data cleaning, data manipulation, data visualizations, data analysis, and data modeling.

How to Get Hired as Data Scientist in the GPT-4 Era
Image by Author

Even then, there are so many things where GPT-4 fails like debugging, research, coming up with the latest API, and adding specialized code.

Learn more about 20 Core Data Science Concepts for Beginners.

Natural Language Processing (NLP)

Both text-to-image and text-to-text generation models require expert knowledge of natural language processing. Without it, you won't be able to fine-tune the model, improve the results, or even come up with your solution. With the launch of ChatGPT, NLP, and Reinforcement learning have become hot jobs.

How to Get Hired as Data Scientist in the GPT-4 Era
Image from Hugging Face

The large language models can be used for text classification, language translation, code generation, question answering, summarization, and more. Without knowledge about NLP, you won't be able to perform text analysis or create AI applications for specific tasks.

NLP core concepts are also required for security, understanding model architecture, and datasets. Without it, it will be hard for you to even pass the initial stage of interviews.

AI Prompt Engineering

AI Prompt engineering is becoming an increasingly essential skill for all tech workers. Mastering this skill can enable you to write code that is both fast and efficient, devise comprehensive project plans, troubleshoot problems effectively, quickly adapt to new technologies, and produce top-quality reports and documentation. The potential applications of this AI are virtually limitless.

How to Get Hired as Data Scientist in the GPT-4 Era
Image by Author

AI Prompt engineering makes you better at communicating with AI and believe me or not. AI is not here to replace us, but to assist us in our workspace. We can write a program or a report in 5 minutes. The only thing that you need to do is double-check the results.

Check out ChatGPT for Data Science Cheat Sheet, or learn about Prompt Engineering by looking at Top Free Resources To Learn ChatGPT.

Data Science Portfolio

Working on portfolio projects and showcasing your portfolio profile is important. You need to have good data science projects on GitHub or DagsHub, Kaggle, and Huggingface. You can even create your website by using templates like mine: Abid's Portfolio, or check out my blog on 7 Free Platforms for Building a Strong Data Science Portfolio.

How to Get Hired as Data Scientist in the GPT-4 Era
Image by Author

In today's digital age, maintaining a strong online presence on LinkedIn has become essential. As evidenced by the job offers I continue to receive through LinkedIn and GitHub, being active in online discussions and continuously working on your portfolio can significantly increase your chances of getting hired. Once you've finalized your project, it's important to showcase your results or create a brief tutorial, which you can share on platforms like Medium and KDnuggets. Don't forget to promote your projects on various social media platforms, as well as tech-focused Discord or Slack groups.

Interviews Preparation

For data science multiple interview sessions, you need to prepare for behavioral, situational, statistics, Python code, SQL, NLP, machine learning, and data analysis questions.

How to Get Hired as Data Scientist in the GPT-4 Era
Image by Author

  1. You can improve your chances of passing the interview stage by working on versatile projects. Check out the complete collection of data science projects – Part 1 and Part 2
  2. Review mock interviews for every topic. Check out the complete collection of data science interviews – Part 1 and Part 2.
  3. Revise forgotten data science concepts using cheat sheets. Check out the complete collection of data science cheat sheets – Part 1 and Part 2.
  4. Research company profile, product category, and employees to understand what they are looking for and try to curate your answer accordingly.
  5. Showcase the knowledge of the latest tech and the ability to use AI to improve workflow.

AIOps

As I mentioned earlier, many companies are seeking data scientists and engineers to integrate AI into their existing products or build entirely new ones. Therefore, it's crucial to be mentally prepared to answer questions related to AI operations.

How to Get Hired as Data Scientist in the GPT-4 Era
Image by Author

For example:

  • How would you deploy a large language model
  • Do you know how to build, debug, and run data pipelines
  • Do you know how to use docker / kubernetes?
  • Do you have experience with Azure, GCP, or AWS?
  • How would you monitor models in production?
  • How would you update your language model?

These questions are becoming common as companies are looking for data scientists with the knowledge about DevOps or MLOps. You can Learn MLOps with This Free Course.

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

More On This Topic

  • Hiring or Looking to Get Hired in Data Science/Analytics? The INFORMS…
  • Data Science Portfolio Project Ideas That Can Get You Hired (Or Not)
  • 20 Machine Learning Projects That Will Get You Hired
  • How to be a 10x data scientist
  • What Does a Data Scientist Do?
  • How I Got My First Job as a Data Scientist

KDnuggets News, April 19: AutoGPT: Everything You Need To Know • 10 Websites to Get Amazing Data for Data Science Projects

KDnuggets and NVIDIA are announcing a blog-writing contest with a GPU focus, with the winner receiving an RTX 3080 Ti GPU! Learn all about it right now!

Features

  • AutoGPT: Everything You Need To Know by Nisha Arya
  • 10 Websites to Get Amazing Data for Data Science Projects by Nate Rosidi
  • 6 ChatGPT mind-blowing extensions to use it anywhere by Josep Ferrer

From Our Partners

  • Beyond Accuracy: Evaluating & Improving a Model with the NLP Test Library by John Snow Labs
  • FREE Ratio Analysis Template by Boxplot Outlier Data Analysis
  • Boost your machine learning model performance! by Manning

This Week's Posts

  • Mastering Generative AI and Prompt Engineering: A Free eBook by Matthew Mayo
  • Baby AGI: The Birth of a Fully Autonomous AI by Nisha Arya
  • A Step-by-Step Guide to Web Scraping with Python and Beautiful Soup by Aryan Garg
  • Unlock the Wealth of Knowledge with ChatPDF by Abid Ali Awan
  • Post GPT-4: Answering Most Asked Questions About AI by Abid Ali Awan
  • Chatting with the Future: Predictions for AI in the Next Decade by Nisha Arya
  • A Guide to Top Natural Language Processing Libraries by Kanwal Mehreen
  • Exploring Unsupervised Learning Metrics by Cornellius Yudha Wijaya
  • 11 Best Practices of Cloud and Data Migration to AWS Cloud by Tonu Varughese
  • What Is ChatGPT Doing and Why Does It Work? by Stephen Wolfram
  • 10 Hurdles of Building a Deep Tech Startup in the Age of ChatGPT by Ivan Mishanin
  • The Role of the MLOps Engineer in an Organization by Bala Priya C
  • Explore LLMs Easily on Your Laptop with openplayground by Cornellius Yudha Wijaya

KDnuggets News

  • Top Posts April 10-16: AutoGPT: Everything You Need To Know
  • KDnuggets Top Posts for March 2023: ChatGPT for Data Science Cheat Sheet

    More On This Topic

    • 10 Websites to Get Amazing Data for Data Science Projects
    • KDnuggets News, April 6: 8 Free MIT Courses to Learn Data Science Online;…
    • KDnuggets News, June 1: The Complete Collection of Data Science Books;…
    • KDnuggets News, August 17: How to Perform Motion Detection Using Python •…
    • KDnuggets News, April 13: Python Libraries Data Scientists Should Know in…
    • KDnuggets News, June 29: 20 Basic Linux Commands for Data Science…
  • Data Analytics: The Four Approaches to Analyzing Data and How To Use Them Effectively

    Data Analytics: The Four Approaches to Analyzing Data and How To Use Them Effectively
    Photo by Leeloo Thefirst

    Have you ever wished you had a crystal ball that could tell you the future of your business? While we can’t promise you a mystical glimpse into what’s to come, we do have the next best thing: data analytics.

    In today’s data-driven world, it has become effortless for businesses to collect and generate vast amounts of data. However, just having data is not enough.

    As a business, you need to be able to make sense of the data and use it in a way that will allow you to make better decisions. This is where data analytics comes in. Data analytics refers to the process of examining data to extract insights and make informed decisions.

    According to statistics, the data analytics market is growing rapidly and is expected to hit over 650 billion dollars by 2029. This shows the increasing significance of data analytics in businesses and the global economy.

    The future is data-driven. From predicting customer behavior to identifying areas for optimization, data analytics can help businesses unlock the secrets hidden in their data and drive better outcomes. But with so many tools and techniques available, it can be overwhelming to know where to start.

    This article will take you through data analytics and explore the four approaches to analyzing data. By the end of reading this, you’ll have the knowledge you need to harness the power of data and make informed decisions that can take your business to new heights.

    Data Analytics: The Four Approaches to Analyzing Data and How To Use Them Effectively
    Image from HBS
    Descriptive Analytics

    Descriptive analytics is a type of data analysis that focuses on describing and summarizing data to gain insights into what has happened in the past. It is commonly used to answer questions such as “What happened?” and “How many?”.

    Descriptive analytics can help businesses and organizations understand their data and identify patterns and trends that can inform decision-making.

    Here are some real-life examples of descriptive analytics:

    • A retail store might analyze historical sales data to identify popular products and trends. For example, people tend to buy more candy in February.
    • Patient data can be summarized to identify common health issues. For example, most people get the flu from October to June.
    • Student performance data can be analyzed to identify areas for improvement. For example, most students who fail Calculus are frequently late to class.

    To use descriptive analytics effectively, you need to ensure that your data is accurate and of high quality. It’s also crucial to use clear and concise visualizations to communicate insights effectively.

    Predictive Analytics

    Predictive analytics uses statistical and machine learning techniques to analyze historical data and predict future events. It is commonly used to answer questions such as “What is likely to happen?” and “What if?”.

    Predictive analytics is useful as it can help you plan ahead. It can help improve business operations, reduce costs, and increase revenue. For example, you can predict how sales will likely behave based on seasonality and previous sales figures. If your predictive analysis tells you that sales will likely decrease in winter, you can use this information to design an effective marketing campaign for this season.

    Here are some practical examples of predictive analytics in action:

    • A bank might use predictive analytics to assess credit risk and determine whether to grant a loan to a customer. In open banking, predictive analytics can help build highly personalized behavioral models specific to each customer and identify their creditworthiness in new ways. For customers, this may mean better and cheaper access to bank accounts, credit cards, and mortgages.
    • In marketing, predictive analytics can help identify which customers are most likely to respond to a particular offer.
    • In healthcare, predictive analytics can be used to identify patients at risk of developing a particular disease.
    • In manufacturing, predictive analytics can be used to forecast demand and optimize supply chain management.

    However, there are also some challenges to using predictive analytics effectively. One challenge is the availability of high-quality data essential for accurate predictions. Another challenge is selecting appropriate modeling techniques to analyze the data and make accurate predictions. Finally, communicating predictive analytics results to decision-makers can be challenging, as the techniques used can be complex and difficult to understand.

    Prescriptive Analytics

    Prescriptive analytics is a type of data analysis that goes beyond descriptive and predictive analytics to provide recommendations for actions you should take. In other words, this approach involves using optimization techniques to identify the best course of action, given a set of constraints and objectives.

    It is commonly used to answer questions such as “What should we do?” and “How can we improve?”

    To be effective, it requires a deep understanding of the data being analyzed and the ability to model and simulate different scenarios to identify the best course of action. As such, this is the most complex approach of the four methods.

    Prescriptive analytics can help you solve various problems, including product mix, workforce planning, marketing mix, capital budgeting, and capacity management.

    Data Analytics: The Four Approaches to Analyzing Data and How To Use Them Effectively
    Photo by Pixabay

    The best example of prescriptive analytics in action is using Google maps for directions during peak hours. The software considers all modes of transport and traffic conditions to calculate the best route possible. A transportation company might use prescriptive analytics in this way to optimize delivery routes and minimize fuel costs. This is important especially when you consider the rising cost of fuel. In Canada, for example, the average person spends approximately $2,000 annually per vehicle on fuel alone, while in the United States households are spending almost 2.24% of their total annual income on fuel.

    However, like with predictive analytics, there are some challenges to using prescriptive analytics effectively. The first challenge is the availability of high-quality data essential for accurate analysis and optimization. Another challenge is the complexity of the optimization algorithms used, which can require specialized skills and knowledge to implement effectively.

    Diagnostic Analysis

    Diagnostic analytics is a type of data analysis that goes beyond descriptive analytics to identify the root cause of an issue or problem. It answers questions such as “Why did it happen?” and “What caused it?”. For example, you can use diagnostic analysis to determine why your January sales dropped by 50%.

    Diagnostic analytics involves exploring and analyzing data to identify relationships and correlations that can help explain an issue or problem. This can be done using techniques such as regression analysis, hypothesis testing, and causal analysis.

    Real-life examples include:

    • You can use diagnostic analysis to identify the root cause of a quality issue in your production process.
    • You can also use it to identify the cause behind a customer’s complaint and provide a targeted solution.
    • In case of a cyber threat, you can also use it to identify the source of a security breach and prevent future attacks.

    There are many benefits to using diagnostic analytics, such as identifying the underlying causes of issues and problems and developing targeted solutions. But, like with the previous two data analytics methods, there are some challenges to consider. For one, acquiring high-quality data and ensuring accurate analysis and insights can be difficult. Secondly, the analysis techniques can be quite complex and may require specialized skills and knowledge to be implemented effectively.

    Approach Definition Answers the questions
    Descriptive Describes and summarizes data to gain insights into what has happened in the past.
    • What happened?
    • How many?
    Diagnostic Identifies the root cause of an issue or problem
    • Why did it happen?
    • What caused it?
    Predictive Analyzes historical data and makes predictions about future events.
    • What is likely to happen?
    • What if?
    Prescriptive Provides recommendations for actions you should take based on the analysis.
    • What should we do?
    • How can we improve?

    How To Use the Four Approaches Effectively

    While each of the four approaches to analyzing data has its own strengths and weaknesses, choosing the most appropriate approach for a given problem can be critical for achieving the desired results. Some factors to consider when choosing an approach may include the following:

    The nature of the problem being addressed. Different problems will require different approaches. For example, you can use:

    • Descriptive analytics to summarize customer feedback data and identify customer demand patterns
    • Diagnostic analytics to identify the factors that are driving changes in sales performance
    • Predictive analytics to forecast future demand for a product
    • Prescriptive analytics to optimize production schedules in a manufacturing facility

    The type and quality of available data. It is also important to ensure that the data is accurate, complete, and relevant. This may involve cleaning, transforming, or otherwise preparing the data to ensure it is suitable for the chosen approach. In many cases, data preparation may be a time-consuming and iterative process and may require specialized tools or expertise.

    The resources and skills available for analysis. To conduct effective data analytics, it is also important to have the right skills and tools at hand. This may include statistical analysis software, programming languages, and visualization tools. Some common skills that may be useful for data analysts include data wrangling, data visualization, machine learning, and statistical inference.

    Conclusion

    From the discussion above, it’s clear that data analytics is a powerful tool that can provide valuable insights and drive business growth. By understanding and utilizing the four different approaches to data analytics, businesses can better understand their data and make more informed decisions.

    However, it is important to carefully consider your business’s specific needs and goals when choosing an analytics approach and to be aware of the advantages and limitations of each.

    Ultimately, by choosing the right approach and implementing it effectively, businesses can gain a competitive advantage and achieve long-term success. So go forth and explore the exciting world of data analytics — the possibilities are endless!
    Nate Rosidi is a data scientist and in product strategy. He's also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Connect with him on Twitter: StrataScratch or LinkedIn.

    More On This Topic

    • Master the Power of Data Analytics: The Four Approaches to Analyzing Data
    • How to Effectively Obtain Consumer Insights in a Data Overload Era
    • 3 Approaches to Data Imputation
    • Approaches to Data Imputation
    • How to Effectively Use Pandas GroupBy
    • A new book that will revolutionize the way your organization approaches…

    The Base Rate Fallacy and its Impact on Data Science

    The Base Rate Fallacy and its Impact on Data Science
    Image by Author

    When working with data and different variables, assigning one variable or value to be greater than the other is easy. We may assume that a specific variable or data point had more impact on the output, but how sure are we that the other variables have an equal impact?

    What is Base Rate?

    In statistics, the base rate can be seen as probabilities of classes that are unconditional on "featural evidence". You can see the base rate as your prior probability assumption.

    Base rates are important tools in research. For example, if we are a pharmaceutical company and are in the process of developing and dispatching a new vaccination, we want to look into the success of the treatment. If we have 4000 people who are willing to take this vaccination, and our base rate is 1/25.

    This means that only 160 people will successfully be cured by the treatment out of 4000 people. In the pharmaceutical world, this is a very low success rate. This is how base rates can be used to improve research, and accuracy and ensure that the product will perform well.

    What is Base Rate Fallacy?

    If we split the words up, it will give us a better understanding. Fallacy means a mistaken belief or faulty reasoning. If we now combine that with our definition of the base rate above.

    The base rate fallacy, also known as base rate bias and base rate neglect, is the likelihood of judging a specific situation, whilst not taking into consideration all relevant data.

    The base rate fallacy has information about the base rate as well as other relevant information. This can be due to various reasons such as not thoroughly examining and analyzing the data properly, or ignorance to favour a specific part of the data.

    The base rate fallacy describes the tendency for someone to disregard the existing base rate information, to push and be in favour of the new information. This goes against the fundamental rules of evidence-based reasoning.

    You will typically hear about this happening in the financial industry. For example, investors will base their buying or sharing tactics on irrational information, which leads to fluctuation in the market — despite having the base rate to their knowledge.

    Base Rate Fallacy and Data Science

    So now we have a better understanding of the base rate and base rate fallacy. What is its relevance and impact in Data Science?

    We’ve spoken about ‘probabilities of classes’ and ‘taking into consideration all relevant data’. If you are a data scientist, or machine learning engineer, or getting your foot in the door — you will know how important probabilities and relevant data are to producing accurate outputs, the learning process of your machine learning model and producing high-performance models.

    To analyse and make predictions about data or for your machine learning model to produce accurate outputs — you need to take into consideration every bit of data. As you’re scanning through your data the first time you see it, you might consider some parts relevant and other parts irrelevant. However, this is your judgement and is not yet factual till proper analysis has taken place.

    As mentioned above, the initial base rate helps you ensure accuracy and produce high-performance models. So how can we do this in Data Science?

    Confusion Matrix

    A confusion Matrix is a performance measurement that provides a summary of prediction results on a classification problem. The confusion matrices are all based on the outcome: True, False, Positive, and Negative.

    The confusion matrix represents our model's predictions during the testing phase. The false-negative and false-positive in the confusion matrix are examples of base rate fallacy.

    • True Positive (TP) — your model predicted positive and it’s positive
    • True Negative (TN) — your model predicted negative and it’s negative
    • False Positive (FP) — your model predicted positive and it’s negative
    • False Negative (FN) — your model predicted negative and it’s positive

    A confusion matrix can calculate 5 different metrics to help us measure the validity of our model:

    1. Misclassification = FP + FN / TP + TN + FP + FN
    2. Precision = TP / TP + FP
    3. Accuracy = TP + TN / TP + TN + FP + FN
    4. Specificity = TN / TN + FP
    5. Sensitivity aka Recall = TP / TP + FN

    To better understand a confusion matrix, it's better to look at a visualisation:

    The Base Rate Fallacy and its Impact on Data Science
    Image by Author Causes of Base Rate Fallacy

    As you’re going through this article, you can probably think of a variety of causes of base rate fallacy, such as not taking all the relevant data into consideration, human error, or lack of precision.

    Although these are all true and add to the cause of the base rate fallacy. They all relate to the biggest problem of ignoring the base rate information in the first place. Base rate information is often ignored as it is considered irrelevant, however, the base rate information can save people a lot of time and money. Using the base rate information available allows you to be more precise in making probabilities about whether a given event will occur.

    Using the base rate information will help you avoid base rate fallacy.

    Being aware of fallacies such as opinions, automatic processes, etc — will allow you to combat the issue of base rate fallacy and reduce potential errors. When you are measuring the probability of a certain event occurring, Bayesian methods can help with this to reduce the base rate fallacy.

    Conclusion

    The base rate is important in data science as it equips you with a base understanding of how to assess your study or project, and fine-tune your model — providing an overall increase in accuracy and performance.

    If you would like to watch a video about base rate fallacy in the medical field, check out this video: Medical Test Paradox
    Nisha Arya is a Data Scientist, Freelance Technical Writer and Community Manager at KDnuggets. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.

    More On This Topic

    • Undersampling Will Change the Base Rates of Your Model's Predictions
    • IMPACT 2022: The Data Observability Summit, on Oct. 25-26
    • How to Use Data Visualization to Add Impact to Your Work Reports and…
    • Watch all of IMPACT's breakout sessions ON-DEMAND
    • How Noisy Labels Impact Machine Learning Models
    • Top November Stories: Top Python Libraries for Data Science, Data…

    Unveiling the Potential of CTGAN: Harnessing Generative AI for Synthetic Data

    We all know that GANs are gaining traction in the generation of unstructured synthetic data, such as images and texts. However, very little work has been done on generating synthetic tabular data using GANs. Synthetic data has numerous benefits, including its use in machine learning applications, data privacy, data analysis, and data augmentation. There are only a few models available for generating synthetic tabular data, and CTGAN (Conditional Tabular Generative Adversarial Network) is one of them. Like other GANs, it uses a generator and discriminator neural network to create synthetic data with similar statistical properties to real data. CTGAN can preserve the underlying structure of the real data, including correlations between columns. The added benefits of CTGAN include augmentation of the training procedure with mode-specific normalization, a few architectural changes, and addressing data imbalance by employing a conditional generator and training-by-sampling.

    In this blog post, I used CTGAN to generate synthetic data based on a dataset on Credit Analysis collected from Kaggle.

    Pros of CTGAN

    • Generates synthetic tabular data that have similar statistical properties as the real data, including correlations between different columns.
    • Preserves the underlying structure of real data.
    • The synthetic data generated by CTGAN can be used for a variety of applications, such as data augmentation, data privacy, and data analysis.
    • Can handle continuous, discrete, and categorical data.

    Cons of CTGAN

    • CTGAN requires a large amount of real tabular data to train the model and generate synthetic data that have similar statistical properties to the real data.
    • CTGAN is computationally intensive and may require a significant amount of computing resources.
    • The quality of the synthetic data generated by CTGAN may vary depending on the quality of the real data used to train the model.

    Tuning CTGAN

    Like all other machine learning models CTGAN performs better when it is tuned. And there are multiple parameters to be considered while tuning CTGANs. However, for this demonstration, I used all the default parameters that come with ‘ctgan library’:

    • Epochs: Number of times generator and discriminator networks are trained on the dataset.
    • Learning rate: The rate at which the model adjusts the weights during training.
    • Batch size: Number of samples used in each training iteration.
    • Generator and discriminator networks size.
    • Choice of the optimization algorithm.

    CTGAN also takes account of hyperparameters, such as the dimensionality of the latent space, the number of layers in the generator and discriminator networks, and the activation functions used in each layer. The choice of parameters and hyperparameters affects the performance and quality of the generated synthetic data.

    Validation of CTGAN

    Validation of CTGAN is tricky as it comes with limitations such as difficulties in the evaluation of the quality of the generated synthetic data, particularly when it comes to tabular data. Though there are metrics that can be used to evaluate the similarity between the real and synthetic data, it can still be challenging to determine if the synthetic data accurately represents the underlying patterns and relationships in the real data. Additionally, CTGAN is vulnerable to overfitting and can produce synthetic data that is too similar to the training data, which may limit their ability to generalize to new data.

    A few common validations techniques include:

    • Statistical Tests: To compare statistical properties of generated data and real data. For example, tests such as correlation analysis, Kolmogorov-Smirnov test, Anderson-Darling test, and chi-squared test to compare the distributions of the generated and real data.
    • Visualization: By plotting histograms, scatterplots, or heatmaps to visualize the similarities and differences.
    • Application Testing: By using synthetic data in real-world applications see if it performs similarly to the real data.

    Case Study

    About Credit Analysis Data

    Credit analysis data contains client data in continuous and discrete/categorical formats. For demonstration purposes, I have pre-processed the data by removing rows with null values and deleting a few columns that were not needed for this demonstration. Due to limitations in computational resources, running all the data and all columns would require a lot of computation power that I do not have. Here is the list of columns for continuous and categorical variables (discrete values such as Count of Children (CNT_CHINDREN) are treated as categorical variables):

    Categorical Variables:

    TARGET  NAME_CONTRACT_TYPE  CODE_GENDER  FLAG_OWN_CAR  FLAG_OWN_REALTY  CNT_CHILDREN

    Continuous Variables:

    AMT_INCOME_TOTAL  AMT_CREDIT  AMT_ANNUITY  AMT_GOODS_PRICE

    Generative models require a large amount of clean data to be trained on for better results. However, due to limitations in computation power, I have selected only 10,000 rows (precisely 9,993) from the over 300,000 rows of real data for this demonstration. Although this number may be considered relatively small, it should be sufficient for the purpose of this demonstration.

    Location of the Real Data:

    https://www.kaggle.com/datasets/kapoorshivam/credit-analysis

    Location of the generated synthetic Data:

    • Synthetic Credit Analysis Data by CTGAN (Kaggle)
    • Synthetic Tabular Data Set Generated by CTGAN (Research Gate)
    • DOI: 10.13140/RG.2.2.23275.82728

    XXXXX
    Credit Analysis Data | Image by Author Results

    I have generated 10k (9997 to be exact) synthetic data points and compared them to the real data. The results look good, although there is still potential for improvement. In my analysis, I used the default parameters, with 'relu' as the activation function and 3000 epochs. Increasing the number of epochs should result in a better generation of real-like synthetic data. The generator and discriminator loss also looks good, with lower losses indicating closer similarity between the synthetic and real data:

    XXXXX
    Generator and Discriminator loss | Image by Author

    The dots along the diagonal line in the Absolute Log Mean and Standard Deviation diagram indicate that the quality of the generated data is good.

    XXXXX
    Absolute Log Mean and Standard Deviations of Numeric Data | Image by Author

    The cumulative sums in the following figures for continuous columns are not exactly overlapping, but they are close, which indicates a good generation of synthetic data and the absence of overfitting. The overlap in categorical/discrete data suggests that the synthetic data generated is near-real. Further statistical analyses are presented in the following figures:

    XXXXX
    Cumulative Sums per Feature | Image by Author

    XXXXX
    Distribution of Features| Image by Author
    XXXXX
    Distribution of Features | Image by Author

    XXXXX
    Principal Component Analysis | Image by Author

    The following correlation diagram shows noticeable correlations between the variables. It is important to note that even after thorough fine-tuning, there may be variations in properties between real and synthetic data. These differences can actually be beneficial, as they may reveal hidden properties within the dataset that can be leveraged to create novel solutions. It has been observed that increasing the number of epochs leads to improvements in the quality of synthetic data.

    XXXXX
    Correlation among variables (Real Data) | Image by Author
    XXXXX
    Correlation among variables (Synthetic Data) | Image by Author

    The summary statistics of both the sample data and real data also appear to be satisfactory.

    XXXXX
    Summary Statistics of Real Data and Synthetic Data | Image by Author Python Code

    # Install CTGAN  !pip install ctgan    # Install table evaluator to analyze generated synthetic data  !pip install table_evaluator  
    # Import libraries  import torch  import pandas as pd  import seaborn as sns  import torch.nn as nn    from ctgan import CTGAN  from ctgan.synthesizers.ctgan import Generator    # Import training Data  data = pd.read_csv("./application_data_edited_2.csv")    # Declare Categorical Columns  categorical_features = [      "TARGET",      "NAME_CONTRACT_TYPE",      "CODE_GENDER",      "FLAG_OWN_CAR",      "FLAG_OWN_REALTY",      "CNT_CHILDREN",  ]    # Declare Continuous Columns  continuous_cols = ["AMT_INCOME_TOTAL", "AMT_CREDIT", "AMT_ANNUITY", "AMT_GOODS_PRICE"]      # Train Model  from ctgan import CTGAN    ctgan = CTGAN(verbose=True)  ctgan.fit(data, categorical_features, epochs=100000)    # Generate synthetic_data  synthetic_data = ctgan.sample(10000)      # Analyze Synthetic Data  from table_evaluator import TableEvaluator    print(data.shape, synthetic_data.shape)  table_evaluator = TableEvaluator(data, synthetic_data, cat_cols=categorical_features)  table_evaluator.visual_evaluation()  # compute the correlation matrix  corr = synthetic_data.corr()    # plot the heatmap  sns.heatmap(corr, annot=True, cmap="coolwarm")    # show summary statistics SYNTHETIC DATA  summary = synthetic_data.describe()  print(summary)  

    Conclusion

    The training process of CTGAN is expected to converge to a point where the generated synthetic data becomes indistinguishable from the real data. However, in reality, convergence cannot be guaranteed. Several factors can affect the convergence of CTGAN, including the choice of hyperparameters, the complexity of the data, and the architecture of the models. Furthermore, the instability of the training process can lead to mode collapse, where the generator produces only a limited set of similar samples instead of exploring the full diversity of the data distribution.

    Ray Islam is a Data Scientist (AI and ML) and Advisory Specialist Leader at Deloitte, USA. He holds a PhD in Engineering from the University of Maryland, College Park, MD, USA and has worked with major companies like Lockheed Martin and Raytheon, serving clients such as NASA and the US Airforce. Ray also has MASc in Engineering from Canada, a MSc in International Marketing, and an MBA from, UK. He is also the Editor-in-Chief of the upcoming peer-reviewed International Research Journal of Ethics for AI (INTJEAI), and his research interests include generative AI, augmented reality, XAI, and ethics in AI.

    More On This Topic

    • 3 Steps for Harnessing the Power of Data
    • Tapping into the Potential of Data Products in 2023
    • 5 Reasons Why You Need Synthetic Data
    • A Community for Synthetic Data is Here and This is Why We Need It
    • 10 Use Cases for Privacy-Preserving Synthetic Data
    • Easy Synthetic Data in Python with Faker

    Dolly 2.0: ChatGPT Open Source Alternative for Commercial Use

    Dolly 2.0: ChatGPT Open Source Alternative for Commercial Use
    Image from Author | Bing Image Creator

    Dolly 2.0 is an open-source, instruction-followed, large language model (LLM) that was fine-tuned on a human-generated dataset. It can be used for both research and commercial purposes.

    Dolly 2.0: ChatGPT Open Source Alternative for Commercial Use
    Image from Hugging Face Space by RamAnanth1

    Previously, the Databricks team released Dolly 1.0, LLM, which exhibits ChatGPT-like instruction following ability and costs less than $30 to train. It was using the Stanford Alpaca team dataset, which was under a restricted license (Research only).

    Dolly 2.0 has resolved this issue by fine-tuning the 12B parameter language model (Pythia) on a high-quality human-generated instruction in the following dataset, which was labeled by a Datbricks employee. Both model and dataset are available for commercial use.

    Why do we need a Commercial License Dataset?

    Dolly 1.0 was trained on a Stanford Alpaca dataset, which was created using OpenAI API. The dataset contains the output from ChatGPT and prevents anyone from using it to compete with OpenAI. In short, you cannot build a commercial chatbot or language application based on this dataset.

    Most of the latest models released in the last few weeks suffered from the same issues, models like Alpaca, Koala, GPT4All, and Vicuna. To get around, we need to create new high-quality datasets that can be used for commercial use, and that is what the Databricks team has done with the databricks-dolly-15k dataset.

    databricks-dolly-15k Dataset

    The new dataset contains 15,000 high-quality human-labeled prompt/response pairs that can be used to design instruction tuning large language models. The databricks-dolly-15k dataset comes with Creative Commons Attribution-ShareAlike 3.0 Unported License, which allows anyone to use it, modify it, and create a commercial application on it.

    How did they create the databricks-dolly-15k dataset?

    The OpenAI research paper states that the original InstructGPT model was trained on 13,000 prompts and responses. By using this information, the Databricks team started to work on it, and it turns out that generating 13k questions and answers was a difficult task. They cannot use synthetic data or AI generative data, and they have to generate original answers to every question. This is where they have decided to use 5,000 employees of Databricks to create human-generated data.

    The Databricks have set up a contest, in which the top 20 labelers would get a big award. In this contest, 5,000 Databricks employees participated that were very interested in LLMs

    Results

    The dolly-v2-12b is not a state-of-the-art model. It underperforms dolly-v1-6b in some evaluation benchmarks. It might be due to the composition and size of the underlying fine-tuning datasets. The Dolly model family is under active development, so you might see an updated version with better performance in the future.

    In short, the dolly-v2-12b model has performed better than EleutherAI/gpt-neox-20b and EleutherAI/pythia-6.9b.

    Dolly 2.0: ChatGPT Open Source Alternative for Commercial Use
    Image from Free Dolly Getting Started

    Dolly 2.0 is 100% open-source. It comes with training code, dataset, model weights, and inference pipeline. All of the components are suitable for commercial use. You can try out the model on Hugging Face Spaces Dolly V2 by RamAnanth1.

    Dolly 2.0: ChatGPT Open Source Alternative for Commercial Use
    Image from Hugging Face

    Resource:

    • Training and inference code: databrickslabs/dolly
    • Dolly 2.0 model weights: databricks/dolly-v2-12b
    • databricks-dolly-15k dataset: dolly/data

    Dolly 2.0 Demo: Dolly V2 by RamAnanth1
    Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

    More On This Topic

    • OpenChatKit: Open-Source ChatGPT Alternative
    • 8 Open-Source Alternative to ChatGPT and Bard
    • The 7 Best Open Source AI Libraries You May Not Have Heard Of
    • Top Open Source Large Language Models
    • GitHub Copilot Open Source Alternatives
    • Baize: An Open-Source Chat Model (But Different?)

    Data Scientist Job Salaries Analysis

    Data Scientist Job Salaries Analysis
    Photo by Tima Miroshnichenko

    Data Science and Machine Learning are increasingly gaining popularity in various fields such as Sports, Art, Space, medicine, healthcare, and many more. It would be insightful to have a look at the salaries and current employment status of these data scientists across various places around the world.

    The dataset was downloaded from Kaggle (link given below) and we will be making an exploratory analysis of the data, and visualize it. https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries

    The dataset is divided based on Experience level as follows:

    1. EN: Entry Level
    2. MI: Mid Level
    3. SE: Senior Level
    4. EX: Executive Level

    The dataset is divided based on Employment types as follows:

    1. FT: Full Time
    2. PT: Part Time
    3. CT: Contract basis
    4. FL: Freelancer

    The dataset is divided based on Company size as follows:

    1. S: Small
    2. M: Medium
    3. L: Large

    Exploratory Analysis and Visualization

    In this section, we will be doing exploratory data analysis and visualization of a given dataset. The following items are on the agenda:

    1. Distribution of Experience level
    2. Distribution of Work type
    3. Comparing salaries of Data scientist jobs based on Experience Level
    4. Comparing salaries of Data scientist jobs based on Employment type
    5. Comparing salaries based on experience level and Size of Company
    6. Comparing Data scientist salaries across the world
    7. Average salary as a function of currency
    8. Average Salary as a Function of location
    9. Top 10 Data Science job positions
    10. Remote work status as a function of time

    In the dataset, 14.5% of members are freshers while most quota is filled by Senior Engineers at 46.1%.

    During the period 2020–2022, 62.8% of members have shifted to work-from-home modalities owing to Covid — 19 crisis. Later, we will see the trend shifting back to normalcy.

    Naturally, it can be seen that the more the experience, the better you get paid for it. However, at the highest executive level, the salaries vary much more as compared to other levels.

    It seems contract-based jobs earn the most out of all types. Although the variation in their pay scale is also too high. An interesting observation is that freelancers earn more than part-timers but variation in their pay scales almost looks proportional.

    Adding a hue ‘company size’ for the previous ‘experience level’ vs ‘salary’ graph reveals more information. Senior levels job salaries on average coincide with Executive level salaries. Moreover, on average senior levels jobs salaries of small companies almost coincide with executive-level salaries of respective company size.

    By summing the salaries column we end up with very skewed data towards the USA. This might be because of various factors like most data scientist jobs being created in the USA, the data is mostly collected in the US, or the data collection form might be in English, and this form might have been circulated in non-English speaking countries. However, in order to linearize the data, we will take a log10 scale on the salaries column and these scaled values are passed for drawing the colors of the map.

    The sum of salaries might not be a correct measure for comparison as the entries in a particular country might be more than others. So we plot the mean keeping the log10 scale. This gives a much better idea of salaries across the world.

    It can be observed that the majority of data science jobs are in the United States of America (US) and it also has the highest-paying jobs. Canada (CA), Japan (JP), Germany (DE), United Kingdom (GB), Spain (ES), France (FR), Greece (GR), and India (IN) follow in terms of highest job salaries and a number of jobs (except Japan) in that order respectively.

    Taking average salaries as a function of currency reveals that people earn most in USD, followed by the Swiss franc and Singapore dollar. This graph is heavily influenced by the value of a particular currency as most currencies on the left-hand side of the graph have relatively high values against USD.

    The location of the company also plays a vital role in determining the mean salaries. The top 10 countries in terms of average salaries are plotted.

    It can be observed that Data Scientist is the most common job title followed by Data Engineer and Data Analyst respectively.

    Due to the Covid — 19 crisis, most jobs were shifted to Work from home modality, however as the vaccines started rolling out, everything starts returning to normalcy.

    Inferences and Conclusion

    Detailed data analysis is done for the given dataset of Data Science Job Salaries. It can be concluded that:

    1. Data Science is one of the most popular and emerging fields in almost all industries such as Healthcare, Sports, Art, etc.
    2. The variation in the average salary of data scientists across the world is explored.
    3. The variation of salaries across types of employment such as Contract basis, Full-time, etc. is very crucial.
    4. The variation of salaries as you gain experience is a rising curve.
    5. Owing to the Covid — 19 crisis, the work environment was shifted to Work from Home and back to normalcy as time passed.

    References and Future Work

    All the useful links are listed below:

    1. https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries
    2. https://jovian.ai/
    3. https://plotly.com/python/choropleth-maps/
    4. https://plotly.com/

    Nikhil Purao is currently pursuing a Masters of Technology degree from IIT Guwahati with a focus on data and decision sciences. As an AI enthusiast. He is passionate about using advanced analytics and artificial intelligence to drive business growth and improve outcomes. Through his studies, he have gained a deep understanding of the latest tools and techniques in the field, and he is committed to staying at the forefront of this exciting discipline. Whether it's uncovering key insights from complex datasets or developing cutting-edge solutions, he is always eager to take on new challenges and collaborate with others to achieve success.

    Original. Reposted with permission.

    More On This Topic

    • Top Jobs and Salaries in Data Science in 2022
    • Data Analysis Using Tableau
    • Data Analysis Using Scala
    • 5 Data Analysis Projects For Beginners
    • How To Collect Data For Customer Sentiment Analysis
    • The Data Matters: Choosing the right data to analyze can make or break your…

    The Ethics of AI: Navigating the Future of Intelligent Machines

    The Ethics of AI: Navigating the Future of Intelligent Machines
    Image by Author

    Depending on your life, everybody has different opinions on artificial intelligence and its future. Some believed that it was just another fad that was going to die out soon. Whilst some believed there was a huge potential to implement it into our everyday lives.

    At this point, it’s clear to say that AI is having a big impact on our lives and is here to stay.

    With the recent advancements in AI technology such as ChatGPT and autonomous systems such as Baby AGI — we can stand on the continuous advancement of artificial intelligence in the future. It is nothing new. It's the same drastic change we saw with the arrival of computers, the internet, and smartphones.

    A few years ago, there was a survey conducted with 6,000 customers in six countries, where only 36% of consumers were comfortable with businesses using AI and 72% expressed that they had some fear about the use of AI.

    Although it is very interesting, it can also be concerning. Although we expect more to come in the future regarding AI, the big question is ‘What are the ethics around it?’.

    The most developing and implemented area of AI development is machine learning. This allows models to learn and improve using past experience by exploring data and identifying patterns with little human intervention. Machine learning is used in different sectors, from finance to healthcare. We have virtual assistants such as Alexa, and now we have large language models such as ChatGPT.

    So how do we determine the ethics around these AI applications, and how it will affect the economy and society?

    The Ethical Concerns of AI

    There are a few ethical concerns surrounding AI:

    1. Bias and Discrimination

    Although data is the new oil and we have a lot of it, there are still concerns about AI being biased and discriminative with the data it has. For example, the use of facial recognition applications has proven to be highly biased and discriminative to certain ethnic groups, such as people with darker skin tones.

    Although some of these facial recognition applications had high racial and gender bias, companies such as Amazon refused to stop selling the product to the government in 2018.

    2. Privacy

    Another concern around the use of AI applications is privacy. These applications require a vast amount of data to produce accurate outputs and have high performance. However, there are concerns regarding data collection, storage, and use.

    3. Transparency

    Although AI applications are inputted with data, there is a high concern about the transparency of how these AI applications come to their decision. The creators of these AI applications deal with a lack of transparency raising the question of who to hold accountable for the outcome.

    4. Autonomous Applications

    We have seen the birth of Baby AGI, an autonomous task manager. Autonomous applications have the ability to make decisions with the help of a human. This naturally opens eyes to the public on leaving the decision to be made by technology, which could be deemed ethically or morally wrong in society's eyes.

    5. Job security

    This concern has been an ongoing conversation since the birth of artificial intelligence. With more and more people seeing that technology can do their job, such as ChatGPT creating content and potentially replacing content creators — what are the social and economic consequences of implementing AI into our everyday lives?

    The Future of Ethical AI

    In April 2021, the European Commission published its legislation on the Act of the use of AI. The act aimed to ensure that AI systems met fundamental rights and provided users and society with trust. It contained a framework that grouped AI systems into 4 risk areas; unacceptable risk, high risk, limited, and minimal or no risk. You can learn more about it here: European AI Act: The Simplified Breakdown.

    Other countries such as Brazil also passed a bill in 2021 that created a legal framework around the use of AI. Therefore, we can see that countries and continents around the world are looking further into the use of AI and how it can be ethically used.

    The fast advancements in AI will have to align with the proposed frameworks and standards. Companies who are building or implementing AI systems will have to follow ethical standards and conduct an assessment of the application to ensure transparency, and privacy and account for bias and discrimination.

    These frameworks and standards will need to focus on data governance, documented, transparent, human oversight, and robust, accurate, cyber-safe AI systems. If companies fail to comply, they will, unfortunately, have to deal with fines and penalties.

    Wrapping it up

    The launch of ChatGPT and the development of general-purpose AI applications have prompted scientists and politicians to establish a legal and ethical framework to avoid any potential harm or impact of AI applications.

    This year alone there have been many papers released on the use of AI and the ethics surrounding it. For example, Assessing the Transatlantic Race to Govern AI-Driven Decision-Making through a Comparative Lens. We will continue to see more and more papers getting released till governments conduct and publish a clear and concise framework for companies to implement.

    Nisha Arya is a Data Scientist, Freelance Technical Writer and Community Manager at KDnuggets. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.

    More On This Topic

    • Future Says Series | Discover the Future of AI
    • Making Intelligent Document Processing Smarter: Part 1
    • The Ethics of AI
    • Ethics, Fairness, and Bias in AI
    • Support Vector Machines: An Intuitive Approach
    • Coding Ethics for AI & AIOps: Designing Responsible AI Systems

    MiniGPT-4: A Lightweight Alternative to GPT-4 for Enhanced Vision-language Understanding

    MiniGPT-4: A Lightweight Alternative to GPT-4 for Enhanced Vision-language Understanding
    Image by Author

    We are seeing rapid development of ChatGPT open-source alternatives, but no one is working on the GPT-4 alternative, which provides multimodality. GPT-4 is an advanced and powerful multimodal model that accepts images and text as input and outputs text response. It can solve complex problems with greater accuracy and learn from its mistakes.

    In this post, we will learn about MiniGPT-4, an open-source alternative to OpenAI’s GPT-4 that can understand both visual and textual context while being lightweight.

    What is MiniGPT-4?

    Similar to GPT-4, MiniGPT-4 can exhibit detailed image description generation, write stories using images, and create a website using the hand-drawn user interface. It achieves that by utilization of a more advanced large language model (LLM).

    You can experience it yourself by trying out the demo: MiniGPT-4 — a Hugging Face Space by Vision-CAIR.

    MiniGPT-4: A Lightweight Alternative to GPT-4 for Enhanced Vision-language Understanding
    Image by Author | MiniGPT-4 Demo

    The authors of MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models found that pre-training on raw image-text pairs could produce poor results that lack coherency, including repetition and fragmented sentences. To counter this issue, they curated a high-quality, well-aligned dataset and fine-tuned the model using a conversational template.

    The MiniGPT-4 model is highly computationally efficient, as they have trained only a projection layer utilizing approximately 5 million aligned image-text pairs.

    How does MiniGPT-4 work?

    MiniGPT-4 aligns a frozen visual encoder with a frozen LLM called Vicuna using just one projection layer. The visual encoder consists of pretrained ViT and Q-Former models that are connected to an advanced Vicuna large language model via a single linear projection layer.

    MiniGPT-4: A Lightweight Alternative to GPT-4 for Enhanced Vision-language Understanding
    Image by Author | The architecture of MiniGPT-4.

    MiniGPT-4 only requires training the linear layer to align the visual features with Vicuna. So, it is lightweight, requires less computational resources, and produces similar results to GPT-4.

    Results

    If you look at the official results at minigpt-4.github.io, you will see that the authors have created a website by uploading the hand-drawn UI and asking it to write an HTML/JS website. The MiniGPT-4 understood the context and generated HTML, CSS, and JS code. It is amazing.

    MiniGPT-4: A Lightweight Alternative to GPT-4 for Enhanced Vision-language UnderstandingMiniGPT-4: A Lightweight Alternative to GPT-4 for Enhanced Vision-language Understanding
    Image from minigpt-4.github.io

    They have also shown how you can use the model to generate a recipe by providing food images, writing advertisements for the product, describing a complex image, explaining the painting, and more.

    Let’s try this on our own by heading to the MiniGPT-4 demo. As we can see, I have provided the Bing AI-generated image and asked the MiniGPT-4 to write a story using it. The result is amazing.

    The story is coherent.

    MiniGPT-4: A Lightweight Alternative to GPT-4 for Enhanced Vision-language Understanding
    Image by Author | MiniGPT-4 Demo

    I wanted to know more, so I asked it to continue writing, and just like an AI chatbot, it kept writing the plot.

    MiniGPT-4: A Lightweight Alternative to GPT-4 for Enhanced Vision-language Understanding
    Image by Author | MiniGPT-4 Demo

    In the second example, I asked it to help me improve the design of the image and then asked it to generate subtitles for the blog using the image.

    MiniGPT-4: A Lightweight Alternative to GPT-4 for Enhanced Vision-language Understanding
    Image by Author | MiniGPT-4 Demo

    MiniGPT-4 is amazing. It learns from mistakes and produces high-quality responses.

    Limitations

    MiniGPT-4 has many advanced vision-language capabilities, but it still faces several limitations.

    • Currently, the model inference is slow even with high-end GPUs, which can result in slow results.
    • The model is built upon LLMs, so it inherits their limitations like unreliable reasoning ability and hallucinating non-existent knowledge.
    • The model has limited visual perception and may struggle to recognize detailed textual information in images.

    Getting Started

    The project comes with training, fine-tuning, and inference of source code. It also includes publicly available model weights, dataset, research paper, demo video, and link to the Hugging Face demo.

    You can start hacking, start fine-tuning the model on your dataset, or just experience the model through various instances of the official demo on the official page.

    • Official Page: minigpt-4.github.io
    • Research Paper: MiniGPT-4/MiniGPT_4.pdf
    • GitHub: Vision-CAIR/MiniGPT-4
    • Gradio Demo: Demo of MiniGPT-4
    • Demo Video: MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
    • Model Weights: Vision-CAIR/MiniGPT-4
    • Dataset: Vision-CAIR/cc_sbu_align

    It is the first version of the model. You will see a more improved version in the upcoming days, so stay tuned.
    Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

    More On This Topic

    • Multimodal Grounded Learning with Vision and Language
    • Vision Transformers: Natural Language Processing (NLP) Increases Efficiency…
    • OpenAI Releases Two Transformer Models that Magically Link Language and…
    • Top Python Libraries for Deep Learning, Natural Language Processing &…
    • How our Obsession with Algorithms Broke Computer Vision: And how Synthetic…
    • N-gram Language Modeling in Natural Language Processing

    Dealing With Noisy Labels in Text Data

    Dealing With Noisy Labels in Text Data
    Image by Editor

    With the rising interest in natural language processing, more and more practitioners are hitting the wall not because they can’t build or fine-tune LLMs, but because their data is messy!

    We will show simple, yet very effective coding procedures for fixing noisy labels in text data. We will deal with 2 common scenarios in real-world text data:

    1. Having a category that contains mixed examples from a few other categories. I love to call this kind of category a meta category.
    2. Having 2 or more categories that should be merged into 1 category because texts belonging to them refer to the same topic.

    We will use ITSM (IT Service Management) dataset created for this tutorial (CCO license). It’s available on Kaggle from the link below:

    https://www.kaggle.com/datasets/nikolagreb/small-itsm-dataset

    It’s time to start with the import of all libraries needed and basic data examination. Brace yourself, code is coming!

    Import and Data Examination

    import pandas as pd  import numpy as np  import string    from sklearn.feature_extraction.text import TfidfVectorizer  from sklearn.naive_bayes import ComplementNB  from sklearn.pipeline import make_pipeline  from sklearn.model_selection import train_test_split  from sklearn import metrics    df = pd.read_excel("ITSM_data.xlsx")  df.info()  
      RangeIndex: 118 entries, 0 to 117  Data columns (total 7 columns):   #   Column                 Non-Null Count  Dtype           ---  ------                 --------------  -----            0   ID_request             118 non-null    int64            1   Text                   117 non-null    object           2   Category               115 non-null    object           3   Solution               115 non-null    object           4   Date_request_recieved  118 non-null    datetime64[ns]   5   Date_request_solved    118 non-null    datetime64[ns]   6   ID_agent               118 non-null    int64           dtypes: datetime64[ns](2), int64(2), object(3)  memory usage: 6.6+ KB  

    Each row represents one entry in the ITSM database. We will try to predict the category of the ticket based on the text of the ticket written by a user. Let’s examine deeper the most important fields for described business use cases.

    for text, category in zip(df.Text.sample(3, random_state=2), df.Category.sample(3, random_state=2)):      print("TEXT:")      print(text)      print("CATEGORY:")      print(category)      print("-"*100)  
    TEXT:  I just want to talk to an agent, there are too many problems on my pc to be explained in one ticket. Please call me when you see this, whoever you are. (talk to agent)  CATEGORY:  Asana  ----------------------------------------------------------------------------------------------------  TEXT:  Asana funktionierte nicht mehr, nachdem ich meinen Laptop neu gestartet hatte. Bitte helfen Sie.  CATEGORY:  Help Needed  ----------------------------------------------------------------------------------------------------  TEXT:  My mail stopped to work after I updated Windows.  CATEGORY:  Outlook  ----------------------------------------------------------------------------------------------------  

    If we take a look at the first two tickets, although one ticket is in German, we can see that described problems refer to the same software?—?Asana, but they carry different labels. This is starting distribution of our categories:

    df.Category.value_counts(normalize=True, dropna=False).mul(100).round(1).astype(str) + "%"
    Outlook             19.1%  Discord             13.9%  CRM                 12.2%  Internet Browser    10.4%  Mail                 9.6%  Keyboard             9.6%  Asana                8.7%  Mouse                8.7%  Help Needed          7.8%  Name: Category, dtype: object  

    The help needed looks suspicious, like the category that can contain tickets from multiple other categories. Also, categories Outlook and Mail sound similar, maybe they should be merged into one category. Before diving deeper into mentioned categories, we will get rid of missing values in columns of our interest.

    important_columns = ["Text", "Category"]  for cat in important_columns:        df.drop(df[df[cat].isna()].index, inplace=True)  df.reset_index(inplace=True, drop=True)  

    Assigning Tickets to the Proper Category

    There isn’t a valid substitute for the examination of data with the bare eye. The fancy function to do so in pandas is .sample(), so we will do exactly that once more, now for the suspicious category:

    meta = df[df.Category == "Help Needed"]    for text in meta.Text.sample(5, random_state=2):      print(text)      print("-"*100)  
    Discord emojis aren't available to me, I would like to have this option enabled like other team members have.  ---------------------------------------------------------------------------  Bitte reparieren Sie mein Hubspot CRM. Seit gestern funktioniert es nicht mehr  ---------------------------------------------------------------------------  My headphones aren't working. I would like to order new.  ---------------------------------------------------------------------------  

    Bundled problems with Office since restart:

    Messages not sent

    Outlook does not connect, mails do not arrive

    Error 0x8004deb0 appears when Connection attempt, see attachment

    The company account is affected: AB123

    Access via Office.com seems to be possible.

    --------------------------------------------------------------------------- Asana funktionierte nicht mehr, nachdem ich meinen Laptop neu gestartet hatte. Bitte helfen Sie. ---------------------------------------------------------------------------

    Obviously, we have tickets talking about Discord, Asana, and CRM. So the name of the category should be changed from “Help Needed” to existing, more specific categories. For the first step of the reassignment process, we will create the new column “Keywords” that gives the information if the ticket has the word from the list of categories in the “Text” column.

    words_categories = np.unique([word.strip().lower() for word in df.Category])  # list of categories    def keywords(row):       list_w = []      for word in row.translate(str.maketrans("", "", string.punctuation)).lower().split():          if word in words_categories:              list_w.append(word)      return list_w    df["Keywords"] = df.Text.apply(keywords)        # since our output is in the list, this function will give us better looking final output.   def clean_row(row):      row = str(row)      row = row.replace("[", "")      row = row.replace("]", "")      row = row.replace("'", "")      row = string.capwords(row)      return row    df["Keywords"] = df.Keywords.apply(clean_row)  

    Also, note that using "if word in str(words_categories)" instead of "if word in words_categories" would catch words from categories with more than 1 word (Internet Browser in our case), but would also require more data preprocessing. To keep things simple and straight to the point, we will go with the code for categories made of just one word. This is how our dataset looks now:

    df.head(2)

    output as image:

    XXXXX

    After extracting the keywords column, we will assume the quality of the tickets. Our hypothesis:

    1. Ticket with just 1 keyword in the Text field that is the same as the category to the which ticket belongs would be easy to classify.
    2. Ticket with multiple keywords in the Text field, where at least one of the keywords is the same as the category to the which ticket belongs would be easy to classify in the majority of cases.
    3. The ticket that has keywords, but none of them is equal to the name of the category to the which ticket belongs is probably a noisy label case.
    4. Other tickets are neutral based on keywords.
    cl_list = []    for category, keywords in zip(df.Category, df.Keywords):      if category.lower() == keywords.lower() and keywords != "":          cl_list.append("easy_classification")      elif category.lower() in keywords.lower():  # to deal with multiple keywords in the ticket          cl_list.append("probably_easy_classification")      elif category.lower() != keywords.lower() and keywords != "":          cl_list.append("potential_problem")      else:                cl_list.append("neutral")        df["Ease_classification"] = cl_list  df.Ease_classification.value_counts(normalize=True, dropna=False).mul(100).round(1).astype(str) + "%"  
    neutral                         45.6%  easy_classification             37.7%  potential_problem                9.6%  probably_easy_classification     7.0%  Name: Ease_classification, dtype: object  

    We made our new distribution and now is the time to examine tickets classified as a potential problem. In practice, the following step would require much more sampling and look at the larger chunks of data with the bare eye, but the rationale would be the same. You are supposed to find problematic tickets and decide if you can improve their quality or if you should drop them from the dataset. When you are facing a large dataset stay calm, and don't forget that data examination and data preparation usually take much more time than building ML algorithms!

    pp = df[df.Ease_classification == "potential_problem"]    for text, category in zip(pp.Text.sample(5, random_state=2), pp.Category.sample(3, random_state=2)):      print("TEXT:")      print(text)      print("CATEGORY:")      print(category)      print("-"*100)  
    TEXT:  

    outlook issue , I did an update Windows and I have no more outlook on my notebook ? Please help !


    Outlook CATEGORY: Mail -------------------------------------------------------------------- TEXT: Please relase blocked attachements from the mail I got from name.surname@company.com. These are data needed for social media marketing campaing. CATEGORY: Outlook -------------------------------------------------------------------- TEXT: Asana funktionierte nicht mehr, nachdem ich meinen Laptop neu gestartet hatte. Bitte helfen Sie. CATEGORY: Help Needed --------------------------------------------------------------------

    We understand that tickets from Outlook and Mail categories are related to the same problem, so we will merge these 2 categories and improve the results of our future ML classification algorithm.

    Merging into the Cluster

    mail_categories_to_merge = ["Outlook", "Mail"]    sum_mail_cluster = 0  for x in mail_categories_to_merge:      sum_mail_cluster += len(df[df["Category"] == x])     print("Number of categories to be merged into new cluster: ", len(mail_categories_to_merge))  print("Expected number of tickets in the new cluster: ", sum_mail_cluster)      def rename_to_mail_cluster(category):      if category in mail_categories_to_merge:          category = "Mail_CLUSTER"      else:          category = category      return category    df["Category"] = df["Category"].apply(rename_to_mail_cluster)    df.Category.value_counts()  
    Number of categories to be merged into new cluster:  2  Expected number of tickets in the new cluster:  33  Mail_CLUSTER        33  Discord             15  CRM                 14  Internet Browser    12  Keyboard            11  Asana               10  Mouse               10  Help Needed          9  Name: Category, dtype: int64  

    Last, but not least, we want to relabel some tickets from the meta category “Help Needed” to the proper category.

    df.loc[(df["Category"] == "Help Needed") & ([set(x).intersection(words_categories) for x in df["Text"].str.lower().str.replace("[^ws]", "", regex=True).str.split()]), "Category"] = "Change"    def cat_name_change(cat, keywords):      if cat == "Change":          cat = keywords      else:          cat = cat      return cat    df["Category"] = df.apply(lambda x: cat_name_change(x.Category, x.Keywords), axis=1)  df["Category"] = df["Category"].replace({"Crm":"CRM"})    df.Category.value_counts(dropna=False)  
    Mail_CLUSTER        33  Discord             16  CRM                 15  Internet Browser    12  Asana               11  Keyboard            11  Mouse               10  Help Needed          6  Name: Category, dtype: int64  

    We did our data relabeling and cleaning but we shouldn't call ourselves data scientists if we don't do at least one scientific experiment and test the impact of our work on the final classification. We will do so by implementing The Complement Naive Bayes classifier in sklearn. Feel free to try other, more complex algorithms. Also, be aware that further data cleaning could be done — for example, we could also drop all tickets left in the "Help Needed" category.

    Testing the Impact of Data Munging

    model = make_pipeline(TfidfVectorizer(), ComplementNB())    # old df  df_o = pd.read_excel("ITSM_data.xlsx")    important_categories = ["Text", "Category"]  for cat in important_categories:          df_o.drop(df_o[df_o[cat].isna()].index, inplace=True)    df_o.name = "dataset just without missing"  df.name = "dataset after deeper cleaning"    for dataframe in [df_o, df]:      # Split dataset into training set and test set      X_train, X_test, y_train, y_test = train_test_split(dataframe.Text, dataframe.Category, test_size=0.2, random_state=1)             # Training the model with train data      model.fit(X_train, y_train)        # Predict the response for test dataset      y_pred = model.predict(X_test)        print(f"Accuracy of  Complement Naive Bayes classifier model on {dataframe.name} is: {round(metrics.accuracy_score(y_test, y_pred),2)}")  
    Accuracy of  Complement Naive Bayes classifier model on dataset just without missing is: 0.48  Accuracy of  Complement Naive Bayes classifier model on dataset after deeper cleaning is: 0.65  

    Pretty impressive, right? The dataset we used is small (on purpose, so you can easily see what happens in each step) so different random seeds might produce different results, but in the vast majority of cases, the model will perform significantly better on the dataset after cleaning compared to the original dataset. We did a good job!
    Nikola Greb been coding for more than four years, and for the past two years he specialized in NLP. Before turning to data science, he was successful in sales, HR, writing and chess.

    More On This Topic

    • How Noisy Labels Impact Machine Learning Models
    • Dealing with Data Leakage
    • Dealing with Imbalanced Data in Machine Learning
    • Dealing with Position Bias in Recommendations and Search
    • How to Clean Text Data at the Command Line
    • eBook: Vocabularies, Text Mining and FAIR Data: The Strategic Role…