Will Coding Jobs Cease to Exist in Three Years? 

Will Coding Jobs Cease to Exist in Three Years? 

Matt Welsh, a former professor of Computer Science and Google engineer believes that programming jobs, as we know them today, will cease to exist in three years

There are some interesting aspects to this:

a) Matt Welsh has a development and computer science background

b) The talk was presented at the online meetup of ACM chicago and also at the ACM website as a viewpoint on ACM

c) There is a specific timeline proposed for the supposed end of coding jobs (three years)

The author however is setting up a startup to address this problem – which represents a bias to the views

Having considered this background, let’s explore the claims

a) Through chatGPT and GPT, programmers will evolve primarily into teachers of AI programs

b) The roles of product managers and code reviewers will not change much

c) Programming languages have not really evolved over the years

d) Github co-pilot offers a radical new way to code

e) Copilot is a fantastic productivity boost because it saves people from switching contexts.

Why three years?

Because Matt Welsh believes that the only thing limiting co-pilot today is more data and compute

In my view, this analysis is probably accurate especially because github co-pilot is evolving to all aspects of the developer lifecycle including documentation, testing, pull requests, etc.

I think most developers will need to prepare for a radical change in their roles

Image source pixabay

How to learn Artificial Intelligence in 2023

Know-how-to-learn-Artificial-Intelligence-in-2023-3

Artificial Intelligence is among the largest and quickest technological wave that has hit the world of tech. Globally the AI market is estimated to grow at a rate of 154 percent.

According to the research done by Gartner about Artificial Intelligence:

1.AI will make a business value worth USD 3.9 trillion by 2025.

2. It is estimated that AI will be the most disruptive technology in the coming years due to the enhancements in computing power, capacity, speed, and data diversity, and progresses in Deep Neural Networks (DNN).

3. Decision automation systems are kind of advanced-level systems that leverage AI to automate the business processes or works, such as translating voice, classification of data which cannot be easily classified by conventional systems, etc. will grow to 16 percent by the next 4–5 years, which will be a huge jump of 14 percent!

In this article, let’s understand how to learn Artificial Intelligence as a Beginner in 2023 and be an artificial intelligence engineer.

Prerequisites

The first step and the most important aspect of becoming learn AI as a beginner will be to get a bachelor’s degree. A person can choose from any specialization from the following:

Mathematics

Statistics

Information Technology

Computer Science

Economics

Finance

Having an advanced degree is a must-have prerequisite that is required to be a part of this growing industry. Apart from that having a master’s degree will be a great added advantage too. They can consider doing a master’s in one of the following area

Mathematics

Computer Science

Cognitive Science

Data Science

Artificial Intelligence

Key skills to become an AI engineer

How to learn Artificial Intelligence in 2023

To be a successful AI engineer, one should gain knowledge and excel in the following technical and non-technical skills, such as:

Technical skills

  1. Programming skills

They are required to gain a better understanding of popular programming languages, like C++, Java, R, and Python. Because they usually use their knowledge of programming languages to build and deploy several AI models.

2. Statistics, probability, and linear algebra

AI models are designed with the help of algorithms that are mainly based on statistics, algebra, and calculus. Apart from that, one must be familiar with probability to interact with artificial intelligence’s most common ML models such as:

  • Hidden Markov
  • Gaussian mixture
  • Naive Bayes models

3. Deep Learning and Neural Networks

Deep Learning and Neural Networks are very helpful for complex pattern recognition. A neural network is a system (software or hardware), which functions similarly to a human brain. Based on the neural functionality of the human brain, the space of artificial neural networks got developed. AI models replicate human understanding that can also be leveraged for works that possess capabilities far beyond humans. With the help of deep learning and neural networks many works get simplified, such as image classification, translation, and speech recognition, all of these play an essential role when it comes to artificial intelligence.

4. Natural Language Processing (NLP) libraries and tools

Natural Language Processing (NLP) libraries are a combination of computer science, information engineering, linguistics, and AI into one and programming the system to process and analyze massive datasets. An artificial intelligence engineer should do specific tasks highly on NLP, such as language, audio, and video processing by leveraging several different NLP libraries and tools, such as:

  • NTLK
  • Gensim
  • word2vec
  • TextBlob
  • CoreNLP
  • Sentiment Analytics
  • PyNLPI

5. Audio, video, and language processing

One is required to have a workable knowledge of a few libraries to be able to achieve language processing. A few libraries are Gensim and NLTK, followed by techniques like summarization, sentimental analysis, and word2vec. Natural Language Processing (NLP) combines linguistics and computer science and it usually deals with either audio or video.

Non-technical skills

  1. Industry knowledge

Successful AI projects are those that tackle the real pain points effectively. It is essential to have a good knowledge of the industry in which one is working and the ways one can reap several benefits for the business to grow.

2. Critical thinking

Individuals must stay updated with the latest industrial developments and data so that they can build better outputs depending on the findings. They must also adopt the best business practices as well as all the latest approaches to AI.

3. Collaboration skills

AI engineers in this area often do tasks within a team of other AI developers and IT professionals, so they are required to possess a certain ability to work efficiently and effectively within a team.

4. Iteration of Ideas

Iterating on ideas is quite essential for finding one that works. An individual should use various types of techniques to fabricate realistic scale models of solid parts or assemblies with the help of 3D computer-aided designs.

AI certification

By doing an AI certificate programfrom prominent organizations like Microsoft or the Artificial Intelligence Board of America (ARTiBA) one can acquire better knowledge and enhance skills in this field. If an individual wants to pursue a career in this field of AI then they should be ready with the right skill set and certifications.

DSC Weekly 25 April 2023 – Tech Layoffs and Uncertainty Raise Big Questions for Higher Education

Announcements

  • With so many legacy applications to modernize, developers have never been under so much pressure to perform. Cloud-native tools and methodologies – when implemented well – can provide much-needed relief, making deployments more efficient, accelerating time-to-value, and optimizing error detection. Join the two-day Cloud-Native Modernization summit to get a roadmap for cloud-native success and discover how to supercharge innovation with multi-cloud, microservices, containers, agile methodologies and more.
  • Automation helps companies keep on top of ever-growing workloads, cut costs and free workers from manual, tedious tasks. The automation market continues to advance to better address the increasing demands companies face. Tune into The Growth of Automation summit to hear leading experts discuss how AI, machine learning and other technologies can expand automation to provide further benefits, as well as the latest technologies to help successfully automate workflows.
man being fired

Tech Layoffs and Uncertainty Raise Big Questions for Higher Education

Mass layoffs continue across the tech industry, with tens of thousands of workers losing their jobs in the first quarter of 2023. The reductions occurred from small startups to the biggest names in tech — Google, Amazon, Microsoft. Core technical roles such as data scientists and software engineers made up a huge chunk of the workers let go. The mass layoffs will no doubt influence IT worker morale in the coming quarters and even influence the number of candidates applying for these positions down the road.

But what about its influence on higher education? High schoolers considering a career would definitely pause before spending potentially hundreds of thousands of dollars to get a degree in a rapidly changing technological field. Data Science Central contributor Vincent Granville poses the question in his article this week, titled Is it Still Worth Getting a Machine Learning Degree?

Vincent raises some great questions about the state of higher education in relation to technology and notes that the economy isn’t the only factor that will influence tech hiring now and in the future. He also provides sage advice for those making career decisions. Despite the recent news headlines about seemingly constant tech layoffs, Vincent remains optimistic about the future of ML jobs, and tech jobs in general.

I won’t spoil it here, but Vincent predicts the market will eventually favor workers, despite factors such as AI and the economy creating uncertainty for them now. We want to hear your comments as well. The article will be posted on our Data Science Central LinkedIn page in the coming days. Please visit and provide your thoughts.

The Editors of Data Science Central

Contact The DSC Team if you are interested in contributing.

DSC Featured Articles

  • Will Coding Jobs Cease to Exist in Three Years?
    April 25, 2023
    by ajitjaokar
  • How to learn Artificial Intelligence in 2023
    April 25, 2023
    by Aileen Scott
  • How Businesses can benefit by Integrating ChatGPT in their Apps
    April 25, 2023
    by Seven Kole
  • What is Modern Data Quality?
    April 25, 2023
    by Vanitha
  • Internal CPU Accelerators and HBM Enable Faster and Smarter HPC and AI Applications
    April 25, 2023
    by RobFarber
  • Creating Healthy AI Utility Function: ChatGPT Example – Part II
    April 23, 2023
    by Bill Schmarzo
  • 5 Crucial Steps To Starting A Successful Hi-Tech Startup: From Idea To Promotion
    April 21, 2023
    by Evan Rogen
  • Businesses Need a Data-Driven Approach to Net-Zero Targets
    April 21, 2023
    by Jane Marsh
  • An Overview of the Role Data Plays in AI Development
    April 20, 2023
    by Roger Brown
  • DSC Weekly 18 April 2023 – ChatGPT, The Overconfident Artist
    April 18, 2023
    by Scott Thompson
  • Data-Driven Procurement Strategies Best Practices
    April 18, 2023
    by Karen Bonifacio

Picture of the Week

DSC Weekly 25 April 2023 – Tech Layoffs and Uncertainty Raise Big Questions for Higher Education

Is it Still Worth Getting a Machine Learning Degree?

Given the current economy, with large companies laying off machine learning employees in droves, one may wonder if spending 4 years and over $80k in education is worth it. How long will it take to get a job when competing with hundreds of candidates for the few listed positions? What salary can I expect?

These days, many machine learning positions in US offer salaries well under $100k per year especially for beginners. Many still offer over $200k, but usually require specialized experience, and typically not what you would learn in school. Engineers are in higher demand than scientists, further casting doubts on the value of a PhD. There are multiple aspects and ways to look at this.

Type of Degree

People sometimes get a degree in a field that they love, not to get a job. If you want to climb Mount Everest, you need a significant amount of training. You may spend a lot of money to achieve your goal – your passion – and may not earn any money in return. The same is true with PhDs. Everyone knows that you are extremely unlikely to get a decent salary and tenure in Academia with a PhD, no matter how great you are. Yet the degree is designed exactly for that purpose, and it is still offered because of the demand. Indeed, it may hurt you in the job market to have a PhD and some applicants don’t include it in their resume. People do it for other reasons: perceived prestige, potential credibility and recognition (if you write books), and passion for research being three of them.

But what about a practical master’s degree, one involving internship, real programming, the creation of a portfolio, and everything to impress a potential employer? The timing is not great for those who just completed the degree, but excellent for those just enrolling. First, given the scarcity of jobs, spending time studying is more productive than looking for elusive jobs these days. But the economy for machine learning professionals will bounce back. I see more and more hiring managers looking to hire, even though you only hear about the layoffs. Some of the laid-off people do not have a real degree, but certificates or a data camp. While there are good providers, there are also many bad ones (those usually don’t even tell you who the instructors are). Hiring managers are more likely to require a real degree these days.

What Companies Usually Do

Some employees were overpaid or unable to sell the value that they produced to their boss or the decision makers. Companies never lower your pay to adjust to the new market: they eliminate jobs and at the same time recruit new (less expensive) hires and increase the pay of their best employees. It happens in cycles, now and then, each time causing or because of a recession, and they do it all at once, as in a game of musical chairs. Those with a good education and reasonable salary expectations, trying to get into this market, will eventually win.

Impact of ChatGPT

Will AI replace workers in the future, including those who develop AI? While there is no doubt that AI is capable of many things, there is also a lot of hype surrounding it. I can imagine many time-consuming and boring tasks like debugging and data cleaning being more and more automated, but we are still far from there. Even a basic problem such as scoring news from fake to real still needs to be solved. It can be solved without learning algorithms (more on this in a future article) but for now, many of the people working on this – including machine learning engineers – are not great at training these models with their own brain, let alone with an algorithm. Some companies like fake news because it’s click-bait and thus a source of revenue, but they will face competition from companies addressing the issue.

Hype, and What is Here to Stay

Talking about hype, vendors try to sell expensive products such as GAN or deep networks, and they find buyers because of the hype. In the classes that I offer, I had to include it as well, because that’s what many participants – themselves machine learning professionals working for insurance, health, or finance companies – want to hear about. In the end, the buyers are going to recognize that for their relatively simple needs, maybe cheaper solutions do just as well. The market is adjusting accordingly.

However, a bigger long-term issue impacting all jobs, not just data science, is the decline in population growth, eventually turning negative. This will open more and more positions in health care and related sectors serving an older population. Addressing climate issues is unlikely to lose steam any time soon.

About the Author

vgr2-1

Vincent Granville is a pioneering data scientist and machine learning expert, founder of MLTechniques.com and co-founder of Data Science Central (acquired by TechTarget in 2020), former VC-funded executive, author and patent owner. Vincent’s past corporate experience includes Visa, Wells Fargo, eBay, NBC, Microsoft, CNET, InfoSpace. Vincent is also a former post-doc at Cambridge University, and the National Institute of Statistical Sciences (NISS).

Vincent published in Journal of Number Theory, Journal of the Royal Statistical Society (Series B), and IEEE Transactions on Pattern Analysis and Machine Intelligence. He is also the author of “Intuitive Machine Learning and Explainable AI”, available here. He lives in Washington state, and enjoys doing research on stochastic processes, dynamical systems, experimental math and probabilistic number theory.

Introducing TPU v4: Googles Cutting Edge Supercomputer for Large Language Models

Introducing TPU v4: Googles Cutting Edge Supercomputer for Large Language Models
Image by Editor

Machine learning and artificial intelligence seem to be growing at a rapid rate that some of us can even keep up with. As these machine-learning models get better at what they do, they will require better infrastructure and hardware support to keep them going. The advancement of machine learning has a direct lead to scaling computing performance. So let’s learn more about TPU v4.

What is TPU v4?

TPU stands for Tensor Processing Unit and they were designed for machine learning and deep learning applications. TPU was invented by Google and was constructed in a way that it has the ability to be able to handle the high computational needs of machine learning and artificial intelligence.

When Google designed the TPU, they created it as a domain-specific architecture, which means they designed it as a matrix processor, instead of it being a general-purpose processor so that it specializes in neural network workloads. This solves Google's issue of memory access problem which slows down GPUs and CPUs, causing them to use more processing power.

So there’s been TPU v2, v3, and now v4. So what’s v2 all about?

The TPU v2 chip contains two TensorCores, four MXUs, a vector unit, and a scalar unit. See the image below:

Introducing TPU v4: Googles Cutting Edge Supercomputer for Large Language Models
Image by Google

Optical Circuit Switches (OCSes)

TPU v4 is the first supercomputer to deploy reconfigurable optical circuit switches. Optical circuit switches (OCS) are considered to be more effective. They reduce congestion found in previous networks because they are transmitted as they occur. OCS improves scalability, availability, modularity, deployment, security, power, performance, and more.

OCSes and other optical components in TPU v4 make up less than 5% of TPU v4’s system cost and less than 5% of system power.

SparseCores

TPU v4 also is the first supercomputer with hardware support for embedding. Neural networks train well on dense vectors, and embeddings are the most effective way to transform categorical feature values into dense vectors. TPU v4 include third-generation SparseCores, which are dataflow processes that accelerate machine learning models that are reliant on embedding.

For example, the embedding function can translate a word in English, which would be considered a large categorical space into a smaller dense space of a 100-vector representation of each word. Embedding is a key element to Deep Learning Recommendation Models (DLRMs), which are part of our everyday lives and are used in advertising, search ranking, YouTube, and more.

The image below shows the performance of recommendation models on CPUs, TPU v3, TPU v4 (using SparseCore), and TPU v4 with embeddings in CPU memory (not using SparseCore). As you can see the TPU v4 SparseCore is 3X faster than TPU v3 on recommendation models, and 5–30X faster than systems using CPUs.

Introducing TPU v4: Googles Cutting Edge Supercomputer for Large Language Models
Image by Google

Performance

TPU v4 outperforms TPU v3 by 2.1x and has an improved performance/Watt by 2.7x. TPU v4 is 4x larger at 4096 chips, making it 10x faster. The implementation and flexibility of OCS are also major help for large language models.

The performance and availability of TPU v4 supercomputers are being heavily considered to improve large language models such as LaMDA, MUM, and PaLM. PaLM, the 540B-parameter model was trained on TPU v4 for over 50 days and had a remarkable 57.8% hardware floating point performance.

TPU v4 also has multidimensional model-partitioning techniques that enable low-latency, high-throughput inference for large language models.

Energy Efficiency

With more laws and regulations being put in place for companies globally to do better to improve their overall energy efficiency, TPU v4 is doing a decent job. TPU v4s inside Google Cloud use ~2-6x less energy and produce ~20x less CO2e than contemporary DSAs in typical on-premise data centres.

Machine Learning Workloads Changes

So now you know a bit more about TPU v4, you might be wondering how fast machine learning workloads actually change with TPU v4.

The below table shows the workload by deep neural network model type and the % TPUs used. Over 90% of training at Google is on TPUs, and this table shows the fast change in production workloads at Google.

There is a drop in recurrent neural networks (RNN), as this is because RNNs process the input all at once rather than sequentially, in comparison to transforms which are known for natural language translation and text summarization.

Introducing TPU v4: Googles Cutting Edge Supercomputer for Large Language Models
To learn more about TPU v4 capabilities, read the research paper TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings. Wrapping it up

Last year, TPU v4 supercomputers were available to AI researchers and developers at Google Cloud’s ML cluster in Oklahoma. The author of this paper claims that the TPU v4 is faster and uses less power than Nvidia A100. However, they have not been able to compare the TPU v4 to the newer Nvidia H100 GPUs due to their limited availability and its 4nm architecture, whereas TPU v4 has a 7nm architecture.

What do you think TPU v4 is capable of, its limitations, and is it better than Nvidia A100 GPU?
Nisha Arya is a Data Scientist, Freelance Technical Writer and Community Manager at KDnuggets. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.

More On This Topic

  • Learn About Large Language Models
  • Top Open Source Large Language Models
  • Top Free Courses on Large Language Models
  • Machine Learning: Cutting Edge Tech with Deep Roots in Other Fields
  • 2022 INFORMS Business Analytics Conference: Join us for cutting-edge…
  • Introducing the Testing Library for Natural Language Processing

Automated Machine Learning with Python: A Case Study

Automated Machine Learning with Python: A Case Study
Image by Author

In today’s world, all organizations want to use Machine learning to analyze the data they generate daily from the users. With the help of a machine or deep learning algorithms, they can analyze the data. Afterwards, they can make the prediction of testing data in the production environment. But suppose we start following the mentioned process. In that case, we may face problems such as building and training machine learning models since this is time-consuming and requires expertise in domains like programming, statistics, data science, etc.

So, to overcome such challenges, Automated Machine Learning (AutoML) comes into the picture, which emerged as one of the most popular solutions that can automate many aspects of the machine learning pipeline. So, in this article, we will discuss AutoML with Python through a real-life case study on the Prediction of heart disease.

Case Study: Prediction of Heart Disease

We can easily observe that problem-related to the heart are the major cause of death worldwide. The only way to reduce such types of impact is to detect the disease early with some of the automated methods so that less time will be consumed there and, after that, take some prevention measures to reduce its effect. So, by keeping this problem in mind, we will explore one of the datasets related to medical patient records to build a machine-learning model from which we can predict the likelihood or probability of a patient with heart disease. This type of solution can easily be applied in hospitals to check so doctors can provide some treatments as soon as possible.

The complete model pipeline we followed in this case study is shown below.

Automated Machine Learning with Python: A Case Study
Fig.1 AutoML Model Pipeline | Image by Author

Implementation

Step-1: Before starting to implement, let's import the required libraries, including NumPy for matrix manipulation, Pandas for data analysis, and Matplotlib for Data Visualization.

import numpy as np  import pandas as pd  import matplotlib.pyplot as plt  import h2o  from h2o.automl import H2OAutoML

Step-2: After importing all the required libraries in the above step, we will now try to load our dataset while utilizing the Pandas data frame to store that in an optimized manner, as they are much more efficient in terms of both space and time complexity compared to other data structures like a linked list, arrays, trees, etc.

Further, we can perform Data preprocessing to prepare the data for further modelling and generalization. To download the dataset which we are using here, you can easily refer to the link.

# Initialize H2O  h2o.init()    # Load the dataset  data = pd.read_csv("heart_disease.csv")    # Convert the Pandas data frame to H2OFrame  hf = h2o.H2OFrame(data)

Step-3: After preparing the data for the machine learning model, we will use one of the famous automated machine learning libraries called H2O.ai, which helps us create and train the model.

Automated Machine Learning with Python: A Case Study
Image by H2O.ai

The main benefit of this platform is that it provides high-level API from which we can easily automate many aspects of the pipeline, including Feature Engineering, Model selection, Data Cleaning, Hyperparameter Tuning, etc., which drastically the time required to train the machine learning model for any of the data science projects.

Step-4: Now, to build the model, we will use the API of the H2O.ai library, and to use this, we have to specify the type of problem, whether it is a regression problem or a classification problem, or some other type with the target variable mentioned. Then, automatically this library chooses the best model for the given problem statement, including algorithms such as Support Vector Machines, Decision Trees, Deep neural networks, etc.

# Split the data into training and testing sets  train, test, valid = hf.split_frame(ratios=[0.7, 0.15])    # Specify the target variable and the type of problem  y = "target"  problem_type = "binary"

Step-5: After finalizing the best model from a set of algorithms, the most critical task is fine-tuning our model based on the hyperparameters involved. This tuning process involved many techniques, such as Grid-search Cross Validation, etc., which allowed for finding the best set of hyperparameters for the given problem.

# Run AutoML  aml = H2OAutoML(max_models=10, seed=1, balance_classes=True)  aml.train(y=y, training_frame=train, validation_frame=valid)    # View the leaderboard  lb = aml.leaderboard  print(lb)    # Get the best model  best_model = aml.leader

Step-6: Now, the final task is to check the model’s performance, using evaluation metrics such as Confusion matrix, Precision, recall, etc., for classification problems and MSE, MAE, RMSE, and R-square, for regression models so that we can find some inference of our model’s working in the production environment.

# Make predictions on the test data  preds = best_model.predict(test)    # Convert the predictions to a Pandas dataframe  preds_df = preds.as_data_frame()    # Evaluate the model using accuracy, precision, recall, and F1-score  accuracy = best_model.accuracy(test)  precision = best_model.precision(test)  recall = best_model.recall(test)  f1 = best_model.f1(test)    print("Accuracy:", accuracy)  print("Precision:", precision)  print("Recall:", recall)  print("F1-score:", f1)

Step-7: Finally, we will plot the ROC curve which shows the graph between false positive rate (which means that our model is predicting the wrong result compare to the actual and model predicts the positive class, where it belongs to the negative class), and false negative rate(which means that our model is predicting the wrong result compare to the actual and model predicts the negative class, where it belongs to the positive class) and also print the confusion matrix and eventually our model prediction and evaluation on the test data is completed. Then we will shut down our H2O.

# Plot the ROC curve  roc = best_model.roc()  roc.plot()  plt.show()    # Plot the confusion matrix  cm = best_model.confusion_matrix()  cm.plot()  plt.show()    # Shutdown H2O  h2o.shutdown()

You can access the notebook of the mentioned code from here.

Conclusion

To conclude this article, we have explored the different aspects of one of the most popular platforms which automate the whole process of machine learning or data science tasks, through which we can easily create and train machine learning models using the python programming language and also we have covered one of the famous case studies of heart disease prediction, which enhances the understanding on how to use such platforms effectively. Using such platforms, machine learning pipelines can be easily optimized, saving the engineer’s time in the organization and reducing system latency and resource utilization such as GPU and CPU cores, which are easily accessible to a large audience.
Aryan Garg is a B.Tech. Electrical Engineering student, currently in the final year of his undergrad. His interest lies in the field of Web Development and Machine Learning. He have pursued this interest and am eager to work more in these directions.

More On This Topic

  • Automated Machine Learning with Python: A Comparison of Different…
  • KDnuggets News, December 14: 3 Free Machine Learning Courses for Beginners…
  • The Complete Machine Learning Study Roadmap
  • DIY Automated Machine Learning with Streamlit
  • Introduction to Automated Machine Learning
  • Automated Data Labeling with Machine Learning

How ChatGPT Works: The Model Behind The Bot

How ChatGPT Works: The Model Behind The Bot
Photo by Matheus Bertelli

This gentle introduction to the machine learning models that power ChatGPT, will start at the introduction of Large Language Models, dive into the revolutionary self-attention mechanism that enabled GPT-3 to be trained, and then burrow into Reinforcement Learning From Human Feedback, the novel technique that made ChatGPT exceptional.

Large Language Models

ChatGPT is an extrapolation of a class of machine learning Natural Language Processing models known as Large Language Model (LLMs). LLMs digest huge quantities of text data and infer relationships between words within the text. These models have grown over the last few years as we’ve seen advancements in computational power. LLMs increase their capability as the size of their input datasets and parameter space increase.

The most basic training of language models involves predicting a word in a sequence of words. Most commonly, this is observed as either next-token-prediction and masked-language-modeling.

How ChatGPT Works: The Model Behind The Bot
Arbitrary example of next-token-prediction and masked-language-modeling generated by the author.

In this basic sequencing technique, often deployed through a Long-Short-Term-Memory (LSTM) model, the model is filling in the blank with the most statistically probable word given the surrounding context. There are two major limitations with this sequential modeling structure.

  1. The model is unable to value some of the surrounding words more than others. In the above example, while ‘reading’ may most often associate with ‘hates’, in the database ‘Jacob’ may be such an avid reader that the model should give more weight to ‘Jacob’ than to ‘reading’ and choose ‘love’ instead of ‘hates’.
  2. The input data is processed individually and sequentially rather than as a whole corpus. This means that when an LSTM is trained, the window of context is fixed, extending only beyond an individual input for several steps in the sequence. This limits the complexity of the relationships between words and the meanings that can be derived.

In response to this issue, in 2017 a team at Google Brain introduced transformers. Unlike LSTMs, transformers can process all input data simultaneously. Using a self-attention mechanism, the model can give varying weight to different parts of the input data in relation to any position of the language sequence. This feature enabled massive improvements in infusing meaning into LLMs and enables processing of significantly larger datasets.

GPT and Self-Attention

Generative Pre-training Transformer (GPT) models were first launched in 2018 by openAI as GPT-1. The models continued to evolve over 2019 with GPT-2, 2020 with GPT-3, and most recently in 2022 with InstructGPT and ChatGPT. Prior to integrating human feedback into the system, the greatest advancement in the GPT model evolution was driven by achievements in computational efficiency, which enabled GPT-3 to be trained on significantly more data than GPT-2, giving it a more diverse knowledge base and the capability to perform a wider range of tasks.

How ChatGPT Works: The Model Behind The Bot
Comparison of GPT-2 (left) and GPT-3 (right). Generated by the author.

All GPT models leverage the transformer architecture, which means they have an encoder to process the input sequence and a decoder to generate the output sequence. Both the encoder and decoder have a multi-head self-attention mechanism that allows the model to differentially weight parts of the sequence to infer meaning and context. In addition, the encoder leverages masked-language-modeling to understand the relationship between words and produce more comprehensible responses.

The self-attention mechanism that drives GPT works by converting tokens (pieces of text, which can be a word, sentence, or other grouping of text) into vectors that represent the importance of the token in the input sequence. To do this, the model,

  1. Creates a query, key, and value vector for each token in the input sequence.
  2. Calculates the similarity between the query vector from step one and the key vector of every other token by taking the dot product of the two vectors.
  3. Generates normalized weights by feeding the output of step 2 into a softmax function.
  4. Generates a final vector, representing the importance of the token within the sequence by multiplying the weights generated in step 3 by the value vectors of each token.

The ‘multi-head’ attention mechanism that GPT uses is an evolution of self-attention. Rather than performing steps 1–4 once, in parallel the model iterates this mechanism several times, each time generating a new linear projection of the query, key, and value vectors. By expanding self-attention in this way, the model is capable of grasping sub-meanings and more complex relationships within the input data.

How ChatGPT Works: The Model Behind The Bot
Screenshot from ChatGPT generated by the author.

Although GPT-3 introduced remarkable advancements in natural language processing, it is limited in its ability to align with user intentions. For example, GPT-3 may produce outputs that

  • Lack of helpfulness meaning they donot follow the user’s explicit instructions.
  • Contain hallucinations that reflect non-existing or incorrect facts.
  • Lack interpretability making it difficult for humans to understand how the model arrived at a particular decision or prediction.
  • Include toxic or biased content that is harmful or offensive and spreads misinformation.

Innovative training methodologies were introduced in ChatGPT to counteract some of these inherent issues of standard LLMs.

ChatGPT

ChatGPT is a spinoff of InstructGPT, which introduced a novel approach to incorporating human feedback into the training process to better align the model outputs with user intent. Reinforcement Learning from Human Feedback (RLHF) is described in depth in openAI’s 2022 paper Training language models to follow instructions with human feedback and is simplified below.

Step 1: Supervised Fine Tuning (SFT) Model

The first development involved fine-tuning the GPT-3 model by hiring 40 contractors to create a supervised training dataset, in which the input has a known output for the model to learn from. Inputs, or prompts, were collected from actual user entries into the Open API. The labelers then wrote an appropriate response to the prompt thus creating a known output for each input. The GPT-3 model was then fine-tuned using this new, supervised dataset, to create GPT-3.5, also called the SFT model.

In order to maximize diversity in the prompts dataset, only 200 prompts could come from any given user ID and any prompts that shared long common prefixes were removed. Finally, all prompts containing personally identifiable information (PII) were removed.

After aggregating prompts from OpenAI API, labelers were also asked to create sample prompts to fill-out categories in which there was only minimal real sample data. The categories of interest included

  • Plain prompts: any arbitrary ask.
  • Few-shot prompts: instructions that contain multiple query/response pairs.
  • User-based prompts: correspond to a specific use-case that was requested for the OpenAI API.

When generating responses, labelers were asked to do their best to infer what the instruction from the user was. The paper describes the main three ways that prompts request information.

  1. Direct: “Tell me about…”
  2. Few-shot: Given these two examples of a story, write another story about the same topic.
  3. Continuation: Given the start of a story, finish it.

The compilation of prompts from the OpenAI API and hand-written by labelers resulted in 13,000 input / output samples to leverage for the supervised model.

How ChatGPT Works: The Model Behind The Bot
Image (left) inserted from Training language models to follow instructions with human feedback OpenAI et al., 2022 https://arxiv.org/pdf/2203.02155.pdf. Additional context added in red (right) by the author.

Step 2: Reward Model

After the SFT model is trained in step 1, the model generates better aligned responses to user prompts. The next refinement comes in the form of training a reward model in which a model input is a series of prompts and responses, and the output is a scaler value, called a reward. The reward model is required in order to leverage Reinforcement Learning in which a model learns to produce outputs to maximize its reward (see step 3).

To train the reward model, labelers are presented with 4 to 9 SFT model outputs for a single input prompt. They are asked to rank these outputs from best to worst, creating combinations of output ranking as follows.

How ChatGPT Works: The Model Behind The Bot
Example of response ranking combinations. Generated by the author.

Including each combination in the model as a separate datapoint led to overfitting (failure to extrapolate beyond seen data). To solve, the model was built leveraging each group of rankings as a single batch datapoint.

How ChatGPT Works: The Model Behind The Bot
Image (left) inserted from Training language models to follow instructions with human feedback OpenAI et al., 2022 https://arxiv.org/pdf/2203.02155.pdf. Additional context added in red (right) by the author.

Step 3: Reinforcement Learning Model

In the final stage, the model is presented with a random prompt and returns a response. The response is generated using the ‘policy’ that the model has learned in step 2. The policy represents a strategy that the machine has learned to use to achieve its goal; in this case, maximizing its reward. Based on the reward model developed in step 2, a scaler reward value is then determined for the prompt and response pair. The reward then feeds back into the model to evolve the policy.

In 2017, Schulman et al. introduced Proximal Policy Optimization (PPO), the methodology that is used in updating the model’s policy as each response is generated. PPO incorporates a per-token Kullback–Leibler (KL) penalty from the SFT model. The KL divergence measures the similarity of two distribution functions and penalizes extreme distances. In this case, using a KL penalty reduces the distance that the responses can be from the SFT model outputs trained in step 1 to avoid over-optimizing the reward model and deviating too drastically from the human intention dataset.

How ChatGPT Works: The Model Behind The Bot
Image (left) inserted from Training language models to follow instructions with human feedback OpenAI et al., 2022 https://arxiv.org/pdf/2203.02155.pdf. Additional context added in red (right) by the author.

Steps 2 and 3 of the process can be iterated through repeatedly though in practice this has not been done extensively.

How ChatGPT Works: The Model Behind The Bot
Screenshot from ChatGPT generated by the author.

Evaluation of the Model

Evaluation of the model is performed by setting aside a test set during training that the model has not seen. On the test set, a series of evaluations are conducted to determine if the model is better aligned than its predecessor, GPT-3.

Helpfulness: the model’s ability to infer and follow user instructions. Labelers preferred outputs from InstructGPT over GPT-3 85 ± 3% of the time.

Truthfulness: the model’s tendency for hallucinations. The PPO model produced outputs that showed minor increases in truthfulness and informativeness when assessed using the TruthfulQA dataset.

Harmlessness: the model’s ability to avoid inappropriate, derogatory, and denigrating content. Harmlessness was tested using the RealToxicityPrompts dataset. The test was performed under three conditions.

  1. Instructed to provide respectful responses: resulted in a significant decrease in toxic responses.
  2. Instructed to provide responses, without any setting for respectfulness: no significant change in toxicity.
  3. Instructed to provide toxic response: responses were in fact significantly more toxic than the GPT-3 model.

For more information on the methodologies used in creating ChatGPT and InstructGPT, read the original paper published by OpenAI Training language models to follow instructions with human feedback, 2022 https://arxiv.org/pdf/2203.02155.pdf.

How ChatGPT Works: The Model Behind The Bot
Screenshot from ChatGPT generated by the author.

Happy learning!

Sources

  1. https://openai.com/blog/chatgpt/
  2. https://arxiv.org/pdf/2203.02155.pdf
  3. https://medium.com/r/?url=https%3A%2F%2Fdeepai.org%2Fmachine-learning-glossary-and-terms%2Fsoftmax-layer
  4. https://www.assemblyai.com/blog/how-chatgpt-actually-works/
  5. https://medium.com/r/?url=https%3A%2F%2Ftowardsdatascience.com%2Fproximal-policy-optimization-ppo-explained-abed1952457b

Molly Ruby is a Data Scientist at Mars and a content writer.

Original. Reposted with permission.

More On This Topic

  • Visual ChatGPT: Microsoft Combine ChatGPT and VFMs
  • What Is Data Enrichment And How It Works
  • How to break a model in 20 days — a tutorial on production model analytics
  • The NLP Model Forge: Generate Model Code On Demand
  • Machine Learning Model Development and Model Operations: Principles and…
  • How Machine Learning Works for Social Good

10 Websites to Get Amazing Data for Data Science Projects

10 Websites to Get Amazing Data for Data Science Projects
Image by Author

“How much can anyone really care about sepal length?” my friend complained to me over coffee a few days ago. She was referring to the built-in `iris` dataset in R, which first debuted way back in 1936. “Why do college professors try to teach us data science with crappy, boring, pointless data when there’s so much great data out there for data science projects?”

She’s right. It’s really tough to motivate yourself to learn data science, or do data science projects when your data is boring or meaningless to you. I know I struggled to motivate myself to learn data science until I found some good crunchy data that interested me.

In this article, I’m going to break down 10 amazing websites where you can grab some really awesome data for data science projects. The purpose will be to showcase a variety of data that might appeal to you. Ultimately, these websites should help you find data you care about, do a cool data science project, and use that to get a job.

How did I Vet these Data Sources?

If you see a website in this article, it’s because the data it contains is:

  • Freely available. You won’t have to pay for it.
  • Community-oriented. It’s not just going to just be a file; there will be some commentary and explanation around it.
  • Cool. It’s something that someone, somewhere will care about. Maybe you!
  • Clean-ish. You’ll get to practice the fun part of data science — analyzing, visualizing, sharing, and so on.
  • Language-agnostic. You can dig into these with Python, R, SQL, or any other language you like.

10 Websites to Get Awesome Data for Your Data Science Projects

Let’s dig into the best websites to find data that you’ll actually care about and want to explore using data science.

Google Dataset Search Super broad, varying quality
Kaggle More limited, but lots of context and community
KDNuggets Specific for AI, ML, data science
Government websites Wide variety, resources to learn
Pudding.cool Pop culture, essays
538 Sports, politics, clean data
Tidy Tuesdays Messy data, great community
GitHub Huge amount of searchable data with commentary, variable quality
Buzzfeed Pop culture, essays, rigorous science
Awesome Public Datasets Wide variety, only datasets, no commentary

1. Google’s Dataset Search

I’m cheating a little bit, because this isn’t really a website for datasets, but rather a search engine for data sets. But it’s too good not to include.

Google’s Dataset Search is just like Google but for data sets. You type in your query, and Google returns as many datasets as it has on that subject.

For example, searching “cats” brings me over one hundred datasets, including a dataset containing over 9,000 images of cats.

10 Websites to Get Amazing Data for Data Science Projects
Source: Google Dataset Search

What I love about this website:

  • It’s super versatile. You will almost certainly find something you care about.
  • It’s instantly applicable. This website includes other papers that have used this dataset, so you can see what interesting things other people have done with the data already.
  • You can toggle to only include free datasets.
  • It pulls out the context for you, so you get a bit of an explanation of what this dataset is and why it was collected.

It’s a great place to start.

2. Kaggle

Kaggle’s Datasets is also a search engine, but it’s both more limited and more focused.

It’s more limited because it only contains datasets that people have published with Kaggle. But it’s more focused because the datasets aren’t just whatever random set of numbers Google scraped. Kaggle is a home for data science competitions, so the datasets it collects are extremely relevant to data science.

This allows you to filter by your specific interest. For example, I can stumble across that same cat dataset if I searched “cat” with the “computer vision” filter on.

10 Websites to Get Amazing Data for Data Science Projects
Source: Kaggle Datasets

What I love about this website:

  • The community aspect is so strong. Clicking on that cat dataset shows six other folks asking questions about the dataset – and getting answers.
  • Lots of example projects. You can also see what other people have built or coded around that data.
  • You can go the other way around, too – check out their competitions and see if anything interests you, then use the accompanying dataset.

3. KDNuggets

This may come as a surprise to you, but KDNuggets curates a great set of datasets. These datasets are specifically for Data Science, Machine Learning, AI & Analytics, so they’re

Many of these aren’t KDNuggets exclusives, but it’s a good list to poke around in. It’s worth noting that when you sign up to be a KDNuggets email subscriber, you also get access to World Data AI which itself contains 3.5 billion datasets.

10 Websites to Get Amazing Data for Data Science Projects
Source: KDnuggets Datasets

What I love about this website:

  • Data specific for data science. Many of these datasets are curated for other purposes, but these are all here specifically because they’re good for AI, machine learning, and data science.
  • Quick description of each set. Just a little bit of context to help you decide if it’s the right dataset for you.

4. Government websites

I could easily expand this list of websites to get datasets to about a million simply by individually listing each of the government websites I like to use to get data. I won’t. Instead, I’ll offer a small list here:

  • http://datasf.org/
  • http://data.gov.uk
  • https://www.usa.gov/About/developer-resources/1usagov.shtml
  • https://www.census.gov/data/datasets.html

Governments are constantly collecting data to do studies, and many of them publish that data online.

10 Websites to Get Amazing Data for Data Science Projects
Source: The US Census Bureau

What I love about these websites:

  • The data is used for studies, so it’s typically pretty clean and well-organized.
  • The data has a real use case. Someone collected it for a real, government-related reason.
  • It’s typically very current data.
  • There are often some cool stories around the data.
  • Many governments have invested resources into showing you how to access or use the data, like the Census Bureau.

5. Pudding.cool

If you like your data to come with a heady dose of pop culture, look no further than Pudding.cool. This website looks at topics as varied as repetitive pop lyrics, women’s pockets, and how The Big Bang Theory gets censored by the Chinese government.

This is more of a digital magazine writing longform essays about culture, showing a lot of data alongside. I’m including it here because they tell awesome stories and share their data.

10 Websites to Get Amazing Data for Data Science Projects10 Websites to Get Amazing Data for Data Science Projects
Source: The Pudding

What I love about this website:

  • Awesome, interesting data.
  • Shares data and scripts.
  • Lots of things you might care about IRL.

6. 538

Another essay-driven pop culture website with freely available data you can purloin. They focus more on sports and politics. It’s less data-driven, but I’m giving it a spot on this list because it still curates and shares datasets.

10 Websites to Get Amazing Data for Data Science Projects
Source: FiveThirtyEight Data

What I love about this website:

  • Intelligent stories, backed up with data, you can dig into.
  • The data is in clean, CSV format.
  • The data sources are highly reliable.

7. Tidy Tuesdays

Now, the reality of the matter is that data often isn’t tidy at all. Tidy Tuesdays isn’t exactly a website with datasets per se, but it’s a weekly event and community with an emphasis on using data science to explore untidy data.

Every week, a new dataset drops. Participants are encouraged to share their cleaning techniques and visualizations with each other on GitHub and Twitter.

10 Websites to Get Amazing Data for Data Science Projects
Source: TidyTuesday GitHub

What I love about this website:

  • The community is incredible. Every week you’ll learn something new.
  • It’s so convenient. Don’t go hunting for datasets. Get the weekly drop.
  • Challenging, untidy data. The data you get IRL will rarely be as sanitized as the other data on this list. Tidy Tuesdays helps you learn how to handle messy data.

8. GitHub

GitHub is the home of a lot of data. You can easily search, filter, and download data to play around with on your own. However, the data quality is highly variable. Because anyone can upload data, it’s not always in great condition.

However, I feel the benefits make up for that.

10 Websites to Get Amazing Data for Data Science Projects
Source: GitHub Cat Data

What I love about this website:

  • You can filter by language, such as Python, Javascript, or other.
  • There’s a ton of data.
  • Usually the data comes with some kind of commentary or code you can check out.

9. Buzzfeed

Buzzfeed doesn’t just do quizzes that comment on the human condition by asking you to build a salad. It may not be as well known for this, but Buzzfeed does a lot of quality data journalism.

It’s all open source, too.

10 Websites to Get Amazing Data for Data Science Projects
Source: BuzzFeed News GitHub

What I love about this website:

  • Interesting data, pre-cleaned, and with well-written commentary in the form of articles attached.
  • Heavier topics. There’s an emphasis on more complex topics such as politics and health, but there’s a lot more, too.

10. Awesome Public Datasets

I’m ending this list with a pretty self-explanatory title: Awesome Public Datasets. This repo lives on GitHub and contains (mostly) free datasets to explore. They come from online datasets, user suggestions, and research papers.

10 Websites to Get Amazing Data for Data Science Projects
Source: Awesome Public Datasets GitHub

What I love about this website:

  • There’s a Slack group you can join!
  • Huge variety in topics. Agriculture, finance, museums. You’re bound to find something that takes your fancy.
  • Well-curated. The datasets are high quality.

These Websites offer Amazing Data Science Datasets

Dig in, you’ll certainly find not just data you can get your feet wet with, but also community, inspiration, and code you can use to learn and grow as a data scientist.

With such a huge variety of data available to you, you should never feel like you’re settling for less interesting data. Always look for data that inspires you or makes you excited to investigate it. Hopefully this list gives you a few starting points to do just that.
Nate Rosidi is a data scientist and in product strategy. He's also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Connect with him on Twitter: StrataScratch or LinkedIn.

More On This Topic

  • 10 Amazing Machine Learning Projects of 2020
  • 5 Data Science Projects to Learn 5 Critical Data Science Skills
  • 14 Data Science projects to improve your skills
  • How to Successfully Deploy Data Science Projects
  • Top Data Science Projects to Build Your Skills
  • Data Science Projects That Will Land You The Job in 2022

FREE Ratio Analysis Template

Sponsored Post

FREE Ratio Analysis Template

Copy and paste as many columns of your own data into the grey shaded cells of this template, and then click the "Ratio Analysis" button in the top right hand corner of the worksheet. Follow the prompts to create your own chart visualizing "Ratio Analysis", Growth Rate" and "Market Share" trends in your financial data. Great for Data Analysis Toolpak users.

You may need to do the following before using any templates:

  • Enable or disable macros in Microsoft 365 files
  • Unblock macros from downloaded files

Get the free Ratio Analysis Template here.

You might also try the FREE Simple Box Plot Graph and Summary Message Outlier and Anomaly Detection Template or FREE Outlier and Anomaly Detection Template. Or, automatically detect outliers, create a box & whisker plot graph, and receive a summary conclusion about dataset outliers with one button click using the Outlier Box Plot Graph Analysis Outlier and Anomaly Detection Template.

More On This Topic

  • Behavior Analysis with Machine Learning and R: The free eBook
  • A Python Data Processing Script Template
  • ORDAINED: The Python Project Template
  • How to Make Sure Your Analysis Actually Gets Used
  • The Lost Art of Decile Analysis
  • Data Analysis Using Scala