Exploring Meta’s AI Endeavours: From Personas to Advantage+ & More

Meta, Facebook and Instagram’s parent company is gearing up to unveil a fleet of AI-powered chatbots that are about to add a sprinkle of innovation to your social media experience.

These chatbots, which are set to grace platforms like Instagram and Facebook in the coming months, are primed to bring a fresh twist to online interactions and engagements. Meta’s primary motivation here is that they’re aiming to make your time on these social media channels much more engaging and interactive, steering away from the dull and mundane experience.

These AI-driven companions are not just any ordinary chatbots but are equipped with distinct personalities that will leave an indelible mark on your conversations. For instance, imagine getting travel advice from a surfer-dude-style persona or engaging in a dialogue with a chatbot channeling the wisdom and wit of Abraham Lincoln himself. It’s all part of Meta’s plan to create digital buddies that feel a little more human.

According to reports, these chatbots, being referred to as “personas” within Meta, are all set to make their grand entrance into the social media scene this September.

“Over the longer term, we’ll focus on developing AI personas that can help people in a variety of ways,” Zuckerberg wrote in a Facebook post.

Instagram is also testing a new feature to label AI-generated content, aiming to increase transparency. Screenshots shared on X microblogging platform by Alessandro Paluzzi reveal that Instagram will flag AI-created text, images, and videos. It’s uncertain if the label will only apply to content produced using built-in AI tools or if it can identify all AI-generated content.

The feature could be based on Meta’s Llama 2, an open-source AI model. This move is intended to make it clear when content has been generated by AI on the platform. Meta along with seven other tech companies pledged to watermark AI content, and looks like its research is coming through.

This initiative follows Meta’s launch of Threads, a Twitter rival app that made quite a splash but faced a rather dramatic drop in users shortly after its much-anticipated release. However, despite the challenges, Meta has been raking in profits like a pro.

Advertising & Meta is all for it

Meta’s main revenue comes from advertising, and its straightforward integration of AI across its apps positions it well to capitalize on AI. Additionally, their advertising game has been strong, boasting a 34% increase in ad impressions across their suite of apps during the second quarter of 2023. But here’s the twist – the average price per ad has actually taken a 16% dip during the same period. Change is afoot, and Meta seems to be playing it smart.

Meta’s cooking up some new ad tools to give advertisers a boost with the help of AI. ​​The company’s AI Sandbox project, being tested with a small group of advertisers, introduces features like Text Variation, which generates multiple ad text variations to optimize performance. Background Generation creates product image backgrounds from text inputs, and Image Outcropping adjusts visuals to fit different formats like Stories or Reels.

These tools aim to provide advertisers with more creative options and complement existing processes, utilizing the power of generative AI while offering more choices to consider. However, caution is needed as there can be limitations and occasional errors in the generated content.

Meta is also expanding its Advantage+ targeting for advertisers, adding new options to reach target audiences more effectively. Advertisers will soon be able to switch between manual and Advantage+ campaigns with a single click, and Catalog Ads for Advantage+ campaigns will support video elements. A Performance Comparisons report will provide insights into manual vs Advantage+ campaign performance, and additional manual inputs will guide the system for better audience targeting.

Meta is leveraging larger and more complex AI models within its ad system, enabling optimization across different surfaces (Feed, Story, Explore, and Reels), resulting in improved conversions and ad quality.

Beyond just upping the engagement game, these AI-powered pals have an additional trick up their sleeves – data collection. As you engage in conversations with these chatty companions, they’re silently gathering insights into your interests. This treasure trove of information could then be put to use by Meta to tailor content and ads to fit your preferences, creating a more personalized digital experience.

With all this talk about personas and engagement, let’s not forget the financial front. Meta’s Q2 2023 results have been nothing short of impressive. They’ve notched up a revenue of nearly $32 billion, a cool 11% jump from the same period last year. While expenses have climbed, operational income has also shown a steady rise of 12%, reaching around $9.4 billion. And let’s not forget about the net income that’s shot up by a whopping 16%. But it’s not just about the numbers; it’s about vision. Mark Zuckerberg, the CEO, has his sights set on the horizon, with exciting projects lined up, from Threads to Reels and much more.

As the digital realm evolves, Meta is harnessing the power of AI to bring a little more spark into your social media experience. These AI companions are set to reshape how we interact online, adding a dash of personality to our virtual conversations.

However, it’s interesting how Twitter’s all about getting rid of those pesky bots that clutter up the platform. They want a cleaner space for everyone to chat and share—or so they claim. Zuckerberg’s got his eye on bringing in some bots for ads on his platforms. Meanwhile, Sam Altman has a cool plan up his sleeve. He wants to tell real humans from those tricky bots using his Worldcoin orb. So, it’s like a bot banishing act versus a bot-friendly vibe, with Altman trying to sort them all out in the mix!

The post Exploring Meta’s AI Endeavours: From Personas to Advantage+ & More appeared first on Analytics India Magazine.

Why is Everyone Trying To Become NVIDIA

After AMD, Cerebras and Intel ambitiously working separately towards taking on chip leader NVIDIA, computing company Tenstorrent is emerging as the new player to join the opposing force- and they are doing it by partnering with automobile and electronics companies.

Canada-headquartered Tenstorrent, a computing company that develops processors and AI-based deep learning processing units, received a $100 million investment from Hyundai and Samsung. Prior to this funding, Tenstorrent had secured $234.5 million in investments and held a valuation of $1 billion, and stands as one of the many meeting companies aiming to contest NVIDIA. The company secured $30 million from Hyundai, $20 million from Kia, and the remaining $50 million originated from Samsung’s Catalyst Fund along with additional contributions from investors such as Fidelity Ventures, Eclipse Ventures, Epiq Capital, and Maverick Capital, among various others.

Tenstorrent is led by Jim Keller – a well-known figure in the semiconductor realm. Playing a pivotal role as a lead architect for AMD K8 microarchitecture and also participating in the design of processors such as Athlon and Apple A4/A5, Jim Keller’s experience and vision puts him in the forefront for building something that might give NVIDIA a run for the money.

Jim Keller said that for individuals seeking to construct a high-performance solution integrated with AI, Nvidia will dominate a significant portion of the product’s profit margin (60% of profit margin). “The problem with the winner-take-all strategy is it generates an economic environment where people really want an alternative.”

Tenstorrent not only manufactures its own AI chips but also offers its intellectual property and other technologies to clients interested in creating their own AI chips.

The Golden Automotive Industry

In 2021, the worldwide market size for automotive chips amounted to $49.8 billion, with a projected growth to $121.3 billion by 2031. This expansion is expected to occur at a Compound Annual Growth Rate (CAGR) of 9.6% from 2022 to 2031. With a positive market on the horizon, chip makers can reap benefits from the same.

Hyundai formed a semiconductor development division in the previous year and announced intentions to integrate Tenstorrent’s technology into forthcoming vehicles under the Hyundai, Kia, and Genesis brands. With this investment, the parties look to develop “optimised but differentiated semiconductor technology” which will aid future AI technology development.

Being one of the globe’s major semiconductor contract manufacturers, Samsung’s decision to invest is understandable. The company has said that the funding will be directed towards expediting the company’s product development, advancing the design and creation of AI chiplets, and enhancing its roadmap for ML software.

In May, Tenstorrent collaborated with LG Electronics. The partnership aims to develop chips to fuel consumer electronics such as TVs, automotive solutions and data centres. LG will initially adopt Tenstorrent’s AI chip blueprint for its own chip design and Tenstorrent said that they would look at some of the technology that LG has developed.

A Long Way before CatchUp

Racing through the computing world, NVIDIA has also branched out to cater to the automobile industry. In collaboration with NVIDIA, Mercedes Benz will work on creating intelligent cockpits and architectures to accommodate AI-driven capabilities for driving. The company had also partnered with other automobile companies such as Jaguar Land Rover and Volvo.

NVIDIA, is currently the undisputed leader with an 80 to 95% of market share in the AI computing market. The company even touched a market capitalization of $1 trillion and is continuing to surpass its competitors.

Recently, NVIDIA’s leading H100 chip achieved its highest performance to date on a series of MLPerf training benchmarks. (MLPerf benchmarks have the purpose of assessing hardware capabilities by measuring the time required to complete specific workloads.) Partnering with CoreWeave and Inflection AI, the GPU established new records across various parameters in a recent test. The trial employed a cluster of 3,584 H100 GPUs hosted on CoreWeave’s platform, interconnected using InfiniBand technology, enabling exceptional performance at both individual and scalable levels.

Tenstorrent and Jim Keller’s ambitious plan to take on NVIDIA may be far-fetched, but setting a path through partnerships with strategic players may get them close to its competitor. However, NVIDIA’s smooth trajectory route puts them in another league altogether.

The post Why is Everyone Trying To Become NVIDIA appeared first on Analytics India Magazine.

India Gets Its Own TruthGPT, Its Wholly Untruthful

India Gets Its Own TruthGPT, Its Wholly Untruthful

While Elon Musk has been trying to make TruthGPT in a bid to “understand reality”, the Indian ecosystem is already at it. Mumbai-based The Whole Truth Foods has launched its own, fact-checked TruthGPT to give information about food & fitness.

Shashank Mehta, the founder of The Whole Truth, posted on Linkedin talking about how people do not need to go to Google, or rely on influencers on Instagram, or click on some click-baity links on the internet for information.

Interestingly, according to Mehta, “Even ChatGPT, the OG GPT, doesn’t help. It’s trained on all of the world wide web. And on all the falsities and misinformation the web contains. There’s a very high chance of GIGO – Garbage In, Garbage Out.”

AIM tried the platform, and surprisingly it is nowhere close to what ChatGPT offers.

But according to Mehta, this TruthGPT platform is an LLM trained on “fact-checked” dataset, which is the company’s verified fitness and food dataset. Interestingly, the Whole Truth Foods company has been all about honesty, and this clearly seems like a marketing gimmick from the company.

The platform is powered by Fini, a YC-backed company that has been building chatbots for companies using their proprietary algorithms such as Uber, Lancey, and many others.

Indian influencers and developers have been bullish on developing their own LLMs and have developed things like KundliGPT, GitaGPT, which is still fine. But when a company claims that they are going to give dietary advice to people, it has to be fool proof, which the company does not promise or reveal at all, not being wholly truthful.

The post India Gets Its Own TruthGPT, Its Wholly Untruthful appeared first on Analytics India Magazine.

How an 8-Character Password Could be Cracked in Just a Few Minutes

Security experts keep advising us to create strong and complex passwords to protect our online accounts and data from savvy cybercriminals. And “complex” typically means using lowercase and uppercase characters, numbers, and even special symbols. But, complexity by itself can still open your password to cracking if it doesn’t contain enough characters, according to research by security firm Hive Systems.

Jump to:

  • How long does it take to crack a password?
  • What tools do hackers use to crack your passwords?
  • How to protect yourself and your organization from password cracking

How long does it take to crack a password?

As described in a report from April 2023, Hive found that an 8-character complex password could be cracked in only five minutes if the attacker was to take advantage of the latest graphics processing technology and artificial intelligence. Further, a seven-character complex password could be cracked in 4 seconds, while one with six or fewer characters could be cracked instantly. Shorter passwords with only one or two character types, such as only numbers or lowercase letters, or only numbers and letters, could also be cracked in an instant.

SEE: Download our guide on why pop culture and passwords don’t mix.

On the plus side, even simpler passwords with a greater number of characters are less vulnerable to cracking in a short amount of time, according to Hive’s research. An 18-character password with only numbers would require six days to crack, but one with the same number of characters using lowercase letters would take 481,000 years to crack (Figure A). This piece of data shows why passphrases, which use a long string of real but random words, can be more secure than a complex but short password.

Figure A

Hive's report shows that passphrases with a mix of 18 uppercase and lowercase letters, numbers, and symbols are the most difficult to brute force. Image: Hive Systems
Hive’s report shows that passphrases with a mix of 18 uppercase and lowercase letters, numbers, and symbols are the most difficult to brute force. Image: Hive Systems

What tools do hackers use to crack your passwords?

A hacker aiming to crack complex yet short passwords quickly enough would need the latest and most advanced graphics processing technology. The more powerful the graphics processing unit, the faster it can perform such tasks as mining cryptocurrencies and cracking passwords.

For example, one of the top GPUs around today is Nvidia’s GeForce RTX 4090, a product that starts at $1,599. But even less powerful and less expensive GPUs can crack passwords of a small length and low complexity in a relatively short amount of time.

Hackers who don’t have the latest and greatest graphics processing on their computers can easily turn to the cloud, according to Hive. By renting computer and graphics hardware through Amazon AWS and other cloud providers, a cybercriminal can tap into multiple virtual instances of a powerful GPU to perform password cracking at a fairly low cost.

Plus, the advances in AI have given hackers another type of tool to crack passwords more quickly and efficiently. An April 2023 report from Home Security Heroes that analyzed 15,600,000 common passwords discovered that by using AI, hackers could crack 81% of them in less than a month, 71% in less than a day, 65% in less than an hour and 51% in less than a minute.

How to protect yourself and your organization from password cracking

Due to the progress in graphics and AI technology, most types of passwords require less time to crack than they did only two years ago. For example, a seven-character password with letters, numbers and symbols would take 7 minutes to crack in 2020 but only 4 seconds in 2023. Given these advances in technology, how can you and your organization better secure your password-protected accounts and data? Here are a few tips.

Use a passphrase instead of a password

A passphrase is a long string of often random words. Passphrases are often more secure than passwords and are usually easier to remember. For example: “sunset-beach-sand” uses words and a dash to separate each word and would take 2 billion years to crack, according to Security.org.

Use a password manager

Since creating and remembering multiple complex and lengthy passwords on your own is impossible, a password manager is your best bet. By using a password manager for yourself or within your organization, you can generate, store and apply strong passwords for websites and online accounts.

SEE: Discover everything you need to know about password managers.

Use a strong master password

If you do adopt a password manager, you’ll want to protect your stored passwords as effectively as possible. The way to do that is through a strong master password. Create a complex and long password or passphrase you can remember.

Test your passwords

To gauge the strength of a potential password, enter it at a site such as Security.org. The site will tell you how long it would take to crack that password.

Read next: Explore these password managers built for teams.

Subscribe to the Cybersecurity Insider Newsletter

Strengthen your organization's IT security defenses by keeping abreast of the latest cybersecurity news, solutions, and best practices.

Delivered Tuesdays and Thursdays Sign up today

What is a ‘AI drift’ and why is it making ChatGPT dumber?

ChatGPT on a phone

Whether you have experienced it yourself using ChatGPT or read about it, the rumors are true, ChatGPT is getting progressively dumber.

This phenomenon is especially perplexing because generative AI models use user input to continuously train themselves, which should make them more intelligent as they accumulate more user entries over time.

Also: How to use ChatGPT to create an app

The answer may lie in a concept called "drift."

A "drift" refers to when large language models (LLMs) behave in unexpected or unpredictable ways that stray away from the original parameters. This may happen because attempts to improve parts of complicated AI models cause other parts to perform worse.

Researchers from the University of California at Berkeley and Stanford University conducted a study to evaluate drifts and examine how ChatGPT's popular large language models (LLMs), GPT 3.5 (the LLM behind ChatGPT) and GPT-4 (the LLM behind Bing Chat and ChatGPT Plus) changed over time.

Also: The best AI chatbots

The study compared the ability of both LLMs to solve math problems, answer sensitive questions, answer opinion surveys, answer multi-hop knowledge-intensive questions, perform code generation, US Medical License exams, and complete visual reasoning tasks in March and June.

As seen by the study results above, GPT-4's March version outperformed the June version in many instances, with the most glaring being basic math prompts where the March version of GPT-4 outperformed the June version in both examples (a) and (b).

GPT-4 also worsened at code generation, answering medical exam questions, and answering opinion surveys. All of these instances can be attributed to the drift phenomenon.

Regarding the drifts, one of the researchers, James Zou told the Wall Street Journal, "We had the suspicion it could happen here, but we were very surprised at how fast the drift is happening."

Also: GPT-3.5 vs GPT-4: Is ChatGPT Plus worth its subscription fee?

Despite the deteriorating intelligence, there were also some instances of improvement in both GPT-4 and GPT-3.5.

As a result, the researchers encourage users to keep using LLMs but to have caution when using them and constantly evaluate them.

Artificial Intelligence

Authors are losing their patience with AI, part 349235

Authors are losing their patience with AI, part 349235 Amanda Silberling 8 hours

On Monday morning, numerous writers woke up to learn that their books had been uploaded and scanned into a massive dataset without their consent. A project of cloud word processor Shaxpir, Prosecraft compiled over 27,000 books, comparing, ranking and analyzing them based on the “vividness” of their language. Many authors — including Young Adult powerhouse Maureen Johnson and “Little Fires Everywhere” author Celeste Ng — spoke out against Prosecraft for training a model on their books without consent. Even books published less than a month ago had already been uploaded.

After a day full of righteous online backlash, Prosecraft creator Benji Smith took down the website, which had existed since 2017.

“I’ve spent thousands of hours working on this project, cleaning up and annotating text, organizing and tweaking things,” Smith wrote. “But in the meantime, ‘AI’ became a thing. And the arrival of AI on the scene has been tainted by early use-cases that allow anyone to create zero-effort impersonations of artists, cutting those creators out of their own creative process.”

I join the chain of those who did not consent to this. https://t.co/PschO2oict pic.twitter.com/nmwh6GylAm

— Maureen Johnson (@maureenjohnson) August 7, 2023

Smith’s Prosecraft was not a generative AI tool, but authors worried it could become one, since he had amassed a dataset of a quarter billion words from published books, which he found by crawling the internet.

Prosecraft would show two paragraphs from a book, one that was “most passive” and one that was “most vivid.” It then placed the books into percentile rankings based on how vivid, how long or how passive it was.

“If you’re a writer as a career it’s maddening, in part because style is not the same as writing a fucking whitepaper for a business that needs to be in active voice or whatever,” author Ilana Masad said. “Style is style!”

Smith did not respond to multiple requests for comment, but he elaborated on his intentions in his blog post.

“Since I was only publishing summary statistics, and small snippets from the text of those books, I believed I was honoring the spirit of the Fair Use doctrine, which doesn’t require the consent of the original author,” Smith wrote. Some authors noted that the excerpts of their books on Prosecraft included major spoilers, causing further frustration.

Though Smith apologized, authors remain exasperated. For artists and writers, the recent proliferation of AI tools has created a deeply frustrating game of whack-a-mole. As soon as they opt out of one database, they find that their work has been used to train another AI model, and so on.

“It’s pretty much the norm, from what I can tell, for these sites and projects to do whatever they’re doing first and then hope that no one notices and then disappear or get defensive when they inevitably do,” Masad said.

Generative AI and the technology behind self-publishing have created a perfect storm for scammy activities. Amazon has been flooded with low-quality, AI-generated travel guides, and even AI-generated children’s books. But tools like ChatGPT are basically trained on the sum total of the internet, so this means that real travel writers or children’s books authors could be getting inadvertently plagiarized.

Author Jane Friedman wrote in a recent blog post — titled “I’d Rather See My Books Get Pirated Than This” — that she is being impersonated on Amazon, where someone is selling books under her name that appear to be written with an AI.

Though Friedman was successful in getting these fake books removed from her Goodreads page, she says that Amazon won’t remove the books for sale unless she has a trademark for her name.

Amazon did not provide a comment before publication.

“I don’t think any writer is seriously convinced that AI is going to ruin books because like, well, that’s not how literature works, and everything I’ve seen ChatGPT write as a ‘story’ is just really fucking boring with no voice or real craft or style,” Masad said.

But she worries that publishers will be convinced otherwise, and possibly replace marketing and publicity teams with AI-generated promotional content.

“It feels really bad,” she said.

AI can’t replace human writers

Science fiction publishers are being flooded with AI-generated stories

Fundamentals Of Statistics For Data Scientists and Analysts

Fundamentals Of Statistics For Data Scientists and Analysts
Image by Editor

As Karl Pearson, a British mathematician has once stated, Statistics is the grammar of science and this holds especially for Computer and Information Sciences, Physical Science, and Biological Science. When you are getting started with your journey in Data Science or Data Analytics, having statistical knowledge will help you to better leverage data insights.

“Statistics is the grammar of science.” Karl Pearson

The importance of statistics in data science and data analytics cannot be underestimated. Statistics provides tools and methods to find structure and to give deeper data insights. Both Statistics and Mathematics love facts and hate guesses. Knowing the fundamentals of these two important subjects will allow you to think critically, and be creative when using the data to solve business problems and make data-driven decisions. In this article, I will cover the following Statistics topics for data science and data analytics:

- Random variables    - Probability distribution functions (PDFs)    - Mean, Variance, Standard Deviation    - Covariance and Correlation     - Bayes Theorem    - Linear Regression and Ordinary Least Squares (OLS)    - Gauss-Markov Theorem    - Parameter properties (Bias, Consistency, Efficiency)    - Confidence intervals    - Hypothesis testing    - Statistical significance     - Type I & Type II Errors    - Statistical tests (Student's t-test, F-test)    - p-value and its limitations    - Inferential Statistics     - Central Limit Theorem & Law of Large Numbers    - Dimensionality reduction techniques (PCA, FA)

If you have no prior Statistical knowledge and you want to identify and learn the essential statistical concepts from the scratch, to prepare for your job interviews, then this article is for you. This article will also be a good read for anyone who wants to refresh his/her statistical knowledge.

Before we start, welcome to LunarTech!

Welcome to LunarTech.ai, where we understand the power of job-searching strategies in the dynamic field of Data Science and AI. We dive deep into the tactics and strategies required to navigate the competitive job search process. Whether it’s defining your career goals, customizing application materials, or leveraging job boards and networking, our insights provide the guidance you need to land your dream job.

Preparing for data science interviews? Fear not! We shine a light on the intricacies of the interview process, equipping you with the knowledge and preparation necessary to increase your chances of success. From initial phone screenings to technical assessments, technical interviews, and behavioral interviews, we leave no stone unturned.

At LunarTech.ai, we go beyond the theory. We’re your springboard to unparalleled success in the tech and data science realm. Our comprehensive learning journey is tailored to fit seamlessly into your lifestyle, allowing you to strike the perfect balance between personal and professional commitments while acquiring cutting-edge skills. With our dedication to your career growth, including job placement assistance, expert resume building, and interview preparation, you’ll emerge as an industry-ready powerhouse.

Join our community of ambitious individuals today and embark on this thrilling data science journey together. With LunarTech.ai, the future is bright, and you hold the keys to unlock boundless opportunities.

Random Variables

The concept of random variables forms the cornerstone of many statistical concepts. It might be hard to digest its formal mathematical definition but simply put, a random variable is a way to map the outcomes of random processes, such as flipping a coin or rolling a dice, to numbers. For instance, we can define the random process of flipping a coin by random variable X which takes a value 1 if the outcome if heads and 0 if the outcome is tails.

Fundamentals Of Statistics For Data Scientists and Analysts

In this example, we have a random process of flipping a coin where this experiment can produce two possible outcomes: {0,1}. This set of all possible outcomes is called the sample space of the experiment. Each time the random process is repeated, it is referred to as an event. In this example, flipping a coin and getting a tail as an outcome is an event. The chance or the likelihood of this event occurring with a particular outcome is called the probability of that event. A probability of an event is the likelihood that a random variable takes a specific value of x which can be described by P(x). In the example of flipping a coin, the likelihood of getting heads or tails is the same, that is 0.5 or 50%. So we have the following setting:

Fundamentals Of Statistics For Data Scientists and Analysts

where the probability of an event, in this example, can only take values in the range [0,1].

The importance of statistics in data science and data analytics cannot be underestimated. Statistics provides tools and methods to find structure and to give deeper data insights.

Mean, Variance, Standard Deviation

To understand the concepts of mean, variance, and many other statistical topics, it is important to learn the concepts of population and sample. The population is the set of all observations (individuals, objects, events, or procedures) and is usually very large and diverse, whereas a sampleis a subset of observations from the population that ideally is a true representation of the population.

Fundamentals Of Statistics For Data Scientists and Analysts
Image Source: The Author

Given that experimenting with an entire population is either impossible or simply too expensive, researchers or analysts use samples rather than the entire population in their experiments or trials. To make sure that the experimental results are reliable and hold for the entire population, the sample needs to be a true representation of the population. That is, the sample needs to be unbiased. For this purpose, one can use statistical sampling techniques such as Random Sampling, Systematic Sampling, Clustered Sampling, Weighted Sampling, and Stratified Sampling.

Mean

The mean, also known as the average, is a central value of a finite set of numbers. Let’s assume a random variable X in the data has the following values:

Fundamentals Of Statistics For Data Scientists and Analysts

where N is the number of observations or data points in the sample set or simply the data frequency. Then the sample meandefined by ?, which is very often used to approximate the population mean, can be expressed as follows:

Fundamentals Of Statistics For Data Scientists and Analysts

The mean is also referred to as expectation which is often defined by E() or random variable with a bar on the top. For example, the expectation of random variables X and Y, that is E(X) and E(Y), respectively, can be expressed as follows:

Fundamentals Of Statistics For Data Scientists and Analysts

import numpy as np  import math  x = np.array([1,3,5,6])  mean_x = np.mean(x)  # in case the data contains Nan values  x_nan = np.array([1,3,5,6, math.nan])  mean_x_nan = np.nanmean(x_nan)

Variance

The variance measures how far the data points are spread out from the average value, and is equal to the sum of squares of differences between the data values and the average (the mean). Furthermore, the population variance, can be expressed as follows:

Fundamentals Of Statistics For Data Scientists and Analysts

x = np.array([1,3,5,6])  variance_x = np.var(x)    # here you need to specify the degrees of freedom (df) max number of logically independent data points that have freedom to vary  x_nan = np.array([1,3,5,6, math.nan])  mean_x_nan = np.nanvar(x_nan, ddof = 1)

For deriving expectations and variances of different popular probability distribution functions, check out this Github repo.

Standard Deviation

The standard deviation is simply the square root of the variance and measures the extent to which data varies from its mean. The standard deviation defined by sigmacan be expressed as follows:

Fundamentals Of Statistics For Data Scientists and Analysts

Standard deviation is often preferred over the variance because it has the same unit as the data points, which means you can interpret it more easily.

x = np.array([1,3,5,6])  variance_x = np.std(x)    x_nan = np.array([1,3,5,6, math.nan])  mean_x_nan = np.nanstd(x_nan, ddof = 1)

Covariance

The covariance is a measure of the joint variability of two random variables and describes the relationship between these two variables. It is defined as the expected value of the product of the two random variables’ deviations from their means. The covariance between two random variables X and Z can be described by the following expression, where E(X) and E(Z) represent the means of X and Z, respectively.

Fundamentals Of Statistics For Data Scientists and Analysts

Covariance can take negative or positive values as well as value 0. A positive value of covariance indicates that two random variables tend to vary in the same direction, whereas a negative value suggests that these variables vary in opposite directions. Finally, the value 0 means that they don’t vary together.

x = np.array([1,3,5,6])  y = np.array([-2,-4,-5,-6])  #this will return the covariance matrix of x,y containing x_variance, y_variance on diagonal elements and covariance of x,y  cov_xy = np.cov(x,y)

Correlation

The correlation is also a measure for relationship and it measures both the strength and the direction of the linear relationship between two variables. If a correlation is detected then it means that there is a relationship or a pattern between the values of two target variables. Correlation between two random variables X and Z are equal to the covariance between these two variables divided to the product of the standard deviations of these variables which can be described by the following expression.

Fundamentals Of Statistics For Data Scientists and Analysts

Correlation coefficients’ values range between -1 and 1. Keep in mind that the correlation of a variable with itself is always 1, that is Cor(X, X) = 1. Another thing to keep in mind when interpreting correlation is to not confuse it with causation, given that a correlation is not causation. Even if there is a correlation between two variables, you cannot conclude that one variable causes a change in the other. This relationship could be coincidental, or a third factor might be causing both variables to change.

x = np.array([1,3,5,6])  y = np.array([-2,-4,-5,-6])  corr = np.corrcoef(x,y)

Probability Distribution Functions

A function that describes all the possible values, the sample space, and the corresponding probabilities that a random variable can take within a given range, bounded between the minimum and maximum possible values, is called a probability distribution function (pdf) or probability density. Every pdf needs to satisfy the following two criteria:

Fundamentals Of Statistics For Data Scientists and Analysts

where the first criterium states that all probabilities should be numbers in the range of [0,1] and the second criterium states that the sum of all possible probabilities should be equal to 1.

Probability functions are usually classified into two categories: discrete and continuous. Discretedistributionfunction describes the random process with countable sample space, like in the case of an example of tossing a coin that has only two possible outcomes. Continuousdistribution function describes the random process with continuous sample space. Examples of discrete distribution functions are Bernoulli, Binomial, Poisson, Discrete Uniform. Examples of continuous distribution functions are Normal, Continuous Uniform, Cauchy.

Binomial Distribution

The binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each with the boolean-valued outcome: success (with probability p) or failure (with probability q = 1 ? p). Let's assume a random variable X follows a Binomial distribution, then the probability of observingk successes in n independent trials can be expressed by the following probability density function:

Fundamentals Of Statistics For Data Scientists and Analysts

The binomial distribution is useful when analyzing the results of repeated independent experiments, especially if one is interested in the probability of meeting a particular threshold given a specific error rate.

Binomial Distribution Mean & Variance

Fundamentals Of Statistics For Data Scientists and Analysts

The figure below visualizes an example of Binomial distribution where the number of independent trials is equal to 8 and the probability of success in each trial is equal to 16%.

Fundamentals Of Statistics For Data Scientists and Analysts
Image Source: The Author

# Random Generation of 1000 independent Binomial samples  import numpy as np  n = 8  p = 0.16  N = 1000  X = np.random.binomial(n,p,N)  # Histogram of Binomial distribution  import matplotlib.pyplot as plt  counts, bins, ignored = plt.hist(X, 20, density = True, rwidth = 0.7, color = 'purple')  plt.title("Binomial distribution with p = 0.16 n = 8")  plt.xlabel("Number of successes")  plt.ylabel("Probability")  plt.show()

Poisson Distribution

The Poisson distribution is the discrete probability distribution of the number of events occurring in a specified time period, given the average number of times the event occurs over that time period. Let's assume a random variable X follows a Poisson distribution, then the probability of observingk events over a time period can be expressed by the following probability function:

Fundamentals Of Statistics For Data Scientists and Analysts

where e is Euler’s number and ? lambda, the arrival rate parameter isthe expected value of X. Poisson distribution function is very popular for its usage in modeling countable events occurring within a given time interval.

Poisson Distribution Mean & Variance

Fundamentals Of Statistics For Data Scientists and Analysts

For example, Poisson distribution can be used to model the number of customers arriving in the shop between 7 and 10 pm, or the number of patients arriving in an emergency room between 11 and 12 pm. The figure below visualizes an example of Poisson distribution where we count the number of Web visitors arriving at the website where the arrival rate, lambda, is assumed to be equal to 7 minutes.

Fundamentals Of Statistics For Data Scientists and Analysts
Image Source: The Author

# Random Generation of 1000 independent Poisson samples  import numpy as np  lambda_ = 7  N = 1000  X = np.random.poisson(lambda_,N)    # Histogram of Poisson distribution  import matplotlib.pyplot as plt  counts, bins, ignored = plt.hist(X, 50, density = True, color = 'purple')  plt.title("Randomly generating from Poisson Distribution with lambda = 7")  plt.xlabel("Number of visitors")  plt.ylabel("Probability")  plt.show()

Normal Distribution

The Normal probability distribution is the continuous probability distribution for a real-valued random variable. Normal distribution, also called Gaussian distribution is arguably one of the most popular distribution functions that are commonly used in social and natural sciences for modeling purposes, for example, it is used to model people’s height or test scores. Let's assume a random variable X follows a Normal distribution, then its probability density function can be expressed as follows.

Fundamentals Of Statistics For Data Scientists and Analysts

where the parameter ? (mu)is the mean of the distribution also referred to as the location parameter, parameter ? (sigma)is the standard deviation of the distribution also referred to as the scale parameter. The number ? (pi) is a mathematical constant approximately equal to 3.14.

Normal Distribution Mean & Variance

Fundamentals Of Statistics For Data Scientists and Analysts

The figure below visualizes an example of Normal distribution with a mean 0 (? = 0) and standard deviation of 1 (? = 1), which is referred to as Standard Normal distribution which is symmetric.

Fundamentals Of Statistics For Data Scientists and Analysts
Image Source: The Author

# Random Generation of 1000 independent Normal samples  import numpy as np  mu = 0  sigma = 1  N = 1000  X = np.random.normal(mu,sigma,N)    # Population distribution  from scipy.stats import norm  x_values = np.arange(-5,5,0.01)  y_values = norm.pdf(x_values)  #Sample histogram with Population distribution  import matplotlib.pyplot as plt  counts, bins, ignored = plt.hist(X, 30, density = True,color = 'purple',label = 'Sampling Distribution')  plt.plot(x_values,y_values, color = 'y',linewidth = 2.5,label = 'Population Distribution')  plt.title("Randomly generating 1000 obs from Normal distribution mu = 0 sigma = 1")  plt.ylabel("Probability")  plt.legend()  plt.show()

Bayes Theorem

The Bayes Theorem or often called Bayes Law is arguably the most powerful rule of probability and statistics, named after famous English statistician and philosopher, Thomas Bayes.

Fundamentals Of Statistics For Data Scientists and Analysts
Image Source: Wikipedia

Bayes theorem is a powerful probability law that brings the concept of subjectivity into the world of Statistics and Mathematics where everything is about facts. It describes the probability of an event, based on the prior information of conditions that might be related to that event. For instance, if the risk of getting Coronavirus or Covid-19 is known to increase with age, then Bayes Theorem allows the risk to an individual of a known age to be determined more accurately by conditioning it on the age than simply assuming that this individual is common to the population as a whole.

The concept of conditional probability, which plays a central role in Bayes theory, is a measure of the probability of an event happening, given that another event has already occurred. Bayes theorem can be described by the following expression where the X and Y stand for events X and Y, respectively:

Fundamentals Of Statistics For Data Scientists and Analysts

  • Pr (X|Y): the probability of event X occurring given that event or condition Y has occurred or is true
  • Pr (Y|X): the probability of event Y occurring given that event or condition X has occurred or is true
  • Pr (X) & Pr (Y): the probabilities of observing events X and Y, respectively

In the case of the earlier example, the probability of getting Coronavirus (event X) conditional on being at a certain age is Pr (X|Y), which is equal to the probability of being at a certain age given one got a Coronavirus, Pr (Y|X), multiplied with the probability of getting a Coronavirus, Pr (X), divided to the probability of being at a certain age., Pr (Y).

Linear Regression

Earlier, the concept of causation between variables was introduced, which happens when a variable has a direct impact on another variable. When the relationship between two variables is linear, then Linear Regression is a statistical method that can help to model the impact of a unit change in a variable, the independent variable on the values of another variable, the dependent variable.

Dependent variables are often referred to as response variables or explained variables, whereas independent variables are often referred to as regressors or explanatory variables. When the Linear Regression model is based on a single independent variable, then the model is called Simple Linear Regression and when the model is based on multiple independent variables, it’s referred to as Multiple Linear Regression. Simple Linear Regression can be described by the following expression:

Fundamentals Of Statistics For Data Scientists and Analysts

where Y is the dependent variable, X is the independent variable which is part of the data, ?0 is the intercept which is unknown and constant, ?1 is the slope coefficient or a parameter corresponding to the variable X which is unknown and constant as well. Finally, u is the error term that the model makes when estimating the Y values. The main idea behind linear regression is to find the best-fitting straight line, the regression line, through a set of paired ( X, Y ) data. One example of the Linear Regression application is modeling the impact of Flipper Length on penguins’ Body Mass, which is visualized below.

Fundamentals Of Statistics For Data Scientists and Analysts
Image Source: The Author

# R code for the graph  install.packages("ggplot2")  install.packages("palmerpenguins")  library(palmerpenguins)  library(ggplot2)  View(data(penguins))  ggplot(data = penguins, aes(x = flipper_length_mm,y = body_mass_g))+    geom_smooth(method = "lm", se = FALSE, color = 'purple')+    geom_point()+    labs(x="Flipper Length (mm)",y="Body Mass (g)")

Multiple Linear Regression with three independent variables can be described by the following expression:

Fundamentals Of Statistics For Data Scientists and Analysts

Ordinary Least Squares

The ordinary least squares (OLS) is a method for estimating the unknown parameters such as ?0 and ?1in a linear regression model. The model is based on the principle of least squares thatminimizes the sum of squares of the differences between the observed dependent variable and its values predicted by the linear function of the independent variable, often referred to as fitted values. This difference between the real and predicted values of dependent variable Y is referred to as residual and what OLS does, is minimizing the sum of squared residuals. This optimization problem results in the following OLS estimates for the unknown parameters ?0 and ?1 which are also known as coefficient estimates.

Fundamentals Of Statistics For Data Scientists and Analysts

Once these parameters of the Simple Linear Regression model are estimated, the fitted valuesof the response variable can be computed as follows:

Fundamentals Of Statistics For Data Scientists and Analysts

Standard Error

The residuals or the estimated error terms can be determined as follows:

Fundamentals Of Statistics For Data Scientists and Analysts

It is important to keep in mind the difference between the error terms and residuals. Error terms are never observed, while the residuals are calculated from the data. The OLS estimates the error terms for each observation but not the actual error term. So, the true error variance is still unknown. Moreover, these estimates are subject to sampling uncertainty. What this means is that we will never be able to determine the exact estimate, the true value, of these parameters from sample data in an empirical application. However, we can estimate it by calculating the sampleresidual variance by using the residuals as follows.

Fundamentals Of Statistics For Data Scientists and Analysts

This estimate for the variance of sample residuals helps to estimate the variance of the estimated parameters which is often expressed as follows:

Fundamentals Of Statistics For Data Scientists and Analysts

The squared root of this variance term is called the standard error of the estimate which is a key component in assessing the accuracy of the parameter estimates. It is used to calculating test statistics and confidence intervals. The standard error can be expressed as follows:

Fundamentals Of Statistics For Data Scientists and Analysts

It is important to keep in mind the difference between the error terms and residuals. Error terms are never observed, while the residuals are calculated from the data.

OLS Assumptions

OLS estimation method makes the following assumption which needs to be satisfied to get reliable prediction results:

A1: Linearity assumption states that the model is linear in parameters.

A2: Random Sample assumption states that all observations in the sample are randomly selected.

A3: Exogeneity assumption states that independent variables are uncorrelated with the error terms.

A4: Homoskedasticity assumption states that the variance of all error terms is constant.

A5: No Perfect Multi-Collinearity assumption states that none of the independent variables is constant and there are no exact linear relationships between the independent variables.

def runOLS(Y,X):       # OLS esyimation Y = Xb + e --> beta_hat = (X'X)^-1(X'Y)     beta_hat = np.dot(np.linalg.inv(np.dot(np.transpose(X), X)), np.dot(np.transpose(X), Y))       # OLS prediction     Y_hat = np.dot(X,beta_hat)     residuals = Y-Y_hat     RSS = np.sum(np.square(residuals))     sigma_squared_hat = RSS/(N-2)     TSS = np.sum(np.square(Y-np.repeat(Y.mean(),len(Y))))     MSE = sigma_squared_hat     RMSE = np.sqrt(MSE)     R_squared = (TSS-RSS)/TSS       # Standard error of estimates:square root of estimate's variance     var_beta_hat = np.linalg.inv(np.dot(np.transpose(X),X))*sigma_squared_hat          SE = []     t_stats = []     p_values = []     CI_s = []          for i in range(len(beta)):         #standard errors         SE_i = np.sqrt(var_beta_hat[i,i])         SE.append(np.round(SE_i,3))            #t-statistics          t_stat = np.round(beta_hat[i,0]/SE_i,3)          t_stats.append(t_stat)            #p-value of t-stat p[|t_stat| >= t-treshhold two sided]           p_value = t.sf(np.abs(t_stat),N-2) * 2          p_values.append(np.round(p_value,3))            #Confidence intervals = beta_hat -+ margin_of_error          t_critical = t.ppf(q =1-0.05/2, df = N-2)          margin_of_error = t_critical*SE_i          CI = [np.round(beta_hat[i,0]-margin_of_error,3), np.round(beta_hat[i,0]+margin_of_error,3)]          CI_s.append(CI)          return(beta_hat, SE, t_stats, p_values,CI_s,                  MSE, RMSE, R_squared)

Parameter Properties

Under the assumption that the OLS criteria A1 — A5 are satisfied, the OLS estimators of coefficients β0 and β1 are BLUE and Consistent.

Gauss-Markov theorem

This theorem highlights the properties of OLS estimates where the term BLUE stands for Best Linear Unbiased Estimator.

Bias

The bias of an estimator is the difference between its expected value and the true value of the parameter being estimated and can be expressed as follows:

Fundamentals Of Statistics For Data Scientists and Analysts

When we state that the estimator is unbiased what we mean is that the bias is equal to zero, which implies that the expected value of the estimator is equal to the true parameter value, that is:

Fundamentals Of Statistics For Data Scientists and Analysts

Unbiasedness does not guarantee that the obtained estimate with any particular sample is equal or close to ?. What it means is that, if one repeatedly draws random samples from the population and then computes the estimate each time, then the average of these estimates would be equal or very close to β.

Efficiency

The term Best in the Gauss-Markov theorem relates to the variance of the estimator and is referred to as efficiency. A parameter can have multiple estimators but the one with the lowest variance is called efficient.

Consistency

The term consistency goes hand in hand with the terms sample size and convergence. If the estimator converges to the true parameter as the sample size becomes very large, then this estimator is said to be consistent, that is:

Fundamentals Of Statistics For Data Scientists and Analysts

Under the assumption that the OLS criteria A1 — A5 are satisfied, the OLS estimators of coefficients β0 and β1 are BLUE and Consistent.
Gauss-Markov Theorem

All these properties hold for OLS estimates as summarized in the Gauss-Markov theorem. In other words, OLS estimates have the smallest variance, they are unbiased, linear in parameters, and are consistent. These properties can be mathematically proven by using the OLS assumptions made earlier.

Confidence Intervals

The Confidence Interval is the range that contains the true population parameter with a certain pre-specified probability, referred to as the confidence levelof the experiment, and it is obtained by using the sample results and the margin of error.

Margin of Error

The margin of error is the difference between the sample results and based on what the result would have been if one had used the entire population.

Confidence Level

The Confidence Level describes the level of certainty in the experimental results. For example, a 95% confidence level means that if one were to perform the same experiment repeatedly for 100 times, then 95 of those 100 trials would lead to similar results. Note that the confidence level is defined before the start of the experiment because it will affect how big the margin of error will be at the end of the experiment.

Confidence Interval for OLS Estimates

As it was mentioned earlier, the OLS estimates of the Simple Linear Regression, the estimates for intercept ?0 and slope coefficient ?1, are subject to sampling uncertainty. However, we can construct CI’sfor these parameters which will contain the true value of these parameters in 95% of all samples. That is, 95% confidence interval for ? can be interpreted as follows:

  • The confidence interval is the set of values for which a hypothesis test cannot be rejected to the level of 5%.
  • The confidence interval has a 95% chance to contain the true value of ?.

95% confidence interval of OLS estimates can be constructed as follows:

Fundamentals Of Statistics For Data Scientists and Analysts

which is based on the parameter estimate, the standard error of that estimate, and the value 1.96 representing the margin of error corresponding to the 5% rejection rule. This value is determined using the Normal Distribution table, which will be discussed later on in this article. Meanwhile, the following figure illustrates the idea of 95% CI:

Fundamentals Of Statistics For Data Scientists and Analysts
Image Source: Wikipedia

Note that the confidence interval depends on the sample size as well, given that it is calculated using the standard error which is based on sample size.

The confidence level is defined before the start of the experiment because it will affect how big the margin of error will be at the end of the experiment.

Statistical Hypothesis testing

Testing a hypothesis in Statistics is a way to test the results of an experiment or survey to determine how meaningful they the results are. Basically, one is testing whether the obtained results are valid by figuring out the odds that the results have occurred by chance. If it is the letter, then the results are not reliable and neither is the experiment. Hypothesis Testing is part of the Statistical Inference.

Null and Alternative Hypothesis

Firstly, you need to determine the thesis you wish to test, then you need to formulate the Null Hypothesis and the Alternative Hypothesis. The test can have two possible outcomes and based on the statistical results you can either reject the stated hypothesis or accept it. As a rule of thumb, statisticians tend to put the version or formulation of the hypothesis under the Null Hypothesis thatthat needs to be rejected, whereas the acceptable and desired version is stated under the Alternative Hypothesis.

Statistical significance

Let’s look at the earlier mentioned example where the Linear Regression model was used to investigating whether a penguins’ Flipper Length, the independent variable, has an impact on Body Mass, the dependent variable. We can formulate this model with the following statistical expression:

Fundamentals Of Statistics For Data Scientists and Analysts

Then, once the OLS estimates of the coefficients are estimated, we can formulate the following Null and Alternative Hypothesis to test whether the Flipper Length has a statistically significant impact on the Body Mass:

Fundamentals Of Statistics For Data Scientists and Analysts

where H0 and H1 represent Null Hypothesis and Alternative Hypothesis, respectively. Rejecting the Null Hypothesis would mean that a one-unit increase in Flipper Length has a direct impact on the Body Mass. Given that the parameter estimate of ?1 is describing this impact of the independent variable, Flipper Length, on the dependent variable, Body Mass. This hypothesis can be reformulated as follows:

Fundamentals Of Statistics For Data Scientists and Analysts

where H0 states that the parameter estimate of ?1 is equal to 0, that is Flipper Length effect on Body Mass is statistically insignificant whereasH0 states that the parameter estimate of ?1 is not equal to 0 suggesting that Flipper Length effect on Body Mass is statistically significant.

Type I and Type II Errors

When performing Statistical Hypothesis Testing one needs to consider two conceptual types of errors: Type I error and Type II error. The Type I error occurs when the Null is wrongly rejected whereas the Type II error occurs when the Null Hypothesis is wrongly not rejected. A confusion matrix can help to clearly visualize the severity of these two types of errors.

As a rule of thumb, statisticians tend to put the version the hypothesis under the Null Hypothesis thatthat needs to be rejected, whereas the acceptable and desired version is stated under the Alternative Hypothesis.

Statistical Tests

Once the Null and the Alternative Hypotheses are stated and the test assumptions are defined, the next step is to determine which statistical test is appropriate and to calculate thetest statistic. Whether or not to reject or not reject the Null can be determined by comparing the test statistic with the critical value. This comparison shows whether or not the observed test statistic is more extreme than the defined critical value and it can have two possible results:

  • The test statistic is more extreme than the critical value ? the null hypothesis can be rejected
  • The test statistic is not as extreme as the critical value ? the null hypothesis cannot be rejected

The critical value is based on a prespecified significance level ? (usually chosen to be equal to 5%) and the type of probability distribution the test statistic follows. The critical value divides the area under this probability distribution curve into the rejection region(s) and non-rejection region. There are numerous statistical tests used to test various hypotheses. Examples of Statistical tests are Student’s t-test, F-test, Chi-squared test, Durbin-Hausman-Wu Endogeneity test, White Heteroskedasticity test. In this article, we will look at two of these statistical tests.

The Type I error occurs when the Null is wrongly rejected whereas the Type II error occurs when the Null Hypothesis is wrongly not rejected.

Student’s t-test

One of the simplest and most popular statistical tests is the Student’s t-test. which can be used for testing various hypotheses especially when dealing with a hypothesis where the main area of interest is to find evidence for the statistically significant effect of a single variable. Thetest statistics of the t-test follows Student’s t distribution and can be determined as follows:

Fundamentals Of Statistics For Data Scientists and Analysts

where h0 in the nominator is the value against which the parameter estimate is being tested. So, the t-test statistics are equal to the parameter estimate minus the hypothesized value divided by the standard error of the coefficient estimate. In the earlier stated hypothesis, where we wanted to test whether Flipper Length has a statistically significant impact on Body Mass or not. This test can be performed using a t-test and the h0 is in that case equal to the 0 since the slope coefficient estimate is tested against value 0.

There are two versions of the t-test: a two-sided t-test and a one-sided t-test. Whether you need the former or the latter version of the test depends entirely on the hypothesis that you want to test.

The two-sidedor two-tailed t-test can be used when the hypothesis is testing equal versus not equal relationship under the Null and Alternative Hypotheses that is similar to the following example:

Fundamentals Of Statistics For Data Scientists and Analysts

The two-sided t-test has two rejection regions as visualized in the figure below:

Fundamentals Of Statistics For Data Scientists and Analysts
Image Source: Hartmann, K., Krois, J., Waske, B. (2018): E-Learning Project SOGA: Statistics and Geospatial Data Analysis. Department of Earth Sciences, Freie Universitaet Berlin
&nbsp

In this version of the t-test, the Null is rejected if the calculated t-statistics is either too small or too large.

Fundamentals Of Statistics For Data Scientists and Analysts

Here, the test statistics are compared to the critical values based on the sample size and the chosen significance level. To determine the exact value of the cutoff point, the two-sided t-distribution table can be used.

The one-sided or one-tailed t-test can be used when the hypothesis is testing positive/negative versus negative/positive relationship under the Null and Alternative Hypotheses that is similar to the following examples:

Fundamentals Of Statistics For Data Scientists and Analysts

One-sided t-test has a singlerejection region and dependingon the hypothesis side the rejection region is either on the left-hand side or the right-hand side as visualized in the figure below:

Fundamentals Of Statistics For Data Scientists and Analysts
Image Source: Hartmann, K., Krois, J., Waske, B. (2018): E-Learning Project SOGA: Statistics and Geospatial Data Analysis. Department of Earth Sciences, Freie Universitaet Berlin

In this version of the t-test, the Null is rejected if the calculated t-statistics is smaller/larger than the critical value.

Fundamentals Of Statistics For Data Scientists and Analysts

F-test

F-test is another very popular statistical test often used to test hypotheses testing a joint statistical significance of multiple variables. This is the case when you want to test whether multiple independent variables have a statistically significant impact on a dependent variable. Following is an example of a statistical hypothesis that can be tested using the F-test:

Fundamentals Of Statistics For Data Scientists and Analysts

where the Null states that the three variables corresponding to these coefficients are jointly statistically insignificant and the Alternative states that these three variables are jointly statistically significant. The test statistics of the F-test follows F distribution and can be determined as follows:

Fundamentals Of Statistics For Data Scientists and Analysts

where the SSRrestricted is the sum of squared residuals of the restricted model which is the same model excluding from the data the target variables stated as insignificant under the Null, the SSRunrestricted is the sum of squared residuals of the unrestricted modelwhich is the model that includes all variables, the q represents the number of variables that are being jointly tested for the insignificance under the Null, N is the sample size, and the k is the total number of variables in the unrestricted model. SSR values are provided next to the parameter estimates after running the OLS regression and the same holds for the F-statistics as well. Following is an example of MLR model output where the SSR and F-statistics values are marked.

Fundamentals Of Statistics For Data Scientists and Analysts
Image Source: Stock and Whatson

F-test has a single rejection region as visualized below:

Fundamentals Of Statistics For Data Scientists and Analysts
Image Source: U of Michigan

If the calculated F-statistics is bigger than the critical value, then the Null can be rejected which suggests that the independent variables are jointly statistically significant. The rejection rule can be expressed as follows:

Fundamentals Of Statistics For Data Scientists and Analysts
P-Values

Another quick way to determine whether to reject or to support the Null Hypothesis is by using p-values. The p-value is the probability of the condition under the Null occurring. Stated differently, the p-value is the probability, assuming the null hypothesis is true, of observing a result at least as extreme as the test statistic. The smaller the p-value, the stronger is the evidence against the Null Hypothesis, suggesting that it can be rejected.

The interpretation of a p-value is dependent on the chosen significance level. Most often, 1%, 5%, or 10% significance levels are used to interpret the p-value. So, instead of using the t-test and the F-test, p-values of these test statistics can be used to test the same hypotheses.

The following figure shows a sample output of an OLS regression with two independent variables. In this table, the p-value of the t-test, testing the statistical significance of class_size variable’s parameter estimate, and the p-value of the F-test, testing the joint statistical significance of the class_size, and el_pct variables parameter estimates, are underlined.

Fundamentals Of Statistics For Data Scientists and Analysts
Image Source: Stock and Whatson

The p-value corresponding to the class_size variable is 0.011 and when comparing this value to the significance levels 1% or 0.01 , 5% or 0.05, 10% or 0.1, then the following conclusions can be made:

  • 0.011 > 0.01 ? Null of the t-test can’t be rejected at 1% significance level
  • 0.011 < 0.05 ? Null of the t-test can be rejected at 5% significance level
  • 0.011 < 0.10 ?Null of the t-test can be rejected at 10% significance level

So, this p-value suggests that the coefficient of the class_size variable is statistically significant at 5% and 10% significance levels. The p-value corresponding to the F-testis 0.0000 and since 0 is smaller than all three cutoff values; 0.01, 0.05, 0.10, we can conclude that the Null of the F-test can be rejected in all three cases. This suggests that the coefficients of class_size and el_pct variables are jointly statistically significant at 1%, 5%, and 10% significance levels.

Limitation of p-values

Although, using p-values has many benefits but it has also limitations. Namely, the p-value depends on both the magnitude of association and the sample size. If the magnitude of the effect is small and statistically insignificant, the p-value might still show a significant impactbecause the large sample size is large. The opposite can occur as well, an effect can be large, but fail to meet the p<0.01, 0.05, or 0.10 criteria if the sample size is small.

Inferential Statistics

Inferential statistics uses sample data to make reasonable judgments about the population from which the sample data originated. It’s used to investigate the relationships between variables within a sample and make predictions about how these variables will relate to a larger population.

Both Law of Large Numbers (LLN) and Central Limit Theorem (CLM) have a significant role in Inferential statistics because they show that the experimental results hold regardless of what shape the original population distribution was when the data is large enough. The more data is gathered, the more accurate the statistical inferences become, hence, the more accurate parameter estimates are generated.

Law of Large Numbers (LLN)

Suppose X1, X2, . . . , Xn are all independent random variables with the same underlying distribution, also called independent identically-distributed or i.i.d, where all X’s have the same mean ? and standard deviation ?. As the sample size grows, the probability that the average of all X’s is equal to the mean ? is equal to 1. The Law of Large Numbers can be summarized as follows:

Fundamentals Of Statistics For Data Scientists and Analysts

Central Limit Theorem (CLM)

Suppose X1, X2, . . . , Xn are all independent random variables with the same underlying distribution, also called independent identically-distributed or i.i.d, where all X’s have the same mean ? and standard deviation ?. As the sample size grows, the probability distribution of X converges in the distribution in Normal distribution with mean ? and variance ?-squared. The Central Limit Theorem can be summarized as follows:

Fundamentals Of Statistics For Data Scientists and Analysts

Stated differently, when you have a population with mean ? and standard deviation ? and you take sufficiently large random samples from that population with replacement, then the distribution of the sample means will be approximately normally distributed.

Dimensionality Reduction Techniques

Dimensionality reduction is the transformation of data from a high-dimensional space into a low-dimensional space such that this low-dimensional representation of the data still contains the meaningful properties of the original data as much as possible.

With the increase in popularity in Big Data, the demand for these dimensionality reduction techniques, reducing the amount of unnecessary data and features, increased as well. Examples of popular dimensionality reduction techniques are Principle Component Analysis, Factor Analysis, Canonical Correlation, Random Forest.

Principle Component Analysis (PCA)

Principal Component Analysis or PCA is a dimensionality reduction technique that is very often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller set that still contains most of the information or the variation in the original large dataset.

Let’s assume we have a data X with p variables; X1, X2, …., Xp with eigenvectors e1, …, ep, and eigenvalues ?1,…, ?p. Eigenvalues show the variance explained by a particular data field out of the total variance. The idea behind PCA is to create new (independent) variables, called Principal Components, that are a linear combination of the existing variable. The ith principal component can be expressed as follows:

Fundamentals Of Statistics For Data Scientists and Analysts

Then using Elbow Rule or Kaiser Rule, you can determine the number of principal components that optimally summarize the data without losing too much information. It is also important to look at the proportion of total variation (PRTV) that is explained by each principal component to decide whether it is beneficial to include or to exclude it. PRTV for the ith principal component can be calculated using eigenvalues as follows:

Fundamentals Of Statistics For Data Scientists and Analysts

Elbow Rule

The elbow rule or the elbow method is a heuristic approach that is used to determine the number of optimal principal components from the PCA results. The idea behind this method is to plot the explained variation as a function of the number of components and pick the elbow of the curve as the number of optimal principal components. Following is an example of such a scatter plot where the PRTV (Y-axis) is plotted on the number of principal components (X-axis). The elbow corresponds to the X-axis value 2, which suggests that the number of optimal principal components is 2.

Fundamentals Of Statistics For Data Scientists and Analysts
Image Source: Multivariate Statistics Github

Factor Analysis (FA)

Factor analysis or FA is another statistical method for dimensionality reduction. It is one of the most commonly used inter-dependency techniques and is used when the relevant set of variables shows a systematic inter-dependence and the objective is to find out the latent factors that create a commonality. Let’s assume we have a data X with p variables; X1, X2, …., Xp. FA model can be expressed as follows:

Fundamentals Of Statistics For Data Scientists and Analysts

where X is a [p x N] matrix of p variables and N observations, µ is [p x N] population mean matrix, A is [p x k] common factor loadings matrix, F [k x N] is the matrix of common factors and u [pxN] is the matrix of specific factors. So, put it differently, a factor model is as a series of multiple regressions, predicting each of the variables Xi from the values of the unobservable common factors fi:

Fundamentals Of Statistics For Data Scientists and Analysts

Each variable has k of its own common factors, and these are related to the observations via factor loading matrix for a single observation as follows: In factor analysis, the factors are calculated to maximize between-group variance while minimizing in-group variance. They are factors because they group the underlying variables. Unlike the PCA, in FA the data needs to be normalized, given that FA assumes that the dataset follows Normal Distribution.

Tatev Karen Aslanyan is an experienced full-stack data scientist with a focus on Machine Learning and AI. She is also the co-founder of LunarTech, an online tech educational platform, and the creator of The Ultimate Data Science Bootcamp.Tatev Karen, with Bachelor and Masters in Econometrics and Management Science, has grown in the field of Machine Learning and AI, focusing on Recommender Systems and NLP, supported by her scientific research and published papers. Following five years of teaching, Tatev is now channeling her passion into LunarTech, helping shape the future of data science.

Original. Reposted with permission.

More On This Topic

  • Must Know for Data Scientists and Data Analysts: Causal Design Patterns
  • What’s the Difference Between Data Analysts and Data Scientists?
  • The Inferential Statistics Data Scientists Should Know
  • Important Statistics Data Scientists Need to Know
  • Practical Statistics for Data Scientists
  • Will Data Analysts be Replaced by AI?

ChatGPT Security Concerns: Credentials on the Dark Web and More

ChatGPT writing phishing emails.
Image: DIgilife/Adobe Stock

As artificial intelligence technology such as ChatGPT continues to improve, so does its potential for misuse by cybercriminals. According to BlackBerry Global Research, 74% of IT decision-makers surveyed acknowledged ChatGPT’s potential threat to cybersecurity. 51% of the respondents believe there will be a successful cyberattack credited to ChatGPT in 2023.

Here’s a rundown of some of the most significant ChatGPT-related cybersecurity reported issues and risks.

Jump to:

  • ChatGPT credentials and jailbreak prompts on the Dark Web
  • Weaponization of ChatGPT
  • ChatGPT can amplify disinformation or fake news
  • ChatGPT can write malicious code
  • Meet WormGPT, an AI developed for cybercriminals
  • How can you mitigate AI-created cyberthreats?

ChatGPT credentials and jailbreak prompts on the Dark Web

ChatGPT stolen credentials on the Dark Web

Group-IB cybersecurity company published research in June 2023 on the trade of ChatGPT stolen credentials on the Dark Web. According to the company, more than 100,000 ChatGPT accounts were stolen between June 2022 and March 2023. More than 40,000 of these credentials have been stolen from the Asia-Pacific region, followed by the Middle East and Africa (24,925), Europe (16,951), Latin America (12,314) and North America (4,737).

There are two main reasons why cybercriminals want to access ChatGPT accounts. The obvious one to get their hands on paid accounts, which have no limitations compared to the free versions. However, the main threat is account spying — ChatGPT keeps a detailed history of all prompts and answers, which could potentially leak sensitive data to fraudsters.

Dmitry Shestakov, head of threat intelligence at Group-IB, wrote, “Many enterprises are integrating ChatGPT into their operational flow. Employees enter classified correspondences or use the bot to optimize proprietary code. Given that ChatGPT’s standard configuration retains all conversations, this could inadvertently offer a trove of sensitive intelligence to threat actors if they obtain account credentials.”

Jailbreak prompts on the Dark Web

SlashNext, a cloud email security company, reported the increasing trade of jailbreak prompts on the cybercriminal underground forums. Those prompts are dedicated to bypass ChatGPT’s guardrails and enable an attacker to craft malicious content with the AI.

Weaponization of ChatGPT

The primary concern regarding the exploitation of ChatGPT is its potential weaponization by cybercriminals. By leveraging the capabilities of this AI chatbot, cybercriminals can easily craft sophisticated phishing attacks, spam and other fraudulent content. ChatGPT can convincingly impersonate individuals or trusted entities/organizations, increasing the likelihood of tricking unsuspecting users into divulging sensitive information or falling victim to scams (Figure A).

Figure A

One result from a prompt to create a phishing email.
One result from a prompt to create a phishing email. Image: Written with ChatGPT

As can be read in this example, ChatGPT can enhance the effectiveness of social engineering attacks by offering more realistic and personalized interactions with potential victims. Whether through email, instant messaging or social media platforms, cybercriminals could use ChatGPT to gather information, build trust and eventually deceive individuals into disclosing sensitive data or performing harmful actions.

ChatGPT can amplify disinformation or fake news

The spread of disinformation and fake news is a growing problem on the internet. With the help of ChatGPT, cybercriminals can quickly generate and disseminate large volumes of misleading or harmful content that might be used for influence operations. This could lead to heightened social unrest, political instability and public distrust in reliable information sources.

ChatGPT can write malicious code

ChatGPT has several protocols that prevents the generation of prompts related to writing malware or engaging in any harmful, illegal or unethical activities. However, even attackers with low-level programming skills can still bypass protocols and make it write malware code. Several security researchers have written about this issue.

Cybersecurity company HYAS published research on how they wrote a proof-of-concept malware they called Black Mamba with the help of ChatGPT. The malware is a polymorphic malware with keylogger functionalities.

Mark Stockley wrote on the MalwareBytes Labs website that he made ChatGPT write ransomware, yet concludes the AI is very bad at it. One reason for this is ChatGPT’s word limit of around 3,000 words. Stockley stated that “ChatGPT is essentially mashing up and rephrasing content it found on the Internet,” so the pieces of code it provides are nothing new.

Researcher Aaron Mulgrew from the data security company Forcepoint exposed how it was possible to bypass all ChatGPT rail-guards by making it write code in snippets. This method created an advanced malware that stayed undetected by 69 antivirus engines from VirusTotal, a platform offering malware detection on various antivirus engines.

Meet WormGPT, an AI developed for cybercriminals

Daniel Kelley from SlashNext exposed a new AI tool advertised on the Dark Web called WormGPT. The tool is being advertised as the “Best GPT alternative for Blackhats” and provides answers to prompts without any ethical limitation, in opposition to ChatGPT which makes it harder to produce malicious content. Figure B shows a prompt from SlashNext.

Figure B

WormGPT provides illegal content without any limitation.
WormGPT provides illegal content without any limitation. Image: SlashNext

The developer of WormGPT does not provide information on how the AI was created and what data it was fed during its training process.

WormGPT offers a few subscriptions: the monthly subscription costs 100 Euros, with the 10 first prompts for 60 Euros. The yearly subscription is 550 Euros. A private setup with a private dedicated API is priced at 5,000 Euros.

How can you mitigate AI-created cyberthreats?

There are no specific mitigation practices for cyberthreats created by AI. The usual security recommendations apply, in particular regarding social engineering. Employees should be trained to detect phishing emails or any social engineering attempt — whether it is via email, instant messaging or social media. Also, remember that it is not a good security practice to send confidential data to ChatGPT because it might leak.

Disclosure: I work for Trend Micro, but the views expressed in this article are mine.

Subscribe to the Cybersecurity Insider Newsletter

Strengthen your organization's IT security defenses by keeping abreast of the latest cybersecurity news, solutions, and best practices.

Delivered Tuesdays and Thursdays Sign up today

GitHub Does Not Necessarily Get You a Job

GitHub Does Not Necessarily Get You a Job

If you are a developer who is looking for jobs, anyone and everyone would suggest you one thing – to build up your GitHub profile through contributing to projects on a regular basis as much as you can. Well, this is not absolute anymore. In certain cases these days, GitHub contributions are making people lose out on jobs while they may be making bucks through the platforms.

JLarky, a developer who works at Fogbender, posted a few days ago on X that he is looking for job opportunities as he might be leaving the role. Few days later, a lot of people have been sharing their GitHub profiles, sharing how many contributions they have made in the last year. JLarky also posted his profile, which showed that he has been making numerous contributions, but is still not able to land a job.

PSA: my GitHub looks like this and no one hired me yet pic.twitter.com/g2Lq3ZTuFH

— JLarky (@JLarky) August 6, 2023

Interestingly, a user points out that the reason he thinks should land him a job is the exact reason that is not landing a job. Why would a company want to hire someone who gives away all the code for free?

Isn’t open source the way forward?

Developers earn from $5 to $30,000 from a Github repository every month. A lot of times, people who post on GitHub have been regarded as developers who already have a full time job. Funnily enough, people say that job is actually about someone who has very less experience and has too much free time on their hands. Moreover, it is very easy to copy code and pad it to your GitHub profile and fake your contributions to open source.

On the other hand, it is very necessary to make your presence and capabilities visible online to be hired by companies. So even if you are working at a company, if you ever wish to switch, how else will companies know you’re a “10X Super Duper Hyper software developer”? The more you contribute to open source projects, the more recognition you will get, and thus get noticed by hirers.

While that is true, the other side of this is you don’t really need a GitHub presence anymore to get a job. People who have contributed barely on GitHub have been able to get jobs easily, when compared to people with high rated profiles. But the reason cannot be said is just that.

As soon as you join a company, you realise that a lot of projects that you would be working on would not make it to your public profile, simply because it is a closed source project of the company. That is simply because code is valuable. Some companies won’t even let you use their private repositories. In the current era of AI models, a lot of companies do not want their codes to be used by AI models for training auto-code platforms, just like GitHub Copilot.

Takes time, but you get what you want

To an exciting note, a lot of companies started out as open source software. For example, Redis, the database company started out as an open source repository on GitHub, then eventually converted it into a company that offers several services through their premium offerings. There are a lot of examples of startups that are being born out of open source and are now minting money such as MindsDB, MongoDB, Kubernetes, Hoppscotch, Kafka, and so on. They are even hiring for roles.

On the other hand, there are many high profile developers who do not care about acquiring jobs. For example Antonio Cheong, a developer who released an open source Bard API on GitHub for free by reverse engineering it, proudly says that he is an open source developer. He continues to contribute on GitHub and has since then built a lot of open source projects like ChatGPT open source, vectordb, and many more.

The best part is that maintaining a GitHub profile by contributing to open source actually gets recognised by big tech companies. So while people might be landing jobs with ugly GitHubs, if a developer waits for long and keeps building his profile, they might end up in a company like OpenAI, Google, or Meta.

Thus, while forgetting about GitHub profile is not the answer, it is also not smart to completely rely on it. It is great to contribute to open source, but it is ideal to learn a lot of other skills that the market, which is highly competitive right now, requires right now if you need a job.

However, there are platforms like MachineHack that are offering jobs through their platforms and also let you build your profile based on how many hackathons and contributions you take part in, something that is missing for GitHub.

The post GitHub Does Not Necessarily Get You a Job appeared first on Analytics India Magazine.

Generative AI megatrends: implications of GPT-4 drift and open source models – part one

Generative AI megatrends: implications of GPT-4 drift and open source models – part one

In this two part discussion, we will discuss two related generative AI megatrends

  1. What are the implications of GPT-4 in terms of model drift
  2. What is the impact of this limitation on the update of open source LLMs

Backgroumd

A recent paper How Is ChatGPT’s Behavior Changing over Time? from Stanford University and UC Berkeley claims that the performance of GPT-4 has drifted over time. To make this claim, specific tasks were evaluated (ex: accuracy of maths) and the results indicated that for these tasks, the performance degraded from March to June. Data drift in ML models is not new. LLMs are particularly susceptible to drift and other issues related to data due to the manner in which they work i.e. after processing a user query, they leverage the training data to understand the context. They then simply attempt to predict the text output based on this content. This means, they can always get a response even if its incorrect ie there is no direct validation of the response (say in classification or in regression)

Analysis

  • In a technical sense, Drift is different from degradation.
  • The model is not uniformly getting worse – but rather is getting worse in specific tasks
  • There was a seperate an unrelated report of a drop in traffic for GPT models.
  • It is possible that because more people are using the system, they are reporting more problems.
  • It is claimed that this drift shows that LLMs are a long way from achieving AGI.
  • Many of these users trusted generative AI solutions to the extent that they were willing to seek financial, medical, and relationship advice from a virtual assistant.

But the most serious finding is that: because chatGPT is a black box model, users will not trust it to build systems because they cannot understand and quantify the nature of the drift.

While this is indeed serious, this discussion also misses the point that LLMs may not necessarily be used in customer facing situations i.e. many LLMs will be used to assist humans or within the development stack where their output may not necessarily be directly exposed to humans.

Image source: drifting sands over time https://pixabay.com/photos/india-desert-sand-pattern-sand-355/