Practical Statistics for Data Scientists

Practical Statistics for Data Scientists
Image by unsplash

Statistical concepts are used widely to extract useful information from data. This article will review essential statistical concepts applicable in data science and machine learning.

Probability Distribution

A probability distribution shows how feature values are distributed around the mean value. Using the iris dataset, the probability distributions for the sepal length, sepal width, petal length, and petal width can be generated using the code below.

import numpy as np  import matplotlib.pyplot as plt  from sklearn import datasets  import seaborn as sns    iris = sns.load_dataset("iris")  sns.kdeplot(data=iris)  plt.show()

Practical Statistics for Data Scientists
Probability distribution of sepal length, sepal width, sepal width, petal length, and petal width | Image by Author Mode

Lets now focus on the sepal length variable. The probability distribution of the sepal length variable is shown below.

Practical Statistics for Data Scientists
Probability distribution of sepal length variable | Image by Author

We observe that the probability distribution of the sepal length variable has a single maximum, hence it is unimodal. The value of the sepal length where the maximum occurs is the mode, which is about 5.8.

A plot of the probability distribution of the petal width variable is shown below.

Practical Statistics for Data Scientists
Probability distribution of the petal width variable | Image by Author

From this plot, we observe that the probability distribution of the petal length variable has 2 maxima, hence it is bimodal. The values of the sepal length where the maxima occurs are the mode, that is at 1.7 and 5.0.

Mean

The mean value is a measure of central tendency. The mean value for the sepal length variable is obtained as follows:

data = datasets.load_iris().data  sepal_length = data[:,0]  mean = np.mean(sepal_length)  >>> 5.843333333333334

Median

The median value is also a measure of central tendency. The median value is less susceptible to the presence of outliers, hence a more reliable measure of central tendency, compared to the mean value. The median value for the sepal length variable is obtained as follows:

data = datasets.load_iris().data  sepal_length = data[:,0]  np.median(sepal_length)  >>> 5.8

Standard Deviation

Standard deviation is a measure of fluctuations of data values around the mean value. It is used to quantify the degree of uncertainty in the dataset. The standard deviation for the sepal length feature is calculated using the code below.

data = datasets.load_iris().data  sepal_length = data[:,0]  std = np.std(sepal_length)  >>> 0.8253012917851409

Confidence Interval

The confidence interval is the range of values around the mean. The 65% confidence interval is the range of values that are one standard deviation from the mean value. The 95% confidence interval is the range of values that are two standard deviations from the mean value. The boxplot below shows the mean value and 65% confidence interval for the sepal length feature.

sns.boxplot(data = iris, y='sepal_length')  plt.show()

Practical Statistics for Data Scientists
Boxplot for the sepal length feature. The blue region indicates the 65% confidence interval | Image by Author Normal Distribution

Probability distributions can be used for predictive modeling. The sepal length feature only has 150 data points. Suppose that we would like to generate more data points. Then assuming that the sepal length feature is normally distributed, we can generate more data points. In the example below, we generate N = 1000 data points for the sepal length feature.

np.random.seed(10**7)  mu = mean  sigma = std  x = np.random.normal(mean, std, N)       num_bins = 50       n, bins, patches = plt.hist(x, num_bins,                               density = 1,                               color ='green',                              alpha = 0.7)       y = ((1 / (np.sqrt(2 * np.pi) * sigma)) *       np.exp(-0.5 * (1 / sigma * (bins - mu))**2))      plt.plot(bins, y, '--', color ='black')      plt.xlabel('sepal length')  plt.ylabel('probability distribution')      plt.title('matplotlib.pyplot.hist() function Examplenn',            fontweight ="bold")      plt.show()

Practical Statistics for Data Scientists
Probability distribution of the sepal length width | Image by Author Bayes’ Theorem and Conditional Probability

Bayes’ theorem is an important theorem in statistics and data science. It is used for evaluating the predictive power of binary classification algorithms. A simple tutorial on how Bayes’ theorem is used in a binary classification algorithm is found here: Bayes’ Theorem in Plain English.

Conclusion

In summary, we’ve reviewed the essential statistical concepts useful for data science such as mode, median, mean, standard deviation, probability distributions, normal distribution, and Bayes’ theorem. Anyone interested in data science must learn the fundamentals of statistics.
Benjamin O. Tayo is a Physicist, Data Science Educator, and Writer, as well as the Owner of DataScienceHub. Previously, Benjamin was teaching Engineering and Physics at U. of Central Oklahoma, Grand Canyon U., and Pittsburgh State U.

Practical Statistics for Data Scientists

More On This Topic

Latest stories

CMS Uses Machine Learning to Fully Reconstruct LHC Collisions

LANL: AI Accelerates Elucidation of Nuclear Forces with Explosive Neutron...

PNNL: Integrating AI into Biological Research

Rick Stevens on the Genesis Mission and the Future of...

Inside the DOE’s 26 AI Challenges for Genesis Mission

You might also like...

CMS Uses Machine Learning to Fully Reconstruct LHC Collisions

LANL: AI Accelerates Elucidation of Nuclear Forces with Explosive Neutron Star Data

PNNL: Integrating AI into Biological Research