Working with Confidence Intervals

Working with Confidence Intervals
Image by Editor

In data science and statistics, confidence intervals are very useful for quantifying uncertainty in a dataset. The 65% confidence interval represents data values that fall within one standard deviation of the mean. The 95% confidence interval represents data values that are distributed within two standard deviations from the mean value. The confidence interval can also be estimated as the interquartile range, which represents data values between the 25th percentile and the 75th percentile, with the 50th percentile representing the mean or median value.

In this article, we illustrate how the confidence interval can be calculated using the heights dataset. The heights dataset contains male and female height data.

Visualization of Probability Distribution of Heights

First, we generate the probability distribution of the male and female heights.

# import necessary libraries  import numpy as np  import pandas as pd  import matplotlib.pyplot as plt  import seaborn as sns    # obtain dataset  df = pd.read_csv('https://raw.githubusercontent.com/bot13956/Bayes_theorem/master/heights.csv')    # plot probability distribution of heights  sns.kdeplot(df[df.sex=='Female']['height'], label='Female')  sns.kdeplot(df[df.sex=='Male']['height'], label = 'Male')  plt.xlabel('height (inch)')  plt.title('probability distribution of Male and Female heights')  plt.legend()  plt.show()

Working with Confidence Intervals
Probability distribution of male and female heights | Image by Author.

From the figure above, we observe that males are on average taller than females.

Calculation of Confidence Intervals

The code below illustrates how the 95% confidence intervals for the male and female heights can be calculated.

# calculate confidence intervals for male heights  mu_male = np.mean(df[df.sex=='Male']['height'])  mu_male    >>> 69.31475494143555    std_male = np.std(df[df.sex=='Male']['height'])  std_male    >>> 3.608799452913512    conf_int_male = [mu_male - 2*std_male, mu_male + 2*std_male]  conf_int_male    >>> [65.70595548852204, 72.92355439434907]    # calculate confidence intervals for female heights  mu_female = np.mean(df[df.sex=='Female']['height'])  mu_female    >>> 64.93942425064515    std_female = np.std(df[df.sex=='Female']['height'])  std_female    >>> 3.752747269853828    conf_int_female = [mu_female - 2*std_female, mu_female + 2*std_female]  conf_int_female    >>> [57.43392971093749, 72.4449187903528]

Confidence Interval Using Boxplot

Another method to estimate the confidence interval is to use the interquartile range. A boxplot can be used to visualize the interquartile range as illustrated below.

# generate boxplot  data = list([df[df.sex=='Male']['height'],                  df[df.sex=='Female']['height']])    fig, ax = plt.subplots()  ax.boxplot(data)  ax.set_ylabel('height (inch)')  xticklabels=['Male', 'Female']  ax.set_xticklabels(xticklabels)  ax.yaxis.grid(True)  plt.show()

Working with Confidence Intervals
Box plot showing the interquartile range.| Image by Author.

The box shows the interquartile range, and the whiskers indicate the minimum and maximum values of the data, excluding outliers. The round circles indicate the outliers. The orange line is the median value. From the figure, the interquartile range for male heights is [ 67 inches, 72 inches]. The interquartile range for female heights is [63 inches, 67 in]. The median height for males heights is 68 inches, while the median height for female heights is 65 inches.

Summary

In summary, confidence intervals are very useful for quantifying uncertainty in a dataset. The 95% confidence interval represents data values that are distributed within two standard deviations from the mean value. The confidence interval can also be estimated as the interquartile range, which represents data values between the 25th percentile and the 75th percentile, with the 50th percentile representing the mean or median value.
Benjamin O. Tayo is a Physicist, Data Science Educator, and Writer, as well as the Owner of DataScienceHub. Previously, Benjamin was teaching Engineering and Physics at U. of Central Oklahoma, Grand Canyon U., and Pittsburgh State U.

More On This Topic

  • Confidence Intervals for XGBoost
  • How to calculate confidence intervals for performance metrics in Machine…
  • Working With The Lambda Layer in Keras
  • Working with Spark, Python or SQL on Azure Databricks
  • Working With Sparse Features In Machine Learning Models
  • Working with Python APIs For Data Science Project
Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...