Exploring Statistics: Day 3 of 14

Exploring Statistics: Day 3 of 14

On day 2, I explored descriptive statistics and touched on Measures of Central Tendency, Measures of Shape, Measures of Spread, Frequency Distributions, and Graphical representations. It is important to note that descriptive statistics can often be referred to as Exploratory Data Analysis (EDA). Data professionals tend to use them interchangeably.

When working with these measures, the values and positions of distribution are determined by outliers — an observation that significantly deviates from the other observations in a dataset.

Hence, the points below touch on what to calculate under the three measures and the appropriate condition to use each:

Note: I would not want to bore you with crazy calculations as it might make it look difficult or complex. Hence, I have summarized most of the essential (need-to-know) knowledge required to guide you on the right path to analyzing any dataset.

1. For the measures of central tendency:

This is when the mean, median, and mode are calculated. A key thing to note when working with the measures of central tendency is that the mean is sensitive to outliers while the median is resistant to outliers. For example, when I enter the supermarket with $100 and 100 persons shopping at that instant of time, all have amounts ranging from $80-$100, it is easier to determine the amount of tax payable using the mean because we hold amounts that fall within a similar range. However, let’s say a shopper walks in with $10,000. Using the mean to calculate the tax each shopper pays becomes unfair as it will be highly increased. The $10,000 held by the latest shopper becomes an outlier, and using the median to calculate what each shopper should pay is only reasonable.

2. For the measures of shape:

This is used to measure the symmetry (skewness) of data distribution and the number of peaks the distribution has. A significant thing to note with measures of shape is that the form any graph takes is determined by an outlier’s presence and the number of peaks. A good thing about the measures of shape is that it helps you visualize when the mean, median and mode of the data sets occur at different intervals or points. It helps remove any personal bias or initial assumption you might have concerning a dataset.

For the measures of shape, there is a symmetric distribution and asymmetric distribution. A key difference between them is that, while symmetric distribution occurs at a regular interval around the mean, the asymmetric distribution occurs at different intervals.

An asymmetric distribution that has a single peak is called a UNIMODAL distribution.

An asymmetric distribution that has double peaks is called a BIMODAL distribution.

A symmetric distribution is UNIFORM when no outlier exists and the MEAN \= MEDIAN.

A symmetric distribution is RIGHT-SKEWED when there are outliers and the MEAN \> MEDIAN.

A symmetric distribution is LEFT-SKEWED when there are outliers and the MEAN < MEDIAN

3. For the measures of spread:

This is used to measure the variability of data within a dataset. A good application of the measures of spread is in weather forecasting, stock price prediction, sales forecasting and any other application that deals with making future estimates. This is calculated using the range, interquartile range (IQR), variance, and standard deviation.

Range: This is calculating the difference between the highest value and lowest value. Though the range gives an estimate of how far the data points are, it does not tell a perfect story about outliers. The IQR is used in addition.

IQR: To know whether a dataset has an outlier, we use the boxplot and then the IQR (interquartile range).

To calculate the IQR, we have the third quartile (Q3) — first quartile (Q1).

Using the IQR, we determine the maximum and minimum values for the boxplot to ensure how far apart the data points fall within a given range:

Maximum value: Q3 (top 50%) + 1.5 (IQR)

Minimum value: Q1 (bottom 50%) — 1.5 (IQR)

Standard deviation: The standard deviation follows the “EMPIRICAL RULE” You must have heard of the bell shape graph, which is the standard deviation rule. This rule states that:

  • Approximately 68% of all observations fall within 1 standard deviation denoted by: 𝝁 + 1σ

  • Approximately 95% of the observations fall within 2 standard deviations denoted by: 𝝁 + 2σ

  • Approximately 99.7% of the observations fall within 3 standard deviations of the mean denoted by: 𝝁 + 3σ

The symbol 𝝁 represents the population mean while the symbol σ represents the standard deviation.

I wouldn’t be diving deep into descriptive statistics as only the basic knowledge is needed to understand how to go about analyzing a data set. At the end of the series, I would be sharing critical resources for further understanding. Until then, let’s take baby steps.

So this a wrap on day 3 and see you around on day 4.