UNIT III: Exploratory Data Analysis (EDA) – Part I

Domain Label: collage courses Published: April 01, 2026 Creator: Sumit Haldar Est. Read: 4 min read

topic covers

Descriptive Statistics: The Basics

Descriptive statistics help summarize and describe the essential features of a dataset. They are generally divided into Measures of Central Tendency (where the data centers) and Measures of Dispersion (how spread out the data is).

1.Measures of Central Tendency

Mean (Average): The sum of all values divided by the total number of values. It is sensitive to outliers (extreme values).

Median: The middle value when the data is sorted in ascending or descending order. If there is an even number of observations, it is the average of the two middle numbers. It is "robust," meaning it isn't heavily affected by outliers.

Mode: The value that appears most frequently in a dataset. A dataset can have one mode, multiple modes (bimodal/multimodal), or no mode at all.

2. Measures of Dispersion (Spread)

These metrics describe how much the data observations vary from the center and from each other.:

Variance (σ²): The average of the squared differences from the Mean. Squaring the differences ensures that negative deviations don't cancel out positive ones.

Standard Deviation (σ): The square root of the variance. This is the most commonly used measure of spread because it is expressed in the same units as the original data, making it easier to interpret.

Low Standard Deviation: Data points are close to the mean.
High Standard Deviation: Data points are spread out over a wider range.

Examaple >>>

Codding in pandas :

Run it >>>

What are Quantiles?

Quantiles are cut points that divide a sorted dataset into equal-sized groups. The most common type is Quartiles, which divide the data into four equal parts (25% each).

$Q_1$ ₁ (First Quartile / 25th Percentile): The middle number between the smallest value and the median. 25% of the data falls below this point.

$Q_2$ ₂ (Second Quartile / 50th Percentile): This is exactly the Median. 50% of the data falls below this point.

$Q_3$ ₃ $Q_3$ (Third Quartile / 75th Percentile): The middle value between the median and the highest value. 75% of the data falls below this point.

What is IQR (Interquartile Range)?

The IQR is the distance between the third and first quartile. It represents the middle 50% of your data. Unlike the "Range" (Max - Min), the IQR is not affected by extreme outliers.

Formula: