UNIT III: Exploratory Data Analysis (EDA) – Part I

 topic covers



Descriptive Statistics: The Basics

Descriptive statistics help summarize and describe the essential features of a dataset. They are generally divided into Measures of Central Tendency (where the data centers) and Measures of Dispersion (how spread out the data is).

1.Measures of Central Tendency

Mean (Average): The sum of all values divided by the total number of values. It is sensitive to outliers (extreme values).


Median: The middle value when the data is sorted in ascending or descending order. If there is an even number of observations, it is the average of the two middle numbers. It is "robust," meaning it isn't heavily affected by outliers.

Mode: The value that appears most frequently in a dataset. A dataset can have one mode, multiple modes (bimodal/multimodal), or no mode at all.


2. Measures of Dispersion (Spread)

These metrics describe how much the data observations vary from the center and from each other.:

Variance (σ²): The average of the squared differences from the Mean. Squaring the differences ensures that negative deviations don't cancel out positive ones.

Standard Deviation (σ): The square root of the variance. This is the most commonly used measure of spread because it is expressed in the same units as the original data, making it easier to interpret.

  • Low Standard Deviation: Data points are close to the mean.
  • High Standard Deviation: Data points are spread out over a wider range.

Examaple >>>


Codding in pandas :

What are Quantiles?

Quantiles are cut points that divide a sorted dataset into equal-sized groups. The most common type is Quartiles, which divide the data into four equal parts (25% each).

Q₁ (First Quartile / 25th Percentile): The middle number between the smallest value and the median. 25% of the data falls below this point.

Q (Second Quartile / 50th Percentile): This is exactly the Median. 50% of the data falls below this point.

Q (Third Quartile / 75th Percentile): The middle value between the median and the highest value. 75% of the data falls below this point.

What is IQR (Interquartile Range)?

The IQR is the distance between the third and first quartile. It represents the middle 50% of your data. Unlike the "Range" (Max - Min), the IQR is not affected by extreme outliers.

Formula:

IQR = Q - Q

Example.>>>


Coding Part >>>

Frequency Tables

A frequency table is a tabular representation that shows the number of occurrences (frequency) of each distinct value in a dataset. It transforms raw, unorganized data into a clear summary.

Components of a Frequency Table:

  • Categories: The unique labels or groups (e.g., Blood Type: A, B, AB, O).
  • Frequency (f): The raw count of how many times each category appears.
  • Relative Frequency: The proportion of the total (Frequency / Total n).
  • Cumulative Frequency: The running total of frequencies.

Value Counts (The Pythonic Way)

n Python, specifically within the Pandas library, the .value_counts() method is the most efficient way to generate a frequency table from a Series or DataFrame column.





Covariance: Direction of the Relationship

Covariance measures the extent to which two variables change together. It indicates the direction of the linear relationship.

  • Positive Covariance: Both variables tend to move in the same direction (if X goes up, Y goes up).
  • Negative Covariance: Variables move in opposite directions (if X goes up, Y goes down).
  • Zero Covariance: No linear relationship exists.







Correlation: Strength and Direction

Correlation is a "normalized" version of covariance. It scales the measure to a fixed range, usually between -1 and +1, making it independent of the units of measurement. The most common type is the Pearson Correlation Coefficient.

  • +1: Perfect positive linear relationship.
  • -1: Perfect negative linear relationship.
  • 0: No linear relationship.




coding part >>>



Data summary reports

In data analysis, a Data Summary Report is the bridge between raw data and actionable insights. It provides a high-level overview of a dataset’s characteristics, helping you identify patterns, detect anomalies (outliers), and understand the distribution of variables before moving into deeper modeling.

A comprehensive report typically breaks down data into two main types of statistics:

A. Measures of Central Tendency

These describe the "center" or typical value of your data:

  • Mean: The average value.
  • Median: The middle value when data is sorted (useful for skewed data)
  • Mode: The most frequent value (essential for categorical data).

B. Measures of Dispersion (Spread)

These describe how "spread out" the data points are:

  • Standard Deviation: How much values deviate from the mean.
  • Variance: The squared deviation from the mean.
  • Range: The difference between the maximum and minimum values.
  • Interquartile Range (IQR): The range between the 25th and 75th percentiles, representing the "middle 50%" of the data.

.describe() Method
.info() Method

coding part >>>



Skewness: The Asymmetry

Skewness measures the lack of symmetry in a probability distribution. It tells you which way the "tail" of the data is pointing.

  • Zero Skew: The distribution is perfectly symmetrical (e.g., a Normal Distribution). The Mean, Median, and Mode are all equal.
  • Positive Skew (Right-Skewed): The tail on the right side is longer. Most data points are clustered on the left. 
  • Example: Household income (a few billionaires pull the "tail" to the right).
  • Relation: Mean > Median > Mode
  • Negative Skew (Left-Skewed): The tail on the left side is longer. Most data points are clustered on the right.
  • Example: Age of retirement (most people retire late, a few retire very early).
  • Relation: Mean< Median< Mode



Kurtosis: The "Peakedness" and Tails

Kurtosis measures the "fatness" of the tails and the sharpness of the peak. It tells you how much of your data sits in the extremes (outliers) versus the center.

  • Mesokurtic (Kurtosis ≈ 3 or Excess = 0): This is the Normal Distribution.
  • Leptokurtic (High Kurtosis / Positive Excess): The distribution has a very sharp, thin peak and fat tails. This indicates a high presence of outliers.
  • Risk: In finance, high kurtosis means a higher chance of extreme "black swan" events.
  • Platykurtic (Low Kurtosis / Negative Excess): The distribution is flat and spread out, with a broad peak and thin tails. Data is spread more evenly.


coding part >>>


examples>>>








0 Comments