UNIT IV: Exploratory Data Analysis (EDA) – Part II

Domain Label: collage courses Published: April 01, 2026 Creator: Sumit Haldar Est. Read: 4 min read

topic covers

Types of Analysis

Analysis is categorized based on the number of variables involved:

Distribution Analysis

What is a Histogram?

A histogram takes a large set of data points and groups them into logical ranges called "bins."
* The Bins (X-axis): These represent the intervals of the data (e.g., age groups 0–10, 11–20, etc.).
The Frequency (Y-axis): This shows how many data points fall into each bin.
The result is a series of rectangles whose area is proportional to the frequency of the variable. Because the data is continuous, the bars touch each other, unlike a categorical bar chart.
In the context of data science and EDA, the importance of a histogram can be summarized into four key points:
Checks Distribution: It reveals if your data follows a "Normal Distribution." Many machine learning algorithms (like Linear Regression) perform better when data is normally distributed.
Spots Outliers: It visually highlights "lonely" bars far from the main group, helping you identify errors or extreme values that could skew your model.
Identifies Skewness: It shows if your data is "leaning" to one side (left or right). This tells you if you need to transform the data (e.g., using a Log transform) before training a model.
Reveals Data Spreading: It provides an instant look at the range and variance of your dataset—showing whether your values are tightly packed or widely scattered.
Box Plot (Whisker Plot)
Definition: A graphical representation of the five-number summary of a dataset: Minimum, First Quartile (Q1), Median (Q2), Third Quartile (Q3), and Maximum. It uses a "box" to represent the middle 50% of the data and "whiskers" to show the rest.
Short Importance:
Outlier Detection: Visually isolates data points that fall outside the typical range.Comparison: Easily compares the spread and medians of different categories side-by-side. Skewness: Shows if data is symmetrical or pushed toward one end.
Short Code (Python):
Pair Plot
Definition:
A matrix of scatter plots that visualizes the pairwise relationships between every numerical variable in a dataset. The diagonal usually shows a histogram or KDE to represent the distribution of a single variable.

Short Importance:

Feature Correlation: Quickly identifies which variables have a linear or non-linear relationship.
Cluster Discovery: Helps spot distinct groupings or clusters within the data.
Multivariate Insight: Moves beyond looking at one variable to seeing how the entire "system" of data interacts.
Short Code (Python):

Heatmap
Definition: A two-dimensional representation of data where values are depicted by colors. In EDA, it is most commonly used to visualize a Correlation Matrix.

Short Importance: * Feature Selection: Quickly shows which variables are redundant (highly correlated) or which features impact the target variable most.

Complexity Management: Summarizes relationships between dozens of variables in a single, color-coded grid.

Pattern Recognition: High-intensity colors immediately draw the eye to the most important relationships.

Scatter Plot
Definition: A plot that uses dots to represent the values of two different numerical variables. One variable is plotted on the horizontal axis (X) and the other on the vertical axis (Y).
Short Importance: * Correlation: Shows if variables move together (positive), in opposite directions (negative), or not at all.Patterns: Helps identify clusters or non-linear shapes (like curves) in the data.Individual Points: Allows you to see every single data point, making it easy to spot specific anomalies.

Trend Lines (Regression Lines)

Definition: A line drawn through the data points on a scatter plot to represent the general direction or "best fit" of the relationship.

Short Importance: * Simplification: Smooths out the "noise" of individual dots to show the underlying movement.Prediction: Provides a mathematical basis to estimate the value of Y for a given X.Strength: The closer the dots are to the line, the stronger the relationship between variables.

When performing Exploratory Data Analysis (EDA), the choice of visualization is determined by the data types of the variables you are analyzing.
The following table summarizes the best plots to use based on the combination of Categorical (labels/groups) and Numerical (continuous numbers) variables.

In Exploratory Data Analysis, the final step is interpreting your visualizations to find Patterns (the rules the data follows) and Anomalies (the exceptions to those rules).