Bivariate & Multivariate Analysis

Numerical vs Numerical Relationships

When analyzing relationships between two numerical variables, we look for patterns that reveal how one variable changes as the other changes. This is the foundation of understanding correlation, causation, and predictive relationships in your data. Scatter plots are your primary tool, while correlation coefficients quantify the strength and direction of linear relationships.

Key Concept

Bivariate Analysis

Bivariate analysis examines the relationship between exactly two variables. For numerical-numerical pairs, we assess whether the variables move together (positive correlation), move oppositely (negative correlation), or have no apparent relationship.

Correlation does not imply causation - always investigate the underlying mechanism before drawing conclusions.

Scatter Plots - The Foundation

Scatter plots display each observation as a point positioned by its x and y values. Patterns in the point cloud reveal the relationship type and strength.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create sample data with different relationships
np.random.seed(42)
n = 200

df = pd.DataFrame({
    'study_hours': np.random.uniform(1, 10, n),
    'experience_years': np.random.uniform(0, 15, n),
    'age': np.random.uniform(22, 60, n)
})
df['exam_score'] = 40 + 5 * df['study_hours'] + np.random.normal(0, 8, n)
df['salary'] = 30000 + 4000 * df['experience_years'] + np.random.normal(0, 5000, n)

# Basic scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(df['study_hours'], df['exam_score'], alpha=0.6)
plt.xlabel('Study Hours')
plt.ylabel('Exam Score')
plt.title('Study Hours vs Exam Score')
plt.show()

Seaborn Scatter Plots with Regression Lines

Adding a regression line helps visualize the linear trend. Seaborn's regplot() combines scatter and regression in one call.

# Scatter plot with regression line
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Positive correlation
sns.regplot(x='study_hours', y='exam_score', data=df, ax=axes[0], 
            scatter_kws={'alpha': 0.5}, line_kws={'color': 'red'})
axes[0].set_title('Positive Correlation: Study Hours vs Score')

# Another positive correlation
sns.regplot(x='experience_years', y='salary', data=df, ax=axes[1],
            scatter_kws={'alpha': 0.5}, line_kws={'color': 'red'})
axes[1].set_title('Positive Correlation: Experience vs Salary')

plt.tight_layout()
plt.show()

Calculating Correlation Coefficients

The Pearson correlation coefficient (r) quantifies linear relationships on a scale from -1 to +1.

from scipy import stats

# Pearson correlation
r_study, p_study = stats.pearsonr(df['study_hours'], df['exam_score'])
print(f"Study vs Score: r = {r_study:.3f}, p-value = {p_study:.4f}")

r_exp, p_exp = stats.pearsonr(df['experience_years'], df['salary'])
print(f"Experience vs Salary: r = {r_exp:.3f}, p-value = {p_exp:.4f}")

# Interpretation
# |r| < 0.3: Weak correlation
# 0.3 <= |r| < 0.7: Moderate correlation
# |r| >= 0.7: Strong correlation

Correlation Value	Strength	Interpretation
0.9 to 1.0 (or -0.9 to -1.0)	Very Strong	Variables move almost perfectly together
0.7 to 0.9 (or -0.7 to -0.9)	Strong	Clear linear pattern visible
0.4 to 0.7 (or -0.4 to -0.7)	Moderate	Noticeable trend with scatter
0.2 to 0.4 (or -0.2 to -0.4)	Weak	Slight trend, lots of variation
0 to 0.2 (or 0 to -0.2)	Very Weak/None	No apparent linear relationship

Pearson vs Spearman Correlation

Use Pearson for linear relationships and Spearman for monotonic (consistently increasing or decreasing) relationships, especially with outliers.

# Create data with outliers
data_outliers = pd.DataFrame({
    'x': list(range(1, 21)) + [25],
    'y': list(range(1, 21)) + [100]  # Outlier at (25, 100)
})

# Compare correlations
pearson_r, _ = stats.pearsonr(data_outliers['x'], data_outliers['y'])
spearman_r, _ = stats.spearmanr(data_outliers['x'], data_outliers['y'])

print(f"Pearson r:  {pearson_r:.3f}")   # Affected by outlier
print(f"Spearman r: {spearman_r:.3f}")  # More robust

# When to use which:
# Pearson: Linear relationships, normally distributed data
# Spearman: Non-linear monotonic, ordinal data, outliers present

Pro Tip: Always visualize your data before calculating correlation. Anscombe's quartet shows that very different patterns can have identical correlation coefficients.

Practice: Numerical Relationships

Task: Given height and weight data: height = [160, 165, 170, 175, 180, 185, 190] and weight = [55, 62, 68, 73, 80, 85, 92], create a scatter plot with a regression line using seaborn.

Show Solution

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

height = [160, 165, 170, 175, 180, 185, 190]
weight = [55, 62, 68, 73, 80, 85, 92]

df = pd.DataFrame({'height': height, 'weight': weight})

plt.figure(figsize=(8, 6))
sns.regplot(x='height', y='weight', data=df, 
            scatter_kws={'s': 100}, line_kws={'color': 'red'})
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.title('Height vs Weight Relationship')
plt.show()

Task: Generate random data with np.random.seed(42); x = np.random.randn(100); y = 2*x + np.random.randn(100)*0.5. Calculate Pearson correlation, print the r value and p-value, and state whether the correlation is significant (p < 0.05).

Show Solution

import numpy as np
from scipy import stats

np.random.seed(42)
x = np.random.randn(100)
y = 2 * x + np.random.randn(100) * 0.5

r, p_value = stats.pearsonr(x, y)

print(f"Pearson r: {r:.4f}")
print(f"P-value: {p_value:.2e}")
print(f"Significant: {'Yes' if p_value < 0.05 else 'No'}")
print(f"Correlation strength: {'Strong' if abs(r) > 0.7 else 'Moderate' if abs(r) > 0.4 else 'Weak'}")

Task: Create data: x = list(range(1, 11)) + [15] and y = list(range(1, 11)) + [50]. Calculate both Pearson and Spearman correlations, create a scatter plot, and explain in a print statement which correlation is more robust to the outlier.

Show Solution

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

x = list(range(1, 11)) + [15]
y = list(range(1, 11)) + [50]

pearson_r, _ = stats.pearsonr(x, y)
spearman_r, _ = stats.spearmanr(x, y)

plt.figure(figsize=(8, 6))
plt.scatter(x, y, s=100)
plt.scatter([15], [50], color='red', s=150, label='Outlier')
plt.xlabel('X')
plt.ylabel('Y')
plt.title(f'Pearson: {pearson_r:.3f}, Spearman: {spearman_r:.3f}')
plt.legend()
plt.show()

print(f"Pearson r: {pearson_r:.3f}")
print(f"Spearman r: {spearman_r:.3f}")
print("Spearman is more robust because it uses ranks, not actual values.")

Numerical vs Categorical Relationships

Comparing how a numerical variable differs across categorical groups is fundamental to understanding your data. This analysis helps answer questions like "Do salaries differ by department?" or "Is customer spending different across age groups?". Box plots, violin plots, and grouped statistics are your primary tools for this type of analysis.

Box Plots by Category

Box plots show the distribution of a numerical variable for each category, making it easy to compare medians, spread, and outliers across groups.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create sample employee data
np.random.seed(42)
n = 300

df = pd.DataFrame({
    'department': np.random.choice(['Sales', 'Engineering', 'Marketing', 'HR'], n),
    'salary': np.random.normal(70000, 15000, n),
    'performance': np.random.choice(['Low', 'Medium', 'High'], n)
})

# Adjust salary by department
df.loc[df['department'] == 'Engineering', 'salary'] *= 1.3
df.loc[df['department'] == 'Sales', 'salary'] *= 1.1

# Box plot by department
plt.figure(figsize=(10, 6))
sns.boxplot(x='department', y='salary', data=df, palette='viridis')
plt.title('Salary Distribution by Department')
plt.xlabel('Department')
plt.ylabel('Salary ($)')
plt.show()

Violin Plots - Distribution Shape

Violin plots combine box plots with kernel density estimation, showing the full distribution shape for each category.

# Violin plot shows distribution shape
plt.figure(figsize=(10, 6))
sns.violinplot(x='department', y='salary', data=df, palette='muted')
plt.title('Salary Distribution by Department (Violin)')
plt.xlabel('Department')
plt.ylabel('Salary ($)')
plt.show()

# Split violin by another category
plt.figure(figsize=(12, 6))
sns.violinplot(x='department', y='salary', hue='performance', 
               data=df, split=False, palette='Set2')
plt.title('Salary by Department and Performance')
plt.legend(title='Performance')
plt.show()

Strip and Swarm Plots

These plots show individual data points, useful when you want to see the actual distribution rather than a summary.

# Strip plot - jittered points
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.stripplot(x='department', y='salary', data=df, 
              ax=axes[0], alpha=0.5, jitter=True)
axes[0].set_title('Strip Plot: Individual Points')

# Swarm plot - non-overlapping points
sns.swarmplot(x='department', y='salary', data=df.sample(100), 
              ax=axes[1], alpha=0.7)
axes[1].set_title('Swarm Plot: Non-overlapping Points')

plt.tight_layout()
plt.show()

Group Statistics with groupby()

Calculate summary statistics for each group to quantify the differences you see in plots.

# Summary statistics by group
group_stats = df.groupby('department')['salary'].agg([
    'count', 'mean', 'median', 'std', 'min', 'max'
]).round(0)
print(group_stats)

# Multiple aggregations
detailed_stats = df.groupby('department').agg({
    'salary': ['mean', 'std'],
}).round(0)
print(detailed_stats)

When to Use Box Plots

Comparing distributions across groups
Identifying outliers quickly
Showing median and quartiles
Many categories (fits more per plot)

When to Use Violin Plots

Need to see distribution shape
Checking for bimodality
Comparing density across groups
Fewer categories (need more width)

Statistical Tests for Group Differences

Use statistical tests to determine if observed differences are statistically significant.

from scipy import stats

# Get salary data for two departments
engineering = df[df['department'] == 'Engineering']['salary']
hr = df[df['department'] == 'HR']['salary']

# Independent t-test (2 groups)
t_stat, p_value = stats.ttest_ind(engineering, hr)
print(f"T-test: t = {t_stat:.3f}, p = {p_value:.4f}")

# ANOVA (3+ groups)
sales = df[df['department'] == 'Sales']['salary']
marketing = df[df['department'] == 'Marketing']['salary']

f_stat, p_anova = stats.f_oneway(engineering, hr, sales, marketing)
print(f"ANOVA: F = {f_stat:.3f}, p = {p_anova:.4f}")

# Interpretation
if p_anova < 0.05:
    print("Significant difference exists between at least two departments")

Pro Tip: Combine box plots with strip plots using sns.boxplot() followed by sns.stripplot() on the same axes to show both summary statistics and individual points.

Practice: Numerical vs Categorical

Task: Create a DataFrame with columns 'grade' (A, B, C) and 'score' (random values). Create a box plot showing score distribution for each grade.

Show Solution

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

np.random.seed(42)
df = pd.DataFrame({
    'grade': np.random.choice(['A', 'B', 'C'], 150),
    'score': np.random.normal(75, 15, 150)
})

# Adjust scores by grade
df.loc[df['grade'] == 'A', 'score'] += 15
df.loc[df['grade'] == 'C', 'score'] -= 10

plt.figure(figsize=(8, 6))
sns.boxplot(x='grade', y='score', data=df, order=['A', 'B', 'C'])
plt.title('Score Distribution by Grade')
plt.xlabel('Grade')
plt.ylabel('Score')
plt.show()

Task: Using the same DataFrame, use groupby() to calculate mean, median, and std of scores for each grade. Display results rounded to 2 decimal places.

Show Solution

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({
    'grade': np.random.choice(['A', 'B', 'C'], 150),
    'score': np.random.normal(75, 15, 150)
})
df.loc[df['grade'] == 'A', 'score'] += 15
df.loc[df['grade'] == 'C', 'score'] -= 10

stats = df.groupby('grade')['score'].agg(['mean', 'median', 'std']).round(2)
stats = stats.reindex(['A', 'B', 'C'])
print(stats)

Task: Create data for 3 product categories with different price distributions. Perform one-way ANOVA to test if prices differ significantly. Create a violin plot with individual points overlaid.

Show Solution

import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt

np.random.seed(42)
df = pd.DataFrame({
    'category': ['Electronics']*50 + ['Clothing']*50 + ['Food']*50,
    'price': np.concatenate([
        np.random.normal(500, 100, 50),
        np.random.normal(80, 30, 50),
        np.random.normal(25, 10, 50)
    ])
})

# ANOVA test
groups = [df[df['category'] == cat]['price'] for cat in df['category'].unique()]
f_stat, p_value = stats.f_oneway(*groups)
print(f"ANOVA: F = {f_stat:.2f}, p = {p_value:.2e}")
print(f"Significant difference: {'Yes' if p_value < 0.05 else 'No'}")

# Combined visualization
plt.figure(figsize=(10, 6))
sns.violinplot(x='category', y='price', data=df, inner=None, alpha=0.3)
sns.stripplot(x='category', y='price', data=df, alpha=0.6, size=4)
plt.title(f'Price by Category (ANOVA p = {p_value:.2e})')
plt.show()

Categorical vs Categorical Relationships

Analyzing relationships between two categorical variables reveals associations and patterns in your data. Questions like "Is product preference related to age group?" or "Does customer region affect purchase category?" require categorical-categorical analysis. Contingency tables, stacked bar charts, and the chi-square test are essential tools for this analysis.

Contingency Tables (Cross-tabulation)

A contingency table shows the frequency distribution of two categorical variables simultaneously. Use pd.crosstab() to create these tables quickly.

import pandas as pd
import numpy as np

# Create sample survey data
np.random.seed(42)
n = 500

df = pd.DataFrame({
    'age_group': np.random.choice(['18-25', '26-35', '36-50', '50+'], n),
    'preference': np.random.choice(['Online', 'In-Store', 'Both'], n),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n)
})

# Basic contingency table
crosstab = pd.crosstab(df['age_group'], df['preference'])
print("Frequency Table:")
print(crosstab)

# With row percentages
crosstab_pct = pd.crosstab(df['age_group'], df['preference'], normalize='index') * 100
print("\nRow Percentages:")
print(crosstab_pct.round(1))

Stacked and Grouped Bar Charts

Visualize contingency tables using bar charts to see patterns in category combinations.

import matplotlib.pyplot as plt
import seaborn as sns

# Stacked bar chart
crosstab = pd.crosstab(df['age_group'], df['preference'])
crosstab.plot(kind='bar', stacked=True, figsize=(10, 6), colormap='viridis')
plt.title('Shopping Preference by Age Group (Stacked)')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.legend(title='Preference')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Grouped bar chart
crosstab.plot(kind='bar', figsize=(10, 6), colormap='viridis')
plt.title('Shopping Preference by Age Group (Grouped)')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.legend(title='Preference')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Seaborn Count Plots with Hue

Use sns.countplot() with the hue parameter for a quick grouped visualization.

# Count plot with hue
plt.figure(figsize=(10, 6))
sns.countplot(x='age_group', hue='preference', data=df, palette='Set2')
plt.title('Shopping Preference Distribution by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.legend(title='Preference')
plt.show()

# Proportions within each age group
fig, ax = plt.subplots(figsize=(10, 6))
crosstab_pct = pd.crosstab(df['age_group'], df['preference'], normalize='index')
crosstab_pct.plot(kind='bar', stacked=True, ax=ax, colormap='viridis')
ax.set_ylabel('Proportion')
ax.set_title('Shopping Preference Proportions by Age Group')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Chi-Square Test for Independence

The chi-square test determines if there is a statistically significant association between two categorical variables.

from scipy import stats

# Create contingency table
crosstab = pd.crosstab(df['age_group'], df['preference'])

# Chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(crosstab)

print(f"Chi-square statistic: {chi2:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")

if p_value < 0.05:
    print("\nResult: Significant association between age group and preference")
else:
    print("\nResult: No significant association (variables are independent)")

# View expected frequencies
print("\nExpected frequencies:")
print(pd.DataFrame(expected, 
                   index=crosstab.index, 
                   columns=crosstab.columns).round(1))

Statistical Test

Chi-Square Test of Independence

The chi-square test compares observed frequencies to expected frequencies (what we would expect if variables were independent). A large chi-square value (and small p-value) indicates the variables are associated.

Rule of thumb: All expected frequencies should be at least 5 for reliable chi-square results.

Cramers V - Effect Size

While chi-square tells you if an association exists, Cramers V measures the strength of the association (0 to 1).

def cramers_v(crosstab):
    """Calculate Cramer's V for effect size"""
    chi2 = stats.chi2_contingency(crosstab)[0]
    n = crosstab.sum().sum()
    min_dim = min(crosstab.shape[0] - 1, crosstab.shape[1] - 1)
    return np.sqrt(chi2 / (n * min_dim))

v = cramers_v(crosstab)
print(f"Cramer's V: {v:.3f}")

# Interpretation:
# V < 0.1: Negligible association
# 0.1 <= V < 0.3: Weak association
# 0.3 <= V < 0.5: Moderate association
# V >= 0.5: Strong association

Pro Tip: When you have many categories, consider combining rare categories into an "Other" group before running chi-square tests to ensure adequate expected frequencies.

Practice: Categorical Relationships

Task: Create a DataFrame with 'gender' (Male, Female) and 'product' (A, B, C) columns. Generate 200 random rows and create a crosstab showing counts.

Show Solution

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame({
    'gender': np.random.choice(['Male', 'Female'], 200),
    'product': np.random.choice(['A', 'B', 'C'], 200)
})

crosstab = pd.crosstab(df['gender'], df['product'])
print(crosstab)

Task: Using the same data, create a stacked bar chart showing the percentage distribution of products within each gender group (rows should sum to 100%).

Show Solution

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
df = pd.DataFrame({
    'gender': np.random.choice(['Male', 'Female'], 200),
    'product': np.random.choice(['A', 'B', 'C'], 200)
})

# Normalize by row (index)
crosstab_pct = pd.crosstab(df['gender'], df['product'], normalize='index') * 100

crosstab_pct.plot(kind='bar', stacked=True, figsize=(8, 6), colormap='Set2')
plt.title('Product Preference by Gender (%)')
plt.xlabel('Gender')
plt.ylabel('Percentage')
plt.legend(title='Product')
plt.xticks(rotation=0)
plt.show()

Task: Create survey data with 'education' (High School, Bachelor, Master, PhD) and 'income_level' (Low, Medium, High). Perform chi-square test and calculate Cramers V. Interpret the results.

Show Solution

import pandas as pd
import numpy as np
from scipy import stats

np.random.seed(42)
n = 400

# Create data with some association
education = np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n)
income = []
for edu in education:
    if edu == 'PhD':
        income.append(np.random.choice(['Low', 'Medium', 'High'], p=[0.1, 0.3, 0.6]))
    elif edu == 'Master':
        income.append(np.random.choice(['Low', 'Medium', 'High'], p=[0.2, 0.4, 0.4]))
    elif edu == 'Bachelor':
        income.append(np.random.choice(['Low', 'Medium', 'High'], p=[0.3, 0.4, 0.3]))
    else:
        income.append(np.random.choice(['Low', 'Medium', 'High'], p=[0.5, 0.35, 0.15]))

df = pd.DataFrame({'education': education, 'income_level': income})
crosstab = pd.crosstab(df['education'], df['income_level'])

# Chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(crosstab)

# Cramers V
min_dim = min(crosstab.shape[0] - 1, crosstab.shape[1] - 1)
cramers_v = np.sqrt(chi2 / (n * min_dim))

print(f"Chi-square: {chi2:.2f}, p-value: {p_value:.4f}")
print(f"Cramer's V: {cramers_v:.3f}")
print(f"Association: {'Significant' if p_value < 0.05 else 'Not significant'}")
print(f"Strength: {'Moderate' if cramers_v >= 0.3 else 'Weak' if cramers_v >= 0.1 else 'Negligible'}")

Correlation Matrices & Heatmaps

When you have many numerical variables, examining all pairwise relationships individually becomes impractical. Correlation matrices and heatmaps provide a comprehensive view of all variable relationships at once. These tools are essential for identifying highly correlated features, detecting multicollinearity, and selecting variables for modeling.

Creating Correlation Matrices

Pandas makes it easy to compute correlation matrices with the .corr() method. By default, it calculates Pearson correlation.

import pandas as pd
import numpy as np

# Create sample dataset with multiple numerical features
np.random.seed(42)
n = 500

df = pd.DataFrame({
    'age': np.random.normal(35, 10, n),
    'income': np.random.normal(60000, 20000, n),
    'experience': np.random.normal(10, 5, n),
    'education_years': np.random.normal(16, 2, n),
    'hours_worked': np.random.normal(40, 8, n)
})

# Add some correlations
df['income'] = df['income'] + df['experience'] * 3000 + df['education_years'] * 2000
df['experience'] = df['experience'] + df['age'] * 0.3

# Compute correlation matrix
corr_matrix = df.corr()
print(corr_matrix.round(2))

Heatmaps with Seaborn

Heatmaps use color to represent correlation values, making patterns immediately visible.

import matplotlib.pyplot as plt
import seaborn as sns

# Basic heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
            fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix Heatmap')
plt.tight_layout()
plt.show()

Customizing Heatmaps

Several customizations make heatmaps more informative and publication-ready.

# Enhanced heatmap with masking upper triangle
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, mask=mask, annot=True, cmap='RdBu_r', 
            center=0, fmt='.2f', linewidths=0.5,
            vmin=-1, vmax=1, square=True,
            cbar_kws={'label': 'Correlation Coefficient'})
plt.title('Correlation Matrix (Lower Triangle)')
plt.tight_layout()
plt.show()

# Different color palette options
# 'coolwarm': Blue (negative) to Red (positive)
# 'RdBu_r': Red (negative) to Blue (positive) 
# 'viridis': Sequential, good for positive-only
# 'YlGnBu': Sequential yellow to blue

Pair Plots for Visual Exploration

Pair plots show scatter plots for all variable pairs plus distributions on the diagonal.

# Basic pair plot
sns.pairplot(df[['age', 'income', 'experience', 'education_years']], 
             diag_kind='kde', plot_kws={'alpha': 0.5})
plt.suptitle('Pair Plot of Key Variables', y=1.02)
plt.show()

# Pair plot with categorical hue
df['income_bracket'] = pd.cut(df['income'], bins=3, labels=['Low', 'Medium', 'High'])
sns.pairplot(df[['age', 'experience', 'education_years', 'income_bracket']], 
             hue='income_bracket', diag_kind='kde', palette='viridis')
plt.suptitle('Pair Plot by Income Bracket', y=1.02)
plt.show()

Identifying Multicollinearity

High correlations between features (multicollinearity) can cause problems in regression models. Use correlation matrices to detect this.

# Find highly correlated pairs
def find_high_correlations(corr_matrix, threshold=0.7):
    """Find feature pairs with correlation above threshold"""
    high_corr = []
    for i in range(len(corr_matrix.columns)):
        for j in range(i+1, len(corr_matrix.columns)):
            if abs(corr_matrix.iloc[i, j]) > threshold:
                high_corr.append({
                    'feature1': corr_matrix.columns[i],
                    'feature2': corr_matrix.columns[j],
                    'correlation': corr_matrix.iloc[i, j]
                })
    return pd.DataFrame(high_corr)

high_corr_pairs = find_high_correlations(corr_matrix, threshold=0.6)
print("Highly correlated pairs:")
print(high_corr_pairs)

# Visualize with filtered heatmap
plt.figure(figsize=(10, 8))
strong_corr = corr_matrix.copy()
strong_corr[abs(strong_corr) < 0.5] = np.nan
sns.heatmap(strong_corr, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Strong Correlations Only (|r| > 0.5)')
plt.show()

Multicollinearity Warning: When two features have correlation above 0.8, consider removing one before building linear models. Tree-based models are more robust to multicollinearity.

Correlation with Target Variable

When preparing for modeling, examining correlations between features and the target variable helps identify predictive features.

# Correlation with target (income)
target_corr = df.corr()['income'].drop('income').sort_values(ascending=False)
print("Feature correlations with income:")
print(target_corr.round(3))

# Visualize as bar chart
plt.figure(figsize=(10, 5))
colors = ['green' if x > 0 else 'red' for x in target_corr.values]
target_corr.plot(kind='barh', color=colors)
plt.xlabel('Correlation with Income')
plt.title('Feature Correlations with Target Variable')
plt.axvline(x=0, color='black', linewidth=0.5)
plt.tight_layout()
plt.show()

Pro Tip: Use df.corr(method='spearman') when your data has outliers or non-linear relationships. Spearman correlation is rank-based and more robust.

Practice: Correlation Analysis

Task: Create a DataFrame with columns 'a', 'b', 'c', 'd' containing 100 random values each. Compute the correlation matrix and display it as a heatmap with annotations.

Show Solution

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

np.random.seed(42)
df = pd.DataFrame({
    'a': np.random.randn(100),
    'b': np.random.randn(100),
    'c': np.random.randn(100),
    'd': np.random.randn(100)
})

corr = df.corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

Task: Using the same DataFrame, create a heatmap showing only the lower triangle (mask the upper triangle). Use 'RdBu_r' colormap and make squares equal-sized.

Show Solution

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

np.random.seed(42)
df = pd.DataFrame({
    'a': np.random.randn(100),
    'b': np.random.randn(100),
    'c': np.random.randn(100),
    'd': np.random.randn(100)
})

corr = df.corr()

# Create mask for upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

plt.figure(figsize=(8, 6))
sns.heatmap(corr, mask=mask, annot=True, cmap='RdBu_r', 
            center=0, fmt='.2f', square=True, linewidths=0.5)
plt.title('Correlation Matrix (Lower Triangle)')
plt.show()

Task: Create a dataset with 5 features where some are correlated with a target variable. Find all features with |correlation| > 0.4 with target. Create a horizontal bar chart showing these correlations sorted by absolute value.

Show Solution

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
n = 200

df = pd.DataFrame({
    'feature1': np.random.randn(n),
    'feature2': np.random.randn(n),
    'feature3': np.random.randn(n),
    'feature4': np.random.randn(n),
    'feature5': np.random.randn(n)
})

# Create target with correlations to some features
df['target'] = (2 * df['feature1'] + 1.5 * df['feature3'] - 
                0.8 * df['feature5'] + np.random.randn(n) * 0.5)

# Find correlations with target
target_corr = df.corr()['target'].drop('target')

# Filter high correlations
high_corr = target_corr[abs(target_corr) > 0.4].sort_values(key=abs, ascending=True)

# Visualize
plt.figure(figsize=(10, 5))
colors = ['green' if x > 0 else 'red' for x in high_corr.values]
high_corr.plot(kind='barh', color=colors)
plt.xlabel('Correlation with Target')
plt.title('Features Correlated with Target (|r| > 0.4)')
plt.axvline(x=0, color='black', linewidth=0.5)
plt.tight_layout()
plt.show()

print("High correlation features:")
print(high_corr.round(3))

Interactive Demo: Correlation Explorer

Experiment with different correlation strengths and patterns. Adjust the sliders to see how correlation coefficients relate to visual scatter patterns.

Correlation Settings

Correlation Strength: 0.70 Move slider from -1.0 (negative) to +1.0 (positive)

Sample Size: 100

Add Noise Level: 0.2

Interpretation Guide

Current Pattern: Strong positive correlation

Points cluster along an upward diagonal
As X increases, Y tends to increase
Predictable linear relationship

Scatter Plot Visualization

Calculated r = 0.70

Key Observations

Strong Positive (r > 0.7)

Points form tight upward band. X and Y move together predictably.

Weak/No Correlation (|r| < 0.3)

Points scattered randomly. Knowing X tells little about Y.

Strong Negative (r < -0.7)

Points form tight downward band. As X increases, Y decreases.

Key Takeaways

Scatter Plots

Essential for visualizing relationships between two numerical variables

Correlation Coefficient

Pearson measures linear relationships from -1 to +1

Box Plots by Group

Compare numerical distributions across categorical groups

Contingency Tables

Cross-tabulation reveals associations between categorical variables

Heatmaps

Color-coded matrices show correlation patterns at a glance

Pair Plots

Visualize all pairwise relationships in a single figure

What You'll Learn

Contents

Numerical vs Numerical Relationships

Bivariate Analysis

Scatter Plots - The Foundation

Seaborn Scatter Plots with Regression Lines

Calculating Correlation Coefficients

Pearson vs Spearman Correlation

Practice: Numerical Relationships

Easy Create a scatter plot with trend line

Medium Calculate and interpret Pearson correlation

Hard Compare Pearson vs Spearman with outliers

Numerical vs Categorical Relationships

Box Plots by Category

Violin Plots - Distribution Shape

Strip and Swarm Plots

Group Statistics with groupby()

When to Use Box Plots

When to Use Violin Plots

Statistical Tests for Group Differences

Practice: Numerical vs Categorical

Easy Create a box plot comparing scores by grade level

Medium Calculate and display group statistics

Hard Perform ANOVA test and create combined visualization

Categorical vs Categorical Relationships

Contingency Tables (Cross-tabulation)

Stacked and Grouped Bar Charts

Seaborn Count Plots with Hue

Chi-Square Test for Independence

Chi-Square Test of Independence

Cramers V - Effect Size

Practice: Categorical Relationships

Easy Create a contingency table

Medium Create a stacked bar chart with percentages

Hard Perform chi-square test and calculate Cramers V

Correlation Matrices & Heatmaps

Creating Correlation Matrices

Heatmaps with Seaborn

Customizing Heatmaps

Pair Plots for Visual Exploration

Identifying Multicollinearity

Correlation with Target Variable

Practice: Correlation Analysis

Easy Create a basic correlation heatmap

Medium Create heatmap with lower triangle mask

Hard Find and visualize high correlations with target

Interactive Demo: Correlation Explorer

Correlation Settings

Interpretation Guide

Scatter Plot Visualization

Key Observations

Strong Positive (r > 0.7)

Weak/No Correlation (|r| < 0.3)

Strong Negative (r < -0.7)

Key Takeaways

Scatter Plots

Correlation Coefficient

Box Plots by Group

Contingency Tables

Heatmaps

Pair Plots

Knowledge Check