Numerical vs Numerical Relationships
When analyzing relationships between two numerical variables, we look for patterns that reveal how one variable changes as the other changes. This is the foundation of understanding correlation, causation, and predictive relationships in your data. Scatter plots are your primary tool, while correlation coefficients quantify the strength and direction of linear relationships.
Bivariate Analysis
Bivariate analysis examines the relationship between exactly two variables. For numerical-numerical pairs, we assess whether the variables move together (positive correlation), move oppositely (negative correlation), or have no apparent relationship.
Scatter Plots - The Foundation
Scatter plots display each observation as a point positioned by its x and y values. Patterns in the point cloud reveal the relationship type and strength.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Create sample data with different relationships
np.random.seed(42)
n = 200
df = pd.DataFrame({
'study_hours': np.random.uniform(1, 10, n),
'experience_years': np.random.uniform(0, 15, n),
'age': np.random.uniform(22, 60, n)
})
df['exam_score'] = 40 + 5 * df['study_hours'] + np.random.normal(0, 8, n)
df['salary'] = 30000 + 4000 * df['experience_years'] + np.random.normal(0, 5000, n)
# Basic scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(df['study_hours'], df['exam_score'], alpha=0.6)
plt.xlabel('Study Hours')
plt.ylabel('Exam Score')
plt.title('Study Hours vs Exam Score')
plt.show()
Seaborn Scatter Plots with Regression Lines
Adding a regression line helps visualize the linear trend. Seaborn's regplot() combines scatter and regression in one call.
# Scatter plot with regression line
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Positive correlation
sns.regplot(x='study_hours', y='exam_score', data=df, ax=axes[0],
scatter_kws={'alpha': 0.5}, line_kws={'color': 'red'})
axes[0].set_title('Positive Correlation: Study Hours vs Score')
# Another positive correlation
sns.regplot(x='experience_years', y='salary', data=df, ax=axes[1],
scatter_kws={'alpha': 0.5}, line_kws={'color': 'red'})
axes[1].set_title('Positive Correlation: Experience vs Salary')
plt.tight_layout()
plt.show()
Calculating Correlation Coefficients
The Pearson correlation coefficient (r) quantifies linear relationships on a scale from -1 to +1.
from scipy import stats
# Pearson correlation
r_study, p_study = stats.pearsonr(df['study_hours'], df['exam_score'])
print(f"Study vs Score: r = {r_study:.3f}, p-value = {p_study:.4f}")
r_exp, p_exp = stats.pearsonr(df['experience_years'], df['salary'])
print(f"Experience vs Salary: r = {r_exp:.3f}, p-value = {p_exp:.4f}")
# Interpretation
# |r| < 0.3: Weak correlation
# 0.3 <= |r| < 0.7: Moderate correlation
# |r| >= 0.7: Strong correlation
| Correlation Value | Strength | Interpretation |
|---|---|---|
| 0.9 to 1.0 (or -0.9 to -1.0) | Very Strong | Variables move almost perfectly together |
| 0.7 to 0.9 (or -0.7 to -0.9) | Strong | Clear linear pattern visible |
| 0.4 to 0.7 (or -0.4 to -0.7) | Moderate | Noticeable trend with scatter |
| 0.2 to 0.4 (or -0.2 to -0.4) | Weak | Slight trend, lots of variation |
| 0 to 0.2 (or 0 to -0.2) | Very Weak/None | No apparent linear relationship |
Pearson vs Spearman Correlation
Use Pearson for linear relationships and Spearman for monotonic (consistently increasing or decreasing) relationships, especially with outliers.
# Create data with outliers
data_outliers = pd.DataFrame({
'x': list(range(1, 21)) + [25],
'y': list(range(1, 21)) + [100] # Outlier at (25, 100)
})
# Compare correlations
pearson_r, _ = stats.pearsonr(data_outliers['x'], data_outliers['y'])
spearman_r, _ = stats.spearmanr(data_outliers['x'], data_outliers['y'])
print(f"Pearson r: {pearson_r:.3f}") # Affected by outlier
print(f"Spearman r: {spearman_r:.3f}") # More robust
# When to use which:
# Pearson: Linear relationships, normally distributed data
# Spearman: Non-linear monotonic, ordinal data, outliers present
Practice: Numerical Relationships
Task: Given height and weight data: height = [160, 165, 170, 175, 180, 185, 190] and weight = [55, 62, 68, 73, 80, 85, 92], create a scatter plot with a regression line using seaborn.
Show Solution
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
height = [160, 165, 170, 175, 180, 185, 190]
weight = [55, 62, 68, 73, 80, 85, 92]
df = pd.DataFrame({'height': height, 'weight': weight})
plt.figure(figsize=(8, 6))
sns.regplot(x='height', y='weight', data=df,
scatter_kws={'s': 100}, line_kws={'color': 'red'})
plt.xlabel('Height (cm)')
plt.ylabel('Weight (kg)')
plt.title('Height vs Weight Relationship')
plt.show()
Task: Generate random data with np.random.seed(42); x = np.random.randn(100); y = 2*x + np.random.randn(100)*0.5. Calculate Pearson correlation, print the r value and p-value, and state whether the correlation is significant (p < 0.05).
Show Solution
import numpy as np
from scipy import stats
np.random.seed(42)
x = np.random.randn(100)
y = 2 * x + np.random.randn(100) * 0.5
r, p_value = stats.pearsonr(x, y)
print(f"Pearson r: {r:.4f}")
print(f"P-value: {p_value:.2e}")
print(f"Significant: {'Yes' if p_value < 0.05 else 'No'}")
print(f"Correlation strength: {'Strong' if abs(r) > 0.7 else 'Moderate' if abs(r) > 0.4 else 'Weak'}")
Task: Create data: x = list(range(1, 11)) + [15] and y = list(range(1, 11)) + [50]. Calculate both Pearson and Spearman correlations, create a scatter plot, and explain in a print statement which correlation is more robust to the outlier.
Show Solution
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
x = list(range(1, 11)) + [15]
y = list(range(1, 11)) + [50]
pearson_r, _ = stats.pearsonr(x, y)
spearman_r, _ = stats.spearmanr(x, y)
plt.figure(figsize=(8, 6))
plt.scatter(x, y, s=100)
plt.scatter([15], [50], color='red', s=150, label='Outlier')
plt.xlabel('X')
plt.ylabel('Y')
plt.title(f'Pearson: {pearson_r:.3f}, Spearman: {spearman_r:.3f}')
plt.legend()
plt.show()
print(f"Pearson r: {pearson_r:.3f}")
print(f"Spearman r: {spearman_r:.3f}")
print("Spearman is more robust because it uses ranks, not actual values.")
Numerical vs Categorical Relationships
Comparing how a numerical variable differs across categorical groups is fundamental to understanding your data. This analysis helps answer questions like "Do salaries differ by department?" or "Is customer spending different across age groups?". Box plots, violin plots, and grouped statistics are your primary tools for this type of analysis.
Box Plots by Category
Box plots show the distribution of a numerical variable for each category, making it easy to compare medians, spread, and outliers across groups.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Create sample employee data
np.random.seed(42)
n = 300
df = pd.DataFrame({
'department': np.random.choice(['Sales', 'Engineering', 'Marketing', 'HR'], n),
'salary': np.random.normal(70000, 15000, n),
'performance': np.random.choice(['Low', 'Medium', 'High'], n)
})
# Adjust salary by department
df.loc[df['department'] == 'Engineering', 'salary'] *= 1.3
df.loc[df['department'] == 'Sales', 'salary'] *= 1.1
# Box plot by department
plt.figure(figsize=(10, 6))
sns.boxplot(x='department', y='salary', data=df, palette='viridis')
plt.title('Salary Distribution by Department')
plt.xlabel('Department')
plt.ylabel('Salary ($)')
plt.show()
Violin Plots - Distribution Shape
Violin plots combine box plots with kernel density estimation, showing the full distribution shape for each category.
# Violin plot shows distribution shape
plt.figure(figsize=(10, 6))
sns.violinplot(x='department', y='salary', data=df, palette='muted')
plt.title('Salary Distribution by Department (Violin)')
plt.xlabel('Department')
plt.ylabel('Salary ($)')
plt.show()
# Split violin by another category
plt.figure(figsize=(12, 6))
sns.violinplot(x='department', y='salary', hue='performance',
data=df, split=False, palette='Set2')
plt.title('Salary by Department and Performance')
plt.legend(title='Performance')
plt.show()
Strip and Swarm Plots
These plots show individual data points, useful when you want to see the actual distribution rather than a summary.
# Strip plot - jittered points
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.stripplot(x='department', y='salary', data=df,
ax=axes[0], alpha=0.5, jitter=True)
axes[0].set_title('Strip Plot: Individual Points')
# Swarm plot - non-overlapping points
sns.swarmplot(x='department', y='salary', data=df.sample(100),
ax=axes[1], alpha=0.7)
axes[1].set_title('Swarm Plot: Non-overlapping Points')
plt.tight_layout()
plt.show()
Group Statistics with groupby()
Calculate summary statistics for each group to quantify the differences you see in plots.
# Summary statistics by group
group_stats = df.groupby('department')['salary'].agg([
'count', 'mean', 'median', 'std', 'min', 'max'
]).round(0)
print(group_stats)
# Multiple aggregations
detailed_stats = df.groupby('department').agg({
'salary': ['mean', 'std'],
}).round(0)
print(detailed_stats)
When to Use Box Plots
- Comparing distributions across groups
- Identifying outliers quickly
- Showing median and quartiles
- Many categories (fits more per plot)
When to Use Violin Plots
- Need to see distribution shape
- Checking for bimodality
- Comparing density across groups
- Fewer categories (need more width)
Statistical Tests for Group Differences
Use statistical tests to determine if observed differences are statistically significant.
from scipy import stats
# Get salary data for two departments
engineering = df[df['department'] == 'Engineering']['salary']
hr = df[df['department'] == 'HR']['salary']
# Independent t-test (2 groups)
t_stat, p_value = stats.ttest_ind(engineering, hr)
print(f"T-test: t = {t_stat:.3f}, p = {p_value:.4f}")
# ANOVA (3+ groups)
sales = df[df['department'] == 'Sales']['salary']
marketing = df[df['department'] == 'Marketing']['salary']
f_stat, p_anova = stats.f_oneway(engineering, hr, sales, marketing)
print(f"ANOVA: F = {f_stat:.3f}, p = {p_anova:.4f}")
# Interpretation
if p_anova < 0.05:
print("Significant difference exists between at least two departments")
sns.boxplot() followed by sns.stripplot() on the same axes to show both summary statistics and individual points.Practice: Numerical vs Categorical
Task: Create a DataFrame with columns 'grade' (A, B, C) and 'score' (random values). Create a box plot showing score distribution for each grade.
Show Solution
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(42)
df = pd.DataFrame({
'grade': np.random.choice(['A', 'B', 'C'], 150),
'score': np.random.normal(75, 15, 150)
})
# Adjust scores by grade
df.loc[df['grade'] == 'A', 'score'] += 15
df.loc[df['grade'] == 'C', 'score'] -= 10
plt.figure(figsize=(8, 6))
sns.boxplot(x='grade', y='score', data=df, order=['A', 'B', 'C'])
plt.title('Score Distribution by Grade')
plt.xlabel('Grade')
plt.ylabel('Score')
plt.show()
Task: Using the same DataFrame, use groupby() to calculate mean, median, and std of scores for each grade. Display results rounded to 2 decimal places.
Show Solution
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({
'grade': np.random.choice(['A', 'B', 'C'], 150),
'score': np.random.normal(75, 15, 150)
})
df.loc[df['grade'] == 'A', 'score'] += 15
df.loc[df['grade'] == 'C', 'score'] -= 10
stats = df.groupby('grade')['score'].agg(['mean', 'median', 'std']).round(2)
stats = stats.reindex(['A', 'B', 'C'])
print(stats)
Task: Create data for 3 product categories with different price distributions. Perform one-way ANOVA to test if prices differ significantly. Create a violin plot with individual points overlaid.
Show Solution
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(42)
df = pd.DataFrame({
'category': ['Electronics']*50 + ['Clothing']*50 + ['Food']*50,
'price': np.concatenate([
np.random.normal(500, 100, 50),
np.random.normal(80, 30, 50),
np.random.normal(25, 10, 50)
])
})
# ANOVA test
groups = [df[df['category'] == cat]['price'] for cat in df['category'].unique()]
f_stat, p_value = stats.f_oneway(*groups)
print(f"ANOVA: F = {f_stat:.2f}, p = {p_value:.2e}")
print(f"Significant difference: {'Yes' if p_value < 0.05 else 'No'}")
# Combined visualization
plt.figure(figsize=(10, 6))
sns.violinplot(x='category', y='price', data=df, inner=None, alpha=0.3)
sns.stripplot(x='category', y='price', data=df, alpha=0.6, size=4)
plt.title(f'Price by Category (ANOVA p = {p_value:.2e})')
plt.show()
Categorical vs Categorical Relationships
Analyzing relationships between two categorical variables reveals associations and patterns in your data. Questions like "Is product preference related to age group?" or "Does customer region affect purchase category?" require categorical-categorical analysis. Contingency tables, stacked bar charts, and the chi-square test are essential tools for this analysis.
Contingency Tables (Cross-tabulation)
A contingency table shows the frequency distribution of two categorical variables simultaneously. Use pd.crosstab() to create these tables quickly.
import pandas as pd
import numpy as np
# Create sample survey data
np.random.seed(42)
n = 500
df = pd.DataFrame({
'age_group': np.random.choice(['18-25', '26-35', '36-50', '50+'], n),
'preference': np.random.choice(['Online', 'In-Store', 'Both'], n),
'region': np.random.choice(['North', 'South', 'East', 'West'], n)
})
# Basic contingency table
crosstab = pd.crosstab(df['age_group'], df['preference'])
print("Frequency Table:")
print(crosstab)
# With row percentages
crosstab_pct = pd.crosstab(df['age_group'], df['preference'], normalize='index') * 100
print("\nRow Percentages:")
print(crosstab_pct.round(1))
Stacked and Grouped Bar Charts
Visualize contingency tables using bar charts to see patterns in category combinations.
import matplotlib.pyplot as plt
import seaborn as sns
# Stacked bar chart
crosstab = pd.crosstab(df['age_group'], df['preference'])
crosstab.plot(kind='bar', stacked=True, figsize=(10, 6), colormap='viridis')
plt.title('Shopping Preference by Age Group (Stacked)')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.legend(title='Preference')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Grouped bar chart
crosstab.plot(kind='bar', figsize=(10, 6), colormap='viridis')
plt.title('Shopping Preference by Age Group (Grouped)')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.legend(title='Preference')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Seaborn Count Plots with Hue
Use sns.countplot() with the hue parameter for a quick grouped visualization.
# Count plot with hue
plt.figure(figsize=(10, 6))
sns.countplot(x='age_group', hue='preference', data=df, palette='Set2')
plt.title('Shopping Preference Distribution by Age Group')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.legend(title='Preference')
plt.show()
# Proportions within each age group
fig, ax = plt.subplots(figsize=(10, 6))
crosstab_pct = pd.crosstab(df['age_group'], df['preference'], normalize='index')
crosstab_pct.plot(kind='bar', stacked=True, ax=ax, colormap='viridis')
ax.set_ylabel('Proportion')
ax.set_title('Shopping Preference Proportions by Age Group')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Chi-Square Test for Independence
The chi-square test determines if there is a statistically significant association between two categorical variables.
from scipy import stats
# Create contingency table
crosstab = pd.crosstab(df['age_group'], df['preference'])
# Chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(crosstab)
print(f"Chi-square statistic: {chi2:.3f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")
if p_value < 0.05:
print("\nResult: Significant association between age group and preference")
else:
print("\nResult: No significant association (variables are independent)")
# View expected frequencies
print("\nExpected frequencies:")
print(pd.DataFrame(expected,
index=crosstab.index,
columns=crosstab.columns).round(1))
Chi-Square Test of Independence
The chi-square test compares observed frequencies to expected frequencies (what we would expect if variables were independent). A large chi-square value (and small p-value) indicates the variables are associated.
Cramers V - Effect Size
While chi-square tells you if an association exists, Cramers V measures the strength of the association (0 to 1).
def cramers_v(crosstab):
"""Calculate Cramer's V for effect size"""
chi2 = stats.chi2_contingency(crosstab)[0]
n = crosstab.sum().sum()
min_dim = min(crosstab.shape[0] - 1, crosstab.shape[1] - 1)
return np.sqrt(chi2 / (n * min_dim))
v = cramers_v(crosstab)
print(f"Cramer's V: {v:.3f}")
# Interpretation:
# V < 0.1: Negligible association
# 0.1 <= V < 0.3: Weak association
# 0.3 <= V < 0.5: Moderate association
# V >= 0.5: Strong association
Practice: Categorical Relationships
Task: Create a DataFrame with 'gender' (Male, Female) and 'product' (A, B, C) columns. Generate 200 random rows and create a crosstab showing counts.
Show Solution
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({
'gender': np.random.choice(['Male', 'Female'], 200),
'product': np.random.choice(['A', 'B', 'C'], 200)
})
crosstab = pd.crosstab(df['gender'], df['product'])
print(crosstab)
Task: Using the same data, create a stacked bar chart showing the percentage distribution of products within each gender group (rows should sum to 100%).
Show Solution
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
df = pd.DataFrame({
'gender': np.random.choice(['Male', 'Female'], 200),
'product': np.random.choice(['A', 'B', 'C'], 200)
})
# Normalize by row (index)
crosstab_pct = pd.crosstab(df['gender'], df['product'], normalize='index') * 100
crosstab_pct.plot(kind='bar', stacked=True, figsize=(8, 6), colormap='Set2')
plt.title('Product Preference by Gender (%)')
plt.xlabel('Gender')
plt.ylabel('Percentage')
plt.legend(title='Product')
plt.xticks(rotation=0)
plt.show()
Task: Create survey data with 'education' (High School, Bachelor, Master, PhD) and 'income_level' (Low, Medium, High). Perform chi-square test and calculate Cramers V. Interpret the results.
Show Solution
import pandas as pd
import numpy as np
from scipy import stats
np.random.seed(42)
n = 400
# Create data with some association
education = np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n)
income = []
for edu in education:
if edu == 'PhD':
income.append(np.random.choice(['Low', 'Medium', 'High'], p=[0.1, 0.3, 0.6]))
elif edu == 'Master':
income.append(np.random.choice(['Low', 'Medium', 'High'], p=[0.2, 0.4, 0.4]))
elif edu == 'Bachelor':
income.append(np.random.choice(['Low', 'Medium', 'High'], p=[0.3, 0.4, 0.3]))
else:
income.append(np.random.choice(['Low', 'Medium', 'High'], p=[0.5, 0.35, 0.15]))
df = pd.DataFrame({'education': education, 'income_level': income})
crosstab = pd.crosstab(df['education'], df['income_level'])
# Chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(crosstab)
# Cramers V
min_dim = min(crosstab.shape[0] - 1, crosstab.shape[1] - 1)
cramers_v = np.sqrt(chi2 / (n * min_dim))
print(f"Chi-square: {chi2:.2f}, p-value: {p_value:.4f}")
print(f"Cramer's V: {cramers_v:.3f}")
print(f"Association: {'Significant' if p_value < 0.05 else 'Not significant'}")
print(f"Strength: {'Moderate' if cramers_v >= 0.3 else 'Weak' if cramers_v >= 0.1 else 'Negligible'}")
Correlation Matrices & Heatmaps
When you have many numerical variables, examining all pairwise relationships individually becomes impractical. Correlation matrices and heatmaps provide a comprehensive view of all variable relationships at once. These tools are essential for identifying highly correlated features, detecting multicollinearity, and selecting variables for modeling.
Creating Correlation Matrices
Pandas makes it easy to compute correlation matrices with the .corr() method. By default, it calculates Pearson correlation.
import pandas as pd
import numpy as np
# Create sample dataset with multiple numerical features
np.random.seed(42)
n = 500
df = pd.DataFrame({
'age': np.random.normal(35, 10, n),
'income': np.random.normal(60000, 20000, n),
'experience': np.random.normal(10, 5, n),
'education_years': np.random.normal(16, 2, n),
'hours_worked': np.random.normal(40, 8, n)
})
# Add some correlations
df['income'] = df['income'] + df['experience'] * 3000 + df['education_years'] * 2000
df['experience'] = df['experience'] + df['age'] * 0.3
# Compute correlation matrix
corr_matrix = df.corr()
print(corr_matrix.round(2))
Heatmaps with Seaborn
Heatmaps use color to represent correlation values, making patterns immediately visible.
import matplotlib.pyplot as plt
import seaborn as sns
# Basic heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0,
fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix Heatmap')
plt.tight_layout()
plt.show()
Customizing Heatmaps
Several customizations make heatmaps more informative and publication-ready.
# Enhanced heatmap with masking upper triangle
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, mask=mask, annot=True, cmap='RdBu_r',
center=0, fmt='.2f', linewidths=0.5,
vmin=-1, vmax=1, square=True,
cbar_kws={'label': 'Correlation Coefficient'})
plt.title('Correlation Matrix (Lower Triangle)')
plt.tight_layout()
plt.show()
# Different color palette options
# 'coolwarm': Blue (negative) to Red (positive)
# 'RdBu_r': Red (negative) to Blue (positive)
# 'viridis': Sequential, good for positive-only
# 'YlGnBu': Sequential yellow to blue
Pair Plots for Visual Exploration
Pair plots show scatter plots for all variable pairs plus distributions on the diagonal.
# Basic pair plot
sns.pairplot(df[['age', 'income', 'experience', 'education_years']],
diag_kind='kde', plot_kws={'alpha': 0.5})
plt.suptitle('Pair Plot of Key Variables', y=1.02)
plt.show()
# Pair plot with categorical hue
df['income_bracket'] = pd.cut(df['income'], bins=3, labels=['Low', 'Medium', 'High'])
sns.pairplot(df[['age', 'experience', 'education_years', 'income_bracket']],
hue='income_bracket', diag_kind='kde', palette='viridis')
plt.suptitle('Pair Plot by Income Bracket', y=1.02)
plt.show()
Identifying Multicollinearity
High correlations between features (multicollinearity) can cause problems in regression models. Use correlation matrices to detect this.
# Find highly correlated pairs
def find_high_correlations(corr_matrix, threshold=0.7):
"""Find feature pairs with correlation above threshold"""
high_corr = []
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
if abs(corr_matrix.iloc[i, j]) > threshold:
high_corr.append({
'feature1': corr_matrix.columns[i],
'feature2': corr_matrix.columns[j],
'correlation': corr_matrix.iloc[i, j]
})
return pd.DataFrame(high_corr)
high_corr_pairs = find_high_correlations(corr_matrix, threshold=0.6)
print("Highly correlated pairs:")
print(high_corr_pairs)
# Visualize with filtered heatmap
plt.figure(figsize=(10, 8))
strong_corr = corr_matrix.copy()
strong_corr[abs(strong_corr) < 0.5] = np.nan
sns.heatmap(strong_corr, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Strong Correlations Only (|r| > 0.5)')
plt.show()
Correlation with Target Variable
When preparing for modeling, examining correlations between features and the target variable helps identify predictive features.
# Correlation with target (income)
target_corr = df.corr()['income'].drop('income').sort_values(ascending=False)
print("Feature correlations with income:")
print(target_corr.round(3))
# Visualize as bar chart
plt.figure(figsize=(10, 5))
colors = ['green' if x > 0 else 'red' for x in target_corr.values]
target_corr.plot(kind='barh', color=colors)
plt.xlabel('Correlation with Income')
plt.title('Feature Correlations with Target Variable')
plt.axvline(x=0, color='black', linewidth=0.5)
plt.tight_layout()
plt.show()
df.corr(method='spearman') when your data has outliers or non-linear relationships. Spearman correlation is rank-based and more robust.Practice: Correlation Analysis
Task: Create a DataFrame with columns 'a', 'b', 'c', 'd' containing 100 random values each. Compute the correlation matrix and display it as a heatmap with annotations.
Show Solution
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(42)
df = pd.DataFrame({
'a': np.random.randn(100),
'b': np.random.randn(100),
'c': np.random.randn(100),
'd': np.random.randn(100)
})
corr = df.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Correlation Matrix')
plt.show()
Task: Using the same DataFrame, create a heatmap showing only the lower triangle (mask the upper triangle). Use 'RdBu_r' colormap and make squares equal-sized.
Show Solution
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
np.random.seed(42)
df = pd.DataFrame({
'a': np.random.randn(100),
'b': np.random.randn(100),
'c': np.random.randn(100),
'd': np.random.randn(100)
})
corr = df.corr()
# Create mask for upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
plt.figure(figsize=(8, 6))
sns.heatmap(corr, mask=mask, annot=True, cmap='RdBu_r',
center=0, fmt='.2f', square=True, linewidths=0.5)
plt.title('Correlation Matrix (Lower Triangle)')
plt.show()
Task: Create a dataset with 5 features where some are correlated with a target variable. Find all features with |correlation| > 0.4 with target. Create a horizontal bar chart showing these correlations sorted by absolute value.
Show Solution
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
n = 200
df = pd.DataFrame({
'feature1': np.random.randn(n),
'feature2': np.random.randn(n),
'feature3': np.random.randn(n),
'feature4': np.random.randn(n),
'feature5': np.random.randn(n)
})
# Create target with correlations to some features
df['target'] = (2 * df['feature1'] + 1.5 * df['feature3'] -
0.8 * df['feature5'] + np.random.randn(n) * 0.5)
# Find correlations with target
target_corr = df.corr()['target'].drop('target')
# Filter high correlations
high_corr = target_corr[abs(target_corr) > 0.4].sort_values(key=abs, ascending=True)
# Visualize
plt.figure(figsize=(10, 5))
colors = ['green' if x > 0 else 'red' for x in high_corr.values]
high_corr.plot(kind='barh', color=colors)
plt.xlabel('Correlation with Target')
plt.title('Features Correlated with Target (|r| > 0.4)')
plt.axvline(x=0, color='black', linewidth=0.5)
plt.tight_layout()
plt.show()
print("High correlation features:")
print(high_corr.round(3))
Interactive Demo: Correlation Explorer
Experiment with different correlation strengths and patterns. Adjust the sliders to see how correlation coefficients relate to visual scatter patterns.
Correlation Settings
Interpretation Guide
Current Pattern: Strong positive correlation
- Points cluster along an upward diagonal
- As X increases, Y tends to increase
- Predictable linear relationship
Scatter Plot Visualization
Key Observations
Strong Positive (r > 0.7)
Points form tight upward band. X and Y move together predictably.
Weak/No Correlation (|r| < 0.3)
Points scattered randomly. Knowing X tells little about Y.
Strong Negative (r < -0.7)
Points form tight downward band. As X increases, Y decreases.
Key Takeaways
Scatter Plots
Essential for visualizing relationships between two numerical variables
Correlation Coefficient
Pearson measures linear relationships from -1 to +1
Box Plots by Group
Compare numerical distributions across categorical groups
Contingency Tables
Cross-tabulation reveals associations between categorical variables
Heatmaps
Color-coded matrices show correlation patterns at a glance
Pair Plots
Visualize all pairwise relationships in a single figure
Knowledge Check
Test your understanding of bivariate and multivariate analysis:
Which plot is best for visualizing the relationship between two numerical variables?
A Pearson correlation coefficient of -0.85 indicates:
Which visualization compares a numerical variable across different categories?
What does a contingency table (crosstab) show?
Which Seaborn function creates a grid of scatter plots for all numerical variable pairs?
When should you use Spearman correlation instead of Pearson?