Visualization Techniques

Introduction to Data Visualization

Why Visualization Matters: Data visualization transforms raw numbers into visual stories. A well-designed chart can reveal patterns, outliers, and trends that would remain hidden in a spreadsheet. In this lesson, you will learn how to select the right chart type, apply color and styling best practices, and build both static and interactive visualizations using Python libraries.

Key Concept

What is Data Visualization?

Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, it provides an accessible way to see and understand trends, outliers, and patterns in data.

Why it matters: Humans process visuals 60,000 times faster than text. Effective visualization accelerates decision-making and communicates insights to stakeholders who may not be data experts.

Why Visualization Matters in Analytics

Charts are not just decoration - they are analytical tools. A histogram reveals distribution shape, a scatter plot shows correlation, and a time-series line chart highlights trends. Choosing the wrong chart type can mislead your audience or hide important patterns. The goal is clarity: one glance should answer a business question.

Reveal Insights

Patterns, outliers, and trends become visible instantly when data is visualized correctly

Communicate Clearly

Share findings with non-technical stakeholders through intuitive visual narratives

Drive Decisions

Well-designed dashboards enable faster, data-driven business decisions

The Science Behind Visual Perception: Our brains are wired to process visual information efficiently. When you look at a chart, your visual cortex instantly identifies patterns like trends (lines going up), clusters (grouped points), and outliers (points far from others) - all before your conscious mind even starts analyzing the numbers.

Real-world example: A manager looking at 1,000 rows of sales data might take hours to find a problem. The same manager looking at a line chart would instantly see "sales dropped in March" in under 2 seconds. This is why visualization is an essential skill in analytics.

Core Python Libraries for Visualization

Python offers three main libraries for visualization, each with distinct strengths. Matplotlib provides low-level control, Seaborn adds statistical elegance, and Plotly enables interactivity. Most analysts combine all three depending on the task.

Think of it this way:

Matplotlib is like a blank canvas - you control every pixel, but you write more code
Seaborn is like a smart template - it makes beautiful charts with minimal code
Plotly is like a web app - charts are interactive and can be embedded in websites

Before creating any chart, you need to import these libraries. The code below shows the standard imports every data analyst memorizes. Running this once at the top of your notebook prepares your environment for visualization:

# Standard imports for visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set a clean default style
sns.set_theme(style="whitegrid")

# Quick sample data
sales = pd.DataFrame({
    "month": ["Jan", "Feb", "Mar", "Apr"],
    "revenue": [12000, 15000, 13500, 17000]
})
print(sales)  # month  revenue

What Each Import Does:

import pandas as pd - Loads your data handling library (DataFrames)
import numpy as np - Provides mathematical functions and arrays
import matplotlib.pyplot as plt - The core plotting engine
import seaborn as sns - Beautiful statistical plots built on matplotlib
sns.set_theme() - Sets a consistent, professional look for all your charts

Library	Best For	Interactivity	Learning Curve
Matplotlib	Full control, publication-quality static plots	None (static)	Medium - more code, more control
Seaborn	Statistical plots with beautiful defaults	None (static)	Easy - smart defaults, less code
Plotly	Interactive, web-ready charts and dashboards	Yes (hover, zoom, export)	Easy - simple API, interactive output

Pro Tip - Which Library to Start With? Start with Seaborn for exploratory analysis (fast, pretty defaults), then switch to Plotly when you need interactivity or web embedding. Use Matplotlib when you need pixel-perfect control for publications or reports.

Your First Seaborn Chart

Let's create a simple bar chart using Seaborn. This example demonstrates why Seaborn is popular with beginners - you get a professional-looking chart with just a few lines of code. Seaborn automatically handles axis labels, colors, and grid styling for you.

What the code does step by step:

sns.barplot() creates a bar chart from your DataFrame
x="month" tells Seaborn which column to use for categories (horizontal axis)
y="revenue" tells Seaborn which column to use for values (bar heights)
palette="Blues_d" applies a blue color scheme for visual appeal
plt.show() displays the chart on your screen

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

sales = pd.DataFrame({
    "month": ["Jan", "Feb", "Mar", "Apr"],
    "revenue": [12000, 15000, 13500, 17000]
})

sns.barplot(data=sales, x="month", y="revenue", palette="Blues_d")
plt.title("Monthly Revenue")
plt.ylabel("Revenue ($)")
plt.tight_layout()
plt.show()  # Displays bar chart

Output: A bar chart appears with four vertical bars representing January through April. The bars get progressively darker blue (thanks to Blues_d palette), and the y-axis shows revenue values from 0 to 17,000. Notice how Seaborn automatically added nice gridlines and spacing.

Matplotlib Basics

The Foundation of Python Plotting: Matplotlib is the foundation of Python visualization - nearly every other plotting library (including Seaborn and Pandas) is built on top of it. While Seaborn handles most tasks with less code, understanding Matplotlib gives you fine-grained control over every pixel, every font, and every color in your chart.

The Matplotlib Philosophy: Matplotlib follows a "blank canvas" approach. Unlike Seaborn which makes smart assumptions, Matplotlib gives you complete control but requires you to specify everything. This makes it more verbose but infinitely customizable.

Two Interfaces - Choose Wisely:

pyplot interface (plt.plot()): Quick and simple for single charts. State-based - each command modifies the "current" figure.
Object-oriented interface (fig, ax = plt.subplots()): More explicit control. Better for multiple charts, complex layouts, or production code.

When to use Matplotlib instead of Seaborn:

When you need exact control over font sizes, line widths, colors, or positions
When creating charts for academic papers, journals, or publications with strict formatting
When building custom visualizations that Seaborn doesn't support (annotations, arrows, shapes)
When you need to combine multiple different chart types in one figure
When performance matters - Matplotlib can be faster for simple, repeated plots

Key Terminology:

Figure: The entire window or page - the top-level container for all plot elements
Axes: The actual plot area where data is drawn (a figure can have multiple axes)
Axis: The x-axis or y-axis with ticks and labels
Artist: Everything visible on the figure (lines, text, patches, etc.)

The code below creates a simple line chart. Notice how we explicitly set figure size, colors, labels, and grid - Matplotlib doesn't assume anything for you:

import matplotlib.pyplot as plt
import numpy as np

# Basic line plot with Matplotlib
x = np.linspace(0, 10, 100)
y = np.sin(x)

plt.figure(figsize=(8, 4))  # Set figure size
plt.plot(x, y, color="steelblue", linewidth=2, label="sin(x)")
plt.title("Sine Wave", fontsize=14)
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

How this code works: The np.linspace(0, 10, 100) function creates 100 evenly spaced points between 0 and 10, which become our x-coordinates. We set the figure size to 8×4 inches using figsize=(8, 4) for a wide, readable chart. The line is drawn in steelblue with linewidth=2 to make it stand out, and label="sin(x)" provides text for the legend. The grid uses alpha=0.3 to make it 70% transparent, ensuring it guides the eye without overpowering the actual data.

Key Matplotlib Concepts to Remember:

plt.figure(figsize=(width, height)) - Creates a canvas with custom dimensions in inches
plt.subplots(rows, cols) - Creates a grid of multiple charts at once
plt.tight_layout() - Automatically adjusts spacing to prevent label overlap
plt.savefig("chart.png", dpi=300) - Saves your chart as an image file
plt.show() - Displays the chart (required at the end)

Subplots with Matplotlib

When you need to compare multiple views side by side, use subplots to arrange charts in a grid. This is incredibly useful for dashboards or when presenting related data together.

How subplots work: The fig, axes pattern gives you a figure (the overall canvas) and an array of axes (individual chart areas). You then plot on each axes separately:

import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(1, 3, figsize=(12, 4))

x = np.arange(5)
y1, y2, y3 = [3, 7, 2, 5, 8], [4, 6, 5, 7, 3], [2, 4, 6, 4, 5]

axes[0].bar(x, y1, color="#6366f1")
axes[0].set_title("Bar Chart")

axes[1].plot(x, y2, marker="o", color="#22c55e")
axes[1].set_title("Line Chart")

axes[2].scatter(x, y3, s=100, color="#f59e0b")
axes[2].set_title("Scatter Plot")

plt.tight_layout()
plt.show()

How this code creates three charts: The plt.subplots(1, 3) call creates a grid with 1 row and 3 columns, returning both the figure object and an array of axes. We access each chart area using axes[0], axes[1], and axes[2], then call different plotting methods (.bar(), .plot(), .scatter()) on each one. Note that when using the object-oriented approach, we use set_title() instead of plt.title() since we're working with specific axes objects.

Understanding the Axes Array: When you write fig, axes = plt.subplots(1, 3), Python creates:

fig - The entire figure (like a window or page)
axes - A list of 3 chart areas: axes[0], axes[1], axes[2]

You then use axes[0].bar(...) to draw on the first chart, axes[1].plot(...) on the second, and so on.

Seaborn Statistical Plots

Smart Statistics Made Simple: Seaborn excels at statistical visualization - plots that automatically compute and display statistical summaries of your data. While Matplotlib requires you to calculate everything manually, Seaborn handles the math behind the scenes, letting you focus on insights rather than implementation.

Why this matters for data analysts: When exploring a new dataset, you need to quickly answer fundamental questions: "How is this variable distributed?" "Is there a correlation between X and Y?" "Are there outliers I should investigate?" "Do different groups behave differently?" Seaborn's statistical plots answer these questions with just a few lines of code, automatically adding confidence intervals, regression lines, and distribution curves.

The Seaborn advantage: Seaborn is built specifically for statistical data exploration. It integrates seamlessly with pandas DataFrames (just pass your df and column names), uses intelligent defaults for colors and styling, and automatically aggregates data when needed. For example, a bar plot in Seaborn automatically shows the mean of each category with error bars - in Matplotlib, you'd need to calculate this yourself.

Statistical

Key Seaborn Plot Types

Seaborn organizes its plots into logical categories based on what you're trying to understand about your data:

histplot / kdeplot - Distribution of a single variable. Histplot shows counts in bins; kdeplot shows a smooth probability curve. Use when asking "what values are common?"
boxplot / violinplot - Distribution with statistical summaries. Box shows quartiles; violin adds density shape. Use when comparing groups or finding outliers.
regplot / lmplot - Scatter plots with regression lines and confidence intervals. Use when asking "does Y increase as X increases?"
heatmap - Color-coded matrices for correlation tables or 2D data. Use when visualizing relationships between many variables at once.
pairplot - All pairwise scatter plots in a single view. Use for initial exploration to spot patterns across all variable combinations.
catplot - Categorical plots with faceting support (bar, strip, swarm, box, violin). Use when your x-axis is categories, not numbers.

Quick decision guide: For distributions → histplot/kdeplot. For comparisons → boxplot/barplot. For relationships → regplot/scatterplot. For correlations → heatmap. For "show me everything" → pairplot.

Working with sample datasets: Seaborn includes several built-in datasets perfect for learning. The tips dataset contains restaurant bill and tip data; iris has flower measurements; titanic includes passenger survival data. Load them with sns.load_dataset("tips") to practice without needing external files.

import seaborn as sns
import matplotlib.pyplot as plt

# Load sample dataset
tips = sns.load_dataset("tips")

# Statistical plots showcase
fig, axes = plt.subplots(2, 2, figsize=(10, 8))

# Distribution with KDE
sns.histplot(tips["total_bill"], kde=True, ax=axes[0, 0], color="purple")
axes[0, 0].set_title("Distribution: Total Bill")

# Box plot by category
sns.boxplot(data=tips, x="day", y="total_bill", ax=axes[0, 1], palette="Set2")
axes[0, 1].set_title("Box Plot: Bill by Day")

# Regression plot
sns.regplot(data=tips, x="total_bill", y="tip", ax=axes[1, 0], color="teal")
axes[1, 0].set_title("Regression: Bill vs Tip")

# Violin plot
sns.violinplot(data=tips, x="day", y="tip", ax=axes[1, 1], palette="muted")
axes[1, 1].set_title("Violin: Tip by Day")

plt.tight_layout()
plt.show()

What each plot reveals: The histogram with KDE (top-left) shows the distribution of bill amounts, with most bills falling between $10-25 and a right skew toward higher amounts - this tells us most customers have moderate bills, but some spend significantly more. The box plot (top-right) compares bill distributions across days, where the box represents the middle 50% of data (interquartile range), whiskers extend to 1.5× that range, and dots indicate statistical outliers worth investigating. The regression plot (bottom-left) reveals a positive relationship between bill size and tip - as bills increase, tips increase proportionally, and the shaded 95% confidence interval shows we can be confident in this trend. Finally, the violin plot (bottom-right) combines box plot statistics with the full distribution shape, where wider sections indicate more data points at that tip amount.

Pairplot for Exploration

The pairplot() function is one of Seaborn's most powerful tools for exploratory data analysis. It creates a matrix of scatter plots showing every possible pair of numeric variables in your dataset. This "show me everything" approach is invaluable when you first encounter a new dataset and need to quickly understand the relationships between variables without writing dozens of individual charts.

When to use pairplot: Use pairplot at the start of any analysis to identify which variables are correlated, which might be useful for prediction, and whether there are obvious clusters or patterns. It's especially powerful when you add hue to color points by a categorical variable - this reveals whether different groups have different patterns. Pairplots answer questions like: "Which variables are strongly correlated?" "Are there natural groupings in my data?" "Which features separate my categories best?"

How pairplot is structured: The output is an N×N grid where N is the number of numeric columns. The diagonal cells show the distribution of each individual variable (either histograms or KDE curves). The off-diagonal cells show scatter plots of every possible variable pair. Since the grid is symmetric (row 1, col 2 shows X vs Y, while row 2, col 1 shows Y vs X), you can focus on just the upper or lower triangle.

Parameters

Key Pairplot Options

hue="column" - Colors all points by a categorical column, revealing group patterns
diag_kind="kde" or "hist" - Controls diagonal plots (smooth curves vs bars)
corner=True - Shows only the lower triangle to reduce redundancy
vars=["col1", "col2"] - Limits which columns to include (faster for large datasets)
kind="reg" - Adds regression lines to all scatter plots
markers=["o", "s", "D"] - Different shapes for each hue category

Performance tip: Pairplots can be slow with many columns or large datasets. Use vars to select only the most relevant columns, or sample your data first with df.sample(1000) for faster iteration.

import seaborn as sns
import matplotlib.pyplot as plt

# Pairplot with hue for categorical grouping
iris = sns.load_dataset("iris")
sns.pairplot(iris, hue="species", palette="husl", diag_kind="kde")
plt.suptitle("Iris Dataset Pairplot", y=1.02)
plt.show()

Understanding the pairplot output: The diagonal shows the distribution of each variable using KDE curves (since we set diag_kind="kde"), allowing you to see how each measurement is distributed within each species - notice how some distributions overlap while others are clearly separated. The off-diagonal cells display scatter plots of every variable pair, colored by species. Look for clusters of the same color - if points group by color, that variable combination helps distinguish species.

Reading the iris pairplot: In the iris example, you'll notice that sepal measurements (top-left area) show significant overlap between species, making them weaker classifiers. However, petal length and petal width (bottom-right area) create distinct, well-separated clusters for each species - setosa is clearly isolated, while versicolor and virginica have minimal overlap. This immediately tells us petal measurements are excellent features for building a species classifier, while sepal measurements alone would struggle to separate the groups.

Real-world applications: In business analytics, pairplots help identify which customer attributes correlate with spending. In healthcare, they reveal which biomarkers cluster together. In manufacturing, they show which process parameters affect quality. The ability to see all relationships at once often reveals unexpected patterns that targeted analysis would miss.

Practice Questions: Seaborn & Matplotlib

Test your understanding with these hands-on exercises.

Task: Given the DataFrame below, create a line chart of temperature over days using Seaborn's lineplot.

import pandas as pd
weather = pd.DataFrame({
    "day": [1, 2, 3, 4, 5],
    "temp": [22, 24, 19, 21, 25]
})

Show Solution

import seaborn as sns
import matplotlib.pyplot as plt

sns.lineplot(data=weather, x="day", y="temp", marker="o")
plt.title("Daily Temperature")
plt.xlabel("Day")
plt.ylabel("Temperature (C)")
plt.show()

Task: Create a grouped bar chart comparing Q1 and Q2 sales for North and South regions.

import pandas as pd
regions = pd.DataFrame({
    "region": ["North", "North", "South", "South"],
    "quarter": ["Q1", "Q2", "Q1", "Q2"],
    "sales": [100, 120, 90, 110]
})

Show Solution

import seaborn as sns
import matplotlib.pyplot as plt

sns.barplot(data=regions, x="region", y="sales", hue="quarter", palette="Set2")
plt.title("Regional Sales by Quarter")
plt.ylabel("Sales (k)")
plt.show()

Task: Create a 2x2 grid of subplots showing: (1) bar chart, (2) line chart, (3) scatter plot, (4) histogram using the data below.

import numpy as np
x = [1, 2, 3, 4, 5]
y = [10, 15, 13, 17, 20]
scatter_x = np.random.rand(50)
scatter_y = np.random.rand(50)
hist_data = np.random.normal(50, 10, 200)

Show Solution

import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(2, 2, figsize=(10, 8))

axes[0, 0].bar(x, y, color='steelblue')
axes[0, 0].set_title('Bar Chart')

axes[0, 1].plot(x, y, marker='o', color='green')
axes[0, 1].set_title('Line Chart')

axes[1, 0].scatter(scatter_x, scatter_y, alpha=0.6, color='purple')
axes[1, 0].set_title('Scatter Plot')

axes[1, 1].hist(hist_data, bins=20, color='coral', edgecolor='white')
axes[1, 1].set_title('Histogram')

plt.tight_layout()
plt.show()

Task: Load the 'tips' dataset from Seaborn and create a pairplot showing relationships between numeric columns, colored by the 'time' column (Lunch/Dinner).

Show Solution

import seaborn as sns
import matplotlib.pyplot as plt

tips = sns.load_dataset('tips')
sns.pairplot(tips, hue='time', palette='Set1', diag_kind='kde')
plt.suptitle('Tips Dataset - Lunch vs Dinner', y=1.02)
plt.show()

Chart Types and Selection

Choose Wisely, Communicate Clearly: Selecting the right chart type is the most critical decision in visualization. A pie chart that should be a bar chart, or a line chart used for categorical data, can confuse or mislead your audience. This section covers the main chart families and provides a decision framework for choosing wisely.

The One Question That Decides Your Chart: Before creating any chart, ask yourself: "What relationship am I trying to show?" The answer determines your chart type. Are you comparing categories? Showing how values are distributed? Revealing a correlation? Or showing parts of a whole? Each relationship has chart types that work best - and chart types that will mislead your audience.

Framework

The Four Chart Families

Every chart falls into one of four families based on its purpose:

Comparison: Show differences between categories (bar, column, radar)
Composition: Show parts of a whole (pie, stacked bar, treemap)
Distribution: Show how values spread across a range (histogram, box, violin)
Relationship: Show connections or correlations (scatter, line, heatmap)

Pro tip: If you're unsure which family fits, describe your insight in one sentence. "Sales are higher in Region A than B" = Comparison. "Most customers are aged 25-35" = Distribution. "Revenue increases with ad spend" = Relationship.

Comparison Charts

Use comparison charts when you need to show differences between categories or groups. These are the most common charts in business analytics because executives constantly ask questions like "Which product sells most?" or "How do regions compare?" The human brain is excellent at comparing lengths and heights, making bar charts one of the most effective and universally understood visualizations.

Why comparison charts work: Unlike pie charts where angle comparisons are notoriously difficult, bar charts align everything to a common baseline, making differences immediately obvious. A 10% difference between two bars is easy to spot, while the same difference in pie slices often goes unnoticed. This is why bar charts dominate business dashboards, reports, and presentations across industries.

Best Practices

Comparison Chart Guidelines

Horizontal bar charts: Work best when category names are long (more than 3 words) - labels read naturally left-to-right
Vertical column charts: Suit time-based comparisons (months, quarters) and when you have short category names
Always start the y-axis at zero: Truncating the axis exaggerates differences and misleads viewers
Sort bars by value: Sort descending (largest first) to make patterns obvious - alphabetical sorting hides insights
Limit to 10-12 categories: Too many bars create visual clutter; consider grouping smaller categories into "Other"

Common mistake: Truncating the y-axis to "make differences look bigger" is a form of chart manipulation. A bar starting at 90 instead of 0 makes a 5% difference look like a 50% difference. Always show the full scale.

Choosing between horizontal and vertical: Use horizontal bars when your category labels are long (like product names or country names) - this gives text room to display without rotation or truncation. Use vertical columns when categories represent time periods (January, Q1 2025) since we naturally read time left-to-right. If you have more than 7-8 categories, horizontal bars usually work better because they can extend vertically without crowding.

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Sample product sales data
products = pd.DataFrame({
    "product": ["Widget A", "Widget B", "Widget C", "Widget D"],
    "units_sold": [340, 290, 450, 380]
})

# Sort by value for better readability
products_sorted = products.sort_values("units_sold", ascending=True)

# Horizontal bar chart - good for long labels
plt.figure(figsize=(8, 4))
sns.barplot(data=products_sorted, y="product", x="units_sold", palette="viridis")
plt.title("Units Sold by Product")
plt.xlabel("Units Sold")
plt.ylabel("")  # Remove redundant y-label when bars are self-explanatory
plt.tight_layout()
plt.show()

Reading this chart: The bars are sorted ascending (smallest at top) so the longest bar appears at the bottom, creating a clean visual hierarchy. Widget C clearly leads with 450 units, followed by Widget D (380), Widget A (340), and Widget B (290). The viridis color palette provides good contrast and is colorblind-friendly. Notice how the horizontal orientation gives the product names room to display fully without rotation or abbreviation.

Adding context with annotations: Raw bar charts show relative differences, but adding data labels or reference lines provides context. Is 450 units good or bad? Adding a target line at 400 units would instantly show which products met their goal. Consider adding value labels on or near bars when the exact numbers matter - especially in reports where viewers might need to cite specific figures.

Grouped vs Stacked Bars: When comparing multiple metrics across categories (like revenue AND profit by product), use grouped bars (side by side) if you want to compare individual values, or stacked bars if the total matters most. Grouped bars make individual comparisons easy but require more horizontal space. Stacked bars show totals clearly but make comparing individual segments harder.

Distribution Charts

Distribution charts answer the question "How are my values spread out?" They reveal patterns like whether data is clustered, spread evenly, or has outliers. Understanding distribution is essential before building any predictive model because many algorithms assume data follows certain patterns (like the normal distribution), and knowing your data's shape helps you choose the right approach.

Why distribution matters: Imagine you're analyzing customer purchase amounts. If most customers spend around $50 but a few spend $5,000, your average ($150) misrepresents typical behavior. Distribution charts expose these patterns instantly. They show whether your data is symmetric (bell curve), skewed (tail on one side), bimodal (two peaks), or uniform (flat). Each shape tells a different story and suggests different analytical approaches.

Chart Types

Distribution Chart Selection Guide

Histogram: Shows frequency of values in "bins" - use for understanding the shape of your data and identifying modes (peaks)
Box plot: Shows median, quartiles, and outliers - use for comparing groups or spotting anomalies quickly
Violin plot: Combines box plot with density estimation - use when distribution shape matters and you need to compare groups
KDE (kernel density): Smooth curve showing probability - use when you want a clean, continuous view without bin artifacts

Quick decision: Comparing 2+ groups? Use box or violin plots. Single variable exploration? Start with histogram + KDE overlay. Looking for outliers? Box plots make them obvious.

Understanding histogram bins: A histogram divides your data range into equal-width "bins" and counts how many values fall into each bin. The number of bins dramatically affects what you see - too few bins hide patterns, too many create noise. The bins parameter controls this. Start with the default (auto-calculated based on data size) and adjust if needed. For 200 data points, 15-20 bins usually works well; for thousands of points, 30-50 bins may reveal more detail.

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Generate sample data - exam scores with mean 72, std dev 12
scores = np.random.normal(loc=72, scale=12, size=200)

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

# Histogram with KDE overlay
sns.histplot(scores, bins=15, kde=True, ax=axes[0], color="steelblue")
axes[0].set_title("Score Distribution (Histogram)")
axes[0].set_xlabel("Exam Score")
axes[0].set_ylabel("Count")

# Box plot for same data
sns.boxplot(x=scores, ax=axes[1], color="coral")
axes[1].set_title("Score Distribution (Box Plot)")
axes[1].set_xlabel("Exam Score")

plt.tight_layout()
plt.show()

Comparing the two views: The histogram on the left shows the bell curve shape with most scores clustering around 72, and the KDE line (smooth curve) overlaid on the bars provides a cleaner view of the distribution shape - notice how it smooths out the bar "steps" into a continuous curve. The box plot on the right condenses the same 200 scores into five key numbers: minimum, Q1 (25th percentile), median, Q3 (75th percentile), and maximum. Box plots are excellent for comparing multiple groups side by side, though they lose the detailed shape information that histograms preserve.

What shapes tell you: A symmetric bell curve (like our exam scores) suggests most values cluster around the center - this is the normal distribution. A right-skewed distribution (long tail to the right) is common for income or wait times - most values are low but some are very high. A left-skewed distribution (long tail to the left) appears in failure times or customer tenure - most values are high with some very low. A bimodal distribution (two peaks) might indicate two distinct groups in your data that should be analyzed separately.

Reading a Box Plot: Box plots pack a lot of information into a small space. Here's how to read one:

Box: Contains the middle 50% of your data (from 25th to 75th percentile) - called the Interquartile Range (IQR)
Line inside box: The median (middle value) - not the mean!
Whiskers: Extend to the most extreme non-outlier values (typically 1.5 × IQR from the box edges)
Dots beyond whiskers: Outliers - values that are unusually far from the rest

When to use violin plots: Violin plots combine the benefits of box plots and KDE. They show the same summary statistics as a box plot (median, quartiles) but add a mirrored density curve on each side, revealing the full shape of the distribution. Use them when comparing groups where the shape matters - for example, comparing test score distributions between classes might show one class has a bimodal distribution (two groups of students) while another has a normal distribution.

import seaborn as sns
import matplotlib.pyplot as plt

# Compare distributions across categories
tips = sns.load_dataset("tips")

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Box plot comparison
sns.boxplot(data=tips, x="day", y="total_bill", ax=axes[0], palette="Set2")
axes[0].set_title("Bill Amounts by Day (Box Plot)")

# Violin plot comparison
sns.violinplot(data=tips, x="day", y="total_bill", ax=axes[1], palette="Set2")
axes[1].set_title("Bill Amounts by Day (Violin Plot)")

plt.tight_layout()
plt.show()

Box vs violin for group comparison: In this example, the box plots clearly show that Sunday has the highest median bill and the most outliers. But the violin plots reveal more - you can see that Saturday bills have a bimodal distribution with peaks around $15 and $35, suggesting two types of dining experiences (perhaps lunch vs dinner). This detail is invisible in the box plot. However, box plots are cleaner when you have many groups to compare or when you mainly care about medians and outliers.

Relationship Charts

Relationship charts reveal how two or more variables relate to each other. They answer questions like "Does more advertising lead to more sales?" or "Are test scores correlated with study hours?" These charts are fundamental to predictive analytics because they help identify which variables might be useful predictors in machine learning models.

Why relationships matter in analytics: Understanding how variables relate is the foundation of prediction. If you know that ad spend strongly correlates with conversions, you can forecast future conversions based on planned ad budgets. Scatter plots with regression lines are often the first step in building any predictive model - they help you visualize whether a linear model is appropriate or if you need something more complex.

Patterns

Types of Relationships to Identify

Positive correlation: As one variable increases, so does the other (points trend upward-right). Example: study hours and test scores
Negative correlation: As one variable increases, the other decreases (points trend downward-right). Example: price and quantity sold
No correlation: No clear pattern between variables (points scattered randomly). Example: shoe size and income
Non-linear: Variables are related, but not in a straight line (curved pattern). Example: age and athletic performance

Strength indicator: The tighter the points cluster around the trend line, the stronger the correlation. Widely scattered points suggest a weak relationship even if the trend direction is clear.

The regression line explained: When you add a regression line (using regplot()), Seaborn calculates the "best fit" line that minimizes the distance between all points and the line. The slope tells you how much Y changes for each unit increase in X. A steep upward slope means strong positive impact; a nearly flat line suggests X has little effect on Y. The shaded confidence interval shows the uncertainty in this estimate.

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Sample data
ads = pd.DataFrame({
    "ad_spend": [100, 200, 300, 400, 500, 600],
    "conversions": [12, 25, 35, 40, 55, 60]
})

sns.regplot(data=ads, x="ad_spend", y="conversions", ci=95, color="teal")
plt.title("Ad Spend vs Conversions")
plt.xlabel("Ad Spend ($)")
plt.ylabel("Conversions")
plt.show()

What this chart tells us: The dots represent each data point showing ad spend versus conversions achieved. The trend line shows the best-fit relationship - as ad spend increases, conversions clearly increase as well. Looking at the slope, roughly every $100 increase in ad spend yields about 10 additional conversions. The shaded area represents the 95% confidence interval, meaning the true relationship likely falls within this range - a narrow band indicates a reliable relationship, while a wide band suggests more uncertainty.

Customizing relationship charts: Use ci=None to remove the confidence interval for cleaner visuals, or ci=99 for a wider band showing 99% confidence. Add scatter_kws={"s": 100, "alpha": 0.7} to control dot size and transparency. For comparing groups, use lmplot() instead with a hue parameter to show separate regression lines for each category - this reveals whether the relationship differs between groups.

Correlation ≠ Causation: Just because two variables are correlated doesn't mean one causes the other! Ice cream sales and drowning deaths are correlated - but ice cream doesn't cause drowning. Both increase in summer (confounding variable). Always think critically about the underlying cause before drawing conclusions from scatter plots. Consider: Is there a logical mechanism? Could a third variable explain both? Would an experiment confirm causation?

Composition Charts

Composition charts show parts of a whole - how different categories contribute to a total. They answer questions like "What percentage of sales comes from each region?" or "How is our budget allocated?" These charts are essential for understanding market share, portfolio allocation, survey responses with single-choice answers, and any scenario where components must sum to 100%.

When composition matters: Use composition charts when the story is about proportions and relative contributions, not absolute values. "Marketing gets 30% of the budget" is a composition insight. "Marketing spends $300,000" is a comparison insight. Knowing which question you're answering determines which chart type you need. Composition charts are especially powerful for tracking how proportions change over time - for example, how market share shifts year over year.

Best Practices

Composition Chart Selection Guide

Pie charts: Use ONLY when you have 2-5 categories that sum to 100% and one slice is clearly dominant. More slices = unreadable
Donut charts: Pie chart with a hole - useful for placing a key metric (like total) in the center
Stacked bar charts: Better than pie for comparing composition across groups (e.g., budget breakdown by department across years)
100% stacked bar: Forces all bars to same height, showing pure proportion changes over time
Treemaps: Great for hierarchical data with many categories - shows nested proportions

The pie chart test: If you can't immediately tell which slice is biggest, or if slices are similar sizes, switch to a bar chart. Our brains compare angles poorly but lengths easily.

import matplotlib.pyplot as plt

# Market share data - sorted by size for best readability
labels = ["Product A", "Product B", "Product C", "Other"]
sizes = [35, 30, 20, 15]  # Must sum to 100
colors = ["#6366f1", "#22c55e", "#f59e0b", "#94a3b8"]
explode = (0.05, 0, 0, 0)  # Slightly separate the largest slice

plt.figure(figsize=(8, 6))
plt.pie(sizes, labels=labels, colors=colors, autopct="%1.0f%%", 
        startangle=90, explode=explode, shadow=False)
plt.title("Market Share by Product", fontsize=14, fontweight="bold")
plt.axis("equal")  # Ensures circular pie (not oval)
plt.show()

Key pie chart parameters: The autopct="%1.0f%%" parameter displays percentage labels on each slice with zero decimal places. Setting startangle=90 rotates the chart so the first slice begins at 12 o'clock (the natural starting point for readers). The explode tuple slightly separates specific slices for emphasis - useful for highlighting the market leader or the category you're discussing. Most importantly, plt.axis("equal") is essential - without it, the pie may appear as an oval instead of a perfect circle.

Creating effective pie charts: Always sort slices by size (largest first, clockwise from 12 o'clock) so the visual hierarchy matches the data hierarchy. Use contrasting colors for adjacent slices - avoid putting two similar blues next to each other. Limit to 5 slices maximum; group smaller categories into "Other" if needed. If "Other" becomes the largest slice, your categorization needs rethinking. Consider adding a legend instead of labels if slice names are long.

Your Question	Best Chart	Avoid Using	Why
Which category is largest?	Bar / Column	Pie (if >5 items)	Humans compare lengths better than angles
How has the value changed over time?	Line chart	Pie, Scatter	Lines show trend direction clearly
What's the typical value and spread?	Histogram, Box, Violin	Pie, Bar	These show distribution shape
Are two things related?	Scatter, Heatmap	Bar, Pie	Position reveals correlation patterns
What's the breakdown of the total?	Pie (≤5 items), Stacked Bar	Line, Scatter	Parts-of-whole needs enclosed area

Common Mistake - The Pie Chart Trap: Using pie charts for more than five categories makes slices hard to compare (our brains are bad at comparing angles). If you have more than 5 categories, switch to a horizontal bar chart sorted by value. Also avoid pie charts when slices are similar sizes - a bar chart makes differences clearer.

Practice Questions: Chart Types

Test your understanding with these hands-on exercises.

Task: Create a histogram of the following ages with 10 bins.

import numpy as np
ages = np.array([22, 25, 27, 29, 31, 33, 35, 38, 40, 42, 45, 48, 50, 55, 60])

Show Solution

import seaborn as sns
import matplotlib.pyplot as plt

sns.histplot(ages, bins=10, color="purple")
plt.title("Age Distribution")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()

Task: Create a heatmap showing correlations between columns in the DataFrame.

import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({
    "price": np.random.randint(100, 500, 50),
    "quantity": np.random.randint(1, 20, 50),
    "discount": np.random.uniform(0, 0.3, 50)
})
df["revenue"] = df["price"] * df["quantity"] * (1 - df["discount"])

Show Solution

import seaborn as sns
import matplotlib.pyplot as plt

corr = df.corr()
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.tight_layout()
plt.show()

Task: Create a 2x2 grid of subplots: (1) bar chart, (2) histogram, (3) scatter, (4) box plot using sample data of your choice.

Show Solution

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

np.random.seed(0)
data = pd.DataFrame({
    "category": ["A", "B", "C", "D"],
    "value": [23, 45, 56, 32],
    "x": np.random.rand(100),
    "y": np.random.rand(100),
    "scores": np.random.normal(50, 10, 100)
})

fig, axes = plt.subplots(2, 2, figsize=(10, 8))

sns.barplot(x="category", y="value", data=data.head(4), ax=axes[0, 0], palette="Blues")
axes[0, 0].set_title("Bar Chart")

sns.histplot(data["scores"], bins=12, ax=axes[0, 1], color="coral")
axes[0, 1].set_title("Histogram")

axes[1, 0].scatter(data["x"], data["y"], alpha=0.6, color="teal")
axes[1, 0].set_title("Scatter Plot")

sns.boxplot(x=data["scores"], ax=axes[1, 1], color="gold")
axes[1, 1].set_title("Box Plot")

plt.tight_layout()
plt.show()

Task: Create a box plot showing exam scores distribution by subject.

import pandas as pd
import numpy as np
np.random.seed(42)
scores = pd.DataFrame({
    "subject": ["Math"]*30 + ["Science"]*30 + ["English"]*30,
    "score": list(np.random.normal(75, 10, 30)) + 
             list(np.random.normal(70, 15, 30)) + 
             list(np.random.normal(80, 8, 30))
})

Show Solution

import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(data=scores, x="subject", y="score", palette="pastel")
plt.title("Exam Scores by Subject")
plt.ylabel("Score")
plt.show()

Task: Create a scatter plot of study hours vs exam scores with a regression line.

import pandas as pd
study = pd.DataFrame({
    "hours": [1, 2, 3, 4, 5, 6, 7, 8],
    "score": [45, 55, 58, 65, 72, 78, 85, 92]
})

Show Solution

import seaborn as sns
import matplotlib.pyplot as plt

sns.regplot(data=study, x="hours", y="score", ci=95, color="teal")
plt.title("Study Hours vs Exam Score")
plt.xlabel("Hours Studied")
plt.ylabel("Exam Score")
plt.show()

Interactive: Chart Type Chooser

Click a category to see recommended chart types and code snippets:

Select a chart category above to explore options...

Styling, Color, and Best Practices

From Good to Professional: Good data visualization is not just about the right chart - it is about clarity, accessibility, and aesthetics. Color choice, typography, and whitespace can make the difference between a chart that informs and one that confuses. This section covers styling fundamentals that separate amateur charts from professional ones.

Why Styling Matters More Than You Think: A poorly styled chart can undermine even the best analysis. Studies show that viewers form impressions about data credibility within seconds based on chart aesthetics. Professional styling signals competence and makes your insights more persuasive. Think of styling as the "first impression" of your data story.

Principle

The Data-Ink Ratio

Coined by visualization pioneer Edward Tufte, the data-ink ratio is the proportion of ink (or pixels) used to display actual data versus non-data elements. The goal is to maximize this ratio by removing unnecessary elements.

The Simple Rule: If you can remove an element without reducing understanding, remove it. This applies to excessive gridlines, 3D effects, decorative borders, and "chart junk" that adds visual noise without conveying information.

Examples of low data-ink elements to consider removing:

Heavy gridlines: Use light, subtle gridlines or remove them entirely
Chart borders: The data itself defines the chart boundaries
3D effects: These distort perception and add no information
Background patterns: Solid backgrounds are cleaner and more professional
Legends with one item: If there's only one series, label it directly instead

Color Palettes: The Psychology of Color

Color is not just decoration - it conveys meaning. Choosing the wrong palette can confuse viewers or hide patterns. There are three main types of color palettes, each designed for specific data types:

Sequential

Use for: Continuous data from low to high (e.g., temperature, sales volume)

Examples: Blues, Greens, viridis - colors go from light to dark

Diverging

Use for: Data with a meaningful midpoint (e.g., profit/loss, above/below average)

Examples: coolwarm, RdBu - two colors with neutral middle

Categorical

Use for: Distinct categories with no order (e.g., regions, product types)

Examples: Set2, tab10, pastel - distinct, non-sequential colors

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Sample data
regions = pd.DataFrame({
    "region": ["North", "South", "East", "West"],
    "sales": [120, 98, 140, 85]
})

# Using a categorical palette - each region gets a distinct color
sns.barplot(data=regions, x="region", y="sales", palette="Set2")
plt.title("Sales by Region")
plt.ylabel("Sales (k)")
plt.show()

Why Set2 works here: Since regions are categorical (no inherent order), we use a qualitative palette. Set2 provides muted, distinct colors that don't imply any ranking. The colors are also accessible to most colorblind viewers.

Palette Type	When to Use	Seaborn Names	Real Example
Sequential	Continuous data (low to high)	`Blues`, `Greens`, `viridis`	Heatmap of website traffic by hour
Diverging	Data with a midpoint	`coolwarm`, `RdBu`	Profit/loss report, deviation from target
Categorical	Distinct categories	`Set2`, `tab10`, `pastel`	Sales by region, product comparison

Color Accessibility - Designing for Everyone: About 8% of men and 0.5% of women have some form of color blindness. Red-green colorblindness is most common, making red-green gradients problematic.

Use colorblind-safe palettes: viridis, cividis, plasma are designed for accessibility
Don't rely on color alone: Add patterns, labels, or shapes to distinguish categories
Test your charts: Use tools like colorbrewer2.org to preview how colorblind viewers see your charts

Seaborn Themes: Instant Professional Styling

Seaborn provides built-in themes that instantly improve chart aesthetics. The set_theme() function changes background, grid, and font defaults with a single line of code. This is far easier than manually configuring matplotlib's dozens of parameters - one line transforms amateur-looking charts into publication-quality visuals.

Why themes matter: Consistent styling across all your charts creates a professional, cohesive look. Themes also ensure readability - proper contrast between data and background, appropriately sized text, and well-chosen gridlines help viewers focus on insights rather than decoding the visual. Setting a theme once at the start of your notebook applies it to all subsequent charts automatically.

Styles

Seaborn Theme Options

darkgrid: Gray background with white gridlines - good for exploration in Jupyter notebooks where dark themes are common
whitegrid: White background with gray gridlines - clean, professional look for business reports and presentations
white: White background, no gridlines - minimal design that focuses attention entirely on the data
dark: Gray background, no gridlines - dramatic look for presentations with dark slides
ticks: White background with just axis ticks, no grid - publication-ready style preferred by academic journals

Quick recommendation: Use whitegrid for most business contexts. Use ticks for publications. Use darkgrid for data exploration where you're the primary audience.

import seaborn as sns
import matplotlib.pyplot as plt

# Set theme ONCE at the start - applies to all subsequent plots
# Available styles: darkgrid, whitegrid, dark, white, ticks
sns.set_theme(style="whitegrid", palette="muted", font_scale=1.1)

# All plots now automatically use this theme
tips = sns.load_dataset("tips")
sns.boxplot(data=tips, x="day", y="total_bill")
plt.title("Total Bill by Day")
plt.show()

Theme parameter breakdown: The style="whitegrid" setting creates a white background with gray gridlines for a clean, professional appearance - gridlines help readers trace values to axes without cluttering the visual. The palette="muted" option uses softer, less saturated colors that are easier on the eyes for extended viewing and work well in printed documents. Finally, font_scale=1.1 increases all text by 10%, making labels more readable in presentations projected on screens.

Additional theme customizations: Beyond the main style, you can fine-tune specific elements. Use sns.set_context("talk") for presentation-sized elements (larger fonts, thicker lines) or sns.set_context("paper") for publication-sized elements. Override individual colors with sns.set_palette(["#custom", "#colors"]). To reset everything back to defaults, call sns.reset_defaults() or sns.reset_orig().

Context Presets: Seaborn has four context presets that scale all elements: paper (smallest), notebook (default), talk (larger for presentations), and poster (largest). Use sns.set_context("talk") when preparing charts for slides - everything will be proportionally larger and more visible from a distance.

Annotations and Labels: Telling the Story

A chart without labels is like a book without a title - the viewer has to guess what they're looking at. Good annotations transform a chart from "raw data" into "actionable insight." The difference between an amateur and professional visualization often comes down to thoughtful labeling that guides the viewer to the key takeaway.

Why annotations matter: In a business context, executives don't have time to decode charts. They need the insight immediately visible. An annotation like "Sales peaked in April after the marketing campaign" does the interpretation work for them. Data scientists spend 80% of their time preparing data - don't waste that effort with a chart that fails to communicate the finding.

Essentials

Required Chart Elements

A descriptive title: Tells the viewer what the chart shows AND the key insight. Not "Sales Chart" but "Quarterly Sales Grew 23% After Price Reduction"
Axis labels with units: Always include measurement units - "Revenue ($M)" not just "Revenue", "Time (seconds)" not just "Time"
Annotations for key points: Highlight outliers, milestones, inflection points, or important values that support your narrative
A legend (if needed): Only include if there are multiple series to distinguish - don't clutter single-series charts
Data source: For formal reports, include where the data came from (footer or caption)

The headline test: Your title should work like a news headline - if someone only read the title, they'd know the main finding. "Revenue" is a label. "Revenue Doubled in Q4" is a headline.

Types of annotations: Use text annotations for insights ("Campaign launched here"), reference lines for thresholds (target revenue, average performance), shaded regions for time periods (recession, holiday season), and arrows to draw attention to specific data points. Each annotation type serves a different communication purpose - combine them thoughtfully to build a complete narrative.

Use plt.annotate() to add arrows and text pointing to specific data points:

import matplotlib.pyplot as plt

months = ["Jan", "Feb", "Mar", "Apr", "May"]
revenue = [45, 52, 48, 61, 55]

plt.figure(figsize=(8, 5))
plt.plot(months, revenue, marker="o", color="#6366f1", linewidth=2)

# Annotate peak with arrow and label
plt.annotate("Peak", xy=("Apr", 61), xytext=("Mar", 65),
             arrowprops=dict(arrowstyle="->", color="gray"),
             fontsize=10, color="green")

plt.title("Monthly Revenue Trend", fontsize=14, fontweight="bold")
plt.xlabel("Month")
plt.ylabel("Revenue ($k)")
plt.tight_layout()
plt.show()

Understanding plt.annotate(): The xy=("Apr", 61) parameter specifies the point being annotated (where the arrow points TO), while xytext=("Mar", 65) determines where the text label is placed (where the arrow starts FROM). The arrowprops dictionary customizes the arrow's style, color, and width - try arrowstyle="fancy" or arrowstyle="wedge" for different looks. Notice how fontweight="bold" in the title makes it stand out with bolder text.

Additional annotation techniques: Use plt.axhline(y=50, color="red", linestyle="--") to add a horizontal reference line (great for showing targets or thresholds). Use plt.axvspan(xmin=2, xmax=4, alpha=0.2, color="yellow") to highlight a region. For adding text without arrows, use plt.text(x, y, "message"). Combine these techniques to create rich, informative visualizations that tell a complete story.

The 5-Second Test for Chart Titles: Ask yourself: If someone looked at your chart for just 5 seconds, would the title tell them the key insight? A good title like "Sales Dropped 23% After Price Increase" is far more useful than "Sales Data Q3". The title should be the headline of your data story - make every word count.

Don't Over-Annotate: Annotations should guide, not overwhelm. If every point has a label and every region is highlighted, nothing stands out. Follow the rule of three: highlight at most 2-3 key insights per chart. More than that, and you need multiple charts or a different approach entirely.

Saving High-Quality Figures

Once you've created the perfect chart, you need to save it in the right format. Different formats serve different purposes:

Format	Best For	Pros	Cons
PNG	Web, presentations, reports	Widely compatible, transparent backgrounds	Gets blurry if scaled up
SVG	Web, scalable graphics	Scales infinitely, small file size	Not all tools support it
PDF	Print publications, papers	Vector format, preserves quality	Harder to edit
JPEG	Photos, complex images	Small file size	Compression artifacts, no transparency

import matplotlib.pyplot as plt

# After creating your plot...
# For screens and presentations (150 dpi is fine)
plt.savefig("revenue_chart.png", dpi=150, bbox_inches="tight")

# For print publications (300 dpi minimum)
plt.savefig("revenue_chart_print.png", dpi=300, bbox_inches="tight")

# For infinite scalability (vector format)
plt.savefig("revenue_chart.svg", format="svg", bbox_inches="tight")

Key savefig parameters: The dpi setting controls dots per inch - higher values produce sharper images but larger files (use 150 for screens, 300+ for print). Always include bbox_inches="tight" to automatically crop whitespace around the chart. For vector graphics that stay sharp at any zoom level, use format="svg". Important: call savefig() BEFORE show() - after show() is called, the figure is cleared from memory.

Practice Questions: Styling & Customization

Test your understanding with these hands-on exercises.

Task: Set the Seaborn theme to whitegrid and create a scatter plot of any two columns from the tips dataset.

Show Solution

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="whitegrid")
tips = sns.load_dataset("tips")
sns.scatterplot(data=tips, x="total_bill", y="tip", hue="time")
plt.title("Tips vs Total Bill")
plt.show()

Task: Create a line chart and add an annotation pointing to the maximum value.

import pandas as pd
data = pd.DataFrame({"month": range(1, 13), "sales": [20, 22, 25, 28, 35, 42, 38, 40, 45, 50, 48, 55]})

Show Solution

import matplotlib.pyplot as plt

max_idx = data["sales"].idxmax()
max_month = data.loc[max_idx, "month"]
max_sales = data.loc[max_idx, "sales"]

plt.plot(data["month"], data["sales"], marker="o")
plt.annotate(f"Max: {max_sales}", xy=(max_month, max_sales), xytext=(max_month - 2, max_sales + 5),
             arrowprops=dict(arrowstyle="->"), fontsize=10)
plt.title("Monthly Sales")
plt.xlabel("Month")
plt.ylabel("Sales")
plt.show()

Task: Create a bar chart using a custom hex color palette of your choice for five categories.

Show Solution

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

df = pd.DataFrame({"category": ["A", "B", "C", "D", "E"], "value": [10, 25, 15, 30, 20]})
custom_colors = ["#264653", "#2a9d8f", "#e9c46a", "#f4a261", "#e76f51"]

sns.barplot(data=df, x="category", y="value", palette=custom_colors)
plt.title("Custom Palette Bar Chart")
plt.show()

Task: Create a simple bar chart and save it as a high-resolution PNG file (300 dpi).

import matplotlib.pyplot as plt
categories = ["A", "B", "C"]
values = [25, 40, 30]

Show Solution

import matplotlib.pyplot as plt

plt.bar(categories, values, color="steelblue")
plt.title("Category Values")
plt.ylabel("Value")
plt.savefig("category_chart.png", dpi=300, bbox_inches="tight")
plt.show()

Task: Create a correlation heatmap using a diverging color palette (coolwarm) with values centered at 0.

import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(np.random.randn(50, 4), columns=["A", "B", "C", "D"])
df["B"] = df["A"] * 0.8 + np.random.randn(50) * 0.3  # Add correlation

Show Solution

import seaborn as sns
import matplotlib.pyplot as plt

corr = df.corr()
sns.heatmap(corr, annot=True, cmap="coolwarm", center=0, 
            vmin=-1, vmax=1, fmt=".2f", linewidths=0.5)
plt.title("Correlation Matrix with Diverging Palette")
plt.tight_layout()
plt.show()

Interactive Visualizations with Plotly

Beyond Static Images: Static charts are great for reports, but interactive charts shine in dashboards and presentations. Plotly lets users hover for details, zoom into regions, and toggle data series on and off. This section introduces Plotly Express, a high-level API that creates interactive charts with minimal code.

Why Interactive Charts Are Game-Changers: Static charts show one fixed view of data. Interactive charts let viewers explore and discover insights on their own. Consider a sales dashboard:

Static: You show one chart of monthly sales. Viewers can only see what you chose to show.
Interactive: Viewers can hover to see exact values, zoom into specific months, click to highlight regions, and filter by product. They discover their own insights.

Library

Plotly Express: Interactive Charts in One Line

Plotly Express (plotly.express) is a high-level wrapper around Plotly that creates interactive figures in a single function call. It works seamlessly with pandas DataFrames and outputs HTML that can be embedded in web pages or Jupyter notebooks.

Key features built-in: Hover tooltips, zoom/pan controls, legend toggling, data point selection, fullscreen mode, and export to PNG - all without any extra code!

Getting Started with Plotly

First, install Plotly using pip. The API is similar to Seaborn - you pass a DataFrame and specify column names. But unlike Seaborn, the result is an interactive HTML figure that viewers can explore.

What happens when you run fig.show():

In Jupyter Notebook: The interactive chart appears directly in the notebook
In a Python script: A web browser opens automatically to display the chart
The chart includes a toolbar with zoom, pan, reset, and download buttons

# Install first: pip install plotly
import plotly.express as px
import pandas as pd

sales = pd.DataFrame({
    "month": ["Jan", "Feb", "Mar", "Apr", "May", "Jun"],
    "revenue": [12000, 15000, 13500, 17000, 16000, 19000]
})

# Create interactive bar chart - try hovering over the bars!
fig = px.bar(sales, x="month", y="revenue", title="Monthly Revenue",
             color="revenue", color_continuous_scale="Blues")
fig.show()  # Opens interactive chart in browser or notebook

What makes this interactive: When you hover over any bar, a tooltip displays the exact month and revenue value. You can click and drag to zoom into a specific region, then double-click to reset the view. The color="revenue" parameter creates a gradient effect where bars are colored by their value using the Blues scale. A toolbar in the top-right corner provides icons to download the chart as PNG, zoom, pan, and reset the view.

Plotly vs Seaborn - Quick Comparison:

Feature	Seaborn	Plotly Express
Output	Static image	Interactive HTML
Hover info	No	Yes (automatic)
Zoom/Pan	No	Yes (built-in)
Web embedding	Export as image	Native HTML
Learning curve	Easy	Easy (similar API)

Interactive Scatter Plots: Exploring Relationships

Scatter plots benefit enormously from interactivity. With thousands of data points, static charts become cluttered and unreadable. Interactive scatter plots let you:

Hover to see details about individual points (country name, exact values)
Zoom into crowded regions to see individual points
Use color and size to encode additional dimensions (up to 5D visualization!)
Click legend items to show/hide specific categories

import plotly.express as px

# Built-in dataset with country statistics
df = px.data.gapminder().query("year == 2007")

# Create scatter with 4 dimensions: x, y, color, and size
fig = px.scatter(df, x="gdpPercap", y="lifeExp",
                 size="pop", color="continent",
                 hover_name="country", log_x=True,
                 title="GDP vs Life Expectancy (2007)")
fig.show()

Understanding the Parameters:

size="pop" - Bubble size represents population (larger countries = bigger bubbles)
color="continent" - Each continent gets a distinct color
hover_name="country" - Country name appears prominently when hovering
log_x=True - Uses logarithmic scale for x-axis (better for GDP data with huge range)

Interactive Line Charts: Time Series Exploration

Time-series data becomes truly explorable with Plotly. Users can zoom into specific date ranges, hover to see exact values on any date, and toggle multiple series on/off by clicking the legend.

Real-world use cases:

Stock price charts where users zoom into specific months
Sales dashboards where managers compare this year vs last year
Sensor data where engineers identify exact timestamps of anomalies

import plotly.express as px
import pandas as pd
import numpy as np

# Generate 90 days of sample time series data
dates = pd.date_range("2024-01-01", periods=90, freq="D")
df = pd.DataFrame({
    "date": dates,
    "sales": np.cumsum(np.random.randn(90) * 100 + 50)
})

fig = px.line(df, x="date", y="sales", title="Daily Sales Trend")
fig.update_traces(mode="lines+markers")  # Add dots at each data point
fig.show()

Time series interactivity features: You can drag horizontally to select and zoom into any date range, making it easy to focus on specific periods. Hovering over the line shows the exact date and value at that point. The update_traces(mode="lines+markers") call adds circular markers at each data point, making it easier to hover on specific days. Pro tip: add fig.update_layout(hovermode="x unified") to show all series values at once when hovering over a multi-line chart.

Saving and Sharing Interactive Charts

Plotly figures can be saved in two ways, depending on how you want to share them:

HTML (Interactive)

Best for: Sharing via email, embedding in websites, or interactive reports

Keeps: All interactivity (hover, zoom, pan, export)

PNG/PDF (Static)

Best for: Printed reports, presentations, or documents

Loses: Interactivity, but high quality image preserved

# Save as interactive HTML - viewers can explore!
fig.write_html("sales_chart.html")

# Save as static image (requires kaleido: pip install kaleido)
fig.write_image("sales_chart.png", scale=2)  # scale=2 for higher resolution
fig.write_image("sales_chart.pdf")  # Vector format for print

Pro Tips for Plotly:

Dark theme: Use fig.update_layout(template="plotly_dark") for modern dark-themed dashboards
Remove logo: Add config={"displaylogo": False} to fig.show() to remove the Plotly logo
Custom hover: Use hover_data parameter to control exactly what appears in tooltips

Practice Questions: Interactive Visualizations

Test your understanding with these hands-on exercises.

Task: Use Plotly Express to create an interactive bar chart of product sales.

import pandas as pd
products = pd.DataFrame({"product": ["A", "B", "C"], "units": [150, 200, 175]})

Show Solution

import plotly.express as px

fig = px.bar(products, x="product", y="units", title="Product Sales",
             color="units", color_continuous_scale="Greens")
fig.show()

Task: Create a scatter plot where hovering shows the city name.

import pandas as pd
cities = pd.DataFrame({
    "city": ["NYC", "LA", "Chicago", "Houston"],
    "population": [8.3, 4.0, 2.7, 2.3],
    "area": [302, 469, 227, 670]
})

Show Solution

import plotly.express as px

fig = px.scatter(cities, x="area", y="population", hover_name="city",
                 size="population", title="City Population vs Area")
fig.show()

Task: Create an interactive pie chart showing market share by company.

import pandas as pd
market = pd.DataFrame({
    "company": ["Apple", "Samsung", "Xiaomi", "Others"],
    "share": [27, 21, 14, 38]
})

Show Solution

import plotly.express as px

fig = px.pie(market, values="share", names="company", 
             title="Smartphone Market Share",
             color_discrete_sequence=px.colors.qualitative.Set2)
fig.show()

Task: Create an interactive line chart comparing sales of two products over months.

import pandas as pd
sales = pd.DataFrame({
    "month": ["Jan", "Feb", "Mar", "Apr", "May", "Jun"],
    "product_a": [100, 120, 115, 140, 155, 170],
    "product_b": [80, 95, 110, 105, 125, 140]
})

Show Solution

import plotly.express as px

# Melt to long format for Plotly
sales_long = sales.melt(id_vars="month", var_name="product", value_name="sales")

fig = px.line(sales_long, x="month", y="sales", color="product",
              markers=True, title="Product Sales Comparison")
fig.show()

Task: Create an interactive scatter plot with the gapminder dataset and save it as an HTML file that can be shared.

Show Solution

import plotly.express as px

df = px.data.gapminder().query("year == 2007")
fig = px.scatter(df, x="gdpPercap", y="lifeExp", size="pop",
                 color="continent", hover_name="country",
                 log_x=True, title="World Development Indicators 2007")

# Save as interactive HTML
fig.write_html("gapminder_2007.html")
print("Chart saved as gapminder_2007.html")

Dashboard Design Principles

Design for Decisions: A dashboard is more than a collection of charts - it is a decision-support tool. Effective dashboards answer specific questions, guide the viewer's eye, and update in real time. This section covers layout, hierarchy, and storytelling principles that make dashboards actionable.

What Makes a Dashboard Different from a Report? Reports present data for detailed analysis. Dashboards are designed for quick decision-making at a glance.

Report: 20-page document with tables, charts, and explanations. Read in 30+ minutes.
Dashboard: Single screen with KPIs, trends, and status indicators. Understood in 5-10 seconds.

Principle

The 5-Second Rule

A well-designed dashboard should communicate its main insight within 5 seconds. If viewers need to hunt for the key metric, the layout has failed.

How to apply it: Place the most important KPI (Key Performance Indicator) at the top-left, where eyes naturally start. Use the largest font size for the primary number. Surround it with context (comparison to last period, target, or trend arrow).

Layout and Visual Hierarchy

Great dashboards guide the eye using visual hierarchy - the arrangement of elements by importance. Just like a newspaper puts headlines at the top, dashboards should prioritize information visually.

The hierarchy formula:

Top-left: Primary KPI (the number that matters most)
Top-right: Secondary KPIs or status indicators
Middle: Trend charts (how are things changing?)
Bottom: Detail tables or less critical charts

Focus

One dashboard, one purpose.

Each dashboard should answer 1-2 specific questions - not everything at once. "How are sales this month?" is a good focus. "Tell me everything about the business" is not.

Filters

Let users drill down.

Add filters for date range, region, or category so users can explore without leaving the page. Keep filters at the top or in a sidebar.

Real-Time

Stale data = bad decisions.

For operational dashboards, ensure data refreshes automatically. Always show "Last updated: [timestamp]" so users know data freshness.

The Rule of Seven: Humans can only hold about 7±2 items in working memory at once. Dashboards with more than 7-9 visual elements cause cognitive overload. If you need more, split into multiple focused dashboards with navigation between them.

KPI Cards: The Heart of Every Dashboard

Key Performance Indicators deserve prominent display. A good KPI card contains three elements:

The number: Current value, large and bold
The comparison: Change from previous period (delta) or progress toward target
The signal: Color-coded status (green = good, yellow = warning, red = critical)

Plotly's Indicator trace creates professional KPI cards with built-in delta calculations:

import plotly.graph_objects as go

# KPI indicator card - shows value with comparison to previous period
fig = go.Figure(go.Indicator(
    mode="number+delta",  # Show number and change
    value=87500,          # Current value
    number={"prefix": "$", "font": {"size": 48}},  # Format as currency
    delta={"reference": 80000, "relative": True, "valueformat": ".1%"},  # +9.4% vs last month
    title={"text": "Monthly Revenue"},
    domain={"x": [0, 1], "y": [0, 1]}
))
fig.update_layout(height=200)
fig.show()

Anatomy of this KPI card: The large "$87,500" is displayed prominently as the current value, immediately drawing the viewer's attention. Below it, a green "+9.4%" arrow indicates improvement versus last month, providing crucial context. The "Monthly Revenue" title tells viewers what metric they're looking at. The delta is automatically calculated from the reference value: (87500-80000)/80000 = 9.4% - no manual math required.

Understanding the Indicator Parameters:

mode="number+delta" - Shows both the value and the change indicator
number={"prefix": "$"} - Adds currency symbol before the number
delta={"reference": 80000} - Compares against this baseline (last month's value)
relative=True - Shows delta as percentage (+9.4%) instead of absolute (+7500)

Combining Charts in a Dashboard

Plotly's make_subplots lets you combine multiple charts into a single figure. This is the foundation for building multi-panel dashboards entirely in Python.

When to use make_subplots:

Creating compact views with related metrics side by side
Building exportable dashboard images for reports or presentations
Prototyping dashboard layouts before building in a tool like Power BI or Tableau

from plotly.subplots import make_subplots
import plotly.graph_objects as go
import pandas as pd

# Sample data
months = ["Jan", "Feb", "Mar", "Apr"]
revenue = [12, 15, 14, 18]
expenses = [8, 9, 10, 11]

# Create 1 row, 2 columns layout
fig = make_subplots(rows=1, cols=2, subplot_titles=("Revenue", "Expenses"))

# Add revenue bar chart to first column
fig.add_trace(go.Bar(x=months, y=revenue, name="Revenue", marker_color="#22c55e"), row=1, col=1)

# Add expenses bar chart to second column
fig.add_trace(go.Bar(x=months, y=expenses, name="Expenses", marker_color="#ef4444"), row=1, col=2)

fig.update_layout(title_text="Financial Dashboard", showlegend=False)
fig.show()

How make_subplots works: The rows=1, cols=2 parameters create a grid with 1 row and 2 columns for side-by-side charts. The subplot_titles parameter adds a descriptive title above each chart automatically. When adding traces, row=1, col=1 and row=1, col=2 specify which grid cell to place each chart. Finally, showlegend=False hides the legend since the subplot titles already explain what each chart represents.

Storytelling with Data: The Narrative Flow

Great dashboards don't just display data - they tell a story. Think of your dashboard as answering questions in sequence:

Story Element	Dashboard Component	Example
Headline (What's happening?)	Primary KPI card	"Revenue: $87,500 (+9.4%)"
Context (Is this good or bad?)	Comparison to target or benchmark	"Target: $85,000 ✓"
Trend (Where are we headed?)	Line or area chart	12-month revenue trend
Breakdown (Why is this happening?)	Bar or pie chart	"Top region: West (+15%)"
Details (What should I investigate?)	Table or drill-down link	List of top 10 accounts

Common Dashboard Mistakes to Avoid:

Too many charts: More than 7-9 elements overwhelms viewers
No clear hierarchy: Everything the same size = nothing stands out
Missing context: A number without comparison is meaningless ("Is $87K good?")
Inconsistent colors: Red should always mean "bad" - don't use it randomly
No timestamps: Users don't know if data is 1 hour or 1 week old

Practice Questions: Dashboards

Test your understanding with these hands-on exercises.

Task: Create a Plotly Indicator showing "Active Users: 12,450" with a delta of +5% from last month.

Show Solution

import plotly.graph_objects as go

fig = go.Figure(go.Indicator(
    mode="number+delta",
    value=12450,
    delta={"reference": 11857, "relative": True, "valueformat": ".1%"},
    title={"text": "Active Users"}
))
fig.show()

Task: Use make_subplots to create a dashboard with a line chart (sales trend) and a pie chart (sales by category).

Show Solution

from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(rows=1, cols=2, specs=[[{"type": "scatter"}, {"type": "pie"}]],
                    subplot_titles=("Sales Trend", "Sales by Category"))

fig.add_trace(go.Scatter(x=["Q1", "Q2", "Q3", "Q4"], y=[100, 120, 115, 140], mode="lines+markers"), row=1, col=1)
fig.add_trace(go.Pie(labels=["Electronics", "Clothing", "Food"], values=[45, 30, 25]), row=1, col=2)

fig.update_layout(title_text="Sales Dashboard", showlegend=False)
fig.show()

Task: Create a gauge indicator showing "Goal Progress" at 73% completion.

Show Solution

import plotly.graph_objects as go

fig = go.Figure(go.Indicator(
    mode="gauge+number",
    value=73,
    title={"text": "Goal Progress"},
    gauge={"axis": {"range": [0, 100]},
           "bar": {"color": "#198754"},
           "steps": [
               {"range": [0, 50], "color": "#f8d7da"},
               {"range": [50, 75], "color": "#fff3cd"},
               {"range": [75, 100], "color": "#d1e7dd"}
           ]}
))
fig.show()

Task: Create a 2-row dashboard: top row has 2 KPI indicators, bottom row has a bar chart spanning full width.

Show Solution

from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(
    rows=2, cols=2,
    specs=[[{"type": "indicator"}, {"type": "indicator"}],
           [{"type": "bar", "colspan": 2}, None]],
    row_heights=[0.3, 0.7]
)

# KPIs
fig.add_trace(go.Indicator(mode="number+delta", value=15420, 
              delta={"reference": 14000}, title={"text": "Revenue"}), row=1, col=1)
fig.add_trace(go.Indicator(mode="number+delta", value=892,
              delta={"reference": 850}, title={"text": "Orders"}), row=1, col=2)

# Bar chart
fig.add_trace(go.Bar(x=["Jan", "Feb", "Mar", "Apr"], y=[3200, 3800, 4100, 4320]), row=2, col=1)

fig.update_layout(title_text="Monthly Performance Dashboard", height=500)
fig.show()

Task: Build an executive dashboard with: 3 KPI cards (Revenue, Customers, Satisfaction), a trend line chart, and a category pie chart. Apply a clean layout with proper titles.

Show Solution

from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(
    rows=2, cols=3,
    specs=[[{"type": "indicator"}, {"type": "indicator"}, {"type": "indicator"}],
           [{"type": "scatter", "colspan": 2}, None, {"type": "pie"}]],
    row_heights=[0.25, 0.75],
    subplot_titles=("", "", "", "Monthly Trend", "", "Sales by Region")
)

# Row 1: KPIs
fig.add_trace(go.Indicator(mode="number+delta", value=284500,
    delta={"reference": 260000, "valueformat": ",.0f"}, 
    title={"text": "Revenue ($)"}), row=1, col=1)
fig.add_trace(go.Indicator(mode="number+delta", value=1247,
    delta={"reference": 1180}, title={"text": "Customers"}), row=1, col=2)
fig.add_trace(go.Indicator(mode="number", value=4.6,
    number={"suffix": "/5"}, title={"text": "Satisfaction"}), row=1, col=3)

# Row 2: Charts
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun"]
revenue = [42000, 48000, 45000, 51000, 49000, 55000]
fig.add_trace(go.Scatter(x=months, y=revenue, mode="lines+markers", 
    line={"color": "#0d6efd", "width": 3}), row=2, col=1)
fig.add_trace(go.Pie(labels=["North", "South", "East", "West"],
    values=[35, 25, 22, 18]), row=2, col=3)

fig.update_layout(title_text="Q2 Executive Dashboard", height=600, showlegend=False)
fig.show()

Key Takeaways

Choose Charts by Purpose

Match chart type to the data relationship: bar for comparison, line for trends, scatter for correlation, pie for composition

Color Communicates Meaning

Use sequential palettes for continuous data, diverging for midpoints, and categorical for distinct groups - always consider accessibility

Maximize Data-Ink Ratio

Remove unnecessary gridlines, borders, and decorations - let the data speak without visual clutter

Interactivity Enables Exploration

Plotly adds hover, zoom, and filtering so users can explore data without needing new charts

Dashboards Drive Decisions

Place KPIs prominently, limit visual elements to 7-9, and tell a story from headline to detail

Export for Your Audience

Use PNG/SVG for reports (high DPI), HTML for interactive sharing, and match theme to presentation context

What You'll Learn

Contents

Introduction to Data Visualization

What is Data Visualization?

Why Visualization Matters in Analytics

Reveal Insights

Communicate Clearly

Drive Decisions

Core Python Libraries for Visualization

Your First Seaborn Chart

Matplotlib Basics

Subplots with Matplotlib

Seaborn Statistical Plots

Key Seaborn Plot Types

Pairplot for Exploration

Key Pairplot Options

Practice Questions: Seaborn & Matplotlib

Easy Create a simple line chart

Medium Compare two categories

Medium Create a 2x2 subplot grid

Hard Statistical visualization with pairplot

Chart Types and Selection

The Four Chart Families

Comparison Charts

Comparison Chart Guidelines

Distribution Charts

Distribution Chart Selection Guide

Relationship Charts

Types of Relationships to Identify

Composition Charts

Composition Chart Selection Guide

Practice Questions: Chart Types

Easy Create a histogram

Medium Correlation heatmap

Hard Multi-panel subplots

Easy Create a box plot by category

Medium Scatter plot with regression line

Interactive: Chart Type Chooser

Styling, Color, and Best Practices

The Data-Ink Ratio

Color Palettes: The Psychology of Color

Sequential

Diverging

Categorical

Seaborn Themes: Instant Professional Styling

Seaborn Theme Options

Annotations and Labels: Telling the Story

Required Chart Elements

Saving High-Quality Figures

Practice Questions: Styling & Customization

Easy Apply a Seaborn theme

Medium Annotate a chart

Hard Custom color palette

Easy Save a chart as PNG

Medium Apply diverging color palette to heatmap

Interactive Visualizations with Plotly

Plotly Express: Interactive Charts in One Line

Getting Started with Plotly

Interactive Scatter Plots: Exploring Relationships

Interactive Line Charts: Time Series Exploration

Saving and Sharing Interactive Charts

HTML (Interactive)

PNG/PDF (Static)

Practice Questions: Interactive Visualizations

Easy Create an interactive bar chart

Medium Scatter with hover info

Easy Create an interactive pie chart

Medium Line chart with multiple series

Hard Save interactive chart as HTML

Dashboard Design Principles

The 5-Second Rule

Layout and Visual Hierarchy

Focus

Filters

Real-Time

KPI Cards: The Heart of Every Dashboard

Combining Charts in a Dashboard

Storytelling with Data: The Narrative Flow

Practice Questions: Dashboards