Module 7.1

EDA Process & Data Profiling

Exploratory Data Analysis (EDA) is the critical first step in any data science project. Learn systematic approaches to understand your data, identify patterns, spot anomalies, and prepare for modeling.

40 min read
Intermediate
What You'll Learn
  • The EDA workflow and objectives
  • Data quality assessment techniques
  • Automated profiling tools
  • Summary statistics interpretation
  • Documenting EDA findings
Contents
01

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often using visual methods. Pioneered by statistician John Tukey in the 1970s, EDA emphasizes looking at data before making assumptions. The goal is to discover patterns, spot anomalies, test hypotheses, and check assumptions through summary statistics and graphical representations.

Exploratory Data Analysis (EDA)

A systematic approach to investigating datasets using statistical summaries and visualizations to understand data structure, detect patterns, identify outliers, and form hypotheses before formal modeling

Why EDA Matters

Skipping EDA is one of the most common mistakes in data science projects. Without understanding your data, you risk building models on flawed assumptions, missing important patterns, or wasting time on irrelevant features. EDA helps you make informed decisions about data cleaning, feature engineering, and model selection.

With EDA
  • Catch data quality issues early
  • Understand variable distributions
  • Identify relationships between features
  • Make informed modeling decisions
Without EDA
  • Build models on corrupted data
  • Miss important patterns
  • Waste time on irrelevant features
  • Get misleading results

Practice: EDA Fundamentals

Scenario: Your team receives new datasets daily and needs a consistent way to assess data quality. Create a function that quickly identifies issues.

# Test with this problematic data
import pandas as pd
df = pd.DataFrame({
    'id': [1, 2, 2, 4, 5],
    'name': ['Alice', 'Bob', 'Bob', 'Diana', None],
    'age': [25, None, 35, 45, 55],
    'salary': [50000, 60000, 60000, None, 80000]
})

Task: Write data_quality_report(df) that returns: total rows, duplicate count, missing values per column, and data types.

Show Solution
def data_quality_report(df):
    """Generate comprehensive data quality report."""
    report = {
        'total_rows': len(df),
        'total_columns': len(df.columns),
        'duplicate_rows': df.duplicated().sum(),
        'memory_mb': df.memory_usage(deep=True).sum() / 1024**2
    }
    
    # Missing values summary
    missing = df.isna().sum()
    missing_pct = (missing / len(df) * 100).round(1)
    report['missing_summary'] = pd.DataFrame({
        'missing_count': missing,
        'missing_pct': missing_pct,
        'dtype': df.dtypes
    })
    
    print(f"\n{'='*50}")
    print(f"DATA QUALITY REPORT")
    print(f"{'='*50}")
    print(f"Rows: {report['total_rows']}")
    print(f"Columns: {report['total_columns']}")
    print(f"Duplicates: {report['duplicate_rows']}")
    print(f"Memory: {report['memory_mb']:.3f} MB")
    print(f"\nMissing Values:")
    print(report['missing_summary'][report['missing_summary']['missing_count'] > 0])
    
    return report

# Test
report = data_quality_report(df)

Task: Write a function check_data_quality(df) that returns a dictionary with missing value counts and duplicate row count.

# Test data with issues
df = pd.DataFrame({
    'id': [1, 2, 2, 4],
    'value': [10, None, 30, None],
    'category': ['A', 'B', 'B', 'A']
})
Show Solution
def check_data_quality(df):
    """Check data quality and return summary dict."""
    quality_report = {
        'total_rows': len(df),
        'duplicate_rows': df.duplicated().sum(),
        'missing_by_column': df.isna().sum().to_dict(),
        'total_missing': df.isna().sum().sum(),
        'missing_percent': (df.isna().sum().sum() / df.size * 100).round(2)
    }
    return quality_report

# Test
report = check_data_quality(df)
print(report)
# {'total_rows': 4, 'duplicate_rows': 0, 
#  'missing_by_column': {'id': 0, 'value': 2, 'category': 0},
#  'total_missing': 2, 'missing_percent': 16.67}

Task: Create a class EDASummary that takes a DataFrame and provides methods for: shape info, missing values, numeric stats, and categorical stats.

Show Solution
class EDASummary:
    """A class for comprehensive EDA summary statistics."""
    
    def __init__(self, df):
        self.df = df
        self.numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
        self.categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
    
    def shape_info(self):
        """Return basic shape information."""
        return {
            'rows': self.df.shape[0],
            'columns': self.df.shape[1],
            'memory_mb': self.df.memory_usage(deep=True).sum() / 1024**2
        }
    
    def missing_summary(self):
        """Return missing value summary."""
        missing = self.df.isna().sum()
        return pd.DataFrame({
            'missing_count': missing,
            'missing_pct': (missing / len(self.df) * 100).round(2)
        }).query('missing_count > 0').sort_values('missing_count', ascending=False)
    
    def numeric_summary(self):
        """Return summary stats for numeric columns."""
        if not self.numeric_cols:
            return "No numeric columns"
        return self.df[self.numeric_cols].describe()
    
    def categorical_summary(self):
        """Return summary for categorical columns."""
        summary = {}
        for col in self.categorical_cols:
            summary[col] = {
                'unique': self.df[col].nunique(),
                'top_value': self.df[col].mode()[0] if len(self.df[col].mode()) > 0 else None,
                'top_freq': self.df[col].value_counts().iloc[0] if len(self.df) > 0 else 0
            }
        return pd.DataFrame(summary).T

# Usage
eda = EDASummary(df)
print(eda.shape_info())
print(eda.missing_summary())
print(eda.numeric_summary())
print(eda.categorical_summary())
02

The EDA Workflow

A structured EDA workflow ensures you do not miss important aspects of your data. While the specific steps may vary by project, most EDA processes follow a similar pattern. Think of it as a conversation with your data, where each question leads to new insights and further questions.

alt="The 5-Step EDA Workflow: Understand, Clean, Explore, Visualize, Document" class="figure-img img-fluid rounded shadow-lg" style="max-width: 900px;">
The systematic 5-step EDA workflow with key activities at each stage

Interactive: EDA Workflow Explorer

Click to Explore

Master the 5-step EDA workflow. Click each step to see what it involves, example code, and pro tips.

Understand
Clean
Explore
Visualize
Document
01
Understand the Data
5-10 minutes

Get familiar with your dataset's structure, size, and basic characteristics. This is your first conversation with the data - learn what columns exist, their data types, and preview sample values.

Goals:
  • Know the dataset dimensions
  • Understand column meanings
  • Identify data types
  • Preview actual values
Pro Tips:
Use df.sample() for random rows
Check df.columns.tolist()
Example Code
# First steps with any new dataset
import pandas as pd

df = pd.read_csv('data.csv')

# 1. Check dimensions
print(f"Dataset: {df.shape[0]:,} rows, {df.shape[1]} cols")

# 2. View column info
print(df.info())

# 3. Preview data
print(df.head())

Key Questions to Answer

During EDA, keep asking questions about your data. Good questions lead to insights that improve your analysis. Here are essential questions to answer for any dataset:

Category Questions to Ask
Structure How many rows and columns? What are the data types?
Quality Are there missing values? Duplicates? Outliers?
Distribution What does each variable look like? Normal? Skewed?
Relationships How do variables relate to each other? To the target?

Practice: EDA Workflow

Scenario: You received an e-commerce dataset and need to perform a complete initial analysis before building any models.

import pandas as pd
import numpy as np
df = pd.DataFrame({
    'product': ['Laptop', 'Phone', 'Tablet', 'Monitor', 'Keyboard'] * 20,
    'price': list(np.random.randint(50, 1200, 100)),
    'quantity': list(np.random.randint(1, 50, 100)),
    'category': ['Electronics', 'Electronics', 'Electronics', 'Accessories', 'Accessories'] * 20,
    'revenue': list(np.random.randint(500, 50000, 100))
})
df.loc[5:10, 'price'] = None  # Add some missing values

Task: Write a comprehensive EDA function that outputs: shape, types, missing values, numeric summaries, categorical value counts, and potential outliers (using IQR).

Show Solution
def complete_eda(df):
    """Perform comprehensive EDA on any DataFrame."""
    print("="*60)
    print("1. DATASET OVERVIEW")
    print("="*60)
    print(f"Shape: {df.shape[0]} rows, {df.shape[1]} columns")
    print(f"Memory: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    
    print("\n" + "="*60)
    print("2. DATA TYPES")
    print("="*60)
    print(df.dtypes)
    
    print("\n" + "="*60)
    print("3. MISSING VALUES")
    print("="*60)
    missing = df.isna().sum()
    if missing.sum() > 0:
        print(missing[missing > 0])
    else:
        print("No missing values!")
    
    print("\n" + "="*60)
    print("4. NUMERIC COLUMNS")
    print("="*60)
    print(df.describe())
    
    print("\n" + "="*60)
    print("5. CATEGORICAL COLUMNS")
    print("="*60)
    for col in df.select_dtypes(include=['object', 'category']).columns:
        print(f"\n{col}:")
        print(df[col].value_counts().head())
    
    print("\n" + "="*60)
    print("6. POTENTIAL OUTLIERS (IQR Method)")
    print("="*60)
    for col in df.select_dtypes(include=['number']).columns:
        Q1, Q3 = df[col].quantile([0.25, 0.75])
        IQR = Q3 - Q1
        outliers = df[(df[col] < Q1 - 1.5*IQR) | (df[col] > Q3 + 1.5*IQR)]
        if len(outliers) > 0:
            print(f"{col}: {len(outliers)} potential outliers")

complete_eda(df)

Task: Write a function analyze_categorical(df, col) that prints value counts, unique count, and most common value for a categorical column.

# Test data
df = pd.DataFrame({
    'region': ['North', 'South', 'North', 'East', 'North', 'South', 'West', 'North'],
    'status': ['Active', 'Active', 'Inactive', 'Active', 'Active', 'Inactive', 'Active', 'Active']
})
Show Solution
def analyze_categorical(df, col):
    """Analyze a categorical column in a DataFrame."""
    print(f"Analysis of '{col}'")
    print("=" * 40)
    print(f"Unique values: {df[col].nunique()}")
    print(f"Most common: {df[col].mode()[0]}")
    print(f"\nValue counts:")
    print(df[col].value_counts())
    print(f"\nValue percentages:")
    print((df[col].value_counts(normalize=True) * 100).round(1))

# Test
analyze_categorical(df, 'region')
# Analysis of 'region'
# ========================================
# Unique values: 4
# Most common: North
# Value counts:
# North    4
# South    2
# East     1
# West     1

Task: Write Python code to perform a complete initial EDA on any DataFrame df.

Show Solution
def quick_eda(df):
    """Perform comprehensive initial EDA on a DataFrame."""
    print("=" * 60)
    print("DATASET OVERVIEW")
    print("=" * 60)
    print(f"Shape: {df.shape[0]} rows, {df.shape[1]} columns")
    print(f"Memory: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    
    print("\n" + "=" * 60)
    print("DATA TYPES")
    print("=" * 60)
    print(df.dtypes.value_counts())
    
    print("\n" + "=" * 60)
    print("MISSING VALUES")
    print("=" * 60)
    missing = df.isnull().sum()
    missing_pct = (missing / len(df) * 100).round(2)
    missing_df = pd.DataFrame({'Missing': missing, 'Percent': missing_pct})
    print(missing_df[missing_df['Missing'] > 0])
    
    print("\n" + "=" * 60)
    print("DUPLICATES")
    print("=" * 60)
    print(f"Duplicate rows: {df.duplicated().sum()}")
    
    print("\n" + "=" * 60)
    print("NUMERICAL SUMMARY")
    print("=" * 60)
    print(df.describe().round(2))
    
    print("\n" + "=" * 60)
    print("CATEGORICAL SUMMARY")
    print("=" * 60)
    for col in df.select_dtypes(include='object').columns:
        print(f"\n{col}: {df[col].nunique()} unique values")
        print(df[col].value_counts().head(5))

# Usage: quick_eda(df)
03

First Look at Data

When you receive a new dataset, your first task is to get a high-level understanding of what you are working with. Pandas provides several methods that give you immediate insights into your data's structure and content. These initial checks take seconds but save hours of confusion later.

Essential First Commands

Start every analysis with these fundamental commands. They reveal the basic shape, types, and sample values in your dataset. Think of this as a quick health check for your data.

import pandas as pd

# Load your dataset
df = pd.read_csv('sales_data.csv')

# Basic structure
print(df.shape)              # (1000, 12) - 1000 rows, 12 columns
print(df.columns.tolist())   # List all column names
print(df.dtypes)             # Data type of each column

# Sample data
print(df.head())             # First 5 rows
print(df.tail())             # Last 5 rows
print(df.sample(5))          # Random 5 rows

Summary Statistics with describe()

The describe() method provides statistical summaries for numerical columns. It shows count, mean, standard deviation, min, max, and quartiles. This single command reveals distribution characteristics and potential outliers at a glance.

# Numerical summary
print(df.describe())

# Include all columns (including categorical)
print(df.describe(include='all'))

# Specific percentiles
print(df.describe(percentiles=[.1, .25, .5, .75, .9]))
Pro Tip: When the mean and median (50%) differ significantly, your data is likely skewed. Large differences between max and 75th percentile often indicate outliers.

Understanding Data Types

Data types affect how you can analyze and visualize variables. Pandas infers types when loading data, but it does not always get them right. Dates might load as strings, categories as objects, and IDs as numbers.

# Check memory usage and types
print(df.info())

# Count by data type
print(df.dtypes.value_counts())

# Convert types as needed
df['date'] = pd.to_datetime(df['date'])
df['category'] = df['category'].astype('category')
df['customer_id'] = df['customer_id'].astype(str)

Practice: First Look at Data

Scenario: HR gave you this describe() output for 'salary' column:

count    10000.000000
mean     65000.000000
std      25000.000000
min       5000.000000
25%      45000.000000
50%      60000.000000
75%      80000.000000
max     500000.000000

Task: Identify data quality issues, explain what they suggest, and write code to investigate outliers and visualize the distribution.

Show Solution

Issues Identified:

  • Extreme outlier: Max ($500K) vs 75th percentile ($80K) - 6x gap
  • Right skew: Mean ($65K) > Median ($60K)
  • Suspicious minimum: $5K seems too low for full-time
# Investigate outliers using IQR
Q1, Q3 = df['salary'].quantile([0.25, 0.75])
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"IQR: ${IQR:,.0f}")
print(f"Lower bound: ${lower_bound:,.0f}")
print(f"Upper bound: ${upper_bound:,.0f}")

# Find outliers
outliers = df[(df['salary'] < lower_bound) | (df['salary'] > upper_bound)]
print(f"\nOutliers: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)")

# Visualize
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
df['salary'].hist(bins=50, ax=axes[0])
axes[0].set_title('Salary Distribution')
df.boxplot(column='salary', ax=axes[1])
axes[1].set_title('Salary Boxplot')
plt.show()

Given this describe() output for a 'salary' column:

count    10000.000000
mean     65000.000000
std      25000.000000
min       5000.000000
25%      45000.000000
50%      60000.000000
75%      80000.000000
max     500000.000000

Task: Identify any potential data quality issues and explain what additional investigation you would perform.

Show Solution

Issues Identified:

  • Potential outliers: Max (500,000) is far from 75th percentile (80,000) - a gap of 420,000
  • Right skew: Mean (65,000) > Median/50% (60,000) indicates positive skew
  • Suspicious minimum: $5,000 salary seems very low - could be part-time, error, or different unit
# Investigate outliers
print(df[df['salary'] > 200000])  # High earners
print(df[df['salary'] < 20000])   # Unusually low

# Check for IQR-based outliers
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['salary'] < Q1 - 1.5*IQR) | (df['salary'] > Q3 + 1.5*IQR)]
print(f"Potential outliers: {len(outliers)}")

# Visualize distribution
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
df['salary'].hist(bins=50, ax=axes[0])
axes[0].set_title('Salary Distribution')
df.boxplot(column='salary', ax=axes[1])
axes[1].set_title('Salary Boxplot')
plt.show()

Your DataFrame has these columns with incorrect data types:

  • order_date: object (should be datetime)
  • zip_code: int64 (should be string to preserve leading zeros)
  • is_active: int64 (should be boolean)
  • product_category: object (should be category for memory efficiency)

Task: Write code to convert all columns to their appropriate types.

Show Solution
# Convert date string to datetime
df['order_date'] = pd.to_datetime(df['order_date'])

# Convert zip_code to string, preserving leading zeros
df['zip_code'] = df['zip_code'].astype(str).str.zfill(5)

# Convert to boolean
df['is_active'] = df['is_active'].astype(bool)

# Convert to category for memory efficiency
df['product_category'] = df['product_category'].astype('category')

# Verify changes
print(df.dtypes)
print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# For multiple conversions at once
df = df.astype({
    'order_date': 'datetime64[ns]',
    'is_active': 'bool',
    'product_category': 'category'
})

Task: Write a function called first_look() that takes a DataFrame and prints a comprehensive summary including:

  • Shape and memory usage
  • Column names and types
  • Missing value summary
  • Numeric column statistics
  • Categorical column value counts (top 5)
Show Solution
def first_look(df):
    """
    Perform comprehensive first look analysis on a DataFrame.
    """
    print("=" * 60)
    print("DATASET OVERVIEW")
    print("=" * 60)
    
    # Basic shape and memory
    print(f"\n๐Ÿ“Š Shape: {df.shape[0]:,} rows ร— {df.shape[1]} columns")
    memory_mb = df.memory_usage(deep=True).sum() / 1024**2
    print(f"๐Ÿ’พ Memory Usage: {memory_mb:.2f} MB")
    
    # Column types summary
    print(f"\n๐Ÿ“‹ Column Types:")
    for dtype, count in df.dtypes.value_counts().items():
        print(f"   {dtype}: {count} columns")
    
    # Missing values
    print("\n" + "=" * 60)
    print("MISSING VALUES")
    print("=" * 60)
    missing = df.isna().sum()
    missing_pct = (missing / len(df) * 100).round(2)
    missing_df = pd.DataFrame({
        'Missing': missing,
        'Percent': missing_pct
    }).query('Missing > 0').sort_values('Missing', ascending=False)
    
    if len(missing_df) > 0:
        print(missing_df)
    else:
        print("โœ… No missing values found!")
    
    # Numeric columns
    numeric_cols = df.select_dtypes(include=['number']).columns
    if len(numeric_cols) > 0:
        print("\n" + "=" * 60)
        print("NUMERIC COLUMN SUMMARY")
        print("=" * 60)
        print(df[numeric_cols].describe().T)
    
    # Categorical columns (top values)
    cat_cols = df.select_dtypes(include=['object', 'category']).columns
    if len(cat_cols) > 0:
        print("\n" + "=" * 60)
        print("CATEGORICAL COLUMNS (Top 5 values)")
        print("=" * 60)
        for col in cat_cols:
            print(f"\n{col} ({df[col].nunique()} unique):")
            print(df[col].value_counts().head())
    
    print("\n" + "=" * 60)

# Usage
first_look(df)
04

Automated Data Profiling

While manual exploration is valuable, automated profiling tools can generate comprehensive reports in seconds. These tools analyze every column, detect data types, calculate statistics, identify correlations, and flag potential issues. They are especially useful for large datasets with many columns.

Using ydata-profiling (formerly pandas-profiling)

The ydata-profiling library generates an interactive HTML report with detailed analysis of each variable. It includes distributions, missing values, correlations, and alerts for potential data quality issues.

# Install: pip install ydata-profiling
from ydata_profiling import ProfileReport

# Generate a comprehensive report
profile = ProfileReport(df, title="Sales Data Analysis")

# Save as HTML
profile.to_file("sales_report.html")

# For large datasets, use minimal mode
profile = ProfileReport(df, minimal=True)
Report Contents: The generated report includes overview statistics, variable analysis, correlation matrices, missing value patterns, and duplicate detection, all in an interactive format.

Quick Profiling with Pandas

For quick profiling without additional libraries, you can create your own summary function. This approach is lightweight and customizable to your specific needs.

def quick_profile(df):
    """Generate a quick data profile summary."""
    print("=" * 50)
    print(f"Dataset Shape: {df.shape[0]} rows, {df.shape[1]} columns")
    print("=" * 50)
    
    for col in df.columns:
        print(f"\n{col}")
        print(f"  Type: {df[col].dtype}")
        print(f"  Missing: {df[col].isna().sum()} ({df[col].isna().mean()*100:.1f}%)")
        print(f"  Unique: {df[col].nunique()}")
        
        if df[col].dtype in ['int64', 'float64']:
            print(f"  Range: [{df[col].min()}, {df[col].max()}]")

quick_profile(df)

Practice: Automated Profiling

Scenario: You have a 50,000-row sales DataFrame and need to generate a quick profiling report for stakeholders.

Task: Write code to generate a ydata-profiling report with settings optimized for large datasets.

Show Solution
from ydata_profiling import ProfileReport

# For large datasets, use minimal mode
profile = ProfileReport(
    sales_df,
    title="Sales Data Profile Report",
    minimal=True,  # Faster for large datasets
    explorative=True
)

# Save to HTML file
profile.to_file("sales_profile_report.html")

# Or display in Jupyter Notebook
profile.to_notebook_iframe()

# For very large datasets (>100K rows), sample first
if len(sales_df) > 100000:
    sample_df = sales_df.sample(n=10000, random_state=42)
    profile = ProfileReport(sample_df, title="Sales Sample Profile")
    profile.to_file("sales_sample_report.html")

You have two DataFrames: train_df (training data) and test_df (test data). You want to compare their distributions to check for data drift.

Task: Write code to generate a comparison report between the two datasets.

Show Solution
from ydata_profiling import ProfileReport

# Generate profiles for both datasets
train_profile = ProfileReport(train_df, title="Training Data")
test_profile = ProfileReport(test_df, title="Test Data")

# Generate comparison report
comparison = train_profile.compare(test_profile)
comparison.to_file("train_test_comparison.html")

# Alternative: Custom comparison function
def compare_datasets(df1, df2, name1="Dataset 1", name2="Dataset 2"):
    """Compare two DataFrames for consistency."""
    print(f"Shape Comparison:")
    print(f"  {name1}: {df1.shape}")
    print(f"  {name2}: {df2.shape}")
    
    print(f"\nColumn Comparison:")
    cols1 = set(df1.columns)
    cols2 = set(df2.columns)
    print(f"  Only in {name1}: {cols1 - cols2}")
    print(f"  Only in {name2}: {cols2 - cols1}")
    
    print(f"\nNumeric Column Statistics:")
    common_numeric = df1.select_dtypes(include=['number']).columns.intersection(
        df2.select_dtypes(include=['number']).columns
    )
    for col in common_numeric:
        print(f"\n  {col}:")
        print(f"    {name1} - Mean: {df1[col].mean():.2f}, Std: {df1[col].std():.2f}")
        print(f"    {name2} - Mean: {df2[col].mean():.2f}, Std: {df2[col].std():.2f}")

compare_datasets(train_df, test_df, "Train", "Test")

Task: Create an enhanced version of quick_profile() that:

  • Detects potential outliers using IQR method
  • Identifies high-cardinality categorical columns
  • Flags columns with high missing percentage (>10%)
  • Calculates correlation with target variable (if provided)
Show Solution
def enhanced_profile(df, target_col=None):
    """
    Generate an enhanced data profile with quality checks.
    
    Parameters:
    -----------
    df : pd.DataFrame
        The DataFrame to profile
    target_col : str, optional
        Target column for correlation analysis
    """
    print("=" * 70)
    print("ENHANCED DATA PROFILE")
    print("=" * 70)
    
    # Basic info
    print(f"\n๐Ÿ“Š Shape: {df.shape[0]:,} rows ร— {df.shape[1]} columns")
    
    issues = []
    
    # Analyze each column
    for col in df.columns:
        col_issues = []
        
        # Missing values check
        missing_pct = df[col].isna().mean() * 100
        if missing_pct > 10:
            col_issues.append(f"โš ๏ธ High missing: {missing_pct:.1f}%")
        
        # Numeric column analysis
        if df[col].dtype in ['int64', 'float64']:
            Q1 = df[col].quantile(0.25)
            Q3 = df[col].quantile(0.75)
            IQR = Q3 - Q1
            lower = Q1 - 1.5 * IQR
            upper = Q3 + 1.5 * IQR
            outlier_count = ((df[col] < lower) | (df[col] > upper)).sum()
            
            if outlier_count > 0:
                outlier_pct = outlier_count / len(df) * 100
                col_issues.append(f"โš ๏ธ Outliers: {outlier_count} ({outlier_pct:.1f}%)")
        
        # Categorical column analysis
        if df[col].dtype == 'object':
            cardinality = df[col].nunique()
            if cardinality > 50:
                col_issues.append(f"โš ๏ธ High cardinality: {cardinality} unique")
            if cardinality == len(df):
                col_issues.append("โš ๏ธ Possible ID column (all unique)")
        
        # Store issues
        if col_issues:
            issues.append((col, col_issues))
    
    # Print issues summary
    if issues:
        print("\n" + "=" * 70)
        print("๐Ÿšจ DATA QUALITY ISSUES")
        print("=" * 70)
        for col, col_issues in issues:
            print(f"\n{col}:")
            for issue in col_issues:
                print(f"  {issue}")
    else:
        print("\nโœ… No major data quality issues detected!")
    
    # Correlation with target
    if target_col and target_col in df.columns:
        print("\n" + "=" * 70)
        print(f"๐Ÿ“ˆ CORRELATION WITH TARGET: {target_col}")
        print("=" * 70)
        numeric_cols = df.select_dtypes(include=['number']).columns
        correlations = df[numeric_cols].corrwith(df[target_col]).sort_values(ascending=False)
        print(correlations)
    
    print("\n" + "=" * 70)

# Usage
enhanced_profile(df, target_col='sales')
05

Data Quality Assessment

Data quality issues can silently corrupt your analysis and models. A thorough quality assessment identifies problems like missing values, duplicates, inconsistent formatting, and impossible values. Catching these issues early saves significant time and prevents misleading results.

alt="Data Profiling Dashboard with completeness donut, column types, missing values, distributions, correlations, and outliers" class="figure-img img-fluid rounded shadow-lg">
Data profiling dashboard: completeness metrics, type distribution, missing values, distributions, correlations, and outliers
alt="Data Quality Metrics gauges showing completeness, uniqueness, validity, and consistency scores" class="figure-img img-fluid rounded shadow-lg">
Key data quality metrics: completeness, uniqueness, validity, and consistency scores

Checking for Missing Values

Missing data is one of the most common data quality issues. Understanding the pattern and extent of missing values helps you decide how to handle them. Random missing values are easier to handle than systematic patterns that might indicate data collection problems.

# Count missing values
print(df.isna().sum())

# Percentage missing per column
print((df.isna().sum() / len(df) * 100).round(2))

# Visualize missing patterns
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
sns.heatmap(df.isna(), cbar=True, yticklabels=False)
plt.title('Missing Value Patterns')
plt.show()

Detecting Duplicates

Duplicate records can inflate statistics and bias models. Check for exact duplicates and near-duplicates (rows that differ only in one or two columns). The appropriate action depends on whether duplicates represent data errors or legitimate repeated observations.

# Check for duplicate rows
print(f"Duplicate rows: {df.duplicated().sum()}")

# View duplicate rows
print(df[df.duplicated(keep=False)])

# Check duplicates based on specific columns
print(f"Duplicate IDs: {df.duplicated(subset=['customer_id']).sum()}")

# Remove duplicates (keep first occurrence)
df_clean = df.drop_duplicates()

Validating Data Ranges

Domain knowledge helps identify impossible or suspicious values. Ages cannot be negative, percentages should be 0 to 100, and dates should be within expected ranges. Create validation rules based on your understanding of what the data should look like.

# Check for impossible values
print(df[df['age'] < 0])                    # Negative ages
print(df[df['price'] < 0])                  # Negative prices
print(df[df['percentage'] > 100])           # Over 100%

# Validate date ranges
df['date'] = pd.to_datetime(df['date'])
print(df[(df['date'] < '2020-01-01') | (df['date'] > '2024-12-31')])

# Check categorical consistency
print(df['status'].value_counts())          # Look for typos
Common Issues: Watch for mixed case text (NYC vs nyc), trailing spaces, inconsistent date formats, and placeholder values like 999 or -1 used for missing data.

Practice: Data Quality Assessment

Scenario: You received a messy customer dataset with various missing value patterns. You need to understand and visualize the missingness before deciding on imputation strategies.

Task: Write code to: (1) show missing value counts/percentages sorted by severity, (2) identify columns with >10% missing that need attention, and (3) create a horizontal bar chart visualization.

Show Solution
import pandas as pd
import matplotlib.pyplot as plt

def analyze_missing_values(df, threshold=10):
    """Comprehensive missing value analysis."""
    # Calculate statistics
    missing_count = df.isna().sum()
    missing_pct = (missing_count / len(df) * 100).round(2)
    
    # Create summary DataFrame
    summary = pd.DataFrame({
        'Missing Count': missing_count,
        'Missing %': missing_pct,
        'Data Type': df.dtypes
    })
    
    # Filter and sort
    summary = summary[summary['Missing Count'] > 0]
    summary = summary.sort_values('Missing Count', ascending=False)
    
    print("MISSING VALUE REPORT")
    print("="*50)
    print(summary)
    
    # Flag high-missing columns
    high_missing = summary[summary['Missing %'] > threshold]
    if len(high_missing) > 0:
        print(f"\nโš ๏ธ ATTENTION NEEDED (>{threshold}% missing):")
        print(list(high_missing.index))
    
    # Visualization
    if len(summary) > 0:
        plt.figure(figsize=(10, max(4, len(summary)*0.4)))
        colors = ['red' if x > threshold else 'steelblue' 
                  for x in summary['Missing %']]
        plt.barh(summary.index, summary['Missing %'], color=colors)
        plt.xlabel('Missing Percentage')
        plt.title('Missing Values by Column')
        plt.axvline(x=threshold, color='red', linestyle='--', 
                    label=f'{threshold}% threshold')
        plt.legend()
        plt.gca().invert_yaxis()
        plt.tight_layout()
        plt.show()
    
    return summary

# Usage
analyze_missing_values(df, threshold=10)

Your order DataFrame might have duplicate orders. You need to:

  1. Find exact duplicate rows
  2. Find duplicate order_ids (same order entered multiple times)
  3. Identify orders with same customer_id and date (potential duplicates)
  4. Remove duplicates keeping the most recent entry
Show Solution
# 1. Exact duplicate rows
exact_dupes = df[df.duplicated(keep=False)]
print(f"Exact duplicates: {len(exact_dupes)}")

# 2. Duplicate order_ids
dupe_orders = df[df.duplicated(subset=['order_id'], keep=False)]
print(f"Duplicate order_ids: {len(dupe_orders)}")
print(dupe_orders.sort_values('order_id'))

# 3. Same customer and date (potential duplicates)
potential_dupes = df[df.duplicated(subset=['customer_id', 'order_date'], keep=False)]
print(f"Same customer/date: {len(potential_dupes)}")
print(potential_dupes.sort_values(['customer_id', 'order_date']))

# 4. Remove duplicates, keeping most recent
# First, ensure order_date is datetime
df['order_date'] = pd.to_datetime(df['order_date'])

# Sort by order_id and date (newest first), then keep first
df_deduped = df.sort_values(['order_id', 'order_date'], ascending=[True, False])
df_deduped = df_deduped.drop_duplicates(subset=['order_id'], keep='first')

print(f"Original rows: {len(df)}")
print(f"After deduplication: {len(df_deduped)}")
print(f"Removed: {len(df) - len(df_deduped)}")

Your employee DataFrame should follow these business rules:

  • Age must be between 18 and 100
  • Salary must be positive
  • Start_date must be before today
  • Department must be one of: ['Sales', 'Engineering', 'HR', 'Marketing', 'Finance']

Task: Write code to find all rows that violate any of these rules.

Show Solution
import pandas as pd
from datetime import datetime

# Define valid departments
valid_departments = ['Sales', 'Engineering', 'HR', 'Marketing', 'Finance']

# Convert date column
df['start_date'] = pd.to_datetime(df['start_date'])
today = datetime.now()

# Create validation masks
invalid_age = (df['age'] < 18) | (df['age'] > 100)
invalid_salary = df['salary'] <= 0
invalid_date = df['start_date'] > today
invalid_dept = ~df['department'].isin(valid_departments)

# Find all violations
violations = pd.DataFrame({
    'Invalid Age': invalid_age,
    'Invalid Salary': invalid_salary,
    'Future Start Date': invalid_date,
    'Invalid Department': invalid_dept
})

# Any violation
any_violation = violations.any(axis=1)
print(f"Rows with violations: {any_violation.sum()}")

# Show detailed violations
violations_df = df[any_violation].copy()
violations_df['Violations'] = violations.apply(
    lambda row: ', '.join(violations.columns[row]), axis=1
)
print(violations_df[['employee_id', 'age', 'salary', 'start_date', 'department', 'Violations']])

# Summary by violation type
print("\nViolation Summary:")
for col in violations.columns:
    count = violations[col].sum()
    if count > 0:
        print(f"  {col}: {count} rows")

Task: Create a comprehensive data_quality_report() function that generates a complete data quality assessment including:

  • Completeness score (percentage of non-null values)
  • Uniqueness score (for potential ID columns)
  • Validity score (based on custom rules)
  • Consistency checks (format consistency within columns)
  • Overall data quality score
Show Solution
def data_quality_report(df, id_columns=None, rules=None):
    """
    Generate a comprehensive data quality report.
    
    Parameters:
    -----------
    df : pd.DataFrame
    id_columns : list, optional - Columns that should be unique
    rules : dict, optional - Validation rules {column: function}
    
    Returns:
    --------
    dict : Quality scores and issues
    """
    report = {
        'scores': {},
        'issues': [],
        'summary': {}
    }
    
    # 1. COMPLETENESS
    completeness_scores = (1 - df.isna().mean()) * 100
    overall_completeness = completeness_scores.mean()
    report['scores']['completeness'] = overall_completeness
    
    for col in df.columns:
        if completeness_scores[col] < 95:
            report['issues'].append({
                'type': 'completeness',
                'column': col,
                'score': completeness_scores[col],
                'message': f'{col}: {100-completeness_scores[col]:.1f}% missing'
            })
    
    # 2. UNIQUENESS (for ID columns)
    if id_columns:
        uniqueness_scores = []
        for col in id_columns:
            unique_ratio = df[col].nunique() / len(df) * 100
            uniqueness_scores.append(unique_ratio)
            if unique_ratio < 100:
                dupes = len(df) - df[col].nunique()
                report['issues'].append({
                    'type': 'uniqueness',
                    'column': col,
                    'score': unique_ratio,
                    'message': f'{col}: {dupes} duplicate values'
                })
        report['scores']['uniqueness'] = sum(uniqueness_scores) / len(uniqueness_scores) if uniqueness_scores else 100
    else:
        report['scores']['uniqueness'] = 100
    
    # 3. VALIDITY (custom rules)
    if rules:
        validity_scores = []
        for col, rule_func in rules.items():
            if col in df.columns:
                valid = df[col].apply(rule_func)
                validity_score = valid.mean() * 100
                validity_scores.append(validity_score)
                if validity_score < 100:
                    invalid_count = (~valid).sum()
                    report['issues'].append({
                        'type': 'validity',
                        'column': col,
                        'score': validity_score,
                        'message': f'{col}: {invalid_count} invalid values'
                    })
        report['scores']['validity'] = sum(validity_scores) / len(validity_scores) if validity_scores else 100
    else:
        report['scores']['validity'] = 100
    
    # 4. CONSISTENCY (format checks for string columns)
    consistency_scores = []
    for col in df.select_dtypes(include=['object']).columns:
        # Check for mixed case
        if df[col].dropna().str.lower().nunique() < df[col].nunique():
            report['issues'].append({
                'type': 'consistency',
                'column': col,
                'message': f'{col}: Mixed case values detected'
            })
        # Check for leading/trailing spaces
        has_spaces = df[col].dropna().str.strip() != df[col].dropna()
        if has_spaces.any():
            report['issues'].append({
                'type': 'consistency',
                'column': col,
                'message': f'{col}: {has_spaces.sum()} values with extra spaces'
            })
    
    report['scores']['consistency'] = 100 - min(len([i for i in report['issues'] if i['type'] == 'consistency']) * 5, 50)
    
    # 5. OVERALL SCORE
    weights = {'completeness': 0.3, 'uniqueness': 0.2, 'validity': 0.3, 'consistency': 0.2}
    overall = sum(report['scores'][k] * weights[k] for k in weights)
    report['scores']['overall'] = overall
    
    # Print report
    print("=" * 60)
    print("DATA QUALITY REPORT")
    print("=" * 60)
    print(f"\n๐Ÿ“Š Overall Quality Score: {overall:.1f}%")
    print(f"\n   Completeness: {report['scores']['completeness']:.1f}%")
    print(f"   Uniqueness:   {report['scores']['uniqueness']:.1f}%")
    print(f"   Validity:     {report['scores']['validity']:.1f}%")
    print(f"   Consistency:  {report['scores']['consistency']:.1f}%")
    
    if report['issues']:
        print(f"\nโš ๏ธ Issues Found: {len(report['issues'])}")
        for issue in report['issues']:
            print(f"   - {issue['message']}")
    
    return report

# Usage example
rules = {
    'age': lambda x: pd.isna(x) or (18 <= x <= 100),
    'email': lambda x: pd.isna(x) or '@' in str(x),
    'salary': lambda x: pd.isna(x) or x > 0
}

report = data_quality_report(
    df,
    id_columns=['employee_id'],
    rules=rules
)

Key Takeaways

EDA First

Always explore your data before building models to avoid flawed assumptions

Systematic Approach

Follow a structured workflow: understand, clean, explore, visualize, document

Start Simple

Use shape, dtypes, describe(), and head() for immediate insights

Automate When Possible

Use profiling tools like ydata-profiling for comprehensive automated reports

Check Quality Early

Identify missing values, duplicates, and invalid data before analysis

Document Findings

Record insights and decisions for reproducibility and communication

Knowledge Check

Test your understanding of EDA concepts:

Question 1 of 6

What is the primary purpose of Exploratory Data Analysis (EDA)?

Question 2 of 6

Which pandas method provides statistical summaries including mean, std, min, and max?

Question 3 of 6

In describe() output, what does a large difference between mean and median (50%) typically indicate?

Question 4 of 6

Which method would you use to check for duplicate rows in a DataFrame?

Question 5 of 6

What is ydata-profiling (formerly pandas-profiling) used for?

Question 6 of 6

Which of these is NOT a typical data quality issue to check during EDA?

Answer all questions to check your score