Why Data Cleaning Matters
Data cleaning is often the most time-consuming yet critical step in the data science process. Real-world data is messy, with missing values, duplicates, inconsistent formats, and errors that can severely impact your analysis if left unaddressed. Studies show that data scientists spend 60-80% of their time cleaning and preparing data before any actual analysis can begin. In this topic, you'll learn the essential techniques to transform raw, messy data into clean, analysis-ready datasets using Pandas' powerful data cleaning tools.
The Reality of Messy Data
When you work with real-world datasets, you'll rarely encounter perfectly clean data. Sensors fail and record null values, users skip form fields, data entry errors introduce typos, systems export data in inconsistent formats, and databases accumulate duplicate records over time. Each of these issues can lead to incorrect conclusions if not properly addressed. For example, calculating the average salary when the dataset contains missing values will give you a skewed result unless you decide how to handle those gaps. Understanding the nature and impact of data quality issues is the first step toward effective data cleaning.
Common Data Quality Issues
Let's explore the four main categories of data quality problems you'll encounter. Missing values occur when data is absent due to collection failures, user behavior, or system errors. Duplicates happen when the same record appears multiple times, often from merging datasets or data entry mistakes. Type inconsistencies arise when numbers are stored as strings, dates have wrong formats, or categories use mixed capitalization. String problems include extra whitespace, special characters, inconsistent formatting, and data that needs pattern extraction. Each of these issues requires different cleaning strategies, which we'll cover in detail throughout this topic.
Data Cleaning
Data cleaning is the process of detecting and correcting (or removing) corrupt, inaccurate, incomplete, or irrelevant data from a dataset. It involves identifying data quality issues, deciding on appropriate handling strategies, and applying transformations to ensure data integrity and consistency.
Why it matters: Clean data leads to accurate insights, reliable models, and trustworthy business decisions. The garbage-in-garbage-out principle means that no amount of sophisticated analysis can compensate for poor quality data
The Data Cleaning Workflow
Effective data cleaning follows a systematic workflow that helps you address quality issues methodically. First, you inspect the data to understand its structure, identify issues, and assess their severity. Next, you develop a cleaning strategy by deciding which issues to fix, which records to keep or remove, and how to handle edge cases. Then you apply transformations using Pandas methods to clean the data according to your strategy. Finally, you validate the results by verifying that cleaning worked as intended and that data quality has improved. This workflow ensures that your cleaning process is reproducible, well-documented, and produces reliable results.
Inspect
Examine data structure, identify missing values, duplicates, and inconsistencies using info(), describe(), and value_counts()
Strategize
Decide how to handle each issue based on domain knowledge, analysis goals, and data characteristics
Transform
Apply Pandas methods like fillna(), drop_duplicates(), astype(), and str operations to clean data
Validate
Verify cleaning results, check data quality metrics, and ensure no unintended changes were introduced
Tools for Data Cleaning
Pandas provides a comprehensive set of methods specifically designed for data cleaning tasks. For missing values, you'll use isnull(), notnull(), fillna(), and dropna() to detect and handle gaps in your data. For duplicates, duplicated() and drop_duplicates() help you find and remove redundant records. Type conversions are handled by astype(), to_numeric(), to_datetime(), and pd.Categorical. String operations use the .str accessor which provides dozens of methods for text processing, plus full regex support. Throughout this topic, we'll explore each of these tools in depth with practical examples that show you exactly when and how to use them.
Interactive: Data Cleaning Playground
Try It!Click operations to see how they transform the data. Notice how different methods handle the issues differently.
Original "Dirty" Data
| Name | Age | Salary |
|---|---|---|
| Alice | 25 | 50,000 |
| Bob | NaN | 60,000 |
| NaN | 30 | 55,000 |
| Alice | 25 | 50,000 |
| Charlie | 35 | NaN |
After Cleaning (select operation)
| Name | Age | Salary |
|---|---|---|
| Select an operation below | ||
Click an operation to see the result.
Handling Missing Values
Missing values are one of the most common data quality issues you'll encounter. They appear as NaN (Not a Number), None, or empty strings, and can occur for many reasons such as data not being collected, errors during data transfer, or users leaving form fields blank. How you handle missing values can significantly impact your analysis results, so it's crucial to understand detection methods, different strategies for handling them, and when to apply each approach. Pandas provides powerful tools to identify, analyze, and address missing data in various ways.
Detecting Missing Values
Before you can handle missing values, you need to know where they are and how prevalent they are in your dataset. Pandas provides several methods to detect missing data. The isnull() method returns a boolean DataFrame showing True where values are missing and False otherwise. Its inverse, notnull(), returns True for non-missing values. You can combine these with sum() to count missing values per column, or use info() to get a quick overview of non-null counts across all columns. Understanding the pattern and extent of missing data helps you choose the right handling strategy.
import pandas as pd
import numpy as np
# Create sample data with missing values
data = {
'name': ['Alice', 'Bob', None, 'David', 'Eve'],
'age': [25, 30, np.nan, 35, 28],
'salary': [50000, np.nan, 60000, 70000, np.nan],
'department': ['Sales', 'IT', 'Sales', None, 'IT']
}
df = pd.DataFrame(data)
print("DataFrame with missing values:")
print(df)
# name age salary department
# 0 Alice 25.0 50000.0 Sales
# 1 Bob 30.0 NaN IT
# 2 None NaN 60000.0 Sales
# 3 David 35.0 70000.0 None
# 4 Eve 28.0 NaN IT
# Check for missing values
print("\nMissing values (boolean):")
print(df.isnull())
# name age salary department
# 0 False False False False
# 1 False False True False
# 2 True True False False
# 3 False False False True
# 4 False False True False
# Count missing values per column
print("\nMissing count per column:")
print(df.isnull().sum())
# name 1
# age 1
# salary 2
# department 1
# Percentage of missing values
print("\nMissing percentage:")
print((df.isnull().sum() / len(df) * 100).round(2))
# name 20.0
# age 20.0
# salary 40.0
# department 20.0
Removing Missing Values with dropna()
The simplest approach to handling missing values is to remove them entirely using dropna(). This method provides several options to control which rows or columns are dropped. By default, dropna() removes any row containing at least one missing value. You can use axis=1 to drop columns instead of rows, or use the thresh parameter to specify a minimum number of non-null values required to keep a row. The subset parameter lets you check for missing values only in specific columns. The inplace parameter determines whether to modify the original DataFrame or return a new one. Be cautious when dropping data, as you might lose valuable information or introduce bias into your analysis.
# Drop rows with any missing values
df_dropped_any = df.dropna()
print("Drop rows with ANY missing values:")
print(df_dropped_any)
# name age salary department
# 0 Alice 25.0 50000.0 Sales
# Drop rows only if ALL values are missing
df_dropped_all = df.dropna(how='all')
print("\nDrop rows only if ALL missing:")
print(df_dropped_all) # All rows kept (no row is entirely null)
# Drop rows with missing values in specific columns
df_dropped_subset = df.dropna(subset=['age', 'salary'])
print("\nDrop rows with missing age OR salary:")
print(df_dropped_subset)
# name age salary department
# 0 Alice 25.0 50000.0 Sales
# 3 David 35.0 70000.0 None
# Keep rows with at least 3 non-null values
df_dropped_thresh = df.dropna(thresh=3)
print("\nKeep rows with at least 3 non-null values:")
print(df_dropped_thresh)
# name age salary department
# 0 Alice 25.0 50000.0 Sales
# 1 Bob 30.0 NaN IT
# 2 None NaN 60000.0 Sales
# 3 David 35.0 70000.0 None
# 4 Eve 28.0 NaN IT
# Drop columns with any missing values
df_dropped_cols = df.dropna(axis=1)
print("\nDrop columns with any missing:")
print(df_dropped_cols)
# Empty DataFrame (all columns have at least one null)
Filling Missing Values with fillna()
Instead of removing missing data, you can fill it with appropriate values using fillna(). This method is more flexible than dropping and helps preserve your data. You can fill with a scalar value (like 0 or 'Unknown'), use statistical measures (mean, median, mode), forward-fill (propagate the last valid value) using method='ffill', or backward-fill using method='bfill'. You can also fill different columns with different values by passing a dictionary. The choice of filling strategy depends on your data type and the nature of the missing values. For numerical data, mean or median are common choices. For categorical data, the mode or a placeholder like 'Unknown' works well.
# Fill with a scalar value
df_filled_zero = df.fillna(0)
print("Fill all missing with 0:")
print(df_filled_zero)
# name age salary department
# 0 Alice 25.0 50000.0 Sales
# 1 Bob 30.0 0.0 IT
# 2 0 0.0 60000.0 Sales
# 3 David 35.0 70000.0 0
# 4 Eve 28.0 0.0 IT
# Fill different columns with different values
df_filled_dict = df.fillna({
'name': 'Unknown',
'age': df['age'].mean(),
'salary': df['salary'].median(),
'department': 'Unassigned'
})
print("\nFill with column-specific values:")
print(df_filled_dict)
# name age salary department
# 0 Alice 25.0 50000.0 Sales
# 1 Bob 30.0 60000.0 IT
# 2 Unknown 29.5 60000.0 Sales
# 3 David 35.0 70000.0 Unassigned
# 4 Eve 28.0 60000.0 IT
# Forward fill (propagate last valid value)
df_ffill = df.fillna(method='ffill')
print("\nForward fill:")
print(df_ffill)
# name age salary department
# 0 Alice 25.0 50000.0 Sales
# 1 Bob 30.0 50000.0 IT
# 2 Bob 30.0 60000.0 Sales
# 3 David 35.0 70000.0 Sales
# 4 Eve 28.0 70000.0 IT
# Backward fill
df_bfill = df.fillna(method='bfill')
print("\nBackward fill:")
print(df_bfill)
# name age salary department
# 0 Alice 25.0 50000.0 Sales
# 1 Bob 30.0 60000.0 IT
# 2 David 35.0 60000.0 Sales
# 3 David 35.0 70000.0 IT
# 4 Eve 28.0 NaN IT
Advanced Missing Value Strategies
Beyond simple filling and dropping, there are more sophisticated strategies for handling missing data. You can use interpolate() to fill missing values using interpolation methods, which works well for time series or sequential data. For categorical data, you might fill with the mode (most frequent value) to maintain the distribution. You can also create indicator columns to flag which values were originally missing, allowing you to track imputation in your analysis. In some cases, missing values themselves carry information (for example, a missing survey response might indicate discomfort with the question), so you might want to treat NaN as its own category. The key is to choose a strategy that makes sense for your specific data and analysis goals.
# Interpolate missing values (useful for time series)
series = pd.Series([1, np.nan, np.nan, 4, 5])
interpolated = series.interpolate()
print("Original series:", series.values) # [ 1. nan nan 4. 5.]
print("Interpolated:", interpolated.values) # [1. 2. 3. 4. 5.]
# Fill with mode (most frequent value)
df_mode = df.copy()
df_mode['department'] = df_mode['department'].fillna(
df_mode['department'].mode()[0]
)
print("\nFill department with mode:")
print(df_mode['department'])
# 0 Sales
# 1 IT
# 2 Sales
# 3 IT (most frequent: IT or Sales)
# 4 IT
# Create indicator columns for missing values
df_with_indicators = df.copy()
df_with_indicators['age_missing'] = df['age'].isnull()
df_with_indicators['salary_missing'] = df['salary'].isnull()
print("\nDataFrame with missing indicators:")
print(df_with_indicators[['age', 'age_missing', 'salary', 'salary_missing']])
# age age_missing salary salary_missing
# 0 25.0 False 50000.0 False
# 1 30.0 False NaN True
# 2 NaN True 60000.0 False
# 3 35.0 False 70000.0 False
# 4 28.0 False NaN True
Practice Questions
Task: Create a DataFrame with student data and find the total number of missing values in the entire DataFrame.
Given:
data = {
'student': ['John', 'Mary', 'Peter', 'Jane'],
'math': [85, np.nan, 92, 88],
'english': [78, 85, np.nan, 90],
'science': [np.nan, 88, 86, np.nan]
}
df = pd.DataFrame(data)
Expected output:
4
Show Solution
import pandas as pd
import numpy as np
data = {
'student': ['John', 'Mary', 'Peter', 'Jane'],
'math': [85, np.nan, 92, 88],
'english': [78, 85, np.nan, 90],
'science': [np.nan, 88, 86, np.nan]
}
df = pd.DataFrame(data)
# Count total missing values
total_missing = df.isnull().sum().sum()
print(total_missing) # 4
Task: Fill missing math scores with the mean, missing english scores with the median, and missing science scores with 0.
Given:
data = {
'student': ['John', 'Mary', 'Peter', 'Jane'],
'math': [85, np.nan, 92, 88],
'english': [78, 85, np.nan, 90],
'science': [np.nan, 88, 86, np.nan]
}
df = pd.DataFrame(data)
Expected output:
student math english science
0 John 85.0 78.0 0.0
1 Mary 88.3 85.0 88.0
2 Peter 92.0 83.5 86.0
3 Jane 88.0 90.0 0.0
Show Solution
import pandas as pd
import numpy as np
data = {
'student': ['John', 'Mary', 'Peter', 'Jane'],
'math': [85, np.nan, 92, 88],
'english': [78, 85, np.nan, 90],
'science': [np.nan, 88, 86, np.nan]
}
df = pd.DataFrame(data)
# Fill with different strategies per column
df_filled = df.fillna({
'math': df['math'].mean(),
'english': df['english'].median(),
'science': 0
})
print(df_filled)
Task: Remove only rows where BOTH math AND science scores are missing, but keep rows where only one subject is missing.
Given:
data = {
'student': ['John', 'Mary', 'Peter', 'Jane', 'Tom'],
'math': [85, np.nan, 92, np.nan, 78],
'science': [90, 88, np.nan, np.nan, 85]
}
df = pd.DataFrame(data)
Expected output:
student math science
0 John 85.0 90.0
1 Mary NaN 88.0
2 Peter 92.0 NaN
4 Tom 78.0 85.0
Show Solution
import pandas as pd
import numpy as np
data = {
'student': ['John', 'Mary', 'Peter', 'Jane', 'Tom'],
'math': [85, np.nan, 92, np.nan, 78],
'science': [90, 88, np.nan, np.nan, 85]
}
df = pd.DataFrame(data)
# Keep rows where at least one of math or science is not null
df_cleaned = df.dropna(subset=['math', 'science'], how='all')
print(df_cleaned)
Detecting & Removing Duplicates
Duplicate records can skew your analysis by inflating counts, distorting statistics, and creating misleading patterns in your data. Duplicates arise from various sources such as data entry errors, system glitches during data transfer, merging datasets without proper validation, or users submitting forms multiple times. Pandas provides powerful methods to identify and remove duplicate rows, giving you control over which duplicates to keep and which columns to consider when determining uniqueness. Understanding how to detect and handle duplicates is essential for maintaining data integrity and producing accurate analysis results.
Identifying Duplicates with duplicated()
The duplicated() method returns a boolean Series indicating which rows are duplicates of earlier rows. By default, it marks all duplicate occurrences except the first one as True, making it easy to see which rows would be removed. You can control this behavior with the keep parameter, which accepts 'first' (default), 'last', or False. Using keep='last' marks all duplicates except the last occurrence, while keep=False marks all duplicates including the first occurrence. You can also specify which columns to consider for duplication using the subset parameter, which is useful when you only care about uniqueness in specific fields rather than entire rows.
import pandas as pd
# Create sample data with duplicates
data = {
'customer_id': [101, 102, 103, 101, 104, 102],
'name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Bob'],
'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com',
'alice@email.com', 'david@email.com', 'bob@email.com'],
'purchase_amount': [150, 200, 175, 150, 225, 200]
}
df = pd.DataFrame(data)
print("DataFrame with duplicates:")
print(df)
# customer_id name email purchase_amount
# 0 101 Alice alice@email.com 150
# 1 102 Bob bob@email.com 200
# 2 103 Charlie charlie@email.com 175
# 3 101 Alice alice@email.com 150
# 4 104 David david@email.com 225
# 5 102 Bob bob@email.com 200
# Check for duplicates (marks duplicates as True)
print("\nDuplicate rows (keep first):")
print(df.duplicated())
# 0 False
# 1 False
# 2 False
# 3 True (duplicate of row 0)
# 4 False
# 5 True (duplicate of row 1)
# Show which rows are duplicates
print("\nDuplicate rows:")
print(df[df.duplicated()])
# customer_id name email purchase_amount
# 3 101 Alice alice@email.com 150
# 5 102 Bob bob@email.com 200
# Mark all occurrences of duplicates (including first)
print("\nAll duplicate occurrences:")
print(df.duplicated(keep=False))
# 0 True (row 0 and 3 are duplicates)
# 1 True (row 1 and 5 are duplicates)
# 2 False
# 3 True
# 4 False
# 5 True
Removing Duplicates with drop_duplicates()
Once you've identified duplicates, you can remove them using drop_duplicates(). This method returns a new DataFrame with duplicate rows removed, or modifies the original DataFrame if you set inplace=True. Like duplicated(), it accepts keep='first', keep='last', or keep=False parameters to control which duplicates are retained. The most common use case is to keep the first occurrence and remove subsequent duplicates, but sometimes you might want to keep the most recent record (using keep='last') or remove all duplicate occurrences entirely. The method also supports the subset parameter to consider only specific columns when determining uniqueness, which is extremely useful when some fields may naturally repeat while others should be unique.
# Remove duplicate rows (keep first occurrence)
df_no_dupes = df.drop_duplicates()
print("After removing duplicates:")
print(df_no_dupes)
# customer_id name email purchase_amount
# 0 101 Alice alice@email.com 150
# 1 102 Bob bob@email.com 200
# 2 103 Charlie charlie@email.com 175
# 4 104 David david@email.com 225
# Keep last occurrence instead
df_keep_last = df.drop_duplicates(keep='last')
print("\nKeep last occurrence:")
print(df_keep_last)
# customer_id name email purchase_amount
# 2 103 Charlie charlie@email.com 175
# 3 101 Alice alice@email.com 150
# 4 104 David david@email.com 225
# 5 102 Bob bob@email.com 200
# Remove all occurrences of duplicates
df_remove_all = df.drop_duplicates(keep=False)
print("\nRemove all duplicate occurrences:")
print(df_remove_all)
# customer_id name email purchase_amount
# 2 103 Charlie charlie@email.com 175
# 4 104 David david@email.com 225
# Count duplicates before removal
print(f"\nOriginal rows: {len(df)}") # 6
print(f"After removing duplicates: {len(df_no_dupes)}") # 4
print(f"Duplicates removed: {len(df) - len(df_no_dupes)}") # 2
Checking Duplicates in Specific Columns
Often, you don't need entire rows to be identical to consider them duplicates. Instead, you might want to ensure uniqueness based on specific key columns like customer_id, email, or order_number. The subset parameter lets you specify which columns to consider when detecting duplicates. This is particularly useful when you have transaction data where some fields naturally repeat (like product names or categories) but you want to ensure uniqueness on identifying fields. You can pass a single column name as a string or multiple columns as a list. This approach is essential for identifying duplicate entries based on business logic rather than exact row matching.
# Create data where some fields repeat naturally
data = {
'order_id': [1001, 1002, 1003, 1001, 1004],
'customer': ['Alice', 'Bob', 'Alice', 'Alice', 'Charlie'],
'product': ['Laptop', 'Mouse', 'Laptop', 'Laptop', 'Keyboard'],
'quantity': [1, 2, 1, 1, 3]
}
df = pd.DataFrame(data)
print("Orders data:")
print(df)
# order_id customer product quantity
# 0 1001 Alice Laptop 1
# 1 1002 Bob Mouse 2
# 2 1003 Alice Laptop 1
# 3 1001 Alice Laptop 1 (duplicate order_id)
# 4 1004 Charlie Keyboard 3
# Check duplicates based on order_id only
print("\nDuplicates based on order_id:")
print(df.duplicated(subset=['order_id']))
# 0 False
# 1 False
# 2 False
# 3 True (order_id 1001 already exists)
# 4 False
# Remove duplicates based on order_id
df_unique_orders = df.drop_duplicates(subset=['order_id'])
print("\nUnique orders:")
print(df_unique_orders)
# order_id customer product quantity
# 0 1001 Alice Laptop 1
# 1 1002 Bob Mouse 2
# 2 1003 Alice Laptop 1
# 4 1004 Charlie Keyboard 3
# Check duplicates based on multiple columns
print("\nDuplicates based on customer AND product:")
print(df.duplicated(subset=['customer', 'product']))
# 0 False
# 1 False
# 2 False
# 3 True (Alice + Laptop combination already exists)
# 4 False
Advanced Duplicate Handling Strategies
Sometimes simply removing duplicates isn't the best approach. You might want to aggregate duplicate records, keep the row with the maximum or minimum value in a certain column, or flag duplicates for manual review. For aggregation, you can group by key columns and use aggregation functions like sum(), mean(), or max() to combine duplicate records meaningfully. To keep the best record among duplicates, you can sort by a priority column before removing duplicates with keep='first'. For auditing purposes, you might want to count occurrences or add a flag column before removing duplicates. These advanced strategies help you handle duplicates in ways that preserve valuable information rather than simply discarding it.
# Create sales data with duplicate customers
data = {
'customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
'date': ['2024-01-15', '2024-01-16', '2024-01-20',
'2024-01-17', '2024-01-18'],
'amount': [100, 150, 200, 175, 125]
}
df = pd.DataFrame(data)
print("Sales data:")
print(df)
# customer date amount
# 0 Alice 2024-01-15 100
# 1 Bob 2024-01-16 150
# 2 Alice 2024-01-20 200
# 3 Charlie 2024-01-17 175
# 4 Bob 2024-01-18 125
# Aggregate duplicates (sum amounts per customer)
df_aggregated = df.groupby('customer', as_index=False)['amount'].sum()
print("\nAggregated by customer:")
print(df_aggregated)
# customer amount
# 0 Alice 300 (100 + 200)
# 1 Bob 275 (150 + 125)
# 2 Charlie 175
# Keep row with highest amount for each customer
df_sorted = df.sort_values('amount', ascending=False)
df_max_amount = df_sorted.drop_duplicates(subset=['customer'], keep='first')
print("\nKeep highest purchase per customer:")
print(df_max_amount.sort_index())
# customer date amount
# 2 Alice 2024-01-20 200
# 1 Bob 2024-01-16 150
# 3 Charlie 2024-01-17 175
# Count occurrences before removing duplicates
df_with_count = df.copy()
df_with_count['occurrence_count'] = df_with_count.groupby('customer')['customer'].transform('count')
print("\nWith occurrence counts:")
print(df_with_count)
# customer date amount occurrence_count
# 0 Alice 2024-01-15 100 2
# 1 Bob 2024-01-16 150 2
# 2 Alice 2024-01-20 200 2
# 3 Charlie 2024-01-17 175 1
# 4 Bob 2024-01-18 125 2
Practice Questions
Task: Create a DataFrame with product data and count how many duplicate rows exist.
Given:
data = {
'product': ['Apple', 'Banana', 'Apple', 'Orange', 'Banana'],
'price': [1.20, 0.50, 1.20, 0.80, 0.50]
}
df = pd.DataFrame(data)
Expected output:
2
Show Solution
import pandas as pd
data = {
'product': ['Apple', 'Banana', 'Apple', 'Orange', 'Banana'],
'price': [1.20, 0.50, 1.20, 0.80, 0.50]
}
df = pd.DataFrame(data)
# Count duplicates (excluding first occurrence)
duplicate_count = df.duplicated().sum()
print(duplicate_count) # 2
Task: Remove duplicate users based on email address only, keeping the first occurrence.
Given:
data = {
'user_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Alice Smith', 'Charlie', 'Robert'],
'email': ['alice@test.com', 'bob@test.com', 'alice@test.com',
'charlie@test.com', 'bob@test.com']
}
df = pd.DataFrame(data)
Expected output:
user_id name email
0 1 Alice alice@test.com
1 2 Bob bob@test.com
3 4 Charlie charlie@test.com
Show Solution
import pandas as pd
data = {
'user_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Alice Smith', 'Charlie', 'Robert'],
'email': ['alice@test.com', 'bob@test.com', 'alice@test.com',
'charlie@test.com', 'bob@test.com']
}
df = pd.DataFrame(data)
# Remove duplicates based on email
df_unique = df.drop_duplicates(subset=['email'], keep='first')
print(df_unique)
Task: For duplicate customer transactions, keep only the most recent transaction based on the date.
Given:
data = {
'customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'Alice'],
'date': ['2024-01-10', '2024-01-15', '2024-01-20',
'2024-01-12', '2024-01-25', '2024-01-18'],
'amount': [100, 150, 200, 175, 125, 180]
}
df = pd.DataFrame(data)
Expected output:
customer date amount
2 Alice 2024-01-20 200
4 Bob 2024-01-25 125
3 Charlie 2024-01-12 175
Show Solution
import pandas as pd
data = {
'customer': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'Alice'],
'date': ['2024-01-10', '2024-01-15', '2024-01-20',
'2024-01-12', '2024-01-25', '2024-01-18'],
'amount': [100, 150, 200, 175, 125, 180]
}
df = pd.DataFrame(data)
# Convert date to datetime for proper sorting
df['date'] = pd.to_datetime(df['date'])
# Sort by date descending, then keep first (most recent) per customer
df_sorted = df.sort_values('date', ascending=False)
df_latest = df_sorted.drop_duplicates(subset=['customer'], keep='first')
# Sort by original index or customer name for display
df_latest = df_latest.sort_values('customer')
print(df_latest)
Data Type Conversions
Data often comes with incorrect or suboptimal data types that can prevent proper analysis or cause unexpected errors. Numbers might be stored as strings, dates as text, or categories as generic objects. Converting data to the correct types is essential for efficient computation, accurate calculations, and enabling type-specific operations. Pandas provides several methods for type conversion including astype() for general conversions, to_numeric() for handling numeric conversions safely, to_datetime() for parsing dates, and categorical types for memory-efficient storage of repeated values. Understanding when and how to use each conversion method helps you prepare data correctly for analysis.
Checking Current Data Types
Before converting types, you need to understand what types your data currently has. The dtypes attribute shows the data type of each column in your DataFrame. Common types include int64 for integers, float64 for decimal numbers, object for strings or mixed types, bool for boolean values, and datetime64 for dates and times. The info() method provides a comprehensive overview showing the data type, non-null count, and memory usage for each column. Understanding your current data types helps you identify which columns need conversion and what target types are appropriate. Remember that object dtype is Pandas' default for text and mixed-type data, but it's often worth converting to more specific types for better performance and functionality.
import pandas as pd
import numpy as np
# Create sample data with mixed types
data = {
'customer_id': ['001', '002', '003', '004'], # Numbers as strings
'age': ['25', '30', 'unknown', '35'], # Mixed: numbers and text
'salary': ['50000', '60000', '55000', '70000'], # Numbers as strings
'join_date': ['2020-01-15', '2019-06-20', '2021-03-10', '2020-11-05'],
'is_active': ['True', 'False', 'True', 'True'], # Booleans as strings
'department': ['Sales', 'IT', 'Sales', 'HR']
}
df = pd.DataFrame(data)
print("DataFrame with string types:")
print(df)
# customer_id age salary join_date is_active department
# 0 001 25 50000 2020-01-15 True Sales
# 1 002 30 60000 2019-06-20 False IT
# 2 003 unknown 55000 2021-03-10 True Sales
# 3 004 35 70000 2020-11-05 True HR
# Check data types
print("\nCurrent data types:")
print(df.dtypes)
# customer_id object
# age object
# salary object
# join_date object
# is_active object
# department object
# Get detailed info
print("\nDataFrame info:")
df.info()
#
# RangeIndex: 4 entries, 0 to 3
# Data columns (total 6 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 customer_id 4 non-null object
# 1 age 4 non-null object
# 2 salary 4 non-null object
# 3 join_date 4 non-null object
# 4 is_active 4 non-null object
# 5 department 4 non-null object
Basic Type Conversion with astype()
The astype() method is the most straightforward way to convert column types. You specify the target type as a string (like 'int', 'float', 'str', 'bool') or a NumPy dtype. This method is fast and efficient but will raise an error if the conversion fails (for example, trying to convert 'abc' to an integer). You can convert a single column by calling astype() on that column, or convert multiple columns at once by passing a dictionary to the DataFrame's astype() method. Use astype() when you're confident that all values in the column can be converted to the target type. For data that might have invalid values, consider using the more forgiving to_numeric() or to_datetime() methods instead.
# Convert customer_id from string to integer
df_converted = df.copy()
df_converted['customer_id'] = df_converted['customer_id'].astype(int)
print("customer_id converted to int:")
print(df_converted['customer_id'])
# 0 1
# 1 2
# 2 3
# 3 4
# Convert salary from string to integer
df_converted['salary'] = df_converted['salary'].astype(int)
print("\nsalary converted to int:")
print(df_converted['salary'])
# 0 50000
# 1 60000
# 2 55000
# 3 70000
# Convert is_active from string to boolean
# Note: string 'False' converts to True! Need special handling
df_converted['is_active'] = df_converted['is_active'].map({'True': True, 'False': False})
print("\nis_active converted to bool:")
print(df_converted['is_active'])
# 0 True
# 1 False
# 2 True
# 3 True
# Try converting age (will fail due to 'unknown')
try:
df['age'].astype(int)
except ValueError as e:
print(f"\nError converting age: {e}")
# Error: invalid literal for int() with base 10: 'unknown'
# Check new types
print("\nNew data types:")
print(df_converted.dtypes)
# customer_id int64
# age object (not converted yet)
# salary int64
# join_date object (not converted yet)
# is_active bool
# department object
Safe Numeric Conversion with to_numeric()
When converting strings to numbers, you often encounter invalid values that can't be converted. The to_numeric() function provides robust handling of such cases through its errors parameter. Setting errors='coerce' converts invalid values to NaN instead of raising an error, allowing the conversion to proceed. Using errors='ignore' leaves invalid values unchanged and returns the original data if conversion fails. You can also specify the downcast parameter to automatically convert to the smallest possible numeric type, saving memory. The to_numeric() function is ideal for cleaning messy data where you expect some values to be non-numeric, and you want to handle them gracefully rather than stopping the entire process.
# Safely convert age to numeric (coerce invalid to NaN)
df_converted['age'] = pd.to_numeric(df['age'], errors='coerce')
print("age converted with coercion:")
print(df_converted['age'])
# 0 25.0
# 1 30.0
# 2 NaN (unknown became NaN)
# 3 35.0
# Example with downcast to save memory
large_numbers = pd.Series(['100', '200', '300', '400'])
converted_int64 = pd.to_numeric(large_numbers)
converted_int8 = pd.to_numeric(large_numbers, downcast='integer')
print("\nWithout downcast:", converted_int64.dtype) # int64
print("With downcast:", converted_int8.dtype) # int8 (smaller)
# Handle mixed data with different strategies
mixed_data = pd.Series(['100', '200', 'invalid', '300'])
# Strategy 1: Coerce to NaN
result_coerce = pd.to_numeric(mixed_data, errors='coerce')
print("\nCoerce strategy:")
print(result_coerce)
# 0 100.0
# 1 200.0
# 2 NaN
# 3 300.0
# Strategy 2: Ignore (keep as string if fails)
result_ignore = pd.to_numeric(mixed_data, errors='ignore')
print("\nIgnore strategy:")
print(result_ignore)
# 0 100
# 1 200
# 2 invalid (kept as string)
# 3 300
Date and Time Conversion with to_datetime()
Converting strings to datetime objects is crucial for time-based analysis. The to_datetime() function is remarkably flexible, automatically recognizing many common date formats without explicit specification. You can provide a format string using strftime codes for faster parsing and to handle unusual formats. Like to_numeric(), it supports the errors parameter to handle invalid dates gracefully. The dayfirst parameter helps with ambiguous dates like '01-02-2020' (is it January 2nd or February 1st?). Once converted to datetime, you gain access to powerful time-based operations like date arithmetic, time-based indexing, and temporal aggregations. You can also extract components like year, month, day using the dt accessor.
# Convert join_date to datetime
df_converted['join_date'] = pd.to_datetime(df['join_date'])
print("join_date converted to datetime:")
print(df_converted['join_date'])
# 0 2020-01-15
# 1 2019-06-20
# 2 2021-03-10
# 3 2020-11-05
print("Type:", df_converted['join_date'].dtype) # datetime64[ns]
# Extract date components
df_converted['year'] = df_converted['join_date'].dt.year
df_converted['month'] = df_converted['join_date'].dt.month
df_converted['day_of_week'] = df_converted['join_date'].dt.day_name()
print("\nExtracted date components:")
print(df_converted[['join_date', 'year', 'month', 'day_of_week']])
# join_date year month day_of_week
# 0 2020-01-15 2020 1 Wednesday
# 1 2019-06-20 2019 6 Thursday
# 2 2021-03-10 2021 3 Wednesday
# 3 2020-11-05 2020 11 Thursday
# Handle different date formats
dates_various = pd.Series([
'2020-01-15',
'01/15/2020',
'15-Jan-2020',
'January 15, 2020'
])
converted_dates = pd.to_datetime(dates_various)
print("\nVarious formats converted:")
print(converted_dates)
# 0 2020-01-15
# 1 2020-01-15
# 2 2020-01-15
# 3 2020-01-15
# Specify format for faster parsing
dates_custom = pd.Series(['15/01/2020', '20/06/2019'])
parsed = pd.to_datetime(dates_custom, format='%d/%m/%Y', dayfirst=True)
print("\nCustom format parsed:")
print(parsed)
# 0 2020-01-15
# 1 2019-06-20
Categorical Data Type for Memory Efficiency
When a column has repeated values (like department names, product categories, or status codes), converting it to categorical type can dramatically reduce memory usage and improve performance. Categorical dtype stores the unique values once and uses integer codes internally to represent each occurrence. This is especially beneficial for large datasets with low-cardinality columns. You can convert to categorical using astype('category') or pd.Categorical(). Categorical columns also enable category-specific operations like reordering categories, adding or removing categories, and ensuring only valid values are allowed. The memory savings can be substantial, for example, a column with 1 million rows but only 5 unique values might use 90% less memory as categorical.
# Check memory usage before conversion
print("Memory usage before:")
print(df['department'].memory_usage(deep=True)) # ~280 bytes
# Convert to categorical
df_converted['department'] = df_converted['department'].astype('category')
print("\nMemory usage after categorical conversion:")
print(df_converted['department'].memory_usage(deep=True)) # ~190 bytes
# View categorical information
print("\nCategorical info:")
print(df_converted['department'].cat.categories) # ['HR' 'IT' 'Sales']
print(df_converted['department'].cat.codes)
# 0 2 (Sales)
# 1 1 (IT)
# 2 2 (Sales)
# 3 0 (HR)
# Create ordered categorical (useful for rankings)
df_converted['education'] = pd.Categorical(
['Bachelor', 'Master', 'PhD', 'Bachelor'],
categories=['High School', 'Bachelor', 'Master', 'PhD'],
ordered=True
)
print("\nOrdered categorical:")
print(df_converted['education'])
# 0 Bachelor
# 1 Master
# 2 PhD
# 3 Bachelor
# Comparisons work with ordered categories
print("\nIs education >= Bachelor?")
print(df_converted['education'] >= 'Bachelor')
# 0 True
# 1 True
# 2 True
# 3 True
# Final data types overview
print("\nFinal data types:")
print(df_converted.dtypes)
# customer_id int64
# age float64
# salary int64
# join_date datetime64[ns]
# is_active bool
# department category
# year int32
# month int32
# day_of_week object
# education category
Practice Questions
Task: Convert the price column from strings to floats.
Given:
data = {
'product': ['Apple', 'Banana', 'Cherry'],
'price': ['1.50', '0.75', '2.25']
}
df = pd.DataFrame(data)
Expected output:
product price
0 Apple 1.50
1 Banana 0.75
2 Cherry 2.25
Type: float64
Show Solution
import pandas as pd
data = {
'product': ['Apple', 'Banana', 'Cherry'],
'price': ['1.50', '0.75', '2.25']
}
df = pd.DataFrame(data)
# Convert to float
df['price'] = df['price'].astype(float)
print(df)
print("Type:", df['price'].dtype)
Task: Convert the score column to numeric, handling invalid values by converting them to NaN.
Given:
data = {
'student': ['Alice', 'Bob', 'Charlie', 'David'],
'score': ['85', '92', 'absent', '78']
}
df = pd.DataFrame(data)
Expected output:
student score
0 Alice 85.0
1 Bob 92.0
2 Charlie NaN
3 David 78.0
Show Solution
import pandas as pd
data = {
'student': ['Alice', 'Bob', 'Charlie', 'David'],
'score': ['85', '92', 'absent', '78']
}
df = pd.DataFrame(data)
# Convert with coercion
df['score'] = pd.to_numeric(df['score'], errors='coerce')
print(df)
Task: Convert order_id to int, order_date to datetime, amount to float, and status to category.
Given:
data = {
'order_id': ['1001', '1002', '1003'],
'order_date': ['2024-01-15', '2024-01-16', '2024-01-17'],
'amount': ['99.99', '149.50', '75.25'],
'status': ['Pending', 'Shipped', 'Pending']
}
df = pd.DataFrame(data)
Expected output: All columns with appropriate types
Show Solution
import pandas as pd
data = {
'order_id': ['1001', '1002', '1003'],
'order_date': ['2024-01-15', '2024-01-16', '2024-01-17'],
'amount': ['99.99', '149.50', '75.25'],
'status': ['Pending', 'Shipped', 'Pending']
}
df = pd.DataFrame(data)
# Convert multiple columns
df['order_id'] = df['order_id'].astype(int)
df['order_date'] = pd.to_datetime(df['order_date'])
df['amount'] = df['amount'].astype(float)
df['status'] = df['status'].astype('category')
print(df)
print("\nData types:")
print(df.dtypes)
String Operations & Regex
Text data is notoriously messy and requires specialized cleaning techniques. String columns often contain extra whitespace, inconsistent capitalization, special characters, and embedded patterns that need extraction. Pandas provides the .str accessor which gives you access to dozens of string methods similar to Python's built-in string operations, but optimized for vectorized operations on entire columns. Additionally, regex (regular expressions) support enables powerful pattern matching and extraction for complex text processing tasks. Mastering string operations is essential for cleaning names, emails, phone numbers, addresses, and any other text data in your datasets.
Basic String Methods
The .str accessor provides methods for common string operations like changing case, removing whitespace, and checking string content. Use str.lower(), str.upper(), and str.title() to standardize capitalization across your data. The str.strip(), str.lstrip(), and str.rstrip() methods remove leading and trailing whitespace, which is crucial for cleaning user input data. You can check string properties with str.startswith(), str.endswith(), and str.contains(). The str.replace() method substitutes text patterns, while str.len() returns string lengths. These operations work element-wise on entire Series, making them much faster than looping through individual values. Remember that str methods return new Series and don't modify the original data unless you assign the result back.
import pandas as pd
# Create sample data with messy strings
data = {
'name': [' Alice Smith ', 'BOB JONES', 'charlie brown', ' David Lee'],
'email': ['ALICE@EMAIL.COM', 'bob@email.com ', 'Charlie@Email.COM', 'david@email.com'],
'phone': ['(555) 123-4567', '555-234-5678', '555.345.6789', '5554567890']
}
df = pd.DataFrame(data)
print("Original messy data:")
print(df)
# name email phone
# 0 Alice Smith ALICE@EMAIL.COM (555) 123-4567
# 1 BOB JONES bob@email.com 555-234-5678
# 2 charlie brown Charlie@Email.COM 555.345.6789
# 3 David Lee david@email.com 5554567890
# Clean whitespace
df['name'] = df['name'].str.strip()
df['email'] = df['email'].str.strip()
print("\nAfter removing whitespace:")
print(df['name'])
# 0 Alice Smith
# 1 BOB JONES
# 2 charlie brown
# 3 David Lee
# Standardize capitalization
df['name'] = df['name'].str.title() # Title case
df['email'] = df['email'].str.lower() # Lowercase
print("\nStandardized case:")
print(df[['name', 'email']])
# name email
# 0 Alice Smith alice@email.com
# 1 Bob Jones bob@email.com
# 2 Charlie Brown charlie@email.com
# 3 David Lee david@email.com
# Check string properties
print("\nNames starting with 'A':")
print(df['name'].str.startswith('A'))
# 0 True
# 1 False
# 2 False
# 3 False
print("\nEmails containing 'alice':")
print(df['email'].str.contains('alice'))
# 0 True
# 1 False
# 2 False
# 3 False
# Get string lengths
print("\nPhone number lengths:")
print(df['phone'].str.len())
# 0 14 (has parentheses and hyphens)
# 1 12 (has hyphens)
# 2 12 (has dots)
# 3 10 (digits only)
String Replacement and Cleaning
Real-world text data often contains unwanted characters, inconsistent formatting, or values that need standardization. The str.replace() method lets you substitute patterns with new values, supporting both literal strings and regex patterns. You can remove characters by replacing them with empty strings. For multiple replacements, chain multiple replace() calls or use a regex pattern with alternation. The str.removeprefix() and str.removesuffix() methods (Python 3.9+) provide clean ways to remove known prefixes or suffixes. When dealing with special characters or punctuation, you might need to escape them in regex patterns or use string methods strategically. These techniques are essential for cleaning phone numbers, product codes, currency values, and other formatted text.
# Clean phone numbers by removing non-digit characters
df['phone_clean'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)
print("Cleaned phone numbers:")
print(df[['phone', 'phone_clean']])
# phone phone_clean
# 0 (555) 123-4567 5551234567
# 1 555-234-5678 5552345678
# 2 555.345.6789 5553456789
# 3 5554567890 5554567890
# Replace multiple characters
text_data = pd.Series(['$100.50', '$2,450.75', '$999.99'])
cleaned = text_data.str.replace('$', '', regex=False)
cleaned = cleaned.str.replace(',', '', regex=False)
print("\nCleaned currency:")
print(cleaned.astype(float))
# 0 100.50
# 1 2450.75
# 2 999.99
# Remove prefixes/suffixes
products = pd.Series(['PRD-001', 'PRD-002', 'PRD-003'])
product_ids = products.str.replace('PRD-', '', regex=False)
print("\nProduct IDs without prefix:")
print(product_ids)
# 0 001
# 1 002
# 2 003
# Replace with mapping (useful for standardization)
status_data = pd.Series(['active', 'Active', 'ACTIVE', 'inactive', 'Inactive'])
status_mapping = {
'active': 'Active',
'Active': 'Active',
'ACTIVE': 'Active',
'inactive': 'Inactive',
'Inactive': 'Inactive'
}
standardized = status_data.replace(status_mapping)
print("\nStandardized status:")
print(standardized)
# 0 Active
# 1 Active
# 2 Active
# 3 Inactive
# 4 Inactive
Splitting and Extracting Strings
Often you need to split text into multiple columns or extract specific parts of strings. The str.split() method divides strings based on a delimiter, returning a Series of lists by default. Use expand=True to create new columns from split results. The str.extract() method uses regex groups to extract specific patterns into new columns, which is perfect for parsing structured text like dates, IDs, or codes. You can use str.slice() or bracket notation to extract substrings by position. For more complex extractions, str.extractall() captures all matches of a pattern, not just the first. These techniques are invaluable for breaking apart full names into first and last names, extracting domain names from emails, or parsing product codes into components.
# Split full names into first and last names
names = pd.Series(['Alice Smith', 'Bob Jones', 'Charlie Brown'])
split_names = names.str.split(' ', expand=True)
split_names.columns = ['first_name', 'last_name']
print("Split names:")
print(split_names)
# first_name last_name
# 0 Alice Smith
# 1 Bob Jones
# 2 Charlie Brown
# Extract domain from email
emails = pd.Series(['alice@gmail.com', 'bob@company.co.uk', 'charlie@test.org'])
domains = emails.str.extract(r'@(.+)', expand=False)
print("\nExtracted domains:")
print(domains)
# 0 gmail.com
# 1 company.co.uk
# 2 test.org
# Extract multiple groups with regex
product_codes = pd.Series(['ABC-2023-001', 'XYZ-2024-042', 'ABC-2023-015'])
extracted = product_codes.str.extract(r'([A-Z]+)-(\d{4})-(\d+)')
extracted.columns = ['category', 'year', 'number']
print("\nExtracted product code parts:")
print(extracted)
# category year number
# 0 ABC 2023 001
# 1 XYZ 2024 042
# 2 ABC 2023 015
# Slice strings by position
ids = pd.Series(['USER001', 'USER002', 'USER003'])
numeric_part = ids.str[4:] # From position 4 to end
print("\nNumeric part of IDs:")
print(numeric_part)
# 0 001
# 1 002
# 2 003
# Extract all matches (not just first)
text = pd.Series(['Prices: $10, $20, $30', 'Cost: $5'])
all_prices = text.str.extractall(r'\$(\d+)')
print("\nAll extracted prices:")
print(all_prices)
# 0
# match
# 0 0 10
# 1 20
# 2 30
# 1 0 5
Regular Expressions (Regex) for Pattern Matching
Regular expressions provide a powerful language for describing text patterns. Common regex patterns include \d for digits, \w for word characters, \s for whitespace, . for any character, and quantifiers like + (one or more), * (zero or more), and {n,m} (between n and m). Character sets like [0-9] or [A-Za-z] match specific ranges. The ^ and $ anchors match the start and end of strings. Parentheses create capture groups for extraction. You can use regex with str.contains() for filtering, str.replace() for substitution, and str.extract() for parsing. Regex is essential for validating email formats, extracting phone numbers, parsing dates, and finding patterns in unstructured text. While regex syntax can be complex, mastering even basic patterns dramatically expands your text processing capabilities.
# Validate email format with regex
emails = pd.Series([
'alice@email.com',
'invalid.email',
'bob@test.co.uk',
'@missing.com',
'charlie@domain.org'
])
valid_emails = emails.str.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')
print("Valid email format:")
print(pd.DataFrame({'email': emails, 'valid': valid_emails}))
# email valid
# 0 alice@email.com True
# 1 invalid.email False
# 2 bob@test.co.uk True
# 3 @missing.com False
# 4 charlie@domain.org True
# Extract dates in various formats
text_with_dates = pd.Series([
'Meeting on 2024-01-15',
'Due: 01/20/2024',
'Event 2024.03.10',
'Deadline 15-Jan-2024'
])
# Match YYYY-MM-DD or YYYY.MM.DD
dates = text_with_dates.str.extract(r'(\d{4}[-./]\d{2}[-./]\d{2})')
print("\nExtracted dates:")
print(dates)
# 0
# 0 2024-01-15
# 1 NaN
# 2 2024.03.10
# 3 NaN
# Find phone numbers in text
text_with_phones = pd.Series([
'Call me at 555-123-4567',
'Office: (555) 234-5678',
'Mobile 555.345.6789'
])
phones = text_with_phones.str.extract(r'(\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4})')
print("\nExtracted phone numbers:")
print(phones)
# 0
# 0 555-123-4567
# 1 (555) 234-5678
# 2 555.345.6789
# Replace using regex groups (format phone numbers)
messy_phones = pd.Series(['5551234567', '555-234-5678', '(555)3456789'])
formatted = messy_phones.str.replace(
r'^\(?(\d{3})\)?[-.\s]?(\d{3})[-.\s]?(\d{4})$',
r'(\1) \2-\3',
regex=True
)
print("\nFormatted phone numbers:")
print(formatted)
# 0 (555) 123-4567
# 1 (555) 234-5678
# 2 (555) 345-6789
# Filter rows using regex patterns
products = pd.DataFrame({
'product': ['Laptop-PRO', 'Mouse-Basic', 'Laptop-Standard', 'Keyboard-PRO']
})
pro_products = products[products['product'].str.contains(r'-PRO$', regex=True)]
print("\nPRO products:")
print(pro_products)
# product
# 0 Laptop-PRO
# 3 Keyboard-PRO
Advanced String Techniques
Beyond basic operations, Pandas offers advanced string methods for complex scenarios. Use str.cat() to concatenate strings across Series or join elements within Series. The str.pad() and str.zfill() methods add padding to create fixed-width strings, useful for IDs or codes. You can normalize text with str.normalize() for Unicode handling. The str.translate() method provides efficient character-level replacements using translation tables. For performance with large datasets, consider pre-compiling regex patterns or using string methods instead of regex when possible. You can also combine string operations with other Pandas functionality like groupby to aggregate text, or use string methods in query conditions for powerful filtering. These advanced techniques help you handle edge cases and optimize text processing in production data pipelines.
# Concatenate strings from multiple columns
df_concat = pd.DataFrame({
'first': ['Alice', 'Bob'],
'last': ['Smith', 'Jones']
})
df_concat['full_name'] = df_concat['first'].str.cat(df_concat['last'], sep=' ')
print("Concatenated names:")
print(df_concat)
# first last full_name
# 0 Alice Smith Alice Smith
# 1 Bob Jones Bob Jones
# Pad strings to fixed width
order_ids = pd.Series([1, 42, 123])
padded = order_ids.astype(str).str.zfill(5) # Zero-fill to 5 characters
print("\nPadded order IDs:")
print(padded)
# 0 00001
# 1 00042
# 2 00123
# Clean and standardize text with multiple operations
messy_data = pd.Series([' Product-A ', ' PRODUCT-B', 'product-c '])
cleaned = (messy_data
.str.strip() # Remove whitespace
.str.lower() # Lowercase
.str.replace('-', ' ', regex=False) # Remove hyphens
.str.title() # Title case
)
print("\nCleaned product names:")
print(cleaned)
# 0 Product A
# 1 Product B
# 2 Product C
# Combine string operations with groupby
sales = pd.DataFrame({
'product': ['Laptop', 'Mouse', 'Laptop', 'Keyboard'],
'quantity': [1, 2, 1, 3]
})
product_summary = sales.groupby('product')['quantity'].sum()
print("\nProduct sales summary:")
print(product_summary)
# product
# Keyboard 3
# Laptop 2
# Mouse 2
# Use string methods in boolean indexing
data = pd.DataFrame({
'code': ['A001', 'B002', 'A003', 'C004']
})
a_codes = data[data['code'].str.startswith('A')]
print("\nCodes starting with 'A':")
print(a_codes)
# code
# 0 A001
# 2 A003
Regular Expressions (Regex)
Regular expressions are sequences of characters that define search patterns for text matching and manipulation. They provide a concise and flexible means for identifying strings of text, such as particular characters, words, or patterns. In Pandas, regex is integrated into string methods through the regex=True parameter.
Why it matters: Regex enables powerful pattern matching that would require hundreds of lines of code using basic string methods. It's essential for validating data formats (emails, phone numbers), extracting structured information from unstructured text, and performing complex text transformations
Practice Questions
Task: Remove leading/trailing whitespace and convert all names to title case.
Given:
names = pd.Series([' alice smith ', 'BOB JONES', ' charlie BROWN'])
df = pd.DataFrame({'name': names})
Expected output:
name
0 Alice Smith
1 Bob Jones
2 Charlie Brown
Show Solution
import pandas as pd
names = pd.Series([' alice smith ', 'BOB JONES', ' charlie BROWN'])
df = pd.DataFrame({'name': names})
# Clean and standardize
df['name'] = df['name'].str.strip().str.title()
print(df)
Task: Extract the domain name (everything after @) from each email address.
Given:
emails = pd.Series(['alice@gmail.com', 'bob@company.co.uk', 'charlie@test.org'])
df = pd.DataFrame({'email': emails})
Expected output:
email domain
0 alice@gmail.com gmail.com
1 bob@company.co.uk company.co.uk
2 charlie@test.org test.org
Show Solution
import pandas as pd
emails = pd.Series(['alice@gmail.com', 'bob@company.co.uk', 'charlie@test.org'])
df = pd.DataFrame({'email': emails})
# Extract domain using regex
df['domain'] = df['email'].str.extract(r'@(.+)', expand=False)
print(df)
Task: Take messy phone numbers and format them consistently as (XXX) XXX-XXXX.
Given:
phones = pd.Series(['5551234567', '555-234-5678', '(555) 345-6789', '555.456.7890'])
df = pd.DataFrame({'phone': phones})
Expected output:
phone formatted_phone
0 5551234567 (555) 123-4567
1 555-234-5678 (555) 234-5678
2 (555) 345-6789 (555) 345-6789
3 555.456.7890 (555) 456-7890
Show Solution
import pandas as pd
phones = pd.Series(['5551234567', '555-234-5678', '(555) 345-6789', '555.456.7890'])
df = pd.DataFrame({'phone': phones})
# Remove all non-digits first, then reformat
df['formatted_phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)
df['formatted_phone'] = df['formatted_phone'].str.replace(
r'^(\d{3})(\d{3})(\d{4})$',
r'(\1) \2-\3',
regex=True
)
print(df)
Key Takeaways
Data Cleaning is Critical
Real-world data is rarely clean. Handling missing values, duplicates, type inconsistencies, and text problems is essential before analysis. Clean data leads to accurate insights and reliable models
Missing Value Strategies
Use dropna() to remove incomplete records or fillna() to impute values. Choose strategies based on data context: mean/median for numeric, mode for categorical, or forward/backward fill for time series
Duplicate Detection
Duplicates skew analysis results. Use duplicated() to identify them and drop_duplicates() to remove them. The subset parameter lets you check uniqueness on specific key columns rather than entire rows
Type Conversions Matter
Correct data types enable proper operations and save memory. Use astype() for simple conversions, to_numeric() with errors='coerce' for messy numeric data, to_datetime() for dates, and categorical for repeated values
String Operations Power
The .str accessor provides vectorized string methods for efficient text processing. Use strip() to remove whitespace, lower()/upper() for case standardization, replace() for substitution, and split() for parsing
Regex for Pattern Matching
Regular expressions enable powerful pattern matching for validation, extraction, and transformation. Use str.contains() for filtering, str.extract() for parsing patterns, and str.replace() with regex=True for complex substitutions