Apply your Pandas skills to clean messy datasets, handle missing values, merge multiple data sources, and create comprehensive analytical reports for DataCorp Analytics Firm.
In this comprehensive assignment, you will work as a Junior Data Analyst at DataCorp Analytics Firm. The company has received messy, inconsistent datasets from a client and needs you to clean, transform, merge, and analyze the data to produce actionable business insights.
"Welcome back to the team! We have a critical project from MegaCorp Industries. They've sent us employee data, expense reports, department information, and performance reviews—but the data is a complete mess. Different formats, missing values, inconsistent naming, duplicate records... you name it.
Your task is to wrangle this data into shape and produce three key reports:
The client needs this by end of week. I know the data is messy, but I trust your Pandas skills. Good luck!"
— Sarah Chen, Lead Data Scientist
Download all four datasets below. Each file has intentional data quality issues that you must identify and fix.
Employee master data including personal info, department, salary, hire dates, and performance scores.
Expense transaction records with dates, amounts, categories, vendors, and approval status.
Department hierarchy with budgets, locations, team information, and active projects.
Performance review records with ratings across multiple dimensions and reviewer comments.
pd.read_csv() for CSV files with correct encoding detectionpd.read_json() for the JSON file - handle nested structure appropriatelypd.merge() with appropriate join types and validate parameterpd.concat() where combining DataFrames verticallyagg() with named aggregationstransform() for at least one group-level calculationapply() with a custom function on at least one column| Component | Points | Criteria |
|---|---|---|
| Data Loading & Profiling | 15 | All datasets load correctly, profiling function complete, validation checks comprehensive |
| Data Cleaning | 25 | All cleaning functions work correctly, edge cases handled, data quality significantly improved |
| Data Transformation | 25 | All merges correct, aggregations accurate, advanced Pandas operations used effectively |
| Validation & Testing | 15 | Verification cells demonstrate all functions work, assertions pass |
| Output Files & Report | 20 | All CSV files correct, report complete with insights, recommendations actionable |
| Total | 100 |
Create a public GitHub repository with the exact name shown below, add all required files, and submit through the submission portal.
github.com/<your-username>/datacorp-pandas-wrangling
All files are required. Submission will fail if any file is missing.
df.info() and df.describe() frequentlydf.isna().sum() after each cleaning stepassert statements to validate assumptions