Assignment Overview
In this assignment, you will build PyDataKit, a professional-grade data analysis toolkit that demonstrates mastery of Python's module system, standard library, and essential third-party packages.
Standard Library (9.1)
datetime, time, math, random, collections, itertools
Package Structure (9.2)
Modules, __init__.py, imports, virtual environments
Third-Party (9.3)
NumPy arrays, Pandas DataFrames, Requests HTTP
The Scenario
DataPulse Analytics
You've been hired by DataPulse Analytics, a data consulting firm that needs a reusable Python toolkit for their analysts. The toolkit should handle common tasks like fetching data from web APIs, processing timestamps across timezones, performing statistical computations, and generating data reports.
"We need a well-structured Python package that our analysts can easily install and use. It must demonstrate proper module organization, leverage Python's powerful standard library, and integrate seamlessly with the data science ecosystem."
Your Task
Create a Python package called pydatakit that provides utilities for time handling,
statistics, data processing, API interactions, NumPy operations, and Pandas DataFrame manipulation.
Required Package Structure
python-pydatakit/
├── pydatakit/
│ ├── __init__.py # Package initialization with version
│ ├── time_utils.py # datetime, time, timezone utilities
│ ├── math_utils.py # math, random, statistics functions
│ ├── data_structures.py # collections, itertools helpers
│ ├── api_client.py # HTTP client using requests
│ ├── numpy_ops.py # NumPy array operations
│ └── pandas_ops.py # Pandas DataFrame operations
├── tests/
│ ├── __init__.py
│ ├── test_time_utils.py
│ ├── test_math_utils.py
│ └── test_api_client.py
├── examples/
│ └── demo.py # Usage demonstration
├── main.py # Main entry point
├── requirements.txt # Dependencies
├── output.txt # Sample output
└── README.md # Documentation
Requirements
Your pydatakit package must implement ALL of the following modules and classes.
Each requirement is mandatory and will be tested individually.
Package Initialization
Create a proper Python package with __init__.py that exposes the public API:
# pydatakit/__init__.py
"""PyDataKit - A comprehensive data analysis toolkit."""
__version__ = "1.0.0"
__author__ = "Your Name"
# Import main classes/functions for easy access
from .time_utils import TimeUtils, DateRange
from .math_utils import Statistics, RandomGenerator
from .data_structures import DataProcessor
from .api_client import APIClient
from .numpy_ops import ArrayOperations
from .pandas_ops import DataFrameHelper
# Define what's available with "from pydatakit import *"
__all__ = [
'TimeUtils', 'DateRange', 'Statistics', 'RandomGenerator',
'DataProcessor', 'APIClient', 'ArrayOperations', 'DataFrameHelper',
]
TimeUtils Class (datetime & time)
Implement comprehensive datetime utilities:
now(tz)- Get current datetime with optional timezoneparse_date(date_string, fmt)- Parse string to datetimeformat_date(dt, fmt)- Format datetime to stringdays_between(start, end)- Calculate days between datesadd_business_days(start, days)- Add business daystimestamp_to_datetime(ts)- Convert Unix timestampmeasure_execution(func)- Decorator for timing functions
Also create a DateRange class with iteration, weekdays, weekends, and monthly split.
Statistics & RandomGenerator Classes (math & random)
Implement math and random utilities:
Statistics.mean/median/std_dev/percentile/correlationRandomGenerator.random_int/random_float/random_choiceRandomGenerator.random_sample/shuffle/generate_normalRandomGenerator.generate_dataset(n, columns)- Generate test data
DataProcessor Class (collections & itertools)
Implement data processing utilities:
count_items(items)- Count occurrences using Countermost_common(items, n)- Get n most common itemsgroup_by(items, key)- Group list of dicts by key using defaultdictflatten(nested)- Flatten nested lists using chainwindow(items, size)- Sliding window using dequebatch(items, size)- Split into batchesunique_combinations(items, r)- Get combinations
APIClient Class (requests library)
Implement HTTP client using requests:
get(endpoint, params, headers)- Make GET requestpost(endpoint, data, json_data, headers)- Make POST requestfetch_with_retry(endpoint, max_retries, backoff)- Retry with exponential backoffdownload_file(url, filepath)- Download fileget_request_stats()- Get request statistics
ArrayOperations Class (NumPy)
Implement NumPy array operations:
create_array/arange/linspace/zeros/ones/random_arrayreshape(arr, new_shape)- Reshape arraystatistics(arr)- Return dict with mean, std, min, max, median, sumnormalize/standardize- Scale datadot_product/matrix_multiply- Linear algebrafilter_by_condition(arr, condition)- Boolean filtering
DataFrameHelper Class (Pandas)
Implement Pandas DataFrame operations:
from_csv/from_json/to_csv/to_json- File I/Oinfo/describe/head- Data explorationfilter_rows(column, condition, value)- Filter by conditionselect_columns/add_column/drop_columns/rename_columnssort_by(columns, ascending)- Sort datagroup_aggregate(group_by, agg_dict)- Group and aggregatefill_missing/drop_duplicates/merge- Data cleaning
Requirements File
Create a proper requirements.txt with pinned versions:
# requirements.txt
numpy>=1.24.0
pandas>=2.0.0
requests>=2.31.0
Unit Tests
Create tests in the tests/ directory:
test_time_utils.py- Test datetime functionstest_math_utils.py- Test statistics and randomtest_api_client.py- Test HTTP client
Use Python's unittest module with at least 5 test cases per file.
Main Entry Point
Create main.py that demonstrates all modules:
#!/usr/bin/env python3
"""PyDataKit - Main demonstration script."""
from pydatakit import (
TimeUtils, DateRange, Statistics, RandomGenerator,
DataProcessor, APIClient, ArrayOperations, DataFrameHelper
)
def main():
print("=" * 60)
print("PyDataKit - Data Analysis Toolkit Demo")
print("=" * 60)
# Demonstrate TimeUtils
print("\n--- DateTime Utilities ---")
now = TimeUtils.now()
print(f"Current time: {TimeUtils.format_date(now)}")
# Demonstrate Statistics
print("\n--- Statistics ---")
data = [10, 20, 30, 40, 50]
print(f"Mean: {Statistics.mean(data)}")
print(f"Std Dev: {Statistics.std_dev(data):.2f}")
# Demonstrate NumPy operations
print("\n--- NumPy Operations ---")
arr = ArrayOperations.create_array([1, 2, 3, 4, 5])
stats = ArrayOperations.statistics(arr)
print(f"Array stats: {stats}")
# Demonstrate Pandas operations
print("\n--- Pandas Operations ---")
df = DataFrameHelper({'name': ['Alice', 'Bob'], 'age': [25, 30]})
print(df.head())
print("\n" + "=" * 60)
print("Demo completed!")
print("=" * 60)
if __name__ == "__main__":
main()
Submission
Create a public GitHub repository with the exact name shown below:
Required Repository Name
python-pydatakit
Required Files
python-pydatakit/
├── pydatakit/
│ ├── __init__.py # Package init with version and exports
│ ├── time_utils.py # TimeUtils and DateRange classes
│ ├── math_utils.py # Statistics and RandomGenerator classes
│ ├── data_structures.py # DataProcessor class
│ ├── api_client.py # APIClient class
│ ├── numpy_ops.py # ArrayOperations class
│ └── pandas_ops.py # DataFrameHelper class
├── tests/
│ ├── __init__.py
│ ├── test_time_utils.py # At least 5 test cases
│ ├── test_math_utils.py # At least 5 test cases
│ └── test_api_client.py # At least 5 test cases
├── examples/
│ └── demo.py # Usage examples
├── main.py # Main entry point
├── requirements.txt # Dependencies
├── output.txt # Sample output from running main.py
└── README.md # Documentation
README.md Must Include:
- Your full name and submission date
- Installation instructions (pip install -r requirements.txt)
- Usage examples for each module
- API documentation for main classes
- How to run tests
Do Include
- All 7 module files with complete classes
- Docstrings for every class and method
- Type hints for function parameters
- At least 15 unit tests total
- output.txt from running main.py
- Comprehensive README.md
Do Not Include
- __pycache__ folders
- Virtual environment (venv/)
- .pyc files
- IDE config files (.idea, .vscode)
- Code that doesn't run without errors
python main.py and save the output to output.txt before submitting!
Enter your GitHub username - we'll verify your repository automatically
Grading Rubric
Your assignment will be graded on the following criteria:
| Criteria | Points | Description |
|---|---|---|
| Package Structure | 20 | Proper __init__.py, exports, module organization |
| TimeUtils & DateRange | 20 | datetime, time module usage, timezone handling |
| Statistics & Random | 15 | math, random modules, statistical calculations |
| DataProcessor | 15 | collections, itertools usage, data transformations |
| NumPy Operations | 20 | Array creation, statistics, transformations |
| Pandas Operations | 25 | DataFrame creation, filtering, grouping, aggregation |
| API Client | 20 | requests library, error handling, retry logic |
| Unit Tests | 15 | Test coverage for core functionality |
| Documentation | 15 | README, docstrings, code comments |
| Code Quality | 10 | Clean code, type hints, best practices |
| Total | 175 |
Bonus Points (up to 25)
Create setup.py for pip-installable package
Add data visualization methods using matplotlib
Implement async API client with aiohttp
Add CLI interface using argparse
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your AssignmentWhat You Will Practice
Standard Library (9.1)
datetime, time, math, random, collections, itertools — Python's powerful built-in tools
Package Structure (9.2)
Creating proper Python packages with __init__.py, module exports, and virtual environments
Third-Party Libraries (9.3)
NumPy for numerical computing, Pandas for data manipulation, Requests for HTTP
Testing & Documentation
Writing unit tests with unittest, creating comprehensive documentation and docstrings