Project Overview
The Iris dataset is the "Hello World" of machine learning classification. This project will help you master the fundamentals of classification by working with 150 samples of iris flowers across 3 species (Setosa, Versicolor, Virginica) using 4 features (sepal length, sepal width, petal length, petal width). Though simple, this project teaches core concepts that scale to any classification problem.
Explore
Visualize features, distributions, and class separability
Preprocess
Scale features and split data properly
Train
Build and compare multiple classifiers
Evaluate
Analyze with metrics and confusion matrices
Learning Objectives
Classification Fundamentals
- Understand multi-class classification problems
- Learn train/test splitting strategies
- Apply feature scaling for different algorithms
- Interpret classification metrics correctly
- Visualize decision boundaries
Practical Skills
- Create pair plots and correlation heatmaps
- Implement Logistic Regression, KNN, SVM, Decision Trees
- Use cross-validation for robust evaluation
- Generate and interpret confusion matrices
- Write clean, documented Jupyter notebooks
Business Scenario
FloraID Botanics
You have been hired as a Junior Data Scientist at FloraID Botanics, a botanical research company that develops automated plant identification systems. Your manager has assigned you to work on the iris flower classification module:
"We're building an app that helps amateur botanists identify flowers from measurements. Start with the classic iris dataset - it's small but perfect for learning. Build me a classifier that can accurately identify the three iris species from petal and sepal measurements. Show me which algorithm works best and why."
Questions to Answer
- Which species does a flower belong to given its measurements?
- How accurate is the classification model?
- Which algorithm performs best on this dataset?
- Are all species equally easy to classify?
- Which features are most important for classification?
- Can we separate species using just 2 features?
- Are there overlapping regions between species?
- How do the feature distributions differ by species?
The Dataset
The Iris dataset is one of the most famous datasets in machine learning, introduced by statistician Ronald Fisher in 1936. Download the CSV file or load it directly from scikit-learn:
Dataset Download
Download the Iris dataset from Kaggle (based on UCI ML Repository) or use sklearn's built-in loader.
Original Data Source
This project uses the classic Iris Dataset from Kaggle/UCI - one of the most famous datasets in machine learning history, introduced by R.A. Fisher in 1936. It contains measurements from 150 iris flowers across 3 species (Setosa, Versicolor, Virginica) with 50 samples per class.
Features Overview
| Feature | Type | Unit | Range | Description |
|---|---|---|---|---|
sepal_length | float | cm | 4.3 - 7.9 | Length of the sepal |
sepal_width | float | cm | 2.0 - 4.4 | Width of the sepal |
petal_length | float | cm | 1.0 - 6.9 | Length of the petal |
petal_width | float | cm | 0.1 - 2.5 | Width of the petal |
| Class | Label | Count | Characteristics |
|---|---|---|---|
0 | Iris-setosa | 50 | Small petals, easily separable from others |
1 | Iris-versicolor | 50 | Medium-sized, some overlap with virginica |
2 | Iris-virginica | 50 | Large petals, some overlap with versicolor |
Sample Data Preview
| sepal_length | sepal_width | petal_length | petal_width | species |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 7.0 | 3.2 | 4.7 | 1.4 | versicolor |
| 6.3 | 3.3 | 6.0 | 2.5 | virginica |
Project Requirements
Create a single well-organized Jupyter notebook that covers all the following components with clear documentation and visualizations.
Data Loading & EDA
- Load the iris dataset (CSV or sklearn)
- Display dataset shape, dtypes, and basic statistics
- Check for missing values (should be none)
- Visualize class distribution with bar chart
- Create pair plot colored by species
- Generate correlation heatmap
- Box plots of features by species
Data Preprocessing
- Separate features (X) and target (y)
- Split data into train/test sets (80/20 or 70/30)
- Apply StandardScaler to features
- Document your preprocessing choices
Model Training
Train at least 4 different classifiers:
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Support Vector Machine (SVM)
- Decision Tree Classifier
- (Optional: Random Forest, Naive Bayes)
Model Evaluation
- Calculate accuracy, precision, recall, F1-score for each model
- Generate confusion matrix for each model
- Perform 5-fold cross-validation
- Create comparison table of all models
- Visualize decision boundaries (using 2 best features)
Conclusion
- Summarize which model performed best and why
- Discuss which features are most important
- Note any challenges or interesting findings
- Suggest potential improvements
Model Specifications
Train and compare the following classifiers. Use default parameters first, then optionally tune hyperparameters for your best performer.
- Logistic Regression: Multi-class with 'ovr' or 'multinomial'
- SVM (Linear): kernel='linear', C=1.0
- KNN: Try k=3, 5, 7
- Decision Tree: max_depth=3 to 5
- SVM (RBF): kernel='rbf', gamma='scale'
Evaluation Metrics
Accuracy
Overall correct predictions
Precision
Correct positive predictions
Recall
Actual positives found
F1-Score
Harmonic mean of P & R
Sample Code
# Load and prepare data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
models = {
'Logistic Regression': LogisticRegression(max_iter=200),
'KNN (k=5)': KNeighborsClassifier(n_neighbors=5),
'SVM (RBF)': SVC(kernel='rbf', gamma='scale'),
'Decision Tree': DecisionTreeClassifier(max_depth=4, random_state=42)
}
for name, model in models.items():
model.fit(X_train_scaled, y_train)
accuracy = model.score(X_test_scaled, y_test)
print(f'{name}: {accuracy:.4f}')
Required Visualizations
Create at least 10 visualizations in your notebook. Each should have proper titles, labels, and brief interpretive commentary.
Exploratory Visualizations
- Class distribution bar chart
- Pair plot (seaborn pairplot)
- Correlation heatmap
- Box plots by species
- Violin plots or histograms
Model Evaluation Plots
- Confusion matrices (heatmaps)
- Model accuracy comparison bar chart
- Cross-validation score boxplots
- Decision boundary plot (2D)
- Classification report visualization
Submission Requirements
Create a public GitHub repository with the exact name shown below:
Required Repository Name
iris-classification-ml
Required Project Structure
iris-classification-ml/
├── data/
│ └── iris.csv # Dataset (or note if using sklearn)
├── notebooks/
│ └── iris_classification.ipynb # Main analysis notebook
├── visualizations/
│ ├── pair_plot.png
│ ├── confusion_matrices.png
│ ├── model_comparison.png
│ └── decision_boundaries.png
├── requirements.txt # Python dependencies
└── README.md # Project documentation
README.md Required Sections
- Project title, your name, date
- Project overview and objectives
- Dataset description
- Technologies used (Python, sklearn, etc.)
- Key findings from EDA
- Model comparison results
- Best model and accuracy
- How to run the notebook
Grading Rubric
Your project will be graded on the following criteria. Total: 200 points.
| Criteria | Points | Description |
|---|---|---|
| Exploratory Data Analysis | 40 | Comprehensive EDA with 5+ visualizations and insights |
| Data Preprocessing | 20 | Proper train/test split and feature scaling |
| Model Training | 40 | At least 4 different classifiers trained correctly |
| Model Evaluation | 40 | Metrics, confusion matrices, cross-validation |
| Visualizations | 30 | 10+ clear, labeled visualizations |
| Documentation | 30 | README, code comments, conclusions |
| Total | 200 |
Ready to Submit?
Make sure you have completed all requirements and reviewed the grading rubric above.
Submit Your Project