Project 2: Iris Classification | Machine Learning Course

Project Overview

The Iris dataset is the "Hello World" of machine learning classification. This project will help you master the fundamentals of classification by working with 150 samples of iris flowers across 3 species (Setosa, Versicolor, Virginica) using 4 features (sepal length, sepal width, petal length, petal width). Though simple, this project teaches core concepts that scale to any classification problem.

Skills Applied: This project tests your proficiency in Python (pandas, numpy, matplotlib, seaborn), scikit-learn (preprocessing, classifiers, metrics), and data visualization for classification problems.

Explore

Visualize features, distributions, and class separability

Preprocess

Scale features and split data properly

Train

Build and compare multiple classifiers

Evaluate

Analyze with metrics and confusion matrices

Learning Objectives

Classification Fundamentals

Understand multi-class classification problems
Learn train/test splitting strategies
Apply feature scaling for different algorithms
Interpret classification metrics correctly
Visualize decision boundaries

Practical Skills

Create pair plots and correlation heatmaps
Implement Logistic Regression, KNN, SVM, Decision Trees
Use cross-validation for robust evaluation
Generate and interpret confusion matrices
Write clean, documented Jupyter notebooks

Ready to submit? Already completed the project? Submit your work now!

Submit Now

Business Scenario

FloraID Botanics

You have been hired as a Junior Data Scientist at FloraID Botanics, a botanical research company that develops automated plant identification systems. Your manager has assigned you to work on the iris flower classification module:

"We're building an app that helps amateur botanists identify flowers from measurements. Start with the classic iris dataset - it's small but perfect for learning. Build me a classifier that can accurately identify the three iris species from petal and sepal measurements. Show me which algorithm works best and why."

Dr. Maya Patel, Lead Data Scientist, FloraID Botanics

Questions to Answer

Classification

Which species does a flower belong to given its measurements?
How accurate is the classification model?
Which algorithm performs best on this dataset?
Are all species equally easy to classify?

Feature Analysis

Which features are most important for classification?
Can we separate species using just 2 features?
Are there overlapping regions between species?
How do the feature distributions differ by species?

Pro Tip: While this is a beginner project, treat it professionally! Write clean code, document your findings, and explain your reasoning - these habits will serve you well on larger projects.

The Dataset

The Iris dataset is one of the most famous datasets in machine learning, introduced by statistician Ronald Fisher in 1936. Download the CSV file or load it directly from scikit-learn:

Dataset Download

Download the Iris dataset from Kaggle (based on UCI ML Repository) or use sklearn's built-in loader.

Download from Kaggle

Original Data Source

This project uses the classic Iris Dataset from Kaggle/UCI - one of the most famous datasets in machine learning history, introduced by R.A. Fisher in 1936. It contains measurements from 150 iris flowers across 3 species (Setosa, Versicolor, Virginica) with 50 samples per class.

View on Kaggle Explore Similar Datasets

Features Overview

Feature	Type	Unit	Range	Description
`sepal_length`	float	cm	4.3 - 7.9	Length of the sepal
`sepal_width`	float	cm	2.0 - 4.4	Width of the sepal
`petal_length`	float	cm	1.0 - 6.9	Length of the petal
`petal_width`	float	cm	0.1 - 2.5	Width of the petal

Class	Label	Count	Characteristics
`0`	Iris-setosa	50	Small petals, easily separable from others
`1`	Iris-versicolor	50	Medium-sized, some overlap with virginica
`2`	Iris-virginica	50	Large petals, some overlap with versicolor

Dataset Stats: 150 samples, 4 features, 3 balanced classes, no missing values

Key Insight: Setosa is linearly separable; Versicolor & Virginica have some overlap

Sample Data Preview

sepal_length	sepal_width	petal_length	petal_width	species
5.1	3.5	1.4	0.2	setosa
7.0	3.2	4.7	1.4	versicolor
6.3	3.3	6.0	2.5	virginica

Project Requirements

Create a single well-organized Jupyter notebook that covers all the following components with clear documentation and visualizations.

Data Loading & EDA

Load the iris dataset (CSV or sklearn)
Display dataset shape, dtypes, and basic statistics
Check for missing values (should be none)
Visualize class distribution with bar chart
Create pair plot colored by species
Generate correlation heatmap
Box plots of features by species

Data Preprocessing

Separate features (X) and target (y)
Split data into train/test sets (80/20 or 70/30)
Apply StandardScaler to features
Document your preprocessing choices

Model Training

Train at least 4 different classifiers:

Logistic Regression
K-Nearest Neighbors (KNN)
Support Vector Machine (SVM)
Decision Tree Classifier
(Optional: Random Forest, Naive Bayes)

Model Evaluation

Calculate accuracy, precision, recall, F1-score for each model
Generate confusion matrix for each model
Perform 5-fold cross-validation
Create comparison table of all models
Visualize decision boundaries (using 2 best features)

Conclusion

Summarize which model performed best and why
Discuss which features are most important
Note any challenges or interesting findings
Suggest potential improvements

Model Specifications

Train and compare the following classifiers. Use default parameters first, then optionally tune hyperparameters for your best performer.

Linear Models

Logistic Regression: Multi-class with 'ovr' or 'multinomial'
SVM (Linear): kernel='linear', C=1.0

Non-Linear Models

KNN: Try k=3, 5, 7
Decision Tree: max_depth=3 to 5
SVM (RBF): kernel='rbf', gamma='scale'

Evaluation Metrics

Accuracy

Overall correct predictions

Precision

Correct positive predictions

Recall

Actual positives found

F1-Score

Harmonic mean of P & R

Sample Code

# Load and prepare data
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

models = {
    'Logistic Regression': LogisticRegression(max_iter=200),
    'KNN (k=5)': KNeighborsClassifier(n_neighbors=5),
    'SVM (RBF)': SVC(kernel='rbf', gamma='scale'),
    'Decision Tree': DecisionTreeClassifier(max_depth=4, random_state=42)
}

for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    accuracy = model.score(X_test_scaled, y_test)
    print(f'{name}: {accuracy:.4f}')

Target Performance: Aim for accuracy > 95% on test set. With proper preprocessing, most algorithms can achieve 96-100% on this dataset.

Required Visualizations

Create at least 10 visualizations in your notebook. Each should have proper titles, labels, and brief interpretive commentary.

EDA

Exploratory Visualizations

Class distribution bar chart
Pair plot (seaborn pairplot)
Correlation heatmap
Box plots by species
Violin plots or histograms

Model

Model Evaluation Plots

Confusion matrices (heatmaps)
Model accuracy comparison bar chart
Cross-validation score boxplots
Decision boundary plot (2D)
Classification report visualization

Submission Requirements

Create a public GitHub repository with the exact name shown below:

Required Repository Name

iris-classification-ml

github.com/<your-username>/iris-classification-ml

Required Project Structure

iris-classification-ml/
├── data/
│   └── iris.csv                  # Dataset (or note if using sklearn)
├── notebooks/
│   └── iris_classification.ipynb # Main analysis notebook
├── visualizations/
│   ├── pair_plot.png
│   ├── confusion_matrices.png
│   ├── model_comparison.png
│   └── decision_boundaries.png
├── requirements.txt              # Python dependencies
└── README.md                     # Project documentation

README.md Required Sections

Project title, your name, date
Project overview and objectives
Dataset description
Technologies used (Python, sklearn, etc.)

Key findings from EDA
Model comparison results
Best model and accuracy
How to run the notebook

Submit Your Project

Grading Rubric

Your project will be graded on the following criteria. Total: 200 points.

Criteria	Points	Description
Exploratory Data Analysis	40	Comprehensive EDA with 5+ visualizations and insights
Data Preprocessing	20	Proper train/test split and feature scaling
Model Training	40	At least 4 different classifiers trained correctly
Model Evaluation	40	Metrics, confusion matrices, cross-validation
Visualizations	30	10+ clear, labeled visualizations
Documentation	30	README, code comments, conclusions
Total	200

Ready to Submit?

Make sure you have completed all requirements and reviewed the grading rubric above.