Module 3: Classification

Introduction to Classification

Classification is a supervised learning approach used to categorize items into discrete classes. It aims to learn the relationship between feature variables and a target variable, which is categorical.

How Classification Works

Given training data with target labels, a classification model predicts the class label for new, unlabeled data.

Example:

A loan default predictor uses historical data (e.g., age, income) to classify customers as defaulters or non-defaulters.

Types of Classification

Binary Classification

Predicts one of two possible classes (e.g., defaulter vs. non-defaulter).

Multi-class Classification

Predicts among more than two classes (e.g., which medication is appropriate for a patient).

Applications

Business Use Cases

Industries

Common Classification Algorithms


K-Nearest Neighbors (KNN) Algorithm

Overview

The K-Nearest Neighbors (KNN) algorithm is a supervised learning classification technique used to classify a data point based on how its neighbors are classified. It is based on the concept that data points that are close to each other are more likely to belong to the same class. KNN can also be used for regression tasks.

Example Scenario

Consider a telecommunications provider that has segmented its customer base into four groups based on service usage patterns. The goal is to predict which group a new customer belongs to using demographic data such as age and income. This is a classification problem, where the goal is to assign a class label to a new, unknown case based on the known labels of other cases.

How K-Nearest Neighbors Works

  1. Choosing the Number of Neighbors (K): The number of neighbors (K) to consider is specified by the user.
  1. Calculating Distance: For a new data point, the algorithm calculates the distance between this point and all other points in the dataset. Common distance metrics include Euclidean distance.
  1. Finding the Nearest Neighbors: The K data points that are closest to the new data point are identified.
  1. Assigning a Class Label: The new data point is assigned the class label that is most common among its K nearest neighbors.

Example of KNN Classification

Example Code for KNN:

from sklearn.neighbors import KNeighborsClassifier

# Training data
X_train = df[['Age', 'Income']]
y_train = df['Customer Group']

# New customer data
new_customer = [[30, 55000]]

# Initialize KNN classifier
knn = KNeighborsClassifier(n_neighbors=3)

# Fit the model
knn.fit(X_train, y_train)

# Predict the class of the new customer
predicted_class = knn.predict(new_customer)
print(f'Predicted Customer Group: {predicted_class[0]}')

Choosing the Value of K

Finding the Optimal K

To find the optimal value of K:

  1. Reserve a portion of your data for testing.
  1. Train the model using the training data and evaluate its accuracy on the test data for different values of K.
  1. Choose the value of K that results in the highest accuracy.

Example of Choosing K:

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2)

# Evaluate different values of K
accuracies = []
for k in range(1, 11):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    accuracies.append(accuracy_score(y_test, y_pred))

best_k = accuracies.index(max(accuracies)) + 1
print(f'Best K: {best_k}')

Regression with KNN

KNN can also be used for regression tasks. In this case, instead of assigning a class label, the algorithm predicts a continuous value (e.g., the price of a house). The predicted value is typically the average or median of the K nearest neighbors' values.

Example of KNN Regression

Summary

The KNN algorithm is a simple yet powerful tool for both classification and regression tasks. Its effectiveness depends on the choice of K and the distance metric used. The main challenge lies in finding the right balance between underfitting and overfitting by selecting an appropriate value of K.


Evaluation Metrics for Classifiers

Model evaluation metrics are essential in determining the performance of a classifier. These metrics provide insights into areas where the model might require improvement. In this note, we will explore three evaluation metrics for classification: Jaccard index, F1 score, and Log Loss.

1. Jaccard Index

Definition

The Jaccard index (also known as the Jaccard similarity coefficient) measures the similarity between the actual labels and the predicted labels by the model. It is calculated as the size of the intersection divided by the size of the union of the two label sets.

Formula

Given that y represents the true labels and represents the predicted labels:

Jaccard Index=yy^yy^\text{Jaccard Index} = \frac{|y \cap \hat{y}|}{|y \cup \hat{y}|}

Example

For a test set of size 10 with 8 correct predictions (8 intersections):

Jaccard Index=810+(108)=0.6\text{Jaccard Index} = \frac{8}{10 + (10 - 8)} = 0.6

Interpretation

Code Example

from sklearn.metrics import jaccard_score

# True labels
y_true = [1, 0, 1, 1, 0, 0, 1, 0, 1, 1]

# Predicted labels
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

# Compute Jaccard Index
jaccard = jaccard_score(y_true, y_pred)
print("Jaccard Index:", jaccard)

2. Confusion Matrix

Definition

A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. Each row of the matrix represents the actual instances in a predicted class, while each column represents the instances in an actual class.

Example

Consider a confusion matrix for a binary classification with 40 rows:

Predicted: No (0)Predicted: Yes (1)
Actual: No (0)241
Actual: Yes (1)96

Interpretation

Examples

Precision and Recall

Precision=TPTP+FP\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
Recall=TPTP+FN\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}

Code Example

from sklearn.metrics import confusion_matrix

# True labels
y_true = [0, 0, 0, 1, 1, 0, 1, 1, 0, 1]

# Predicted labels
y_pred = [0, 0, 0, 1, 0, 0, 1, 0, 1, 1]

# Compute Confusion Matrix
conf_matrix = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:\n", conf_matrix)

3. F1 Score

Definition

The F1 score is the harmonic mean of precision and recall, providing a balance between the two. It is particularly useful when the class distribution is imbalanced.

Formula

F1 Score=2×Precision×RecallPrecision+Recall\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

Example

Interpretation

Code Example

from sklearn.metrics import f1_score

# True labels
y_true = [0, 0, 0, 1, 1, 0, 1, 1, 0, 1]

# Predicted labels
y_pred = [0, 0, 0, 1, 0, 0, 1, 0, 1, 1]

# Compute F1 Score
f1 = f1_score(y_true, y_pred)
print("F1 Score:", f1)

4. Log Loss (Logarithmic Loss)

Definition

Log Loss measures the accuracy of a classifier that outputs probabilities rather than class labels. It penalizes predictions that are confident but wrong more than those that are less confident but wrong.

Formula

Log Loss=1Ni=1N(yilog(p^i)+(1yi)log(1p^i))\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} \left( y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right)

Where:

Interpretation

Code Example

from sklearn.metrics import log_loss

# True labels
y_true = [0, 0, 0, 1, 1, 0, 1, 1, 0, 1]

# Predicted probabilities
y_prob = [0.1, 0.4, 0.2, 0.8, 0.6, 0.2, 0.9, 0.4, 0.7, 0.8]

# Compute Log Loss
logloss = log_loss(y_true, y_prob)
print("Log Loss:", logloss)

5. Accuracy

Definition

Accuracy measures the proportion of correctly predicted instances (both true positives and true negatives) out of the total number of instances. It is a simple and widely used metric for classification tasks.

Formula

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}

Alternatively, it can also be expressed as:

Accuracy=Number of True PredictionsTotal Number of Predictions\text{Accuracy} = \frac{\text{Number of True Predictions}}{\text{Total Number of Predictions}}

Example

If a model has 24 true negatives, 6 true positives, 1 false positive, and 9 false negatives, the accuracy is:

Accuracy=24+624+6+1+9=3040=0.75\text{Accuracy} = \frac{24 + 6}{24 + 6 + 1 + 9} = \frac{30}{40} = 0.75

Interpretation

Code Example

from sklearn.metrics import accuracy_score

# True labels
y_true = [0, 1, 0, 1, 1, 0, 1, 0, 0, 1]

# Predicted labels
y_pred = [0, 1, 0, 0, 1, 0, 1, 0, 1, 1]

# Compute Accuracy
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy:", accuracy)

Summary


Introduction to Decision Trees

Decision trees are a powerful tool in classification that helps in making decisions based on data. In this note, we will explore what a decision tree is, how it is used for classification, and the basic process of building a decision tree.

1. What is a Decision Tree?

A decision tree is a flowchart-like structure that is used for decision-making. It helps in classifying a dataset by breaking it down into smaller and smaller subsets while at the same time, an associated decision tree is incrementally developed.

Example Scenario:

Imagine a medical researcher compiling data about patients who suffered from the same illness. The patients responded to one of two medications: Drug A or Drug B. The dataset includes features like age, gender, blood pressure, and cholesterol levels, with the target being the drug each patient responded to.

The goal is to build a model that predicts which drug might be appropriate for a future patient with the same illness. This is a binary classification problem where the decision tree will help classify the appropriate drug.

2. Structure of a Decision Tree

Components:

Example:

Consider the dataset with features like age, gender, blood pressure, and cholesterol. The decision tree may start with the age attribute:

Each internal node tests an attribute, and the branches represent the possible outcomes of the test. The leaf node assigns the final decision, such as prescribing a specific drug.

3. Building a Decision Tree

Steps Involved:

  1. Choosing an Attribute: Select an attribute from the dataset (e.g., age).
  1. Calculating Significance: Determine the significance of the attribute in splitting the data. This significance helps in identifying the best attribute to split the data on.
  1. Splitting the Data: Based on the value of the chosen attribute, split the data into different branches.
  1. Repeating the Process: For each branch, repeat the process for the remaining attributes until all attributes are used, or a decision can be made.

Outcome:

Once the tree is built, it can be used to predict the class of new or unknown cases. In the context of our example, the tree can help in determining the appropriate drug for a new patient based on their characteristics.

Key Points:

Summary

Decision trees are intuitive and powerful, especially in scenarios where a clear decision-making process is needed based on multiple attributes.


Decision Tree Building Process

Introduction

Decision trees are a key tool in machine learning used for classification tasks. They work by recursively partitioning data based on the most predictive features, creating branches that lead to decision outcomes.

Recursive Partitioning

The process of building a decision tree involves recursive partitioning, where data is split based on the most predictive features. This splitting continues until the subsets of data (or leaves) are sufficiently pure.

Attribute Selection

Choosing the right attribute for splitting the data is crucial. The effectiveness of an attribute is measured by its ability to reduce impurity in the resulting nodes.

Example: Drug Dataset

Consider a dataset with 14 patients where the goal is to decide which drug to prescribe.

Impurity and Entropy

The purity of a node in a decision tree is assessed using entropy, which measures the randomness or disorder in the data.

Entropy Calculation

Entropy=iPilog2(Pi)\text{Entropy} = - \sum_{i} P_i \log_2(P_i)

where PiP_i  is the proportion of data points belonging to class ii.

Example Calculation

Information Gain

Information gain measures how well an attribute separates the data into pure subsets. It is calculated as the difference between entropy before and after the split.

Calculation

Information Gain=EntropybeforeWeighted Entropyafter\text{Information Gain} = \text{Entropy}_{\text{before}} - \text{Weighted Entropy}_{\text{after}}
Weighted Entropyafter=i(NiN)Entropyi\text{Weighted Entropy}_{\text{after}} = \sum_{i} \left(\frac{N_i}{N}\right) \text{Entropy}_i

where NiN \frac{N_i}{N} is the proportion of samples in subset ii, and Entropy Entropyi\text{Entropy}_i is the entropy of subset ii

Decision Tree Example Using Python

Here's how to create and visualize a decision tree using Python and scikit-learn.

Code Example

import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

# Sample dataset
data = {
    'Cholesterol': ['Normal', 'High', 'Normal', 'High', 'Normal', 'High', 'Normal', 'Normal', 'High', 'High', 'Normal', 'High', 'Normal', 'High'],
    'Sex': ['Male', 'Female', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Male', 'Female'],
    'Drug': ['A', 'B', 'A', 'A', 'B', 'B', 'A', 'B', 'B', 'B', 'A', 'A', 'B', 'B']
}
df = pd.DataFrame(data)

# Convert categorical features to numeric
df['Cholesterol'] = df['Cholesterol'].map({'Normal': 0, 'High': 1})
df['Sex'] = df['Sex'].map({'Male': 0, 'Female': 1})
df['Drug'] = df['Drug'].map({'A': 0, 'B': 1})

# Features and target
X = df[['Cholesterol', 'Sex']]
y = df['Drug']

# Initialize and fit the model
clf = DecisionTreeClassifier(criterion='entropy')
clf.fit(X, y)

# y_pred = clf.predit(x_test)

# Plot the decision tree
plt.figure(figsize=(12,8))
plot_tree(clf, feature_names=['Cholesterol', 'Sex'], class_names=['Drug A', 'Drug B'], filled=True)
plt.title('Decision Tree')
plt.show()

Explanation

Conclusion

The attribute with the highest information gain is chosen for splitting the data. In this case, cholesterol has a higher information gain than sex, making it a better choice for the initial split.

This process is repeated for each branch of the tree until the data in each node is sufficiently pure.