Module 4: Model Development

Introduction to Model Development

This module delves into the process of model development, focusing on predictive modeling using datasets. It covers various regression techniques, model evaluation methods, and the importance of accurate data in making predictions.

Key Concepts

1. Simple and Multiple Linear Regression:

2. Model Evaluation Using Visualization:

3. Polynomial Regression and Pipelines:

4. R-squared and Mean Squared Error (MSE):

5. Prediction and Decision Making:

Importance of Data

A model, or estimator, is essentially a mathematical equation that predicts a value based on one or more other values. It relates one or more independent variables (features) to dependent variables (outcomes). The accuracy of the model often improves with the relevance and quantity of data. Including multiple independent variables can lead to more precise predictions.

For instance, consider a scenario where predicting an outcome is based on several features. If the model's independent variables do not include a crucial feature, predictions may be inaccurate. Therefore, gathering more relevant data and exploring different types of models is cru1cial for robust model development.

1. Simple and Multiple Linear Regression

Linear Regression

Linear regression predicts a target value using one or more independent variables.

1.1 Simple Linear Regression (SLR)

SLR examines the relationship between two variables:

The relationship is defined as:

y=b0+b1xy = b_0 + b_1 x

Prediction Step

If highway miles per gallon is 20, a linear model might predict the car price as $22,000, assuming a linear relationship.

Training the Model

Data points are stored in data frames or NumPy arrays. The predictor values (xx) and target values (yy) are stored separately. The model is trained using these data points to determine the parameters b0b_0 and .

Handling Uncertainty

Factors like car make and age influence car prices. Uncertainty in the model is addressed by adding a small random value (noise) to the line. The noise is usually a small positive or negative value.

Python Implementation

from sklearn.linear_model import LinearRegression

# Create a linear regression object
lm = LinearRegression()

# Define predictor (x) and target (y) variables
x = ...
y = ...

# Fit the model, y)

# Make predictions
predicted_values = lm.predict(x)

# Model parameters
intercept = lm.intercept_
slope = lm.coef_

1.2 Multiple Linear Regression (MLR)

Multiple linear regression (MLR) extends SLR to include multiple predictor variables

(x1,x2,,xnx1,x2, \ldots,xn) to predict a continuous target variable (yy):


Visualization and Training

With two predictor variables (x1x_1 and ), data points are mapped on a 2D plane, and () values are mapped vertically.

Python Implementation

from sklearn.linear_model import LinearRegression

# Create a linear regression object
lm = LinearRegression()

# Define predictor variables (z) and target (y)
z = ...
y = ...

# Fit the model, y)

# Make predictions
predicted_values = lm.predict(z)

# Model parameters
intercept = lm.intercept_
coefficients = lm.coef_

In summary, SLR and MLR provide methods to model relationships between variables, helping predict outcomes based on data observations.

2. Model Evaluation Using Visualization

1. Regression Plots

Regression plots estimate the relationship between two variables, showing the strength and direction of the correlation.

Creating a Regression Plot

  1. Import Seaborn:
    import seaborn as sns
  1. Use regplot Function:
    • Parameters:
      • x: Independent variable column.
      • y: Dependent variable column.
      • data: Name of the DataFrame.
    sns.regplot(x='feature', y='target', data=df)

2. Residual Plots

Residual plots represent the error between actual and predicted values.

Creating a Residual Plot

  1. Import Seaborn:
    import seaborn as sns
  1. Use residplot Function:
    • Parameters:
      • Series of dependent variable (feature).
      • Series of target variable.
    sns.residplot(x='feature', y='target', data=df)

3. Distribution Plots

Distribution plots visualize predicted versus actual values.


Creating a Distribution Plot

  1. Import Pandas and Seaborn:
    import pandas as pd
    import seaborn as sns
  1. Use Seaborn's Distribution Plot:
    • Parameters:
      • hist: Set to False for a distribution.
      • color: Set to desired color.
      • label: Include label for legend.
    sns.kdeplot(predicted_values, color='blue', label='Predicted')
    sns.kdeplot(actual_values, color='red', label='Actual')

Observation: Prices in range from 40,000 to 50,000 are inaccurate, while 10,000 to 20,000 are closer to target values.

3. Polynomial Regression and Pipelines

Polynomial Regression

What is Polynomial Regression?

Types of Polynomial Models

Example Model


Implementation in Python

Polynomial Features with Scikit-learn

  1. Import the Library:
    from sklearn.preprocessing import PolynomialFeatures
  1. Create Polynomial Features:
from sklearn.linear_model import LinearRegression

# Create a PolynomialFeatures object with the desired degree
poly = PolynomialFeatures(degree=3)

# Fit and transform your data
X_poly = poly.fit_transform(X)

# Create LinearRegression model
model = LinearRegression()

# Fit the model with the polynomial features, y)

In this code:


Why Normalize?

How to Normalize


What are Pipelines?


Creating a Pipeline

  1. Import Pipeline:
    from sklearn.pipeline import Pipeline
  1. Define Steps:
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('poly', PolynomialFeatures(degree=3)),
        ('model', LinearRegression())
  1. Train and Predict:, y_train)
    y_pred = pipeline.predict(x_test)

Key Points

Use polynomial regression and pipelines to enhance model accuracy and streamline your machine learning workflow.


Example Code

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

# Simple Linear Regression
X_slr = np.array([[20], [30], [40]])
y_slr = np.array([15000, 13000, 12000])

model_slr = LinearRegression(), y_slr)
predicted_slr = model_slr.predict([[30]])
print("SLR Predicted Price:", predicted_slr[0])

# Multiple Linear Regression
X_mlr = np.array([[20, 5], [30, 4], [40, 6]])
y_mlr = np.array([15000, 13000, 12000])

model_mlr = LinearRegression(), y_mlr)
predicted_mlr = model_mlr.predict([[30, 5]])
print("MLR Predicted Price:", predicted_mlr[0])

# Polynomial Regression
X_poly = np.array([[20], [30], [40]])
y_poly = np.array([15000, 13000, 12000])

poly = PolynomialFeatures(degree=2)
X_poly_transformed = poly.fit_transform(X_poly)

model_poly = LinearRegression(), y_poly)
predicted_poly = model_poly.predict(poly.transform([[30]]))
print("Polynomial Predicted Price:", predicted_poly[0])

4. Model Evaluation Metrics

Mean Squared Error (MSE)


Scenario: Predicting house prices based on size.


MSE=13((200210)2+(250240)2+(300310)2)=13(100+100+100)=100\text{MSE} = \frac{1}{3} ((200-210)^2 + (250-240)^2 + (300-310)^2) = \frac{1}{3} (100 + 100 + 100) = 100 

Python Code:

from sklearn.metrics import mean_squared_error

actual = [200, 250, 300]
predicted = [210, 240, 310]
mse = mean_squared_error(actual, predicted)
print("MSE:", mse)  # Output: MSE: 100.0

R-squared (Coefficient of Determination)


Scenario: Predicting test scores based on study hours.


Python Code:

from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[1], [2], [3]])
y = np.array([2, 3, 5])
model = LinearRegression(), y)

r_squared = model.score(X, y)
print("R-squared:", r_squared)  # Output: R-squared: 0.9642857142857143

Example Interpretation

5. Prediction and Decision Making

Model Evaluation

To ensure a model's validity, use:

Example: Predicting Car Price

Coefficient Interpretation

An increase of 1 mpg decreases price by $821.

Potential Issues

Unrealistic predictions may indicate:

Generating Predictions

import numpy as np
from sklearn.linear_model import LinearRegression

# Sample data
X = np.array([[20], [30], [40]])
y = np.array([15000, 13000, 12000])

# Model training
model = LinearRegression(), y)

# Generating a sequence for prediction
mpg_values = np.arange(1, 101, 1)  # From 1 to 100 with step 1
predicted_prices = model.predict(mpg_values.reshape(-1, 1))

# Example prediction for 30 mpg
predicted_price = model.predict([[30]])
print("Predicted Price:", predicted_price[0])


Visualization Techniques

Evaluating Models

Mean Squared Error (MSE)

Example MSEs:

Code Example:

from sklearn.metrics import mean_squared_error

# Actual and predicted values
actual = [15000, 13000, 12000]
predicted = model.predict(X)

# Calculate MSE
mse = mean_squared_error(actual, predicted)
print("MSE:", mse)

R-squared (R²)

Example R² Values:

# Calculate R-squared
r_squared = model.score(X, y)
print("R-squared:", r_squared) 

Model Comparison



Cheat Sheet: Model Development

Linear Regression

Train Linear Regression Model

Generate Output Predictions

Identify the Coefficient and Intercept

Residual Plot

Distribution Plot

Polynomial Regression

Multi-variate Polynomial Regression


R² Value

Mean Squared Error (MSE) Value