Module 4: Model Development

Introduction to Model Development

This module delves into the process of model development, focusing on predictive modeling using datasets. It covers various regression techniques, model evaluation methods, and the importance of accurate data in making predictions.

Key Concepts

1. Simple and Multiple Linear Regression:

2. Model Evaluation Using Visualization:

3. Polynomial Regression and Pipelines:

4. R-squared and Mean Squared Error (MSE):

5. Prediction and Decision Making:

Importance of Data

A model, or estimator, is essentially a mathematical equation that predicts a value based on one or more other values. It relates one or more independent variables (features) to dependent variables (outcomes). The accuracy of the model often improves with the relevance and quantity of data. Including multiple independent variables can lead to more precise predictions.

For instance, consider a scenario where predicting an outcome is based on several features. If the model's independent variables do not include a crucial feature, predictions may be inaccurate. Therefore, gathering more relevant data and exploring different types of models is cru1cial for robust model development.


1. Simple and Multiple Linear Regression

Linear Regression

Linear regression predicts a target value using one or more independent variables.

1.1 Simple Linear Regression (SLR)

SLR examines the relationship between two variables:

The relationship is defined as:

y=b0+b1xy = b_0 + b_1 x

Prediction Step

If highway miles per gallon is 20, a linear model might predict the car price as $22,000, assuming a linear relationship.

Training the Model

Data points are stored in data frames or NumPy arrays. The predictor values (xx) and target values (yy) are stored separately. The model is trained using these data points to determine the parameters b0b_0 and .

Handling Uncertainty

Factors like car make and age influence car prices. Uncertainty in the model is addressed by adding a small random value (noise) to the line. The noise is usually a small positive or negative value.

Python Implementation

from sklearn.linear_model import LinearRegression

# Create a linear regression object
lm = LinearRegression()

# Define predictor (x) and target (y) variables
x = ...
y = ...

# Fit the model
lm.fit(x, y)

# Make predictions
predicted_values = lm.predict(x)

# Model parameters
intercept = lm.intercept_
slope = lm.coef_

1.2 Multiple Linear Regression (MLR)

Multiple linear regression (MLR) extends SLR to include multiple predictor variables

(x1,x2,,xnx1,x2, \ldots,xn) to predict a continuous target variable (yy):

y=b0+b1x1+b2x2+...+bnxnyy=b_0+b_1x_1+b_2x_2+...+bn_xn_y

Visualization and Training

With two predictor variables (x1x_1 and ), data points are mapped on a 2D plane, and () values are mapped vertically.

Python Implementation

from sklearn.linear_model import LinearRegression

# Create a linear regression object
lm = LinearRegression()

# Define predictor variables (z) and target (y)
z = ...
y = ...

# Fit the model
lm.fit(z, y)

# Make predictions
predicted_values = lm.predict(z)

# Model parameters
intercept = lm.intercept_
coefficients = lm.coef_

In summary, SLR and MLR provide methods to model relationships between variables, helping predict outcomes based on data observations.


2. Model Evaluation Using Visualization

1. Regression Plots

Regression plots estimate the relationship between two variables, showing the strength and direction of the correlation.

Creating a Regression Plot

  1. Import Seaborn:
    import seaborn as sns
  1. Use regplot Function:
    • Parameters:
      • x: Independent variable column.
      • y: Dependent variable column.
      • data: Name of the DataFrame.
    sns.regplot(x='feature', y='target', data=df)

2. Residual Plots

Residual plots represent the error between actual and predicted values.

Creating a Residual Plot

  1. Import Seaborn:
    import seaborn as sns
  1. Use residplot Function:
    • Parameters:
      • Series of dependent variable (feature).
      • Series of target variable.
    sns.residplot(x='feature', y='target', data=df)

3. Distribution Plots

Distribution plots visualize predicted versus actual values.

Process

Creating a Distribution Plot

  1. Import Pandas and Seaborn:
    import pandas as pd
    import seaborn as sns
  1. Use Seaborn's Distribution Plot:
    • Parameters:
      • hist: Set to False for a distribution.
      • color: Set to desired color.
      • label: Include label for legend.
    sns.kdeplot(predicted_values, color='blue', label='Predicted')
    sns.kdeplot(actual_values, color='red', label='Actual')

Observation: Prices in range from 40,000 to 50,000 are inaccurate, while 10,000 to 20,000 are closer to target values.


3. Polynomial Regression and Pipelines

Polynomial Regression

What is Polynomial Regression?

Types of Polynomial Models

Example Model

1.557x3+204.8x2+8965x+1.37×105−1.557x^3+204.8x^2+8965x+1.37×105

Implementation in Python

Polynomial Features with Scikit-learn

  1. Import the Library:
    from sklearn.preprocessing import PolynomialFeatures
  1. Create Polynomial Features:
from sklearn.linear_model import LinearRegression

# Create a PolynomialFeatures object with the desired degree
poly = PolynomialFeatures(degree=3)

# Fit and transform your data
X_poly = poly.fit_transform(X)

# Create LinearRegression model
model = LinearRegression()

# Fit the model with the polynomial features
model.fit(X_poly, y)

In this code:

Normalization

Why Normalize?

How to Normalize

Pipelines

What are Pipelines?

Benefits

Creating a Pipeline

  1. Import Pipeline:
    from sklearn.pipeline import Pipeline
  1. Define Steps:
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('poly', PolynomialFeatures(degree=3)),
        ('model', LinearRegression())
    ])
  1. Train and Predict:
    pipeline.fit(x_train, y_train)
    y_pred = pipeline.predict(x_test)

Key Points

Use polynomial regression and pipelines to enhance model accuracy and streamline your machine learning workflow.

Note:

Simple Linear Regression (SLR)

Multiple Linear Regression (MLR)

Polynomial Regression

Key Differences

Example Code

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

# Simple Linear Regression
X_slr = np.array([[20], [30], [40]])
y_slr = np.array([15000, 13000, 12000])

model_slr = LinearRegression()
model_slr.fit(X_slr, y_slr)
predicted_slr = model_slr.predict([[30]])
print("SLR Predicted Price:", predicted_slr[0])

# Multiple Linear Regression
X_mlr = np.array([[20, 5], [30, 4], [40, 6]])
y_mlr = np.array([15000, 13000, 12000])

model_mlr = LinearRegression()
model_mlr.fit(X_mlr, y_mlr)
predicted_mlr = model_mlr.predict([[30, 5]])
print("MLR Predicted Price:", predicted_mlr[0])

# Polynomial Regression
X_poly = np.array([[20], [30], [40]])
y_poly = np.array([15000, 13000, 12000])

poly = PolynomialFeatures(degree=2)
X_poly_transformed = poly.fit_transform(X_poly)

model_poly = LinearRegression()
model_poly.fit(X_poly_transformed, y_poly)
predicted_poly = model_poly.predict(poly.transform([[30]]))
print("Polynomial Predicted Price:", predicted_poly[0])

4. Model Evaluation Metrics

Mean Squared Error (MSE)

Example

Scenario: Predicting house prices based on size.

Calculation:

MSE=13((200210)2+(250240)2+(300310)2)=13(100+100+100)=100\text{MSE} = \frac{1}{3} ((200-210)^2 + (250-240)^2 + (300-310)^2) = \frac{1}{3} (100 + 100 + 100) = 100 

Python Code:

from sklearn.metrics import mean_squared_error

actual = [200, 250, 300]
predicted = [210, 240, 310]
mse = mean_squared_error(actual, predicted)
print("MSE:", mse)  # Output: MSE: 100.0

R-squared (Coefficient of Determination)

Example

Scenario: Predicting test scores based on study hours.

Interpretation:

Python Code:

from sklearn.linear_model import LinearRegression
import numpy as np

X = np.array([[1], [2], [3]])
y = np.array([2, 3, 5])
model = LinearRegression()
model.fit(X, y)

r_squared = model.score(X, y)
print("R-squared:", r_squared)  # Output: R-squared: 0.9642857142857143

Example Interpretation


5. Prediction and Decision Making

Model Evaluation

To ensure a model's validity, use:

Example: Predicting Car Price

Coefficient Interpretation

An increase of 1 mpg decreases price by $821.

Potential Issues

Unrealistic predictions may indicate:

Generating Predictions

import numpy as np
from sklearn.linear_model import LinearRegression

# Sample data
X = np.array([[20], [30], [40]])
y = np.array([15000, 13000, 12000])

# Model training
model = LinearRegression()
model.fit(X, y)

# Generating a sequence for prediction
mpg_values = np.arange(1, 101, 1)  # From 1 to 100 with step 1
predicted_prices = model.predict(mpg_values.reshape(-1, 1))

# Example prediction for 30 mpg
predicted_price = model.predict([[30]])
print("Predicted Price:", predicted_price[0])

Explanation

Visualization Techniques

Evaluating Models

Mean Squared Error (MSE)

Example MSEs:

Code Example:

from sklearn.metrics import mean_squared_error

# Actual and predicted values
actual = [15000, 13000, 12000]
predicted = model.predict(X)

# Calculate MSE
mse = mean_squared_error(actual, predicted)
print("MSE:", mse)

R-squared (R²)

Example R² Values:

# Calculate R-squared
r_squared = model.score(X, y)
print("R-squared:", r_squared) 

Model Comparison

Conclusion


Notes

Regression Plot

When it comes to simple linear regression, an excellent way to visualize the fit of our model is by using regression plots.

This plot will show a combination of scattered data points (a scatterplot), as well as the fitted linear regression line going through the data. This will give us a reasonable estimate of the relationship between the two variables, the strength of the correlation, as well as the direction (positive or negative correlation).

Residual Plot

A good way to visualize the variance of the data is to use a residual plot.

What is a residual?

The difference between the observed value (y) and the predicted value (ŷ) is called the residual (e). When we look at a regression plot, the residual is the distance from the data point to the fitted regression line.

So what is a residual plot?

A residual plot is a graph that shows the residuals on the vertical y-axis and the independent variable on the horizontal x-axis.

What do we pay attention to when looking at a residual plot?

We look at the spread of the residuals:

Why is that? Randomly spread out residuals mean that the variance is constant, and thus the linear model is a good fit for this data.

Simple Linear Regression

One example of a Data Model that we will be using is Simple Linear Regression.

Simple Linear Regression is a method to help us understand the relationship between two variables:

The result of Linear Regression is a linear function that predicts the response (dependent) variable as a function of the predictor (independent) variable.

Multiple Linear Regression

What if we want to predict car price using more than one variable?

If we want to use more variables in our model to predict car price, we can use Multiple Linear Regression. Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and two or more predictor (independent) variables. Most of the real-world regression models involve multiple predictors. We will illustrate the structure by using four predictor variables, but these results can generalize to any number.

Polynomial Regression

Polynomial regression is a particular case of the general linear regression model or multiple linear regression models.

We get non-linear relationships by squaring or setting higher-order terms of the predictor variables.

Measures for In-Sample Evaluation

When evaluating our models, not only do we want to visualize the results, but we also need a quantitative measure to determine how accurate the model is.

Two very important measures that are often used in statistics to assess the accuracy of a model are:

R-squared

R-squared, also known as the coefficient of determination, measures how closely the data aligns with the fitted regression line.

The R-squared value represents the percentage of variation in the response variable (y) that is explained by the linear model.

Mean Squared Error (MSE)

The Mean Squared Error (MSE) measures the average of the squares of the errors. In other words, it calculates the difference between the actual values (y) and the estimated values (ŷ).


Cheat Sheet: Model Development

Linear Regression

Train Linear Regression Model

Generate Output Predictions

Identify the Coefficient and Intercept

Residual Plot

Distribution Plot

Polynomial Regression

Multi-variate Polynomial Regression

Pipeline

R² Value

Mean Squared Error (MSE) Value