Machine Learning Tutorial: Hyperparameter Tuning, Advanced Models & Overfitting/Underfitting


1. Introduction

In Machine Learning (ML), choosing the right settings (called hyperparameters) for a model is crucial for good performance. But if you’re not careful, you can end up with overfitting or underfitting. Let’s break it down.


2. What are Models in Machine Learning?

A model is like a function that learns patterns from data to make predictions.

Common models include:

  • Linear Regression
  • Decision Trees
  • Random Forests
  • Support Vector Machines (SVM)
  • Neural Networks

3. Underfitting vs Overfitting

ConceptUnderfittingOverfitting
DefinitionThe model is too simpleThe model is too complex
SymptomsPoor performance on training and test dataGood performance on training, poor on test
CauseNot enough features or too little trainingToo many parameters or not enough data
ExampleLinear model on a curved datasetDecision tree memorizing training data

Goal: Find the sweet spot = good performance on both training and test sets.


4. What are Hyperparameters?

Hyperparameters are the settings you choose before training a model.

Examples:

  • For a Decision Tree:
    • max_depth (how deep the tree can go)
  • For a Random Forest:
    • n_estimators (how many trees)
  • For a Neural Network:
    • learning_rate
    • epochs

Unlike model parameters (which are learned during training), hyperparameters must be set manually or tuned.


5. Hyperparameter Tuning

Goal:

To find the best combination of hyperparameters that gives the best performance on unseen (test) data.

Methods:

1. Grid Search

Try every possible combination of hyperparameters.

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 10, 20],
}

grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(grid_search.best_params_)

2. Random Search

Try a random sample of combinations. Faster than grid search.

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint

param_dist = {
    'n_estimators': randint(50, 200),
    'max_depth': randint(5, 20),
}

random_search = RandomizedSearchCV(RandomForestClassifier(), param_dist, n_iter=10, cv=5)
random_search.fit(X_train, y_train)

print(random_search.best_params_)

3. Bayesian Optimization (Advanced)

Uses past results to choose the next best hyperparameters. Tools: Optuna, scikit-optimize, Hyperopt.


6. Advanced Models (for Classification & Regression)

1. Random Forest

  • Ensemble of Decision Trees
  • Less likely to overfit than a single tree

2. Gradient Boosting (e.g., XGBoost, LightGBM)

  • Builds trees sequentially to correct errors
  • Often performs better than Random Forest

3. Support Vector Machines (SVM)

  • Good for high-dimensional data
  • Needs proper tuning (e.g., kernel, C, gamma)

4. Neural Networks

  • Powerful for images, text, and more
  • Requires large datasets and tuning many hyperparameters

7. How to Detect Overfitting/Underfitting

Use Learning Curves:

from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import numpy as np

train_sizes, train_scores, test_scores = learning_curve(
    RandomForestClassifier(), X, y, cv=5, scoring='accuracy')

train_mean = np.mean(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)

plt.plot(train_sizes, train_mean, label='Training Score')
plt.plot(train_sizes, test_mean, label='Validation Score')
plt.xlabel('Training Size')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Learning Curve')
plt.show()

8. Best Practices

  • Split data into training, validation, and test sets
  • Use cross-validation for more reliable results
  • Always check for overfitting/underfitting
  • Try different models and compare
  • Use scaling/normalization where needed (especially for SVM, NN)

9. Summary

ConceptKey Idea
HyperparametersSettings you tune for better performance
OverfittingModel memorizes data (too complex)
UnderfittingModel misses patterns (too simple)
Advanced ModelsRandom Forest, XGBoost, SVM, Neural Nets
Tuning ToolsGridSearchCV, RandomizedSearchCV, Optuna
Scroll to Top