Machine Learning Tutorial: Hyperparameter Tuning, Advanced Models & Overfitting/Underfitting

1. Introduction

In Machine Learning (ML), choosing the right settings (called hyperparameters) for a model is crucial for good performance. But if you’re not careful, you can end up with overfitting or underfitting. Let’s break it down.

2. What are Models in Machine Learning?

A model is like a function that learns patterns from data to make predictions.

Common models include:

Linear Regression
Decision Trees
Random Forests
Support Vector Machines (SVM)
Neural Networks

3. Underfitting vs Overfitting

Concept	Underfitting	Overfitting
Definition	The model is too simple	The model is too complex
Symptoms	Poor performance on training and test data	Good performance on training, poor on test
Cause	Not enough features or too little training	Too many parameters or not enough data
Example	Linear model on a curved dataset	Decision tree memorizing training data

Goal: Find the sweet spot = good performance on both training and test sets.

4. What are Hyperparameters?

Hyperparameters are the settings you choose before training a model.

Examples:

For a Decision Tree:
- max_depth (how deep the tree can go)
For a Random Forest:
- n_estimators (how many trees)
For a Neural Network:
- learning_rate
- epochs

Unlike model parameters (which are learned during training), hyperparameters must be set manually or tuned.

5. Hyperparameter Tuning

Goal:

To find the best combination of hyperparameters that gives the best performance on unseen (test) data.

Methods:

1. Grid Search

Try every possible combination of hyperparameters.

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [None, 10, 20],
}

grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)

print(grid_search.best_params_)

2. Random Search

Try a random sample of combinations. Faster than grid search.

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint

param_dist = {
    'n_estimators': randint(50, 200),
    'max_depth': randint(5, 20),
}

random_search = RandomizedSearchCV(RandomForestClassifier(), param_dist, n_iter=10, cv=5)
random_search.fit(X_train, y_train)

print(random_search.best_params_)

3. Bayesian Optimization (Advanced)

Uses past results to choose the next best hyperparameters. Tools: Optuna, scikit-optimize, Hyperopt.

6. Advanced Models (for Classification & Regression)

1. Random Forest

Ensemble of Decision Trees
Less likely to overfit than a single tree

2. Gradient Boosting (e.g., XGBoost, LightGBM)

Builds trees sequentially to correct errors
Often performs better than Random Forest

3. Support Vector Machines (SVM)

Good for high-dimensional data
Needs proper tuning (e.g., kernel, C, gamma)

4. Neural Networks

Powerful for images, text, and more
Requires large datasets and tuning many hyperparameters

7. How to Detect Overfitting/Underfitting

Use Learning Curves:

from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import numpy as np

train_sizes, train_scores, test_scores = learning_curve(
    RandomForestClassifier(), X, y, cv=5, scoring='accuracy')

train_mean = np.mean(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)

plt.plot(train_sizes, train_mean, label='Training Score')
plt.plot(train_sizes, test_mean, label='Validation Score')
plt.xlabel('Training Size')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Learning Curve')
plt.show()

8. Best Practices

Split data into training, validation, and test sets
Use cross-validation for more reliable results
Always check for overfitting/underfitting
Try different models and compare
Use scaling/normalization where needed (especially for SVM, NN)

9. Summary

Concept	Key Idea
Hyperparameters	Settings you tune for better performance
Overfitting	Model memorizes data (too complex)
Underfitting	Model misses patterns (too simple)
Advanced Models	Random Forest, XGBoost, SVM, Neural Nets
Tuning Tools	GridSearchCV, RandomizedSearchCV, Optuna

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31