Machine Learning Tutorial: Hyperparameter Tuning, Advanced Models & Overfitting/Underfitting
1. Introduction
In Machine Learning (ML), choosing the right settings (called hyperparameters) for a model is crucial for good performance. But if you’re not careful, you can end up with overfitting or underfitting. Let’s break it down.
2. What are Models in Machine Learning?
A model is like a function that learns patterns from data to make predictions.
Common models include:
- Linear Regression
- Decision Trees
- Random Forests
- Support Vector Machines (SVM)
- Neural Networks
3. Underfitting vs Overfitting
Concept | Underfitting | Overfitting |
---|---|---|
Definition | The model is too simple | The model is too complex |
Symptoms | Poor performance on training and test data | Good performance on training, poor on test |
Cause | Not enough features or too little training | Too many parameters or not enough data |
Example | Linear model on a curved dataset | Decision tree memorizing training data |
Goal: Find the sweet spot = good performance on both training and test sets.
4. What are Hyperparameters?
Hyperparameters are the settings you choose before training a model.
Examples:
- For a Decision Tree:
max_depth
(how deep the tree can go)
- For a Random Forest:
n_estimators
(how many trees)
- For a Neural Network:
learning_rate
epochs
Unlike model parameters (which are learned during training), hyperparameters must be set manually or tuned.
5. Hyperparameter Tuning
Goal:
To find the best combination of hyperparameters that gives the best performance on unseen (test) data.
Methods:
1. Grid Search
Try every possible combination of hyperparameters.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {
'n_estimators': [50, 100],
'max_depth': [None, 10, 20],
}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
2. Random Search
Try a random sample of combinations. Faster than grid search.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint
param_dist = {
'n_estimators': randint(50, 200),
'max_depth': randint(5, 20),
}
random_search = RandomizedSearchCV(RandomForestClassifier(), param_dist, n_iter=10, cv=5)
random_search.fit(X_train, y_train)
print(random_search.best_params_)
3. Bayesian Optimization (Advanced)
Uses past results to choose the next best hyperparameters. Tools: Optuna
, scikit-optimize
, Hyperopt
.
6. Advanced Models (for Classification & Regression)
1. Random Forest
- Ensemble of Decision Trees
- Less likely to overfit than a single tree
2. Gradient Boosting (e.g., XGBoost, LightGBM)
- Builds trees sequentially to correct errors
- Often performs better than Random Forest
3. Support Vector Machines (SVM)
- Good for high-dimensional data
- Needs proper tuning (e.g., kernel, C, gamma)
4. Neural Networks
- Powerful for images, text, and more
- Requires large datasets and tuning many hyperparameters
7. How to Detect Overfitting/Underfitting
Use Learning Curves:
from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import numpy as np
train_sizes, train_scores, test_scores = learning_curve(
RandomForestClassifier(), X, y, cv=5, scoring='accuracy')
train_mean = np.mean(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
plt.plot(train_sizes, train_mean, label='Training Score')
plt.plot(train_sizes, test_mean, label='Validation Score')
plt.xlabel('Training Size')
plt.ylabel('Accuracy')
plt.legend()
plt.title('Learning Curve')
plt.show()
8. Best Practices
- Split data into training, validation, and test sets
- Use cross-validation for more reliable results
- Always check for overfitting/underfitting
- Try different models and compare
- Use scaling/normalization where needed (especially for SVM, NN)
9. Summary
Concept | Key Idea |
---|---|
Hyperparameters | Settings you tune for better performance |
Overfitting | Model memorizes data (too complex) |
Underfitting | Model misses patterns (too simple) |
Advanced Models | Random Forest, XGBoost, SVM, Neural Nets |
Tuning Tools | GridSearchCV, RandomizedSearchCV, Optuna |