Random Forest in Machine Learning (Simple Explanation + Python Code)
Introduction
Machine Learning often uses models to make predictions for example predicting whether a student will pass, whether a customer will buy a product, or what the price of a house will be.
One of the most powerful and easy-to-understand algorithms is Random Forest.
Even non-technical students can understand it easily because it works like a group decision.
In this tutorial, you will learn:
- What is Random Forest (simple explanation)
- How it works (step-by-step)
- Where we use it
- Python implementation (classification + regression)
- Feature importance
Let’s begin.
What is Random Forest ?
Imagine you want to make a decision.
Instead of asking one person, you ask 100 people and take the majority vote.
That gives a more accurate answer, right?
This is exactly how Random Forest works.
- It creates many decision trees
- Each tree learns something different
- All trees vote on the final answer
Random Forest = Many Decision Trees + Voting
Because of many trees, the model becomes:
- More accurate
- More stable
- Less overfitted
Why is it called “Random” Forest?
Two things are chosen randomly:
Random rows
Each tree gets a random sample of data.
Random columns
Each tree uses random features (columns) to make decisions.
This randomness reduces overfitting and improves accuracy.
Where is Random Forest used ?
Random Forest is used in:
- Student performance prediction
- Loan approval
- Medical diagnosis
- Fraud detection
- Email spam detection
- House price prediction
- Image classification
It works for both:
✔ Classification (Yes/No, Pass/Fail, Category)
✔ Regression (Numbers like price, marks, salary)
How Random Forest Works (Step-by-Step)
Step 1: Take dataset
Example: marks, attendance, homework, etc.
Step 2: Create many small random data samples
Tree 1 → random rows
Tree 2 → random rows
Tree 3 → random rows
…
Step 3: Build one decision tree for each sample
Tree 1 → learns different rules
Tree 2 → learns different rules
Tree 3 → learns different rules
Step 4: Make prediction
Give new data to every tree.
Step 5: Combine results
- If classification → Majority vote
- If regression → Average
Python Implementation (Classification Example)
We will use the built-in Iris dataset (flower classification).
Step 1: Import libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
Step 2: Load dataset
iris = load_iris()
X = iris.data
y = iris.target
print("Shape of X:", X.shape)
print("Shape of y:", y.shape)
Step 3: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Step 4: Create Random Forest Model
rf_model = RandomForestClassifier(
n_estimators=100,
random_state=42
)
Step 5: Train Model
rf_model.fit(X_train, y_train)
Step 6: Predict and Check Accuracy
y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Feature Importance (Which column is important?)
Random Forest shows which features matter most.
import numpy as np
feature_importances = rf_model.feature_importances_
for name, importance in zip(iris.feature_names, feature_importances):
print(f"{name}: {importance:.3f}")
Random Forest Regression (Predict numbers)
Example: Predict house price.
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
# Dummy dataset
X = np.array([
[500, 1],
[700, 2],
[1000, 3],
[1200, 3],
[1500, 4],
[1800, 4],
])
y = np.array([20, 30, 50, 55, 70, 80]) # prices in lakhs
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
print("Actual:", y_test)
print("Predicted:", y_pred)
print("MSE:", mean_squared_error(y_test, y_pred))
Advantages of Random Forest
✔ Works well with small as well as large datasets
✔ Handles missing values
✔ Reduces overfitting
✔ Gives high accuracy
✔ Works for both classification and regression
Disadvantages
Takes more time if trees are many
Harder to visualize compared to a single decision tree
Summary
- Random Forest is a group of many decision trees.
- Each tree sees random rows + random columns.
- Final prediction = majority vote or average.
- It gives stable and accurate results.
- Works for both classification and regression.
- Random Forest Algorithm: Random Forest creates multiple decision trees and uses voting to improve accuracy, stability, and reduce overfitting in predictions.
