Random Forest in Machine Learning (Simple Explanation + Python Code)

Introduction

Machine Learning often uses models to make predictions for example predicting whether a student will pass, whether a customer will buy a product, or what the price of a house will be.

One of the most powerful and easy-to-understand algorithms is Random Forest.
Even non-technical students can understand it easily because it works like a group decision.

In this tutorial, you will learn:

What is Random Forest (simple explanation)
How it works (step-by-step)
Where we use it
Python implementation (classification + regression)
Feature importance

Let’s begin.

What is Random Forest ?

Imagine you want to make a decision.
Instead of asking one person, you ask 100 people and take the majority vote.

That gives a more accurate answer, right?

This is exactly how Random Forest works.

It creates many decision trees
Each tree learns something different
All trees vote on the final answer

Random Forest = Many Decision Trees + Voting

Because of many trees, the model becomes:

More accurate
More stable
Less overfitted

Why is it called “Random” Forest?

Two things are chosen randomly:

Random rows

Each tree gets a random sample of data.

Random columns

Each tree uses random features (columns) to make decisions.

This randomness reduces overfitting and improves accuracy.

Where is Random Forest used ?

Random Forest is used in:

Student performance prediction
Loan approval
Medical diagnosis
Fraud detection
Email spam detection
House price prediction
Image classification

It works for both:

✔ Classification (Yes/No, Pass/Fail, Category)
✔ Regression (Numbers like price, marks, salary)

How Random Forest Works (Step-by-Step)

Step 1: Take dataset

Example: marks, attendance, homework, etc.

Step 2: Create many small random data samples

Tree 1 → random rows
Tree 2 → random rows
Tree 3 → random rows
…

Step 3: Build one decision tree for each sample

Tree 1 → learns different rules
Tree 2 → learns different rules
Tree 3 → learns different rules

Step 4: Make prediction

Give new data to every tree.

Step 5: Combine results

If classification → Majority vote
If regression → Average

Python Implementation (Classification Example)

We will use the built-in Iris dataset (flower classification).

Step 1: Import libraries

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

Step 2: Load dataset

iris = load_iris()

X = iris.data
y = iris.target

print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

iris = load_iris()

X = iris.data
y = iris.target

print("Shape of X:", X.shape)
print("Shape of y:", y.shape)

Step 3: Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Step 4: Create Random Forest Model

rf_model = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)

rf_model = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)

Step 5: Train Model

rf_model.fit(X_train, y_train)

rf_model.fit(X_train, y_train)

Step 6: Predict and Check Accuracy

y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

Feature Importance (Which column is important?)

Random Forest shows which features matter most.

import numpy as np

feature_importances = rf_model.feature_importances_

for name, importance in zip(iris.feature_names, feature_importances):
    print(f"{name}: {importance:.3f}")

import numpy as np

feature_importances = rf_model.feature_importances_

for name, importance in zip(iris.feature_names, feature_importances):
    print(f"{name}: {importance:.3f}")

Random Forest Regression (Predict numbers)

Example: Predict house price.

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# Dummy dataset
X = np.array([
    [500, 1],
    [700, 2],
    [1000, 3],
    [1200, 3],
    [1500, 4],
    [1800, 4],
])

y = np.array([20, 30, 50, 55, 70, 80])  # prices in lakhs

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

print("Actual:", y_test)
print("Predicted:", y_pred)
print("MSE:", mean_squared_error(y_test, y_pred))

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# Dummy dataset
X = np.array([
    [500, 1],
    [700, 2],
    [1000, 3],
    [1200, 3],
    [1500, 4],
    [1800, 4],
])

y = np.array([20, 30, 50, 55, 70, 80])  # prices in lakhs

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

print("Actual:", y_test)
print("Predicted:", y_pred)
print("MSE:", mean_squared_error(y_test, y_pred))

Advantages of Random Forest

✔ Works well with small as well as large datasets
✔ Handles missing values
✔ Reduces overfitting
✔ Gives high accuracy
✔ Works for both classification and regression

Disadvantages

Takes more time if trees are many
Harder to visualize compared to a single decision tree

Summary

Random Forest is a group of many decision trees.
Each tree sees random rows + random columns.
Final prediction = majority vote or average.
It gives stable and accurate results.
Works for both classification and regression.

Random Forest Algorithm: Random Forest creates multiple decision trees and uses voting to improve accuracy, stability, and reduce overfitting in predictions.