1. Introduction to Data Frames
A data frame in R is a two-dimensional, table-like structure where each column can contain different data types (e.g., numeric, character, logical), and each row represents an observation. Data frames are ideal for storing datasets and are widely used in statistical analysis and data science.
2. Creating a Data Frame
Data frames can be created using the data.frame()
function. Below is a step-by-step example.
Example: Creating a Simple Data Frame
Suppose we want to create a data frame to store information about employees, including their names, ages, and salaries.
# Step 1: Define vectors for each column
names <- c("Alice", "Bob", "Charlie", "Diana")
ages <- c(25, 30, 28, 35)
salaries <- c(50000, 60000, 55000, 70000)
# Step 2: Combine vectors into a data frame
employees <- data.frame(Name = names, Age = ages, Salary = salaries)
# Step 3: Display the data frame
print(employees)
Output:
Name Age Salary
1 Alice 25 50000
2 Bob 30 60000
3 Charlie 28 55000
4 Diana 35 70000
Explanation:
- Each vector (
names
,ages
,salaries
) becomes a column in the data frame. - The
data.frame()
function combines these vectors, ensuring they have the same length. - Column names are specified using the assignment (
Name = names
).
3. Inspecting a Data Frame
Understanding the structure and content of a data frame is crucial. R provides several functions for this purpose.
Example: Inspecting the Employee Data Frame
# Step 1: View the structure of the data frame
str(employees)
# Step 2: Display summary statistics
summary(employees)
# Step 3: View the first few rows
head(employees, n = 2)
# Step 4: Check dimensions (rows and columns)
dim(employees)
Output:
# str(employees)
'data.frame': 4 obs. of 3 variables:
$ Name : chr "Alice" "Bob" "Charlie" "Diana"
$ Age : num 25 30 28 35
$ Salary: num 50000 60000 55000 70000
# summary(employees)
Name Age Salary
Length:4 Min. :25.00 Min. :50000
Class :character 1st Qu.:26.50 1st Qu.:53750
Mode :character Median :29.00 Median :57500
Mean :29.50 Mean :58750
3rd Qu.:31.25 3rd Qu.:62500
Max. :35.00 Max. :70000
# head(employees, n = 2)
Name Age Salary
1 Alice 25 50000
2 Bob 30 60000
# dim(employees)
[1] 4 3
Explanation:
str()
: Shows the structure, including column names, data types, and a preview of values.summary()
: Provides statistical summaries (e.g., min, max, mean) for numeric columns and frequency for character columns.head()
: Displays the firstn
rows (default is 6).dim()
: Returns the number of rows and columns.
4. Accessing and Modifying Data Frames
Data frames can be accessed and modified using indexing, column names, or functions.
Example: Accessing and Modifying Data
# Step 1: Access a specific column
employee_names <- employees$Name
print(employee_names)
# Step 2: Access a specific row
third_employee <- employees[3, ]
print(third_employee)
# Step 3: Access a specific cell
charlie_salary <- employees[3, "Salary"]
print(charlie_salary)
# Step 4: Modify a cell
employees[2, "Salary"] <- 65000
print(employees)
# Step 5: Add a new column
employees$Department <- c("HR", "IT", "Marketing", "Finance")
print(employees)
# Step 6: Remove a column
employees$Department <- NULL
print(employees)
Output:
# employee_names
[1] "Alice" "Bob" "Charlie" "Diana"
# third_employee
Name Age Salary
3 Charlie 28 55000
# charlie_salary
[1] 55000
# After modifying Bob's salary
Name Age Salary
1 Alice 25 50000
2 Bob 30 65000
3 Charlie 28 55000
4 Diana 35 70000
# After adding Department column
Name Age Salary Department
1 Alice 25 50000 HR
2 Bob 30 65000 IT
3 Charlie 28 55000 Marketing
4 Diana 35 70000 Finance
# After removing Department column
Name Age Salary
1 Alice 25 50000
2 Bob 30 65000
3 Charlie 28 55000
4 Diana 35 70000
Explanation:
- Use
$
to access a column by name (e.g.,employees$Name
). - Use
[row, column]
for specific rows, columns, or cells. - Modify values by assigning new data to specific indices.
- Add a column by assigning a new vector to a new column name.
- Remove a column by setting it to
NULL
.
5. Subsetting and Filtering Data Frames
Subsetting allows you to extract specific rows or columns based on conditions.
Example: Filtering Employees
# Step 1: Subset employees older than 28
senior_employees <- employees[employees$Age > 28, ]
print(senior_employees)
# Step 2: Subset specific columns
name_salary <- employees[, c("Name", "Salary")]
print(name_salary)
# Step 3: Use subset() function for filtering
high_earners <- subset(employees, Salary > 60000)
print(high_earners)
Output:
# senior_employees
Name Age Salary
2 Bob 30 65000
4 Diana 35 70000
# name_salary
Name Salary
1 Alice 50000
2 Bob 65000
3 Charlie 55000
4 Diana 70000
# high_earners
Name Age Salary
2 Bob 30 65000
4 Diana 35 70000
Explanation:
- Use logical conditions within
[]
to filter rows (e.g.,employees$Age > 28
). - Select specific columns by passing a vector of column names.
- The
subset()
function provides a more readable way to filter and select columns.
6. Merging Data Frames
Data frames can be merged to combine information from multiple sources.
Example: Merging Employee Data
Suppose we have another data frame with employee departments.
# Step 1: Create a department data frame
departments <- data.frame(
Name = c("Alice", "Bob", "Charlie", "Eve"),
Department = c("HR", "IT", "Marketing", "Sales")
)
# Step 2: Merge with employees data frame
merged_data <- merge(employees, departments, by = "Name", all = TRUE)
print(merged_data)
Output:
Name Age Salary Department
1 Alice 25 50000 HR
2 Bob 30 65000 IT
3 Charlie 28 55000 Marketing
4 Diana 35 70000 <NA>
5 Eve NA NA Sales
Explanation:
- The
merge()
function combines data frames based on a common column (by = "Name"
). all = TRUE
performs a full outer join, keeping all rows from both data frames.- Missing values are filled with
NA
.
7. Handling Missing Values
Missing values (NA
) can affect analysis. R provides functions to handle them.
Example: Managing Missing Values
# Step 1: Identify missing values
is.na(merged_data)
# Step 2: Remove rows with missing values
complete_data <- na.omit(merged_data)
print(complete_data)
# Step 3: Replace missing values in Salary with mean
merged_data$Salary[is.na(merged_data$Salary)] <- mean(merged_data$Salary, na.rm = TRUE)
print(merged_data)
Output:
# is.na(merged_data)
Name Age Salary Department
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE TRUE
[5,] FALSE TRUE TRUE FALSE
# complete_data
Name Age Salary Department
1 Alice 25 50000 HR
2 Bob 30 65000 IT
3 Charlie 28 55000 Marketing
# merged_data (after replacing NA in Salary)
Name Age Salary Department
1 Alice 25 50000.0 HR
2 Bob 30 65000.0 IT
3 Charlie 28 55000.0 Marketing
4 Diana 35 70000.0 <NA>
5 Eve NA 60000.0 Sales
Explanation:
is.na()
identifies missing values.na.omit()
removes rows with anyNA
values.- Replace
NA
values in a column using conditional indexing and functions likemean()
.
8. Sorting Data Frames
Sorting organizes data for better analysis.
Example: Sorting Employees
# Step 1: Sort by Age (ascending)
sorted_by_age <- employees[order(employees$Age), ]
print(sorted_by_age)
# Step 2: Sort by Salary (descending)
sorted_by_salary <- employees[order(-employees$Salary), ]
print(sorted_by_salary)
Output:
# sorted_by_age
Name Age Salary
1 Alice 25 50000
3 Charlie 28 55000
2 Bob 30 65000
4 Diana 35 70000
# sorted_by_salary
Name Age Salary
4 Diana 35 70000
2 Bob 30 65000
3 Charlie 28 55000
1 Alice 25 50000
Explanation:
- The
order()
function generates indices for sorting. - Use
-
for descending order. - Multiple columns can be sorted by passing them to
order()
.
9. Aggregating Data
Aggregation summarizes data, such as calculating group means.
Example: Aggregating by Department
Assume we have the merged data with departments.
# Step 1: Calculate mean salary by department
avg_salary <- aggregate(Salary ~ Department, data = merged_data, mean, na.rm = TRUE)
print(avg_salary)
Output:
Department Salary
1 HR 50000.00
2 IT 65000.00
3 Marketing 55000.00
4 Sales 60000.00
Explanation:
- The
aggregate()
function groups data byDepartment
and appliesmean
toSalary
. na.rm = TRUE
ignoresNA
values during calculation.
10. Exporting and Importing Data Frames
Data frames can be saved to or loaded from files.
Example: Exporting and Importing
# Step 1: Export to CSV
write.csv(employees, "employees.csv", row.names = FALSE)
# Step 2: Import from CSV
imported_data <- read.csv("employees.csv")
print(imported_data)
Output:
# imported_data
Name Age Salary
1 Alice 25 50000
2 Bob 30 65000
3 Charlie 28 55000
4 Diana 35 70000
Explanation:
write.csv()
saves the data frame to a CSV file.read.csv()
loads a CSV file into a data frame.row.names = FALSE
prevents writing row indices to the file.
11. Practical Application: Analyzing a Real Dataset
Let’s apply these skills to the built-in mtcars
dataset.
Example: Analyzing mtcars
# Step 1: Load the dataset
data(mtcars)
# Step 2: Inspect the data
str(mtcars)
head(mtcars, n = 3)
# Step 3: Filter cars with mpg > 20
efficient_cars <- subset(mtcars, mpg > 20, select = c("mpg", "hp", "wt"))
print(efficient_cars)
# Step 4: Calculate average mpg by number of cylinders
avg_mpg <- aggregate(mpg ~ cyl, data = mtcars, mean)
print(avg_mpg)
Output:
# str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
...
# head(mtcars, n = 3)
mpg cyl disp hp Kan drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# efficient_cars
mpg hp wt
Datsun 710 22.8 93 2.320
Hornet 4 Drive 21.4 110 3.215
Valiant 18.1 105 3.460
Merc 240D 24.4 62 3.190
Merc 230 22.8 95 3.150
Fiat 128 32.4 66 2.200
Honda Civic 30.4 52 1.615
Toyota Corolla 33.9 65 1.835
Fiat X1-9 27.3 66 1.935
Porsche 914-2 26.0 91 2.140
Lotus Europa 30.4 113 1.513
# avg_mpg
cyl mpg
1 4 26.66364
2 6 19.74286
3 8 15.10000
Explanation:
mtcars
is a built-in dataset with car performance metrics.- We inspect, filter, and aggregate to extract insights, demonstrating real-world application.
12. Best Practices
- Use Meaningful Column Names: Ensure clarity in analysis.
- Check Data Types: Use
str()
to verify column types match expectations. - Handle Missing Values: Address
NA
values before analysis. - Document Code: Comment your code for reproducibility.
- Validate Data: Check for outliers or inconsistencies.
13. Assignments
To reinforce your understanding of data frames in R, complete the following assignments. Each task builds on the concepts covered in this tutorial.
Assignment 1: Creating and Inspecting a Data Frame
- Create a data frame called
students
with the following information:- Columns:
StudentID
(1, 2, 3, 4, 5),Name
(“John”, “Emma”, “Liam”, “Olivia”, “Noah”),Grade
(85, 92, 78, 95, 88).
- Columns:
- Display the structure of the data frame using
str()
. - Print a summary of the data frame using
summary()
. - Show the first three rows using
head()
.
Expected Output:
- A data frame with 5 rows and 3 columns.
- Structure showing column types (numeric for
StudentID
andGrade
, character forName
). - Summary with statistical details for numeric columns.
- First three rows of the data frame.
Assignment 2: Modifying and Subsetting
- Using the
students
data frame, increase Emma’s grade by 5 points. - Add a new column called
Pass
with logical values (TRUE
ifGrade >= 80
,FALSE
otherwise). - Subset the data frame to show only students with grades above 90.
- Extract only the
Name
andGrade
columns for all students.
Expected Output:
- Modified data frame with Emma’s updated grade.
- New
Pass
column with logical values. - Subset with students having grades > 90 (Emma and Olivia).
- Data frame with only
Name
andGrade
columns.
Assignment 3: Merging and Handling Missing Values
- Create a new data frame called
courses
with columns:StudentID
(1, 2, 3, 6),Course
(“Math”, “Science”, “History”, “English”).
- Merge
students
andcourses
byStudentID
using a left join (keep all students). - Identify missing values in the merged data frame.
- Replace missing
Course
values with “Not Enrolled”. - Remove rows with missing
Grade
values.
Expected Output:
- Merged data frame with all students, some with
NA
forCourse
. - Logical matrix showing
NA
values. - Merged data frame with “Not Enrolled” for missing
Course
values. - Data frame with no missing
Grade
values.
Assignment 4: Sorting and Aggregating
- Sort the
students
data frame byGrade
in descending order. - Using the merged data frame from Assignment 3 (after handling missing values), calculate the average grade by
Course
. - Save the sorted
students
data frame to a CSV file named “students_sorted.csv”.
Expected Output:
- Sorted
students
data frame with highest grades first. - Data frame with average grades for each course (including “Not Enrolled”).
- Confirmation that the CSV file was created (check your working directory).
Assignment 5: Analyzing a Built-in Dataset
- Load the
iris
dataset usingdata(iris)
. - Filter the dataset to include only flowers with
Sepal.Length
> 5.5. - Calculate the average
Petal.Length
bySpecies
. - Export the filtered data frame to a CSV file named “iris_filtered.csv”.
Expected Output:
- Filtered
iris
data frame withSepal.Length
> 5.5. - Data frame with average
Petal.Length
for each species. - Confirmation that the CSV file was created.
14. Conclusion
Data frames are a cornerstone of data analysis in R. This tutorial covered creating, inspecting, manipulating, and analyzing data frames with practical examples. The assignments provide hands-on practice to solidify your skills. By mastering these techniques, you can efficiently handle datasets and perform robust statistical analyses.