1. Introduction to Data Frames
A data frame in R is a two-dimensional, table-like structure where each column can contain different data types (e.g., numeric, character, logical), and each row represents an observation. Data frames are ideal for storing datasets and are widely used in statistical analysis and data science.
2. Creating a Data Frame
Data frames can be created using the data.frame() function. Below is a step-by-step example.
Example: Creating a Simple Data Frame
Suppose we want to create a data frame to store information about employees, including their names, ages, and salaries.
# Step 1: Define vectors for each column
names <- c("Alice", "Bob", "Charlie", "Diana")
ages <- c(25, 30, 28, 35)
salaries <- c(50000, 60000, 55000, 70000)
# Step 2: Combine vectors into a data frame
employees <- data.frame(Name = names, Age = ages, Salary = salaries)
# Step 3: Display the data frame
print(employees)
Output:
Name Age Salary
1 Alice 25 50000
2 Bob 30 60000
3 Charlie 28 55000
4 Diana 35 70000
Explanation:
- Each vector (
names,ages,salaries) becomes a column in the data frame. - The
data.frame()function combines these vectors, ensuring they have the same length. - Column names are specified using the assignment (
Name = names).
3. Inspecting a Data Frame
Understanding the structure and content of a data frame is crucial. R provides several functions for this purpose.
Example: Inspecting the Employee Data Frame
# Step 1: View the structure of the data frame
str(employees)
# Step 2: Display summary statistics
summary(employees)
# Step 3: View the first few rows
head(employees, n = 2)
# Step 4: Check dimensions (rows and columns)
dim(employees)
Output:
# str(employees)
'data.frame': 4 obs. of 3 variables:
$ Name : chr "Alice" "Bob" "Charlie" "Diana"
$ Age : num 25 30 28 35
$ Salary: num 50000 60000 55000 70000
# summary(employees)
Name Age Salary
Length:4 Min. :25.00 Min. :50000
Class :character 1st Qu.:26.50 1st Qu.:53750
Mode :character Median :29.00 Median :57500
Mean :29.50 Mean :58750
3rd Qu.:31.25 3rd Qu.:62500
Max. :35.00 Max. :70000
# head(employees, n = 2)
Name Age Salary
1 Alice 25 50000
2 Bob 30 60000
# dim(employees)
[1] 4 3
Explanation:
str(): Shows the structure, including column names, data types, and a preview of values.summary(): Provides statistical summaries (e.g., min, max, mean) for numeric columns and frequency for character columns.head(): Displays the firstnrows (default is 6).dim(): Returns the number of rows and columns.
4. Accessing and Modifying Data Frames
Data frames can be accessed and modified using indexing, column names, or functions.
Example: Accessing and Modifying Data
# Step 1: Access a specific column
employee_names <- employees$Name
print(employee_names)
# Step 2: Access a specific row
third_employee <- employees[3, ]
print(third_employee)
# Step 3: Access a specific cell
charlie_salary <- employees[3, "Salary"]
print(charlie_salary)
# Step 4: Modify a cell
employees[2, "Salary"] <- 65000
print(employees)
# Step 5: Add a new column
employees$Department <- c("HR", "IT", "Marketing", "Finance")
print(employees)
# Step 6: Remove a column
employees$Department <- NULL
print(employees)
Output:
# employee_names
[1] "Alice" "Bob" "Charlie" "Diana"
# third_employee
Name Age Salary
3 Charlie 28 55000
# charlie_salary
[1] 55000
# After modifying Bob's salary
Name Age Salary
1 Alice 25 50000
2 Bob 30 65000
3 Charlie 28 55000
4 Diana 35 70000
# After adding Department column
Name Age Salary Department
1 Alice 25 50000 HR
2 Bob 30 65000 IT
3 Charlie 28 55000 Marketing
4 Diana 35 70000 Finance
# After removing Department column
Name Age Salary
1 Alice 25 50000
2 Bob 30 65000
3 Charlie 28 55000
4 Diana 35 70000
Explanation:
- Use
$to access a column by name (e.g.,employees$Name). - Use
[row, column]for specific rows, columns, or cells. - Modify values by assigning new data to specific indices.
- Add a column by assigning a new vector to a new column name.
- Remove a column by setting it to
NULL.
5. Subsetting and Filtering Data Frames
Subsetting allows you to extract specific rows or columns based on conditions.
Example: Filtering Employees
# Step 1: Subset employees older than 28
senior_employees <- employees[employees$Age > 28, ]
print(senior_employees)
# Step 2: Subset specific columns
name_salary <- employees[, c("Name", "Salary")]
print(name_salary)
# Step 3: Use subset() function for filtering
high_earners <- subset(employees, Salary > 60000)
print(high_earners)
Output:
# senior_employees
Name Age Salary
2 Bob 30 65000
4 Diana 35 70000
# name_salary
Name Salary
1 Alice 50000
2 Bob 65000
3 Charlie 55000
4 Diana 70000
# high_earners
Name Age Salary
2 Bob 30 65000
4 Diana 35 70000
Explanation:
- Use logical conditions within
[]to filter rows (e.g.,employees$Age > 28). - Select specific columns by passing a vector of column names.
- The
subset()function provides a more readable way to filter and select columns.
6. Merging Data Frames
Data frames can be merged to combine information from multiple sources.
Example: Merging Employee Data
Suppose we have another data frame with employee departments.
# Step 1: Create a department data frame
departments <- data.frame(
Name = c("Alice", "Bob", "Charlie", "Eve"),
Department = c("HR", "IT", "Marketing", "Sales")
)
# Step 2: Merge with employees data frame
merged_data <- merge(employees, departments, by = "Name", all = TRUE)
print(merged_data)
Output:
Name Age Salary Department
1 Alice 25 50000 HR
2 Bob 30 65000 IT
3 Charlie 28 55000 Marketing
4 Diana 35 70000 <NA>
5 Eve NA NA Sales
Explanation:
- The
merge()function combines data frames based on a common column (by = "Name"). all = TRUEperforms a full outer join, keeping all rows from both data frames.- Missing values are filled with
NA.
7. Handling Missing Values
Missing values (NA) can affect analysis. R provides functions to handle them.
Example: Managing Missing Values
# Step 1: Identify missing values
is.na(merged_data)
# Step 2: Remove rows with missing values
complete_data <- na.omit(merged_data)
print(complete_data)
# Step 3: Replace missing values in Salary with mean
merged_data$Salary[is.na(merged_data$Salary)] <- mean(merged_data$Salary, na.rm = TRUE)
print(merged_data)
Output:
# is.na(merged_data)
Name Age Salary Department
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE TRUE
[5,] FALSE TRUE TRUE FALSE
# complete_data
Name Age Salary Department
1 Alice 25 50000 HR
2 Bob 30 65000 IT
3 Charlie 28 55000 Marketing
# merged_data (after replacing NA in Salary)
Name Age Salary Department
1 Alice 25 50000.0 HR
2 Bob 30 65000.0 IT
3 Charlie 28 55000.0 Marketing
4 Diana 35 70000.0 <NA>
5 Eve NA 60000.0 Sales
Explanation:
is.na()identifies missing values.na.omit()removes rows with anyNAvalues.- Replace
NAvalues in a column using conditional indexing and functions likemean().
8. Sorting Data Frames
Sorting organizes data for better analysis.
Example: Sorting Employees
# Step 1: Sort by Age (ascending)
sorted_by_age <- employees[order(employees$Age), ]
print(sorted_by_age)
# Step 2: Sort by Salary (descending)
sorted_by_salary <- employees[order(-employees$Salary), ]
print(sorted_by_salary)
Output:
# sorted_by_age
Name Age Salary
1 Alice 25 50000
3 Charlie 28 55000
2 Bob 30 65000
4 Diana 35 70000
# sorted_by_salary
Name Age Salary
4 Diana 35 70000
2 Bob 30 65000
3 Charlie 28 55000
1 Alice 25 50000
Explanation:
- The
order()function generates indices for sorting. - Use
-for descending order. - Multiple columns can be sorted by passing them to
order().
9. Aggregating Data
Aggregation summarizes data, such as calculating group means.
Example: Aggregating by Department
Assume we have the merged data with departments.
# Step 1: Calculate mean salary by department
avg_salary <- aggregate(Salary ~ Department, data = merged_data, mean, na.rm = TRUE)
print(avg_salary)
Output:
Department Salary
1 HR 50000.00
2 IT 65000.00
3 Marketing 55000.00
4 Sales 60000.00
Explanation:
- The
aggregate()function groups data byDepartmentand appliesmeantoSalary. na.rm = TRUEignoresNAvalues during calculation.
10. Exporting and Importing Data Frames
Data frames can be saved to or loaded from files.
Example: Exporting and Importing
# Step 1: Export to CSV
write.csv(employees, "employees.csv", row.names = FALSE)
# Step 2: Import from CSV
imported_data <- read.csv("employees.csv")
print(imported_data)
Output:
# imported_data
Name Age Salary
1 Alice 25 50000
2 Bob 30 65000
3 Charlie 28 55000
4 Diana 35 70000
Explanation:
write.csv()saves the data frame to a CSV file.read.csv()loads a CSV file into a data frame.row.names = FALSEprevents writing row indices to the file.
11. Practical Application: Analyzing a Real Dataset
Let’s apply these skills to the built-in mtcars dataset.
Example: Analyzing mtcars
# Step 1: Load the dataset
data(mtcars)
# Step 2: Inspect the data
str(mtcars)
head(mtcars, n = 3)
# Step 3: Filter cars with mpg > 20
efficient_cars <- subset(mtcars, mpg > 20, select = c("mpg", "hp", "wt"))
print(efficient_cars)
# Step 4: Calculate average mpg by number of cylinders
avg_mpg <- aggregate(mpg ~ cyl, data = mtcars, mean)
print(avg_mpg)
Output:
# str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
...
# head(mtcars, n = 3)
mpg cyl disp hp Kan drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# efficient_cars
mpg hp wt
Datsun 710 22.8 93 2.320
Hornet 4 Drive 21.4 110 3.215
Valiant 18.1 105 3.460
Merc 240D 24.4 62 3.190
Merc 230 22.8 95 3.150
Fiat 128 32.4 66 2.200
Honda Civic 30.4 52 1.615
Toyota Corolla 33.9 65 1.835
Fiat X1-9 27.3 66 1.935
Porsche 914-2 26.0 91 2.140
Lotus Europa 30.4 113 1.513
# avg_mpg
cyl mpg
1 4 26.66364
2 6 19.74286
3 8 15.10000
Explanation:
mtcarsis a built-in dataset with car performance metrics.- We inspect, filter, and aggregate to extract insights, demonstrating real-world application.
12. Best Practices
- Use Meaningful Column Names: Ensure clarity in analysis.
- Check Data Types: Use
str()to verify column types match expectations. - Handle Missing Values: Address
NAvalues before analysis. - Document Code: Comment your code for reproducibility.
- Validate Data: Check for outliers or inconsistencies.
13. Assignments
To reinforce your understanding of data frames in R, complete the following assignments. Each task builds on the concepts covered in this tutorial.
Assignment 1: Creating and Inspecting a Data Frame
- Create a data frame called
studentswith the following information:- Columns:
StudentID(1, 2, 3, 4, 5),Name(“John”, “Emma”, “Liam”, “Olivia”, “Noah”),Grade(85, 92, 78, 95, 88).
- Columns:
- Display the structure of the data frame using
str(). - Print a summary of the data frame using
summary(). - Show the first three rows using
head().
Expected Output:
- A data frame with 5 rows and 3 columns.
- Structure showing column types (numeric for
StudentIDandGrade, character forName). - Summary with statistical details for numeric columns.
- First three rows of the data frame.
Assignment 2: Modifying and Subsetting
- Using the
studentsdata frame, increase Emma’s grade by 5 points. - Add a new column called
Passwith logical values (TRUEifGrade >= 80,FALSEotherwise). - Subset the data frame to show only students with grades above 90.
- Extract only the
NameandGradecolumns for all students.
Expected Output:
- Modified data frame with Emma’s updated grade.
- New
Passcolumn with logical values. - Subset with students having grades > 90 (Emma and Olivia).
- Data frame with only
NameandGradecolumns.
Assignment 3: Merging and Handling Missing Values
- Create a new data frame called
courseswith columns:StudentID(1, 2, 3, 6),Course(“Math”, “Science”, “History”, “English”).
- Merge
studentsandcoursesbyStudentIDusing a left join (keep all students). - Identify missing values in the merged data frame.
- Replace missing
Coursevalues with “Not Enrolled”. - Remove rows with missing
Gradevalues.
Expected Output:
- Merged data frame with all students, some with
NAforCourse. - Logical matrix showing
NAvalues. - Merged data frame with “Not Enrolled” for missing
Coursevalues. - Data frame with no missing
Gradevalues.
Assignment 4: Sorting and Aggregating
- Sort the
studentsdata frame byGradein descending order. - Using the merged data frame from Assignment 3 (after handling missing values), calculate the average grade by
Course. - Save the sorted
studentsdata frame to a CSV file named “students_sorted.csv”.
Expected Output:
- Sorted
studentsdata frame with highest grades first. - Data frame with average grades for each course (including “Not Enrolled”).
- Confirmation that the CSV file was created (check your working directory).
Assignment 5: Analyzing a Built-in Dataset
- Load the
irisdataset usingdata(iris). - Filter the dataset to include only flowers with
Sepal.Length> 5.5. - Calculate the average
Petal.LengthbySpecies. - Export the filtered data frame to a CSV file named “iris_filtered.csv”.
Expected Output:
- Filtered
irisdata frame withSepal.Length> 5.5. - Data frame with average
Petal.Lengthfor each species. - Confirmation that the CSV file was created.
14. Conclusion
Data frames are a cornerstone of data analysis in R. This tutorial covered creating, inspecting, manipulating, and analyzing data frames with practical examples. The assignments provide hands-on practice to solidify your skills. By mastering these techniques, you can efficiently handle datasets and perform robust statistical analyses.