R Programming Data Frame

1. Introduction to Data Frames

A data frame in R is a two-dimensional, table-like structure where each column can contain different data types (e.g., numeric, character, logical), and each row represents an observation. Data frames are ideal for storing datasets and are widely used in statistical analysis and data science.

2. Creating a Data Frame

Data frames can be created using the data.frame() function. Below is a step-by-step example.

Example: Creating a Simple Data Frame

Suppose we want to create a data frame to store information about employees, including their names, ages, and salaries.

# Step 1: Define vectors for each column
names <- c("Alice", "Bob", "Charlie", "Diana")
ages <- c(25, 30, 28, 35)
salaries <- c(50000, 60000, 55000, 70000)

# Step 2: Combine vectors into a data frame
employees <- data.frame(Name = names, Age = ages, Salary = salaries)

# Step 3: Display the data frame
print(employees)

Output:

     Name Age Salary
1   Alice  25  50000
2     Bob  30  60000
3 Charlie  28  55000
4   Diana  35  70000

Explanation:

  • Each vector (names, ages, salaries) becomes a column in the data frame.
  • The data.frame() function combines these vectors, ensuring they have the same length.
  • Column names are specified using the assignment (Name = names).

3. Inspecting a Data Frame

Understanding the structure and content of a data frame is crucial. R provides several functions for this purpose.

Example: Inspecting the Employee Data Frame

# Step 1: View the structure of the data frame
str(employees)

# Step 2: Display summary statistics
summary(employees)

# Step 3: View the first few rows
head(employees, n = 2)

# Step 4: Check dimensions (rows and columns)
dim(employees)

Output:

# str(employees)
'data.frame':	4 obs. of  3 variables:
 $ Name  : chr  "Alice" "Bob" "Charlie" "Diana"
 $ Age   : num  25 30 28 35
 $ Salary: num  50000 60000 55000 70000

# summary(employees)
     Name               Age           Salary    
 Length:4           Min.   :25.00   Min.   :50000  
 Class :character   1st Qu.:26.50   1st Qu.:53750  
 Mode  :character   Median :29.00   Median :57500  
                    Mean   :29.50   Mean   :58750  
                    3rd Qu.:31.25   3rd Qu.:62500  
                    Max.   :35.00   Max.   :70000  

# head(employees, n = 2)
    Name Age Salary
1  Alice  25  50000
2    Bob  30  60000

# dim(employees)
[1] 4 3

Explanation:

  • str(): Shows the structure, including column names, data types, and a preview of values.
  • summary(): Provides statistical summaries (e.g., min, max, mean) for numeric columns and frequency for character columns.
  • head(): Displays the first n rows (default is 6).
  • dim(): Returns the number of rows and columns.

4. Accessing and Modifying Data Frames

Data frames can be accessed and modified using indexing, column names, or functions.

Example: Accessing and Modifying Data

# Step 1: Access a specific column
employee_names <- employees$Name
print(employee_names)

# Step 2: Access a specific row
third_employee <- employees[3, ]
print(third_employee)

# Step 3: Access a specific cell
charlie_salary <- employees[3, "Salary"]
print(charlie_salary)

# Step 4: Modify a cell
employees[2, "Salary"] <- 65000
print(employees)

# Step 5: Add a new column
employees$Department <- c("HR", "IT", "Marketing", "Finance")
print(employees)

# Step 6: Remove a column
employees$Department <- NULL
print(employees)

Output:

# employee_names
[1] "Alice"   "Bob"     "Charlie" "Diana"

# third_employee
     Name Age Salary
3 Charlie  28  55000

# charlie_salary
[1] 55000

# After modifying Bob's salary
     Name Age Salary
1   Alice  25  50000
2     Bob  30  65000
3 Charlie  28  55000
4   Diana  35  70000

# After adding Department column
     Name Age Salary Department
1   Alice  25  50000         HR
2     Bob  30  65000         IT
3 Charlie  28  55000  Marketing
4   Diana  35  70000    Finance

# After removing Department column
     Name Age Salary
1   Alice  25  50000
2     Bob  30  65000
3 Charlie  28  55000
4   Diana  35  70000

Explanation:

  • Use $ to access a column by name (e.g., employees$Name).
  • Use [row, column] for specific rows, columns, or cells.
  • Modify values by assigning new data to specific indices.
  • Add a column by assigning a new vector to a new column name.
  • Remove a column by setting it to NULL.

5. Subsetting and Filtering Data Frames

Subsetting allows you to extract specific rows or columns based on conditions.

Example: Filtering Employees

# Step 1: Subset employees older than 28
senior_employees <- employees[employees$Age > 28, ]
print(senior_employees)

# Step 2: Subset specific columns
name_salary <- employees[, c("Name", "Salary")]
print(name_salary)

# Step 3: Use subset() function for filtering
high_earners <- subset(employees, Salary > 60000)
print(high_earners)

Output:

# senior_employees
    Name Age Salary
2    Bob  30  65000
4  Diana  35  70000

# name_salary
     Name Salary
1   Alice  50000
2     Bob  65000
3 Charlie  55000
4   Diana  70000

# high_earners
    Name Age Salary
2    Bob  30  65000
4  Diana  35  70000

Explanation:

  • Use logical conditions within [] to filter rows (e.g., employees$Age > 28).
  • Select specific columns by passing a vector of column names.
  • The subset() function provides a more readable way to filter and select columns.

6. Merging Data Frames

Data frames can be merged to combine information from multiple sources.

Example: Merging Employee Data

Suppose we have another data frame with employee departments.

# Step 1: Create a department data frame
departments <- data.frame(
  Name = c("Alice", "Bob", "Charlie", "Eve"),
  Department = c("HR", "IT", "Marketing", "Sales")
)

# Step 2: Merge with employees data frame
merged_data <- merge(employees, departments, by = "Name", all = TRUE)
print(merged_data)

Output:

     Name Age Salary Department
1   Alice  25  50000         HR
2     Bob  30  65000         IT
3 Charlie  28  55000  Marketing
4   Diana  35  70000       <NA>
5     Eve  NA     NA      Sales

Explanation:

  • The merge() function combines data frames based on a common column (by = "Name").
  • all = TRUE performs a full outer join, keeping all rows from both data frames.
  • Missing values are filled with NA.

7. Handling Missing Values

Missing values (NA) can affect analysis. R provides functions to handle them.

Example: Managing Missing Values

# Step 1: Identify missing values
is.na(merged_data)

# Step 2: Remove rows with missing values
complete_data <- na.omit(merged_data)
print(complete_data)

# Step 3: Replace missing values in Salary with mean
merged_data$Salary[is.na(merged_data$Salary)] <- mean(merged_data$Salary, na.rm = TRUE)
print(merged_data)

Output:

# is.na(merged_data)
     Name   Age Salary Department
[1,] FALSE FALSE  FALSE      FALSE
[2,] FALSE FALSE  FALSE      FALSE
[3,] FALSE FALSE  FALSE      FALSE
[4,] FALSE FALSE  FALSE       TRUE
[5,] FALSE  TRUE   TRUE      FALSE

# complete_data
     Name Age Salary Department
1   Alice  25  50000         HR
2     Bob  30  65000         IT
3 Charlie  28  55000  Marketing

# merged_data (after replacing NA in Salary)
     Name Age   Salary Department
1   Alice  25  50000.0         HR
2     Bob  30  65000.0         IT
3 Charlie  28  55000.0  Marketing
4   Diana  35  70000.0       <NA>
5     Eve  NA  60000.0      Sales

Explanation:

  • is.na() identifies missing values.
  • na.omit() removes rows with any NA values.
  • Replace NA values in a column using conditional indexing and functions like mean().

8. Sorting Data Frames

Sorting organizes data for better analysis.

Example: Sorting Employees

# Step 1: Sort by Age (ascending)
sorted_by_age <- employees[order(employees$Age), ]
print(sorted_by_age)

# Step 2: Sort by Salary (descending)
sorted_by_salary <- employees[order(-employees$Salary), ]
print(sorted_by_salary)

Output:

# sorted_by_age
     Name Age Salary
1   Alice  25  50000
3 Charlie  28  55000
2     Bob  30  65000
4   Diana  35  70000

# sorted_by_salary
     Name Age Salary
4   Diana  35  70000
2     Bob  30  65000
3 Charlie  28  55000
1   Alice  25  50000

Explanation:

  • The order() function generates indices for sorting.
  • Use - for descending order.
  • Multiple columns can be sorted by passing them to order().

9. Aggregating Data

Aggregation summarizes data, such as calculating group means.

Example: Aggregating by Department

Assume we have the merged data with departments.

# Step 1: Calculate mean salary by department
avg_salary <- aggregate(Salary ~ Department, data = merged_data, mean, na.rm = TRUE)
print(avg_salary)

Output:

  Department   Salary
1         HR 50000.00
2         IT 65000.00
3  Marketing 55000.00
4      Sales 60000.00

Explanation:

  • The aggregate() function groups data by Department and applies mean to Salary.
  • na.rm = TRUE ignores NA values during calculation.

10. Exporting and Importing Data Frames

Data frames can be saved to or loaded from files.

Example: Exporting and Importing

# Step 1: Export to CSV
write.csv(employees, "employees.csv", row.names = FALSE)

# Step 2: Import from CSV
imported_data <- read.csv("employees.csv")
print(imported_data)

Output:

# imported_data
     Name Age Salary
1   Alice  25  50000
2     Bob  30  65000
3 Charlie  28  55000
4   Diana  35  70000

Explanation:

  • write.csv() saves the data frame to a CSV file.
  • read.csv() loads a CSV file into a data frame.
  • row.names = FALSE prevents writing row indices to the file.

11. Practical Application: Analyzing a Real Dataset

Let’s apply these skills to the built-in mtcars dataset.

Example: Analyzing mtcars

# Step 1: Load the dataset
data(mtcars)

# Step 2: Inspect the data
str(mtcars)
head(mtcars, n = 3)

# Step 3: Filter cars with mpg > 20
efficient_cars <- subset(mtcars, mpg > 20, select = c("mpg", "hp", "wt"))
print(efficient_cars)

# Step 4: Calculate average mpg by number of cylinders
avg_mpg <- aggregate(mpg ~ cyl, data = mtcars, mean)
print(avg_mpg)

Output:

# str(mtcars)
'data.frame':	32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 ...

# head(mtcars, n = 3)
                   mpg cyl disp  hp Kan drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

# efficient_cars
                  mpg  hp    wt
Datsun 710       22.8  93 2.320
Hornet 4 Drive   21.4 110 3.215
Valiant          18.1 105 3.460
Merc 240D        24.4  62 3.190
Merc 230         22.8  95 3.150
Fiat 128         32.4  66 2.200
Honda Civic      30.4  52 1.615
Toyota Corolla   33.9  65 1.835
Fiat X1-9        27.3  66 1.935
Porsche 914-2    26.0  91 2.140
Lotus Europa     30.4 113 1.513

# avg_mpg
  cyl      mpg
1   4 26.66364
2   6 19.74286
3   8 15.10000

Explanation:

  • mtcars is a built-in dataset with car performance metrics.
  • We inspect, filter, and aggregate to extract insights, demonstrating real-world application.

12. Best Practices

  • Use Meaningful Column Names: Ensure clarity in analysis.
  • Check Data Types: Use str() to verify column types match expectations.
  • Handle Missing Values: Address NA values before analysis.
  • Document Code: Comment your code for reproducibility.
  • Validate Data: Check for outliers or inconsistencies.

13. Assignments

To reinforce your understanding of data frames in R, complete the following assignments. Each task builds on the concepts covered in this tutorial.

Assignment 1: Creating and Inspecting a Data Frame

  1. Create a data frame called students with the following information:
    • Columns: StudentID (1, 2, 3, 4, 5), Name (“John”, “Emma”, “Liam”, “Olivia”, “Noah”), Grade (85, 92, 78, 95, 88).
  2. Display the structure of the data frame using str().
  3. Print a summary of the data frame using summary().
  4. Show the first three rows using head().

Expected Output:

  • A data frame with 5 rows and 3 columns.
  • Structure showing column types (numeric for StudentID and Grade, character for Name).
  • Summary with statistical details for numeric columns.
  • First three rows of the data frame.

Assignment 2: Modifying and Subsetting

  1. Using the students data frame, increase Emma’s grade by 5 points.
  2. Add a new column called Pass with logical values (TRUE if Grade >= 80, FALSE otherwise).
  3. Subset the data frame to show only students with grades above 90.
  4. Extract only the Name and Grade columns for all students.

Expected Output:

  • Modified data frame with Emma’s updated grade.
  • New Pass column with logical values.
  • Subset with students having grades > 90 (Emma and Olivia).
  • Data frame with only Name and Grade columns.

Assignment 3: Merging and Handling Missing Values

  1. Create a new data frame called courses with columns:
    • StudentID (1, 2, 3, 6), Course (“Math”, “Science”, “History”, “English”).
  2. Merge students and courses by StudentID using a left join (keep all students).
  3. Identify missing values in the merged data frame.
  4. Replace missing Course values with “Not Enrolled”.
  5. Remove rows with missing Grade values.

Expected Output:

  • Merged data frame with all students, some with NA for Course.
  • Logical matrix showing NA values.
  • Merged data frame with “Not Enrolled” for missing Course values.
  • Data frame with no missing Grade values.

Assignment 4: Sorting and Aggregating

  1. Sort the students data frame by Grade in descending order.
  2. Using the merged data frame from Assignment 3 (after handling missing values), calculate the average grade by Course.
  3. Save the sorted students data frame to a CSV file named “students_sorted.csv”.

Expected Output:

  • Sorted students data frame with highest grades first.
  • Data frame with average grades for each course (including “Not Enrolled”).
  • Confirmation that the CSV file was created (check your working directory).

Assignment 5: Analyzing a Built-in Dataset

  1. Load the iris dataset using data(iris).
  2. Filter the dataset to include only flowers with Sepal.Length > 5.5.
  3. Calculate the average Petal.Length by Species.
  4. Export the filtered data frame to a CSV file named “iris_filtered.csv”.

Expected Output:

  • Filtered iris data frame with Sepal.Length > 5.5.
  • Data frame with average Petal.Length for each species.
  • Confirmation that the CSV file was created.

14. Conclusion

Data frames are a cornerstone of data analysis in R. This tutorial covered creating, inspecting, manipulating, and analyzing data frames with practical examples. The assignments provide hands-on practice to solidify your skills. By mastering these techniques, you can efficiently handle datasets and perform robust statistical analyses.

Scroll to Top