R Programming Tutorial: Reading and Analyzing CSV Dataset with Data Frames

This tutorial provides a comprehensive guide to reading a CSV dataset file into R, creating a data frame, and performing various data frame operations. The tutorial uses a sample dataset, “sales_data.csv,” which contains sales information from different store locations. Step-by-step examples demonstrate how to read, inspect, manipulate, and analyze the data using R. The dataset is provided as a separate artifact, and this tutorial assumes it is saved as “sales_data.csv” in your R working directory.

1. Introduction

A data frame in R is a two-dimensional structure ideal for storing and analyzing tabular data, such as datasets stored in CSV files. This tutorial focuses on reading a CSV file into a data frame and performing common data frame operations, including inspection, filtering, aggregation, and visualization. The sample dataset contains sales data with columns for store ID, location, date, sales amount, units sold, and product category.

2. Setting Up the Environment

Before starting, ensure you have R installed and the dataset file “sales_data.csv” saved in your R working directory. You can check or set your working directory using:

# Check current working directory
getwd()

# Set working directory (replace with your path if needed)
setwd("path/to/your/directory")

3. Reading the CSV File into a Data Frame

The read.csv() function is used to read a CSV file into a data frame.

Example: Reading the Sales Dataset

# Step 1: Read the CSV file
sales_data <- read.csv("sales_data.csv", stringsAsFactors = FALSE)

# Step 2: Display the first few rows
head(sales_data)

Output:

  StoreID   Location       Date Sales UnitsSold   Category
1       1  New York 2023-01-01  1500        50 Electronics
2       2 Los Angeles 2023-01-01  1200        40   Clothing
3       3    Chicago 2023-01-01   800        30      Books
4       4  New York 2023-01-02  1600        55 Electronics
5       5 Los Angeles 2023-01-02  1300        45   Clothing
6       6    Chicago 2023-01-02   900        35      Books

Explanation:

  • read.csv() loads the CSV file into a data frame named sales_data.
  • stringsAsFactors = FALSE ensures character columns (e.g., Location, Category) are not converted to factors.
  • head() displays the first six rows by default, providing a quick preview.

4. Inspecting the Data Frame

Inspecting the data frame helps understand its structure, data types, and content.

Example: Inspecting the Sales Data

# Step 1: View the structure
str(sales_data)

# Step 2: Display summary statistics
summary(sales_data)

# Step 3: Check dimensions
dim(sales_data)

Output:

# str(sales_data)
'data.frame':	10 obs. of  6 variables:
 $ StoreID   : int  1 2 3 4 5 6 7 8 9 10
 $ Location  : chr  "New York" "Los Angeles" "Chicago" "New York" ...
 $ Date      : chr  "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-02" ...
 $ Sales     : int  1500 1200 800 1600 1300 900 1400 1100 850 1700
 $ UnitsSold : int  50 40 30 55 45 35 48 38 32 60
 $ Category  : chr  "Electronics" "Clothing" "Books" "Electronics" ...

# summary(sales_data)
     StoreID       Location             Date               Sales        UnitsSold       Category        
 Min.   : 1.00   Length:10          Length:10          Min.   : 800   Min.   :30.0   Length:10         
 1st Qu.: 3.25   Class :character   Class :character   1st Qu.: 925   1st Qu.:34.5   Class :character  
 Median : 5.50   Mode  :character   Mode  :character   Median :1250   Median :42.5   Mode  :character  
 Mean   : 5.50                                         Mean   :1235   Mean   :43.3                     
 3rd Qu.: 7.75                                         3rd Qu.:1525   3rd Qu.:49.5                     
 Max.   :10.00                                         Max.   :1700   Max.   :60.0                     

# dim(sales_data)
[1] 10  6

Explanation:

  • str() shows column names, data types (e.g., integer, character), and a preview of values.
  • summary() provides statistical summaries for numeric columns (e.g., Sales, UnitsSold) and length for character columns.
  • dim() returns the number of rows (10) and columns (6).

5. Converting Date Column

The Date column is read as a character. Convert it to a proper date format for analysis.

Example: Converting Date

# Step 1: Convert Date to Date type
sales_data$Date <- as.Date(sales_data$Date, format = "%Y-%m-%d")

# Step 2: Verify conversion
str(sales_data$Date)

Output:

 Date[1:10], format: "2023-01-01" "2023-01-01" "2023-01-01" "2023-01-02" ...

Explanation:

  • as.Date() converts the Date column to a Date object.
  • format = "%Y-%m-%d" matches the CSV’s date format (YYYY-MM-DD).
  • str() confirms the column is now of type Date.

6. Accessing and Modifying the Data Frame

Data frames can be accessed and modified using indexing or column names.

Example: Accessing and Modifying Sales Data

# Step 1: Access the Sales column
sales_values <- sales_data$Sales
print(sales_values)

# Step 2: Access rows for New York
ny_sales <- sales_data[sales_data$Location == "New York", ]
print(ny_sales)

# Step 3: Modify Sales for StoreID 1
sales_data$Sales[sales_data$StoreID == 1] <- 1550
print(sales_data[sales_data$StoreID == 1, ])

# Step 4: Add a new column for Revenue per Unit
sales_data$RevenuePerUnit <- sales_data$Sales / sales_data$UnitsSold
print(head(sales_data))

Output:

# sales_values
 [1] 1500 1200  800 1600 1300  900 1400 1100  850 1700

# ny_sales
  StoreID Location       Date Sales UnitsSold   Category
1       1 New York 2023-01-01  1500        50 Electronics
4       4 New York 2023-01-02  1600        55 Electronics
7       7 New York 2023-01-03  1400        48 Electronics
10     10 New York 2023-01-04  1700        60 Electronics

# Modified StoreID 1
  StoreID Location       Date Sales UnitsSold   Category
1       1 New York 2023-01-01  1550        50 Electronics

# head(sales_data) with RevenuePerUnit
  StoreID   Location       Date Sales UnitsSold   Category RevenuePerUnit
1       1  New York 2023-01-01  1550        50 Electronics           31.0
2       2 Los Angeles 2023-01-01  1200        40   Clothing           30.0
3       3    Chicago 2023-01-01   800        30      Books           26.7
4       4  New York 2023-01-02  1600        55 Electronics           29.1
5       5 Los Angeles 2023-01-02  1300        45   Clothing           28.9
6       6    Chicago 2023-01-02   900        35      Books           25.7

Explanation:

  • Use $ to access a column (e.g., sales_data$Sales).
  • Filter rows using logical conditions (e.g., Location == "New York").
  • Modify specific values using conditional indexing.
  • Add a new column by computing RevenuePerUnit (Sales ÷ UnitsSold).

7. Filtering and Subsetting

Filtering extracts specific rows or columns based on conditions.

Example: Filtering Sales Data

# Step 1: Filter sales above 1400
high_sales <- subset(sales_data, Sales > 1400)
print(high_sales)

# Step 2: Select specific columns
location_sales <- sales_data[, c("Location", "Sales", "Category")]
print(head(location_sales))

# Step 3: Filter Electronics category in New York
ny_electronics <- subset(sales_data, Location == "New York" & Category == "Electronics")
print(ny_electronics)

Output:

# high_sales
  StoreID Location       Date Sales UnitsSold   Category RevenuePerUnit
1       1 New York 2023-01-01  1550        50 Electronics           31.0
4       4 New York 2023-01-02  1600        55 Electronics           29.1
10     10 New York 2023-01-04  1700        60 Electronics           28.3

# head(location_sales)
    Location Sales   Category
1  New York  1550 Electronics
2 Los Angeles  1200   Clothing
3    Chicago   800      Books
4  New York  1600 Electronics
5 Los Angeles  1300   Clothing
6    Chicago   900      Books

# ny_electronics
  StoreID Location       Date Sales UnitsSold   Category RevenuePerUnit
1       1 New York 2023-01-01  1550        50 Electronics           31.0
4       4 New York 2023-01-02  1600        55 Electronics           29.1
7       7 New York 2023-01-03  1400        48 Electronics           29.2
10     10 New York 2023-01-04  1700        60 Electronics           28.3

Explanation:

  • subset() filters rows based on conditions (e.g., Sales > 1400).
  • Select columns using a vector of column names.
  • Combine conditions with & for complex filtering (e.g., Location == "New York" & Category == "Electronics").

8. Aggregating Data

Aggregation summarizes data, such as calculating means or sums by group.

Example: Aggregating Sales Data

# Step 1: Calculate average sales by Location
avg_sales_location <- aggregate(Sales ~ Location, data = sales_data, mean)
print(avg_sales_location)

# Step 2: Calculate total units sold by Category
total_units_category <- aggregate(UnitsSold ~ Category, data = sales_data, sum)
print(total_units_category)

Output:

# avg_sales_location
     Location    Sales
1    Chicago   850.0
2 Los Angeles  1200.0
3   New York  1562.5

# total_units_category
    Category UnitsSold
1      Books        97
2   Clothing       123
3 Electronics       213

Explanation:

  • aggregate() groups data by Location or Category and applies mean or sum.
  • The result is a new data frame with one row per group.

9. Sorting Data

Sorting organizes the data frame for better analysis.

Example: Sorting Sales Data

# Step 1: Sort by Sales (descending)
sorted_by_sales <- sales_data[order(-sales_data$Sales), ]
print(head(sorted_by_sales))

# Step 2: Sort by Date (ascending)
sorted_by_date <- sales_data[order(sales_data$Date), ]
print(head(sorted_by_date))

Output:

# head(sorted_by_sales)
   StoreID Location       Date Sales UnitsSold   Category RevenuePerUnit
10      10 New York 2023-01-04  1700        60 Electronics           28.3
4        4 New York 2023-01-02  1600        55 Electronics           29.1
1        1 New York 2023-01-01  1550        50 Electronics           31.0
7        7 New York 2023-01-03  1400        48 Electronics           29.2
5        5 Los Angeles 2023-01-02  1300        45   Clothing           28.9
2        2 Los Angeles 2023-01-01  1200        40   Clothing           30.0

# head(sorted_by_date)
  StoreID   Location       Date Sales UnitsSold   Category RevenuePerUnit
1       1  New York 2023-01-01  1550        50 Electronics           31.0
2       2 Los Angeles 2023-01-01  1200        40   Clothing           30.0
3       3    Chicago 2023-01-01   800        30      Books           26.7
4       4  New York 2023-01-02  1600        55 Electronics           29.1
5       5 Los Angeles 2023-01-02  1300        45   Clothing           28.9
6       6    Chicago 2023-01-02   900        35      Books           25.7

Explanation:

  • order() with - sorts in descending order (e.g., -sales_data$Sales).
  • Without -, order() sorts in ascending order (e.g., sales_data$Date).

10. Visualizing Data

Basic visualization helps explore trends. We’ll use the base R plot() function for simplicity.

Example: Visualizing Sales Trends

# Step 1: Plot Sales over Date by Location
plot(sales_data$Date[sales_data$Location == "New York"], 
     sales_data$Sales[sales_data$Location == "New York"], 
     type = "l", col = "blue", xlab = "Date", ylab = "Sales", 
     main = "Sales Trend in New York")

Output:

  • A line plot showing sales trends for New York over the four days.

Explanation:

  • plot() creates a line plot (type = "l") for New York’s sales.
  • Subset data using sales_data$Location == "New York" to focus on one location.
  • Customize with xlab, ylab, and main for clarity.

11. Exporting the Data Frame

Save the modified data frame to a new CSV file.

Example: Exporting to CSV

# Step 1: Export the modified data frame
write.csv(sales_data, "sales_data_modified.csv", row.names = FALSE)

# Step 2: Verify by reading back
modified_data <- read.csv("sales_data_modified.csv")
head(modified_data)

Output:

  StoreID   Location       Date Sales UnitsSold   Category RevenuePerUnit
1       1  New York 2023-01-01  1550        50 Electronics           31.0
2       2 Los Angeles 2023-01-01  1200        40   Clothing           30.0
3       3    Chicago 2023-01-01   800        30      Books           26.7
4       4  New York 2023-01-02  1600        55 Electronics           29.1
5       5 Los Angeles 2023-01-02  1300        45   Clothing           28.9
6       6    Chicago 2023-01-02   900        35      Books           25.7

Explanation:

  • write.csv() saves the data frame to “sales_data_modified.csv”.
  • row.names = FALSE excludes row indices.
  • Reading back verifies the export.

12. Best Practices

  • Verify File Path: Ensure the CSV file is in the correct directory.
  • Check Data Types: Use str() to confirm column types (e.g., convert dates).
  • Handle Missing Values: Check for NA values if the dataset is large.
  • Comment Code: Add comments for clarity and reproducibility.
  • Save Modified Data: Export changes to avoid losing work.

13. Conclusion

This tutorial demonstrated how to read a CSV dataset into a data frame in R, inspect its structure, manipulate data, and perform analyses like filtering, aggregation, sorting, and visualization. The sample dataset “sales_data.csv” provided a practical context for learning. By mastering these techniques, you can efficiently work with real-world datasets in R.

Scroll to Top